Changed behavior — review or accept
2 show behavioral drift from established patterns · 1 dropped previously-used tools
Not compared — create a baseline to enable drift detection
1 agent skipped — create baselines to include them in behavioral testing
Behavior within baseline expectations
Avg score 96/100 — all agents operating within baseline expectations