Skip to content

Compare

The compare command computes deltas between two evaluation runs for A/B testing.

Run two evaluations and compare them:

Terminal window
agentv eval evals/my-eval.yaml --out before.jsonl
# ... make changes to your agent ...
agentv eval evals/my-eval.yaml --out after.jsonl
agentv compare before.jsonl after.jsonl
OptionDescription
--threshold, -tScore delta threshold for win/loss classification (default: 0.1)
--format, -fOutput format: table (default) or json
--jsonShorthand for --format=json
  1. Load Results — reads both JSONL files containing evaluation results
  2. Match by test_id — pairs results with matching test_id fields
  3. Compute Deltas — calculates delta = score2 - score1 for each pair
  4. Compute Normalized Gain — calculates g = delta / (1 - score1) for each pair (see below)
  5. Classify Outcomes:
    • win: delta >= threshold (candidate better)
    • loss: delta <= -threshold (baseline better)
    • tie: |delta| < threshold (no significant difference)
  6. Output Summary — human-readable table or JSON

In addition to raw delta, compare reports normalized gain (g):

g = (score_candidate − score_baseline) / (1 − score_baseline)

g measures improvement relative to remaining headroom rather than as an absolute number. This matters when baselines differ across tasks:

BaselineCandidateΔgInterpretation
0.100.55+0.45+0.50Captured 50% of remaining headroom
0.900.95+0.05+0.50Same proportional gain despite smaller Δ
0.500.25−0.25−0.50Regression: lost 50% of headroom

g is null when the baseline is already 1.0 (no headroom to improve). Null values are excluded from the mean.

Comparing: baseline/ → candidate/
Test ID Baseline Candidate Delta Result
─────────────────── ──────── ───────── ──────── ────────
fix-cwd-bug 0.00 0.60 +0.60 ✓ win
spec-driven-impl 0.40 0.80 +0.40 ✓ win
multi-file-refactor 0.60 0.40 -0.20 ✗ loss
Summary: 2 wins, 1 loss, 0 ties | Mean Δ: +0.267 | g: +0.256 | Status: improved

Wins are highlighted green, losses red, and ties gray. Colors are automatically disabled when output is piped or NO_COLOR is set.

Use --json or --format=json for machine-readable output. Fields use snake_case for Python ecosystem compatibility:

{
"matched": [
{
"test_id": "fix-cwd-bug",
"score1": 0.0,
"score2": 0.6,
"delta": 0.6,
"normalized_gain": 0.6,
"outcome": "win"
}
],
"unmatched": {
"file1": 0,
"file2": 0
},
"summary": {
"total": 6,
"matched": 3,
"wins": 2,
"losses": 1,
"ties": 0,
"mean_delta": 0.267,
"mean_normalized_gain": 0.256
}
}
CodeMeaning
0Candidate is equal or better (mean delta >= 0)
1Baseline is better (regression detected)

Use exit codes to gate CI pipelines — a non-zero exit signals regression.

Compare different model versions:

Terminal window
# Run baseline evaluation
agentv eval evals/*.yaml --target gpt-4 --out baseline.jsonl
# Run candidate evaluation
agentv eval evals/*.yaml --target gpt-4o --out candidate.jsonl
# Compare results
agentv compare baseline.jsonl candidate.jsonl

Compare before/after prompt changes:

Terminal window
# Run with original prompt
agentv eval evals/*.yaml --out before.jsonl
# Modify prompt, then run again
agentv eval evals/*.yaml --out after.jsonl
# Compare with strict threshold
agentv compare before.jsonl after.jsonl --threshold 0.05

Fail CI if the candidate regresses:

#!/bin/bash
agentv compare baseline.jsonl candidate.jsonl
if [ $? -eq 1 ]; then
echo "Regression detected! Candidate performs worse than baseline."
exit 1
fi
echo "Candidate is equal or better than baseline."
  • Threshold selection — the default 0.1 means a 10% difference is required for a win or loss. Use stricter thresholds (0.05) for critical evaluations.
  • Normalized gain vs delta — use g to compare across tasks with different baseline difficulty; use Δ for absolute improvement tracking.
  • Unmatched results — check unmatched counts in JSON output to identify tests that only exist in one file.
  • Multiple comparisons — compare against multiple baselines by running the command multiple times.