Regression Detection
Why regression is the core feature and how it works
Regression detection is the most important reason to use Regtrace. It answers a question that single-run evaluations cannot: is quality getting worse?
The problem with absolute scores
A single evaluation produces an absolute score: "suite score: 0.82". Is 0.82 good? It depends entirely on context:
- For a well-tuned production model on easy test cases, 0.82 might be bad
- For a new model on hard test cases, 0.82 might be excellent
- For a model in a controlled benchmark, 0.82 might be expected
Absolute scores are hard to interpret without deep domain knowledge and historical context. Regression detection solves this by comparing against a known-good reference point.
The baseline strategy
Every run produces an evaluation record. The next run compares against a baseline:
The delta transforms an opaque absolute score into a meaningful comparison: "this run is 16.5% worse than the baseline."
Tolerance vs critical thresholds
Two thresholds control regression behavior:
| Threshold | Effect |
|---|---|
tolerance (default 5%) | Drops below this trigger a warning |
critical_threshold (default 15%) | Drops below this trigger a critical failure |
These are separate because they serve different purposes:
- A warning tells you to investigate — quality might be slipping
- A critical failure tells you to stop — quality has degraded unacceptably
The gap between 5% and 15% gives you room to detect trends before they become emergencies.
What methodically changes
If you track scores over time and a regression appears, the question is: what changed? Common candidates:
The model changed
Model updates, prompt tweaks, system prompt changes, or a new version deployed. The model's behavior shifted.
The golden set changed
New test cases, modified expectations, or version bumps. If the golden set evolved, score changes might reflect different standards rather than model degradation.
The exclude_new_test_cases option helps here — it compares only test cases
that existed in both the baseline and current run.
The judge changed
Switching judge providers or models can change scoring. The judge is a separate system from the model being evaluated, and changes to either system affect the scores.
Regtrace records the judge provider and model in each run record, so you can cross-reference score changes with judge changes.
The data changed
RAG context documents, customer input distribution, or environment variables. External factors that affect model output.
Branch-aware baselines
When running in CI on feature branches, the baseline must match the branch
context. Without branch awareness, a feature branch would always compare
against the latest main baseline — even if main moved forward since the
branch was created.
Branch-aware mode (enabled by default via metrics.regression.branch_aware)
solves this:
- Auto-detects the current branch from CI environment variables
(
GITHUB_HEAD_REF,CI_COMMIT_BRANCH,CIRCLE_BRANCH) orgit rev-parse - Filters baselines to only passing runs from the same branch
- Falls back to the
fallback_baseline(default:main) when no same-branch baseline exists
Baselines are stored as regular run records in .regtrace/runs/ with the
originating branch field for cross-referencing.
The branch_aware flag works with both last_passing and pinned baseline
strategies. Pinned baselines are global (same run ID regardless of branch).
Multi-level delta reporting
When a regression is detected, Regtrace reports deltas at three levels:
This hierarchy lets you drill in:
- Suite delta tells you if there's a problem
- Per-metric delta tells you which dimension is affected
- Per-test-case delta tells you which input triggers the problem
The baseline as a decision tool
The real power of regression detection is in automation. A CI pipeline doesn't need to know what a "good" absolute score is. It just needs to know whether quality regressed compared to what was already deployed.
This makes the baseline a decision tool:
- Baseline = "known good" reference
- Current run = "what we're proposing to deploy"
- Regression check = "is the proposal worse than what's running?"
Questions it answers:
- Did the prompt change make things better or worse?
- Did the model update introduce regressions?
- Can we deploy, or should we investigate first?