Glossary

Baseline

A stored run record used as a reference for regression comparison. Each suite uses either the latest passing run or a pinned run as its baseline. See Regression.

Controls how factuality evaluation extracts claims from text. shallow uses heuristic N-gram overlap (deterministic, no API call). deep uses the LLM judge to extract and verify claims. See Config File.

Deterministic metric

A metric that produces the same score for the same input every time. Deterministic metrics (format, shallow factuality) use rule-based evaluation and require no LLM API call.

Golden set

A YAML file containing a curated collection of test cases. Each test case defines an input, expected output, metrics to evaluate, and optionally an actual output. See Golden Set.

Judge provider

An LLM service used to evaluate factuality and tone metrics. Supported providers: Anthropic, OpenAI, Gemini, Groq, Ollama. See Judge Providers.

LLM-judged metric

A metric that uses an LLM judge to evaluate responses. LLM-judged metrics (deep factuality, tone) capture nuance that deterministic rules miss but require API calls and incur latency and cost.

Quality gate

A configurable threshold that translates evaluation scores into a pass/fail decision. Four gates exist: suite score, max failed cases, low-confidence ratio, and regression status. See Quality Gates.

Regression status

The classification of a suite's score change relative to its baseline. Three states: clean (within tolerance), warning (exceeded tolerance but below critical threshold), critical (exceeded critical threshold, gate fails). See Regression.

Run record

A JSON file stored in .regtrace/runs/ containing the full evaluation result, config snapshot, metadata, and timestamps. Each regtrace run produces one run record. See Run Record.

Suite

A collection of test cases from one golden set file. Each golden set produces one suite. A run can evaluate multiple suites if multiple golden sets are configured.

Test case

A single evaluation unit in a golden set. Defines an input, expected output, enabled metrics, and optionally an actual output. See Golden Set.

Tolerance / Critical threshold

Two-tier regression thresholds in quality gates. Tolerance sets the boundary for a warning status. Critical threshold sets the boundary for a critical status that fails the gate. See Config File.