RegtraceRegtrace

Metrics Reference

Complete reference for all evaluation metrics

Metric types

MetricTypeDeterministicRequires LLM judge
FactualityLLM-judgedNoYes (deep mode)
FormatDeterministicYesNo
ToneLLM-judgedPartialYes
RegressionStatisticalYesNo

Factuality

Evaluates whether the actual output is factually correct against the expected output or context.

Modes:

ModeBehavior
strictFail on contradicted claims AND unverifiable claims
lenientFail only on contradicted claims
json_structuralAuto-detected when both expected and actual parse as JSON. Compares each leaf value type-appropriately (string exact, number within 0.01, boolean exact). Arrays of primitives use set-based comparison (order-independent), arrays of objects use positional. Reports specific mismatches in explanations. Falls back to n-gram overlap for non-JSON.

JSON auto-detection triggers in any mode when both expected and actual output are valid JSON. Explicitly setting json_structural forces this comparison and produces a diagnostic error if either value isn't JSON.

Claim extraction:

DepthBehavior
shallow (default)Heuristic N-gram overlap. Deterministic, no API call.
deepUse LLM to extract claims (requires configured judge provider).

If no LLM provider is configured or the API call fails, factuality falls back to heuristic (shallow) scoring regardless of the configured depth.

Scoring:

  • Score = ratio of supported claims to total claims
  • Contradicted and unverifiable claims each contribute to the failure count

Format

Deterministic structural checks on the actual output.

Sub-checks:

Sub-checkWhat it does
lengthCompares output length to expected; tolerance controls allowed deviation
json_validityChecks if output is valid JSON
json_schemaValidates JSON against a schema derived from the expected output
markdown_structureChecks heading hierarchy, list formatting
required_fieldsChecks specific terms appear in output
forbidden_contentChecks specific terms do NOT appear
regex_matchChecks output matches a regex pattern (patterns > 500 chars rejected; complex quantified patterns rejected as ReDoS protection)

Scoring:

  • Score = (passed sub-checks) / (total sub-checks)
  • 1.0 if all enabled sub-checks pass

Tone

Evaluates the tone and voice of the actual output against expectations.

Sub-dimensions:

DimensionWhat it checks
formalityFormal vs informal register
sentimentPositive, negative, or neutral sentiment
assertivenessConfident vs tentative language
persona_consistencyConsistent voice matching the tone profile
verbosityAppropriate level of detail

Scoring:

Weighted average of active sub-dimensions. Default weights are 1.0 for all.

Judge fallback:

When the LLM judge is unavailable, tone uses heuristic rules:

  • formality: Ratio of formal hedging terms
  • sentiment: Keyword-based scoring
  • verbosity: Word and sentence length ratios

Regression

Statistical comparison against a baseline run.

Strategies:

StrategyDescription
last_passingCompare against most recent passing run
pinnedCompare against a specific pinned run

Statuses:

StatusCondition
cleanNo significant delta or improvement
warningDelta below tolerance (above threshold)
criticalDelta below critical_threshold
newNo baseline exists for comparison

Test case deltas:

DeltaCondition
improvedScore increased
regressedScore decreased
unchangedScore unchanged
newTest case didn't exist in baseline
removedTest case existed in baseline but not in current

Scoring cascade

Scoring method: weighted_average (default). All weights default to 1.

Assertion-level details

For deterministic metrics (format sub-checks, JSON factuality comparisons), each metric result includes an optional details[] array. Each entry describes a single assertion:

FieldTypeDescription
checkstringHierarchical check name (e.g. format.length, json_path.$.amount)
passedbooleanWhether this specific assertion passed
expectedstring (optional)Expected value (truncated to 80 chars)
actualstring (optional)Actual value (truncated to 80 chars)
messagestring (optional)Human-readable description

Up to 10 mismatches are reported per metric. Beyond 10, a + N more entry summarizes the rest.

In markdown reports, failed details are shown in collapsible sections per metric. In JSON output, the full details[] array is included in the metric_results object.

On this page