Configuring Metrics
Tune thresholds, weights, and sub-checks per metric
Each metric has its own configuration block in regtrace.config.yaml.
Global settings
metrics:
enabled: [factuality, format, tone, regression]
default_threshold: 0.7
default_weight: 1default_threshold sets the passing bar for all metrics. A score below 0.7
marks the metric as failed.
default_weight controls how much each metric contributes to a test case's
aggregate score.
Per-metric configuration
Factuality
metrics:
factuality:
mode: strict # strict | lenient | json_structural
claim_extraction_depth: shallow # shallow | deep
rag_faithfulness_only: falsestrict mode fails on both contradicted claims and unverifiable claims.
lenient mode only fails on direct contradictions.
json_structural mode enables leaf-value comparison for JSON outputs.
When both expected and actual parse as valid JSON, each value is compared
type-appropriately — string exact, number within 0.01 tolerance, boolean
exact, arrays of primitives order-independent (set-based), arrays of
objects positional. Specific mismatches (key, expected, actual) are
reported in the explanation instead of a generic overlap score. Auto-
detected in any mode when both values are valid JSON; explicit setting
forces JSON comparison and errors if values don't parse.
Heuristic N-gram overlap. Deterministic, no API call. Runs entirely locally — no API key needed.
Shallow mode is fast, free, and always available. Use it for daily runs and CI pipelines where speed matters.
Uses LLM judge to extract and verify claims. Requires a configured LLM provider (Groq, Anthropic, etc.). More accurate but incurs API costs and latency.
Deep mode calls the LLM judge for every test case. Costs scale with the number of test cases and the length of responses.
Format
metrics:
format:
sub_checks:
length: true
json_validity: true
json_schema: true
markdown_structure: true
required_fields: true
forbidden_content: true
regex_match: true
length_tolerance: 0.2
strict_json: falseProp
Type
length_tolerance is a ratio. If expected output is 100 words and tolerance
is 0.2, outputs between 80 and 120 words pass.
Tone
metrics:
tone:
tone_profile: null
sub_dimensions:
formality: true
sentiment: true
assertiveness: true
persona_consistency: true
verbosity: true
sub_dimension_weights:
formality: 1.0
sentiment: 1.0Prop
Type
tone_profile is a plain-text description of the expected tone, passed to
the LLM judge. Example: "Professional but approachable, avoid jargon."
The tone config block is required by the schema. To skip tone evaluation,
disable all sub-dimensions:
metrics:
tone:
sub_dimensions:
formality: false
sentiment: false
assertiveness: false
persona_consistency: false
verbosity: falseRegression
metrics:
regression:
enabled: true
baseline_strategy: last_passing # last_passing | pinned
tolerance: 0.05
critical_threshold: 0.15
exclude_new_test_cases: trueProp
Type
See regression detection for details.
Per-test-case overrides
Override thresholds for a specific test case in the golden set:
- id: qa-001
input: "What is 2+2?"
expected_output: "4"
actual_output: "4"
metrics: [factuality, format, tone]
thresholds:
factuality: 0.9This overrides the global default_threshold for factuality on this case only.
Configuration priority
Global config (defaults)
↓
Test case thresholds (overrides for one case)