Configuring Metrics

Each metric has its own configuration block in regtrace.config.yaml.

Global settings

metrics:
  enabled: [factuality, format, tone, regression]
  default_threshold: 0.7
  default_weight: 1

default_threshold sets the passing bar for all metrics. A score below 0.7 marks the metric as failed.

default_weight controls how much each metric contributes to a test case's aggregate score.

Per-metric configuration

Factuality

metrics:
  factuality:
    mode: strict           # strict | lenient | json_structural
    claim_extraction_depth: shallow  # shallow | deep
    rag_faithfulness_only: false

strict mode fails on both contradicted claims and unverifiable claims. lenient mode only fails on direct contradictions.

json_structural mode enables leaf-value comparison for JSON outputs. When both expected and actual parse as valid JSON, each value is compared type-appropriately — string exact, number within 0.01 tolerance, boolean exact, arrays of primitives order-independent (set-based), arrays of objects positional. Specific mismatches (key, expected, actual) are reported in the explanation instead of a generic overlap score. Auto- detected in any mode when both values are valid JSON; explicit setting forces JSON comparison and errors if values don't parse.

Heuristic N-gram overlap. Deterministic, no API call. Runs entirely locally — no API key needed.

Shallow mode is fast, free, and always available. Use it for daily runs and CI pipelines where speed matters.

Uses LLM judge to extract and verify claims. Requires a configured LLM provider (Groq, Anthropic, etc.). More accurate but incurs API costs and latency.

Deep mode calls the LLM judge for every test case. Costs scale with the number of test cases and the length of responses.

Format

metrics:
  format:
    sub_checks:
      length: true
      json_validity: true
      json_schema: true
      markdown_structure: true
      required_fields: true
      forbidden_content: true
      regex_match: true
    length_tolerance: 0.2
    strict_json: false

Prop

Type

length_tolerance is a ratio. If expected output is 100 words and tolerance is 0.2, outputs between 80 and 120 words pass.

Tone

metrics:
  tone:
    tone_profile: null
    sub_dimensions:
      formality: true
      sentiment: true
      assertiveness: true
      persona_consistency: true
      verbosity: true
    sub_dimension_weights:
      formality: 1.0
      sentiment: 1.0

Prop

Type

tone_profile is a plain-text description of the expected tone, passed to the LLM judge. Example: "Professional but approachable, avoid jargon."

The tone config block is required by the schema. To skip tone evaluation, disable all sub-dimensions:

metrics:
  tone:
    sub_dimensions:
      formality: false
      sentiment: false
      assertiveness: false
      persona_consistency: false
      verbosity: false

Regression

metrics:
  regression:
    enabled: true
    baseline_strategy: last_passing  # last_passing | pinned
    tolerance: 0.05
    critical_threshold: 0.15
    exclude_new_test_cases: true

Prop

Type

See regression detection for details.

Per-test-case overrides

Override thresholds for a specific test case in the golden set:

- id: qa-001
  input: "What is 2+2?"
  expected_output: "4"
  actual_output: "4"
  metrics: [factuality, format, tone]
  thresholds:
    factuality: 0.9

This overrides the global default_threshold for factuality on this case only.

Configuration priority

Global config (defaults)
    ↓
Test case thresholds (overrides for one case)

Configuring Metrics

On this page