RegtraceRegtrace

Creating a Golden Set

Write your own evaluation test cases

A golden set is a YAML file containing a collection of test cases. Each test case defines an input, an expected output, and which metrics to evaluate.

Structure

name: my-qa-set
version: "1.0"
description: QA test cases for my chatbot
interaction_type: single_turn
tags: [qa, chatbot]
author: you@example.com
created_at: "2026-01-01"
updated_at: "2026-01-01"

test_cases:
  - id: qa-001
    description: Capital of France
    input: "What is the capital of France?"
    system_prompt: null
    expected_output: "Paris is the capital of France."
    actual_output: null
    metrics: [factuality, format, tone]
    tags: [geography]
    weight: 1

Fields

FieldRequiredDescription
nameYesHuman-readable identifier
versionYesSemantic version for diff tracking
descriptionYesWhat this set tests
interaction_typeYessingle_turn or rag
tagsYesFor filtering during runs
authorYesWho owns the set
created_atYesISO 8601 date
updated_atYesISO 8601 date

Each test case:

FieldRequiredDescription
idYesUnique within the set, stable across versions
descriptionYesOne-line summary
inputYesPrompt or user message
system_promptYes (nullable)System instruction in effect
expected_outputYesGround truth ideal response
actual_outputYes (nullable)Model output — null at authoring time
metricsYesWhich metrics to run
tagsYesFor case-level filtering
weightYesScore multiplier for suite aggregation

Providing actual output

When you run regtrace run, the tool evaluates whatever is in actual_output. You have two options:

Set it manually

Fill in the model's response yourself. Metrics run deterministically where possible. This is the fastest path — no API key needed.

actual_output: "Paris is the capital of France."

Leave it null

Regtrace evaluates using the LLM judge to score the expected_output against the input. Requires a configured judge provider.

actual_output: null

All-null actual_output across the entire golden set is rejected. You must either provide inline outputs or configure a judge provider.

RAG test cases

For RAG evaluation, add a context block:

- id: rag-001
  input: "What is the API rate limit?"
  system_prompt: null
  context:
    documents:
      - source: "docs/api-reference.md"
        content: "The API rate limit is 500 requests per minute."
        retrieval_score: 0.94
  expected_output: "500 requests per minute"
  actual_output: null
  metrics: [factuality, format]

Each document in the context includes a source, content, and optional retrieval_score.

Versioning

The golden set version follows semver:

  • Patch — typo fixes, clarifying wording without changing meaning
  • Minor — new test cases or meaningful modifications
  • Major — restructuring, removing cases, changing interaction type

Regtrace flags golden set version changes in regression reports so you can distinguish score changes caused by the golden set evolving from changes caused by the model degrading.

Register the set

Add the golden set to your config:

golden_sets:
  - path: golden-sets/my-set.yaml
    enabled: true
    weight: 1

On this page