Judge Provider Reference
Complete reference for LLM judge providers
Provider support
| Provider | Endpoint | Auth | Status |
|---|---|---|---|---|
| Anthropic | https://api.anthropic.com/v1/messages | ANTHROPIC_API_KEY | ✅ Stable |
| OpenAI | https://api.openai.com/v1/chat/completions | OPENAI_API_KEY | ✅ Stable |
| Gemini | https://generativelanguage.googleapis.com/v1beta/models/{model}:generateContent | GEMINI_API_KEY (via x-goog-api-key header) | ✅ Stable |
| Groq | https://api.groq.com/openai/v1/chat/completions | GROQ_API_KEY | ✅ Stable |
| Ollama | http://localhost:11434/api/chat | None | ✅ Stable |
Provider configuration
Each provider uses a common configuration shape:
judge:
primary:
provider: anthropic # provider name
model: claude-haiku-4-5-20251001 # model identifier
temperature: 0.1 # generation temperature
max_tokens: 4096 # max response tokens
timeout_ms: 30000 # request timeout
retry_attempts: 3 # retries on failureJudge prompts
Each provider receives identical judge prompts. The prompt structure differs between factuality evaluation and tone evaluation.
Factuality prompt
Sent to the judge when factuality metric is enabled. The judge receives:
input: The original promptexpected_output: The ground truthactual_output: The model's responsecontext: Retrieved documents (RAG mode only)
The judge returns a structured response with:
- Claim-by-claim support assessment
- Overall score
- Confidence level
- Explanation
Tone prompt
Sent to the judge when tone metric is enabled. The judge receives:
input: The original promptexpected_output: The ground truthactual_output: The model's responsetone_profile: Expected tone description from config
The judge returns structured assessment per sub-dimension.
Default provider
Anthropic is the default provider when no judge block is specified.
Retry with exponential backoff
When a provider request fails (network error, rate limit, server error), Regtrace retries with exponential backoff and jitter:
delay = min(1000 × 2^attempt + random(500), 30000)- Base: 1s, doubles per attempt
- Jitter: up to 500ms random added
- Cap: 30s max delay
- Total attempts: configured by
retry_attempts(default 3)
Fallback judge
When the primary judge provider fails after all retries, Regtrace can fall back to a secondary provider:
judge:
primary:
provider: anthropic
model: claude-haiku-4-5-20251001
retry_attempts: 3
fallback:
provider: openai
model: gpt-5.4-mini-2026-03-17
retry_attempts: 2The fallback is only invoked when the primary exhausts all retries. There is no double-fallback — if the fallback also fails, the metric falls back to heuristic scoring.
API key validation
Regtrace validates required API keys before making any HTTP calls. If a key is missing, the CLI exits immediately with a descriptive error naming the provider and expected environment variable:
Anthropic API key not configured. Set ANTHROPIC_API_KEY or pass apiKey in judge config.This fail-fast behavior prevents silent degradation — you never discover a missing key halfway through a run after spending time on earlier test cases.
Error sanitization
API error responses are sanitized before display to prevent credential leaks:
- Response body truncated to 400 characters
- API key patterns redacted (
sk-...,gsk_...,AIza...) - Error format:
${provider} API error ${status}: ${sanitized}
Generation
The same providers are also used by regtrace run --generate to produce
actual_output for test cases with null output. By default the generator uses
judge.primary. Override with an optional generator block in the config file
(see config file reference).
Known limitations
LLM-as-judge evaluation has known biases and failure modes. See Limitations & Caveats for a detailed discussion of verbosity bias, self-preference, position bias, and how Regtrace mitigates them.