RegtraceRegtrace

Architecture & Design Decisions

Why certain design choices were made and what they mean for users

This page explains the rationale behind Regtrace's major design decisions.

Standalone binary over npm package

Regtrace distributes as a standalone ELF binary via bun build --compile rather than an npm package or language-specific installer.

Why: Eval tools that require a specific runtime (Python, Node, Java) force every CI pipeline to install that runtime first. A standalone binary has zero prerequisites — download, chmod +x, run. This matters most in:

  • CI runners in heterogeneous environments (GitHub Actions, GitLab CI, Jenkins)
  • Teams that don't use the runtime language (e.g., a Go shop adding LLM eval)
  • Security-constrained environments that restrict runtime installation

Trade-off: The binary is ~90 MB (it bundles the Bun runtime). This is acceptable because CI runners download it once per pipeline run and cache it.

Deterministic fallbacks for LLM-judged metrics

Factuality and tone metrics can use an LLM judge for sophisticated evaluation. When the judge is unavailable (no API key configured, network error, rate limit), they fall back to deterministic heuristics.

Why: An evaluation pipeline should never be blocked by an external API. Deterministic fallbacks guarantee that every regtrace run produces scores, even if every LLM provider is down. The scores are less nuanced than LLM-judged ones, but they're always available.

The fallback is silent — it doesn't require user configuration. The run record notes which evaluation mode was used for each metric.

Four-pillar architecture

Regtrace's design decomposes evaluation into four independent concepts:

PillarResponsibility
Golden setsDefine what to test (inputs + expected outputs)
MetricsDefine how to score (LLM-judged or deterministic)
BaselinesDefine what "good enough" means historically
Quality gatesDefine pass/fail thresholds for automation

Each pillar has its own schema, storage, and CLI subcommand. They compose through well-defined interfaces: golden sets feed into metrics, metrics produce scores for baselines, baselines drive quality gates.

This decomposition makes each part testable in isolation and replaceable without affecting the others.

On this page