Architecture & Design Decisions
Why certain design choices were made and what they mean for users
This page explains the rationale behind Regtrace's major design decisions.
Standalone binary over npm package
Regtrace distributes as a standalone ELF binary via bun build --compile
rather than an npm package or language-specific installer.
Why: Eval tools that require a specific runtime (Python, Node, Java) force
every CI pipeline to install that runtime first. A standalone binary has zero
prerequisites — download, chmod +x, run. This matters most in:
- CI runners in heterogeneous environments (GitHub Actions, GitLab CI, Jenkins)
- Teams that don't use the runtime language (e.g., a Go shop adding LLM eval)
- Security-constrained environments that restrict runtime installation
Trade-off: The binary is ~90 MB (it bundles the Bun runtime). This is acceptable because CI runners download it once per pipeline run and cache it.
Deterministic fallbacks for LLM-judged metrics
Factuality and tone metrics can use an LLM judge for sophisticated evaluation. When the judge is unavailable (no API key configured, network error, rate limit), they fall back to deterministic heuristics.
Why: An evaluation pipeline should never be blocked by an external API.
Deterministic fallbacks guarantee that every regtrace run produces scores,
even if every LLM provider is down. The scores are less nuanced than
LLM-judged ones, but they're always available.
The fallback is silent — it doesn't require user configuration. The run record notes which evaluation mode was used for each metric.
Four-pillar architecture
Regtrace's design decomposes evaluation into four independent concepts:
| Pillar | Responsibility |
|---|---|
| Golden sets | Define what to test (inputs + expected outputs) |
| Metrics | Define how to score (LLM-judged or deterministic) |
| Baselines | Define what "good enough" means historically |
| Quality gates | Define pass/fail thresholds for automation |
Each pillar has its own schema, storage, and CLI subcommand. They compose through well-defined interfaces: golden sets feed into metrics, metrics produce scores for baselines, baselines drive quality gates.
This decomposition makes each part testable in isolation and replaceable without affecting the others.