4.1 KiB
System Prompt Testing Status
This document describes the benchmark tool that exists today. It is the current
truth for ai_evals/.
The longer planning document in system-prompt-testing-plan.md still contains useful background, but parts of its workflow are now historical because the old variants/history system was removed.
Current Tool
There is one repo-level benchmark CLI under ai_evals/ with three commands:
bun run cli -- modelsbun run cli -- cases [mode]bun run cli -- run <mode> [caseIds...]
Supported modes:
cliflowscriptapp
Public run options:
--runs <n>--output <path>--model <alias>--verbose--record
There is no variant workflow and no compare command in the current tool.
Tracked history is intentionally minimal: run --record appends one compact
summary line to ai_evals/history/<mode>.jsonl. This is only allowed for
full-suite runs, not selected case ids. History lines include average token
usage when the benchmark mode reports it, plus average judge score and per-case
duration/judge/token usage summaries.
How It Works
Each attempt runs:
- the current production prompts, tools, and guidance from this checkout
- deterministic validation
- LLM judging
Results are written locally under ai_evals/results/ as:
- a summary JSON file
- a sibling artifacts directory containing the generated flow/script/app/workspace
If --record is used, the CLI also appends a compact JSONL summary line to the
tracked file for that mode under ai_evals/history/.
Current Architecture
ai_evals/cases/: one YAML manifest per modeai_evals/fixtures/: initial and expected fixturesai_evals/core/: shared case loading, model resolution, validation, judging, and result writingai_evals/history/: optional tracked pass-rate history written byrun --record, one JSONL file per modeai_evals/modes/: one runner per mode
Execution model:
flow,script, andappreuse the production frontend chat loop and production tool definitions through the frontend Vitest bridgeclicreates a temp workspace, writes the current checkout guidance into it, and runs the Anthropic agent SDK against that workspace
Case Model
Each case is intentionally small:
prompt- optional
initial - optional
expected - optional
validate
validate is mainly used for stronger deterministic checks where exact fixture
matching would be too strict, especially for flow creation cases.
Examples of current deterministic checks:
- schema contains one of several accepted input shapes
results.*references resolve- required code/input characteristics exist in some module
- expected workspace files are created in
climode
Model Selection
Model aliases are resolved through a shared registry in ai_evals/core/models.ts.
Current aliases:
haikusonnetopus4o
Notes:
- the
modelscommand also shows accepted alias spellings such asgpt-4oandclaude-opus-4.6 - frontend modes can use Anthropic and OpenAI-backed aliases
climode is Anthropic-only because it runs through the Anthropic agent SDK- the judge model is separate and currently defaults to
claude-sonnet-4-6
What Is Working Well
- one simple local benchmark CLI
- real production execution paths instead of synthetic prompt variants
- local result and artifact persistence by default
- live frontend progress output
- reusable flow/script/app/cli runners under one tool
- deterministic validation can now catch real runtime-invalid flow wiring
What Still Needs Work
- broader case coverage across all four modes
- stronger deterministic validators for more cases, especially app/script semantics
- clearer per-case validation metadata as the corpus grows
- CI automation for smoke and nightly runs
Recommended Next Focus
The next high-value work is:
- add more realistic benchmark cases
- keep simplifying deterministic validators so they check correctness, not one exact implementation
- add CI only after the local benchmark signal is trustworthy