* fix: support special flow modules in evals Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * refactor: extract shared flow helper logic Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * fix: make special flow tools openai-compatible Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * fix: improve flow eval prompts and validation Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * test: relax flow benchmark overfits Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * test: record updated flow benchmark history Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * fix: address flow review findings Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * refactor: source flow chat special module prompt Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * fix: narrow rawscript helper return type Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * refactor: dedupe flow chat prompt guidance Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * fix: relax flow test10 validation Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
AI Evals
Small benchmark runner for the four Windmill AI generation modes:
cliflowscriptapp
The benchmark always tests the current production prompts, tools, and guidance in this checkout.
Each attempt runs:
- the real production path
- deterministic validation
- LLM judging
Install
cd ai_evals
bun install
Frontend modes also require frontend dependencies:
cd frontend
bun install
Commands
List model aliases:
cd ai_evals
bun run cli -- models
List cases:
cd ai_evals
bun run cli -- cases
bun run cli -- cases flow
Run benchmarks:
cd ai_evals
bun run cli -- run flow
bun run cli -- run flow flow-test4-order-processing-loop --model opus
bun run cli -- run flow flow-test0-sum-two-numbers --models haiku,opus,4o
bun run cli -- run flow flow-test0-sum-two-numbers --runs 3 --verbose
bun run cli -- run flow --record
WMILL_AI_EVAL_BACKEND_URL=http://127.0.0.1:8000 bun run cli -- run flow --backend-validation preview
bun run cli -- run cli bun-hello-script
Public CLI surface:
modelscases [mode]run <mode> [caseIds...]
run options:
--runs <n>: repeat each casentimes--output <path>: custom result JSON path--model <alias>: choose the model under test--models <a,b,c>: run the same cases sequentially against several model aliases--verbose: stream assistant output for frontend runs--record: append a compact tracked summary line toai_evals/history/<mode>.jsonlfor full-suite runs only--backend-validation <mode>: optional backend smoke validation (offorpreview) forscriptandflowevals
Models
Use bun run cli -- models to see the current aliases.
Today:
haikusonnetopus4ogemini-flashgemini-progemini-3-flash-previewgemini-3.1-pro-preview
Notes:
- the command also prints accepted alias spellings such as
gpt-4o,claude-opus-4.6, andclaude-haiku-4.5 - frontend modes (
flow,script,app) can use Anthropic, OpenAI, and Gemini-backed aliases climode always uses the Anthropic agent SDK, so only Anthropic aliases are valid there- the judge model is separate and currently defaults to
claude-sonnet-4-6
Case Format
Cases live in one YAML file per mode under ai_evals/cases/.
Minimal shape:
- id: flow-test0-sum-two-numbers
prompt: |-
Create a flow that takes two numbers, `a` and `b`, and returns their sum.
initial: ai_evals/fixtures/...
expected: ai_evals/fixtures/...
Optional fields:
initial: starting state fixtureexpected: expected artifact fixturevalidate: extra deterministic validation rulesruntime.backendPreview: optional real backend preview config for smoke validation
For flow mode, validate can express requirements such as:
- accepted input schema shapes
- required
results.*reference validity - required module/code/input characteristics
For flow mode, an initial fixture can also include a benchmark workspace catalog of
existing scripts and flows. That lets the real search_workspace and
get_runnable_details tools discover reusable workspace runnables during evals.
If --backend-validation preview is enabled:
scriptevals run a real backend script preview in an isolated temp workspaceflowevals run a real backend flow preview only for cases that defineruntime.backendPreviewflowcases withinitial.workspacefixtures seed those scripts and flows into the preview workspace before preview- when
WMILL_AI_EVAL_BACKEND_WORKSPACEis set,ai_evalstreats that workspace as a dedicated test workspace, clears managed eval assets underf/evals/*before each preview run, and then reseeds the current case fixtures
Supported backend validation env vars:
WMILL_AI_EVAL_BACKEND_VALIDATION=previewWMILL_AI_EVAL_BACKEND_URL=http://127.0.0.1:8000WMILL_AI_EVAL_BACKEND_EMAIL=admin@windmill.devWMILL_AI_EVAL_BACKEND_PASSWORD=changemeWMILL_AI_EVAL_BACKEND_WORKSPACE=integration-teststo reuse an existing workspace on CE installs with low workspace limitsWMILL_AI_EVAL_KEEP_WORKSPACES=1WMILL_AI_EVAL_WORKSPACE_PREFIX=ai-evals
Results And Artifacts
Every run writes:
- a summary JSON under
ai_evals/results/ - generated artifacts in a sibling directory
If --record is used, the CLI also appends one compact JSON line to:
ai_evals/history/flow.jsonlai_evals/history/script.jsonlai_evals/history/app.jsonlai_evals/history/cli.jsonl
Each recorded line contains:
- run metadata (
createdAt,gitSha,mode,runModel,judgeModel) - suite totals (
caseCount,attemptCount,passedAttempts,passRate,averageDurationMs,averageJudgeScore) - average token usage (
averageTokenUsagePerAttempt) - per-case metrics under
cases[](averageDurationMs,averageJudgeScore,averageTokenUsagePerAttempt, pass rate) failedCaseIds
Example:
- summary:
ai_evals/results/2026-04-09T09-40-33.051Z__flow.json - artifacts:
ai_evals/results/2026-04-09T09-40-33.051Z__flow/
Typical artifacts by mode:
flow:flow.jsonscript:script.jsonplus the generated script fileapp:app.jsonplus frontend/backend filescli:assistant-output.txtplus generated workspace files- backend-validated attempts also include
backend-preview.json
Layout
cases/: one YAML file per modefixtures/: initial and expected fixturescore/: shared loading, model resolution, validation, judging, and result writingmodes/: one runner per modehistory/: optional tracked pass-rate history written byrun --record, one JSONL file per moderesults/: local benchmark output and artifacts
Notes
- Frontend modes reuse the production frontend chat code through the Vitest bridge.
- CLI mode creates an isolated workspace, writes the current checkout guidance into it, and benchmarks the real skills /
AGENTS.mdflow. - Frontend progress streams live while the benchmark is running.
- Deterministic validators should stay focused on real correctness constraints, not one exact implementation shape.