Merge branch 'main' into reload-setting-load-split

chore(main): release 1.683.0 (#8802 )
* chore(main): release 1.683.0 * Apply automatic changes --------- Co-authored-by: rubenfiszel <275584+rubenfiszel@users.noreply.github.com>
2026-04-13 18:31:13 -04:00 · 2026-04-13 22:21:47 +00:00 · 2026-04-13 22:05:09 +00:00 · 2026-04-13 21:50:50 +00:00 · 2026-04-13 20:35:01 +00:00 · 2026-04-13 20:31:28 +00:00
582 changed files with 30588 additions and 7337 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -25,6 +25,8 @@ rust-client/Cargo.toml
 backend/target
 frontend/node_modules
 typescript-client/node_modules
+ai_evals/node_modules
+ai_evals/results/
 frontend/.svelte-kit
 backend/chrome_profiler.json
 .fast-check/
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -1,5 +1,89 @@
 # Changelog

+## [1.683.0](https://github.com/windmill-labs/windmill/compare/v1.682.0...v1.683.0) (2026-04-13)
+
+
+### Features
+
+* add black-box ai eval benchmarks ([#8618](https://github.com/windmill-labs/windmill/issues/8618)) ([cdcc564](https://github.com/windmill-labs/windmill/commit/cdcc56461b77554964622f490ae901f170886595))
+* add deploy restriction rule and fork review requests ([#8804](https://github.com/windmill-labs/windmill/issues/8804)) ([64c58c8](https://github.com/windmill-labs/windmill/commit/64c58c824fcefe00f15405b7e3877eb566a3ffa2))
+* allow non-admins to create and edit HTTP triggers ([#8810](https://github.com/windmill-labs/windmill/issues/8810)) ([9fb7816](https://github.com/windmill-labs/windmill/commit/9fb78164b4baa14c10d10f91ae969d48590c29f3))
+* display agent message in flow graph ([#8806](https://github.com/windmill-labs/windmill/issues/8806)) ([95411b2](https://github.com/windmill-labs/windmill/commit/95411b256332fa41816a93b19906f1534da9b300))
+* folder default_permissioned_as rules for ownership defaults on deploy ([#8801](https://github.com/windmill-labs/windmill/issues/8801)) ([60211c1](https://github.com/windmill-labs/windmill/commit/60211c1d1910b5f7ac6fed112f790201d2047a4c))
+* instance-level ruff config auto-pulled by LSP container ([#8803](https://github.com/windmill-labs/windmill/issues/8803)) ([3f5841f](https://github.com/windmill-labs/windmill/commit/3f5841f84d878cd3f43c435fa237d3f0c2265fb9))
+
+
+### Bug Fixes
+
+* **cli:** make cli help resilient to npm registry fetch failures ([#8809](https://github.com/windmill-labs/windmill/issues/8809)) ([b6f1cc7](https://github.com/windmill-labs/windmill/commit/b6f1cc70cd87c61df7112d3838fbb5fe9bcdc145))
+* enrich OTEL log records with per-request LogContext ([#8812](https://github.com/windmill-labs/windmill/issues/8812)) ([42d3e8c](https://github.com/windmill-labs/windmill/commit/42d3e8c7893cd959c7faffd19cd210c869c604f8))
+* silence user-facing toast for non-critical hub script tracking error ([#8808](https://github.com/windmill-labs/windmill/issues/8808)) ([378ba78](https://github.com/windmill-labs/windmill/commit/378ba7828456c871b5778f1144c4bb559bd5a733))
+
+
+### Performance Improvements
+
+* add inline-persist fast path for WAC v2 step() ([#8807](https://github.com/windmill-labs/windmill/issues/8807)) ([b3ef4bc](https://github.com/windmill-labs/windmill/commit/b3ef4bc26c5696624efee89b5e4e33e77e10cf15))
+
+## [1.682.0](https://github.com/windmill-labs/windmill/compare/v1.681.0...v1.682.0) (2026-04-10)
+
+
+### Features
+
+* enrich hanging flow error with worker and service log info ([#8800](https://github.com/windmill-labs/windmill/issues/8800)) ([59c457a](https://github.com/windmill-labs/windmill/commit/59c457a13881e35c229baed3edd87e618f89b9a0))
+
+
+### Bug Fixes
+
+* bypass OTEL MITM tracing proxy for git sync jobs ([#8796](https://github.com/windmill-labs/windmill/issues/8796)) ([9c85565](https://github.com/windmill-labs/windmill/commit/9c855652212dbac0e49f87dedd447d3d7d7b500a))
+* show full path on hover in deploy drawer and widen drawer ([#8799](https://github.com/windmill-labs/windmill/issues/8799)) ([b783bf2](https://github.com/windmill-labs/windmill/commit/b783bf2d835cde0843739f7d1099193bb0af042e))
+
+## [1.681.0](https://github.com/windmill-labs/windmill/compare/v1.680.0...v1.681.0) (2026-04-10)
+
+
+### Features
+
+* add CI test scripts with auto-trigger on deploy ([#8736](https://github.com/windmill-labs/windmill/issues/8736)) ([c57c769](https://github.com/windmill-labs/windmill/commit/c57c769deaa207e7ba7995f75649d3630774e898))
+* add edit yaml button to raw app settings ([#8771](https://github.com/windmill-labs/windmill/issues/8771)) ([b73be37](https://github.com/windmill-labs/windmill/commit/b73be37916de808dc64bec1337edf6e7d3993c5e))
+* add user offboarding flow with object reassignment ([#8647](https://github.com/windmill-labs/windmill/issues/8647)) ([435b25e](https://github.com/windmill-labs/windmill/commit/435b25e6a4c7272c0189cbcfb83526379f41ebf0))
+* allow selecting hub flows as raw app backend runnables ([#8772](https://github.com/windmill-labs/windmill/issues/8772)) ([5f57727](https://github.com/windmill-labs/windmill/commit/5f57727a4d956a9066b005b3c55f08dd6780475a))
+* list external JWT tokens in instance settings ([#8783](https://github.com/windmill-labs/windmill/issues/8783)) ([ce3e676](https://github.com/windmill-labs/windmill/commit/ce3e676f4ab0c442058c64db4ebf35545a805ef5))
+* oauth manual connect option ([#8770](https://github.com/windmill-labs/windmill/issues/8770)) ([4b87639](https://github.com/windmill-labs/windmill/commit/4b876392a0ce41ae42bd882ced10fe0187e532bc))
+* unify CLI config to workspaces, deprecate gitBranches/environments ([#8767](https://github.com/windmill-labs/windmill/issues/8767)) ([5b97092](https://github.com/windmill-labs/windmill/commit/5b9709299761b83a88df17a4259c431dfcd244f9))
+* **vault:** add skip_ssl_verify option for HashiCorp Vault ([#8791](https://github.com/windmill-labs/windmill/issues/8791)) ([6cf7ffc](https://github.com/windmill-labs/windmill/commit/6cf7ffc26bcbc8f4ef0e4ad2879fcd114332c4e2))
+
+
+### Bug Fixes
+
+* bypass sql type injection during formatting to prevent offset corruption ([#8786](https://github.com/windmill-labs/windmill/issues/8786)) ([8957d8f](https://github.com/windmill-labs/windmill/commit/8957d8f19bce3430871c2858b3accd53e0be178f))
+* CLI falls back to workspace whoami for workspace-scoped tokens ([#8789](https://github.com/windmill-labs/windmill/issues/8789)) ([d243eb3](https://github.com/windmill-labs/windmill/commit/d243eb31b014781a249f903b2a467aa58909ddd6))
+* disable scroll-to-change-number on number inputs ([#8777](https://github.com/windmill-labs/windmill/issues/8777)) ([e63924e](https://github.com/windmill-labs/windmill/commit/e63924e3778b40486813192dc2913e565e0a765e))
+* error on flow/app folder suffix format mismatch during sync push/pull ([#8775](https://github.com/windmill-labs/windmill/issues/8775)) ([1deb31f](https://github.com/windmill-labs/windmill/commit/1deb31f1e01d6168eee3c2cc242cb483272d1965))
+* flow dev page layout and compact toolbar improvements ([#8776](https://github.com/windmill-labs/windmill/issues/8776)) ([89920e7](https://github.com/windmill-labs/windmill/commit/89920e77f3f5dc45db939ec938d92c881dccc8a0))
+* Flow status viewer layout nits (avoid excess y space and scroll) ([#8780](https://github.com/windmill-labs/windmill/issues/8780)) ([6d36eca](https://github.com/windmill-labs/windmill/commit/6d36eca21684f9d3ab36658c2b66f85b9be8d331))
+* flow step testing UX improvements ([#8781](https://github.com/windmill-labs/windmill/issues/8781)) ([3fb557a](https://github.com/windmill-labs/windmill/commit/3fb557a7f51dbbd3fac445734196f1b9a1d2e287))
+* hide legacy global_settings.worker_configs ghost row ([#8790](https://github.com/windmill-labs/windmill/issues/8790)) ([4fff89f](https://github.com/windmill-labs/windmill/commit/4fff89f98ce72997a055cc313c8fe217d2f1fe78))
+* limit multi-runnable dedicated workers to one job at a time ([#8782](https://github.com/windmill-labs/windmill/issues/8782)) ([946848f](https://github.com/windmill-labs/windmill/commit/946848feef60aba2a54bc2f5b686b33cc96ec9ef))
+* normalize multi-word pg types in build_parameters to fix float8 serialization ([#8778](https://github.com/windmill-labs/windmill/issues/8778)) ([3d02be9](https://github.com/windmill-labs/windmill/commit/3d02be98f748d985f688243f3215d15ca4227f8f))
+* refresh custom instance user password if auth failed ([#8787](https://github.com/windmill-labs/windmill/issues/8787)) ([3d43d31](https://github.com/windmill-labs/windmill/commit/3d43d31aba276f23903f16f06035a4c4955b52e2))
+* treat empty global setting strings as unset ([#8793](https://github.com/windmill-labs/windmill/issues/8793)) ([ec9cec1](https://github.com/windmill-labs/windmill/commit/ec9cec1d02d87328db92a71a1b3a945e9e0c6bd2))
+* zero-downtime coordinated restarts for OTEL and other setting changes ([#8768](https://github.com/windmill-labs/windmill/issues/8768)) ([506b7f5](https://github.com/windmill-labs/windmill/commit/506b7f55e17472d1384e9676c1b6df7a9d7a118b))
+
+## [1.680.0](https://github.com/windmill-labs/windmill/compare/v1.679.0...v1.680.0) (2026-04-08)
+
+
+### Features
+
+* add CLI workspace merge command and enhance fork with datatable/color support ([#8756](https://github.com/windmill-labs/windmill/issues/8756)) ([4342c18](https://github.com/windmill-labs/windmill/commit/4342c1854134500d3b2bc46280f9885ee84e2c9e))
+* add scheduled job deletion with configurable retention period ([#8753](https://github.com/windmill-labs/windmill/issues/8753)) ([2d18a68](https://github.com/windmill-labs/windmill/commit/2d18a680991babe317ca315bbce40e6ce733afda))
+* add status indicator dots to parallel loop iteration picker ([#8761](https://github.com/windmill-labs/windmill/issues/8761)) ([470b8aa](https://github.com/windmill-labs/windmill/commit/470b8aa5f1870e26fea022c1e2a9f48471d8a205))
+
+
+### Bug Fixes
+
+* move alert config from config table to global_settings ([#8762](https://github.com/windmill-labs/windmill/issues/8762)) ([fa66870](https://github.com/windmill-labs/windmill/commit/fa668707c0ee7f261d78e145666b1073471259fd))
+* resolve esbuild host/binary version mismatch in app sync push ([#8765](https://github.com/windmill-labs/windmill/issues/8765)) ([e36d440](https://github.com/windmill-labs/windmill/commit/e36d440a251a43ea888e3ce378d0bb8ed8f42e11))
+* skip serializing ws_specific on resources when false ([#8764](https://github.com/windmill-labs/windmill/issues/8764)) ([c69f10d](https://github.com/windmill-labs/windmill/commit/c69f10d20dd064f0c329934096c2945424ff81f2))
+
 ## [1.679.0](https://github.com/windmill-labs/windmill/compare/v1.678.0...v1.679.0) (2026-04-07)


--- a/ai_evals/.gitignore
+++ b/ai_evals/.gitignore
@@ -0,0 +1,2 @@
+.env
+results/
--- a/ai_evals/AGENTS.md
+++ b/ai_evals/AGENTS.md
@@ -0,0 +1,172 @@
+# AI Evals Authoring Guide
+
+This folder contains black-box benchmark cases for:
+
+- `flow`
+- `app`
+- `script`
+- `cli`
+
+The goal is to test the current production prompts and guidance with realistic user requests, not to test one exact implementation shape.
+
+## Core rules
+
+1. Write prompts like a real user request.
+2. Prefer behavior, inputs, constraints, and outcomes over internal implementation details.
+3. Keep deterministic validation narrow and hard.
+4. Put semantic expectations in `judgeChecklist`.
+5. Use `expected` fixtures only when exact structure really matters.
+
+## Prompt writing
+
+Prompts should sound like something a user would naturally ask.
+
+Good:
+
+- "Create a flow that routes support requests based on customer tier."
+- "Add a reset button that sets the counter back to 0."
+- "Create a flow that reuses the existing greeting script instead of duplicating the logic."
+
+Bad:
+
+- "Use `branchone` with 3 branches and a default branch."
+- "Create a `rawscript` step with this exact topology."
+- "This is a benchmark harness."
+
+Do not write prompts as if the user knows Windmill internals unless the case is explicitly testing a power-user workflow.
+
+## Flow-specific rules
+
+This is the main principle you asked for:
+
+- flow prompts should read like requests from a user who does not know the product internals
+- the user should ask for behavior, not for `branchone`, `branchall`, `rawscript`, `preprocessor_module`, `failure_module`, exact graph topology, or other internal constructs
+
+That means:
+
+- creation cases should describe the business behavior and expected result
+- modification cases may mention existing step names, because the user can see the current flow
+- only mention special Windmill constructs when the case is explicitly about those constructs
+
+Examples:
+
+- acceptable creation prompt:
+  "Create a purchase approval flow that pauses for approval and asks the approver for a comment."
+- avoid:
+  "Create a suspend step with one required event and a resume form."
+
+For flow cases, do not fail a case just because the model chose a different valid topology.
+
+## App-specific rules
+
+App prompts should focus on user-visible behavior:
+
+- what the UI should let the user do
+- what should persist
+- what backend behavior is needed
+
+Avoid prompting in terms of React structure, component names, or implementation unless the case is specifically about editing an existing app.
+
+## CLI-specific rules
+
+CLI prompts can be more explicit about paths and file names because real CLI users often do specify them.
+
+Still, avoid benchmark phrasing. The prompt should read like a repo task, not a harness instruction.
+
+When relevant, ask the assistant to tell the user which `wmill` commands to run next. That is part of the benchmarked behavior.
+
+## Deterministic validation
+
+Use deterministic validation only for hard failures such as:
+
+- missing required files
+- unexpected extra files when the prompt says not to create them
+- syntax errors
+- unresolved flow refs
+- missing required special modules or suspend config
+- obvious artifact corruption
+
+Do not use deterministic validation to enforce one preferred implementation for broad creation tasks.
+
+Examples of bad hard checks:
+
+- exact step topology for a creation flow
+- exact branch structure when the prompt only asked for routing behavior
+- exact input shape when multiple reasonable shapes are acceptable
+
+## Judge checklist
+
+Every non-trivial case should have a `judgeChecklist`.
+
+The checklist should capture:
+
+- the user-visible behavior that must be present
+- important constraints
+- key completion criteria
+
+The checklist should not duplicate low-level implementation details unless they are truly required by the task.
+
+Good checklist items:
+
+- "the flow calculates the order total with 8% tax"
+- "the app persists recipes appropriately for a raw Windmill app"
+- "the flow reuses the existing workspace script instead of rewriting the logic"
+
+Bad checklist items:
+
+- "uses `branchone`"
+- "contains a `rawscript` node"
+
+## When to use `expected`
+
+Use `expected` fixtures when the case is structure-sensitive, for example:
+
+- exact file creation
+- exact script content
+- modification cases where a specific file must change in a specific way
+- cases where preserving an existing structure is part of the requirement
+
+Do not use a full `expected` artifact as the semantic oracle for broad creation tasks when multiple valid outputs should pass.
+
+## When to use `initial`
+
+Use `initial` when the benchmark is about:
+
+- editing an existing artifact
+- reusing existing workspace assets
+- preserving existing behavior while adding a change
+
+If the case is greenfield, prefer no `initial`.
+
+## Case design ladder
+
+Prefer suites that get gradually harder:
+
+1. trivial create case
+2. realistic create case
+3. reuse-existing-assets case
+4. modification case
+5. refactor case
+6. edge-case or niche product behavior
+
+The last cases in a suite should cover unusual or product-specific behavior.
+
+## Anti-patterns
+
+Avoid these:
+
+- benchmark framing in prompts
+- over-specified internal topology for creation tasks
+- judge checklists that just restate implementation details
+- deterministic validation that encodes one preferred solution
+- fixtures that are so minimal or brittle that they create false negatives
+
+## Before adding a case
+
+Ask:
+
+1. Would a real user plausibly write this prompt?
+2. If the model solves it in a different valid way, would the case still pass?
+3. Are the hard deterministic checks only catching objectively broken output?
+4. Does the `judgeChecklist` describe the real success criteria?
+5. If this case fails, will the reason be understandable from the saved artifacts?
--- a/ai_evals/CLAUDE.md
+++ b/ai_evals/CLAUDE.md
@@ -0,0 +1 @@
+@AGENTS.md
--- a/ai_evals/README.md
+++ b/ai_evals/README.md
@@ -0,0 +1,176 @@
+# AI Evals
+
+Small benchmark runner for the four Windmill AI generation modes:
+
+- `cli`
+- `flow`
+- `script`
+- `app`
+
+The benchmark always tests the current production prompts, tools, and guidance in this checkout.
+
+Each attempt runs:
+
+1. the real production path
+2. deterministic validation
+3. LLM judging
+
+## Install
+
+```bash
+cd ai_evals
+bun install
+```
+
+Frontend modes also require frontend dependencies:
+
+```bash
+cd frontend
+bun install
+```
+
+## Commands
+
+List model aliases:
+
+```bash
+cd ai_evals
+bun run cli -- models
+```
+
+List cases:
+
+```bash
+cd ai_evals
+bun run cli -- cases
+bun run cli -- cases flow
+```
+
+Run benchmarks:
+
+```bash
+cd ai_evals
+bun run cli -- run flow
+bun run cli -- run flow flow-test4-order-processing-loop --model opus
+bun run cli -- run flow flow-test0-sum-two-numbers --models haiku,opus,4o
+bun run cli -- run flow flow-test0-sum-two-numbers --runs 3 --verbose
+bun run cli -- run flow --record
+bun run cli -- run cli bun-hello-script
+```
+
+Public CLI surface:
+
+- `models`
+- `cases [mode]`
+- `run <mode> [caseIds...]`
+
+`run` options:
+
+- `--runs <n>`: repeat each case `n` times
+- `--output <path>`: custom result JSON path
+- `--model <alias>`: choose the model under test
+- `--models <a,b,c>`: run the same cases sequentially against several model aliases
+- `--verbose`: stream assistant output for frontend runs
+- `--record`: append a compact tracked summary line to `ai_evals/history/<mode>.jsonl` for full-suite runs only
+
+## Models
+
+Use `bun run cli -- models` to see the current aliases.
+
+Today:
+
+- `haiku`
+- `sonnet`
+- `opus`
+- `4o`
+- `gemini-flash`
+- `gemini-pro`
+- `gemini-3-flash-preview`
+- `gemini-3.1-pro-preview`
+
+Notes:
+
+- the command also prints accepted alias spellings such as `gpt-4o`, `claude-opus-4.6`, and `claude-haiku-4.5`
+- frontend modes (`flow`, `script`, `app`) can use Anthropic, OpenAI, and Gemini-backed aliases
+- `cli` mode always uses the Anthropic agent SDK, so only Anthropic aliases are valid there
+- the judge model is separate and currently defaults to `claude-sonnet-4-6`
+
+## Case Format
+
+Cases live in one YAML file per mode under `ai_evals/cases/`.
+
+Minimal shape:
+
+```yaml
+- id: flow-test0-sum-two-numbers
+  prompt: |-
+    Create a flow that takes two numbers, `a` and `b`, and returns their sum.
+  initial: ai_evals/fixtures/...
+  expected: ai_evals/fixtures/...
+```
+
+Optional fields:
+
+- `initial`: starting state fixture
+- `expected`: expected artifact fixture
+- `validate`: extra deterministic validation rules
+
+For `flow` mode, `validate` can express requirements such as:
+
+- accepted input schema shapes
+- required `results.*` reference validity
+- required module/code/input characteristics
+
+For `flow` mode, an `initial` fixture can also include a benchmark workspace catalog of
+existing scripts and flows. That lets the real `search_workspace` and
+`get_runnable_details` tools discover reusable workspace runnables during evals.
+
+## Results And Artifacts
+
+Every run writes:
+
+- a summary JSON under `ai_evals/results/`
+- generated artifacts in a sibling directory
+
+If `--record` is used, the CLI also appends one compact JSON line to:
+
+- `ai_evals/history/flow.jsonl`
+- `ai_evals/history/script.jsonl`
+- `ai_evals/history/app.jsonl`
+- `ai_evals/history/cli.jsonl`
+
+Each recorded line contains:
+
+- run metadata (`createdAt`, `gitSha`, `mode`, `runModel`, `judgeModel`)
+- suite totals (`caseCount`, `attemptCount`, `passedAttempts`, `passRate`, `averageDurationMs`, `averageJudgeScore`)
+- average token usage (`averageTokenUsagePerAttempt`)
+- per-case metrics under `cases[]` (`averageDurationMs`, `averageJudgeScore`, `averageTokenUsagePerAttempt`, pass rate)
+- `failedCaseIds`
+
+Example:
+
+- summary: `ai_evals/results/2026-04-09T09-40-33.051Z__flow.json`
+- artifacts: `ai_evals/results/2026-04-09T09-40-33.051Z__flow/`
+
+Typical artifacts by mode:
+
+- `flow`: `flow.json`
+- `script`: `script.json` plus the generated script file
+- `app`: `app.json` plus frontend/backend files
+- `cli`: `assistant-output.txt` plus generated workspace files
+
+## Layout
+
+- `cases/`: one YAML file per mode
+- `fixtures/`: initial and expected fixtures
+- `core/`: shared loading, model resolution, validation, judging, and result writing
+- `modes/`: one runner per mode
+- `history/`: optional tracked pass-rate history written by `run --record`, one JSONL file per mode
+- `results/`: local benchmark output and artifacts
+
+## Notes
+
+- Frontend modes reuse the production frontend chat code through the Vitest bridge.
+- CLI mode creates an isolated workspace, writes the current checkout guidance into it, and benchmarks the real skills / `AGENTS.md` flow.
+- Frontend progress streams live while the benchmark is running.
+- Deterministic validators should stay focused on real correctness constraints, not one exact implementation shape.
--- a/ai_evals/adapters/cli/runtime.test.ts
+++ b/ai_evals/adapters/cli/runtime.test.ts
@@ -0,0 +1,72 @@
+import { describe, expect, it } from "bun:test";
+import {
+  anthropicUsageToBenchmarkTokenUsage,
+  extractCliResultTokenUsage,
+} from "./runtime";
+
+describe("anthropicUsageToBenchmarkTokenUsage", () => {
+  it("includes cache tokens in prompt usage", () => {
+    expect(
+      anthropicUsageToBenchmarkTokenUsage({
+        input_tokens: 120,
+        output_tokens: 45,
+        cache_creation_input_tokens: 30,
+        cache_read_input_tokens: 5,
+      })
+    ).toEqual({
+      prompt: 155,
+      completion: 45,
+      total: 200,
+    });
+  });
+
+  it("returns null when usage is absent", () => {
+    expect(anthropicUsageToBenchmarkTokenUsage(null)).toBeNull();
+  });
+});
+
+describe("extractCliResultTokenUsage", () => {
+  it("reads aggregate usage from the SDK result event", () => {
+    expect(
+      extractCliResultTokenUsage({
+        type: "result",
+        usage: {
+          input_tokens: 400,
+          output_tokens: 120,
+          cache_creation_input_tokens: 50,
+          cache_read_input_tokens: 25,
+        },
+      })
+    ).toEqual({
+      prompt: 475,
+      completion: 120,
+      total: 595,
+    });
+  });
+
+  it("falls back to modelUsage when aggregate usage is unavailable", () => {
+    expect(
+      extractCliResultTokenUsage({
+        type: "result",
+        modelUsage: {
+          opus: {
+            inputTokens: 200,
+            outputTokens: 60,
+            cacheCreationInputTokens: 10,
+            cacheReadInputTokens: 5,
+          },
+          haiku: {
+            inputTokens: 80,
+            outputTokens: 20,
+            cacheCreationInputTokens: 0,
+            cacheReadInputTokens: 15,
+          },
+        },
+      })
+    ).toEqual({
+      prompt: 310,
+      completion: 80,
+      total: 390,
+    });
+  });
+});
--- a/ai_evals/adapters/cli/runtime.ts
+++ b/ai_evals/adapters/cli/runtime.ts
@@ -0,0 +1,199 @@
+import { query, type Options } from "@anthropic-ai/claude-agent-sdk";
+import { join } from "path";
+import { fileURLToPath } from "url";
+import { getCliEvalModel, resolveEvalModel, type CliEvalModelConfig } from "../../core/models";
+import type { BenchmarkTokenUsage } from "../../core/types";
+
+export interface ToolInvocation {
+  tool: string;
+  input: Record<string, unknown>;
+  timestamp: number;
+}
+
+export interface PromptRunResult {
+  toolsUsed: ToolInvocation[];
+  skillsInvoked: string[];
+  output: string;
+  durationMs: number;
+  assistantMessageCount: number;
+  tokenUsage: BenchmarkTokenUsage | null;
+}
+
+interface AnthropicUsageLike {
+  input_tokens?: number | null;
+  output_tokens?: number | null;
+  cache_creation_input_tokens?: number | null;
+  cache_read_input_tokens?: number | null;
+}
+
+interface AnthropicModelUsageLike {
+  inputTokens?: number | null;
+  outputTokens?: number | null;
+  cacheCreationInputTokens?: number | null;
+  cacheReadInputTokens?: number | null;
+}
+
+interface CliResultMessageLike {
+  type?: string;
+  usage?: AnthropicUsageLike | null;
+  modelUsage?: Record<string, AnthropicModelUsageLike> | null;
+}
+
+const REPO_ROOT = fileURLToPath(new URL("../../../", import.meta.url));
+export const DEFAULT_CLI_EVAL_MODEL: CliEvalModelConfig = getCliEvalModel(resolveEvalModel("cli"));
+
+export function getGeneratedSkillsSource(): string {
+  return join(REPO_ROOT, "system_prompts", "auto-generated", "skills");
+}
+
+export function anthropicUsageToBenchmarkTokenUsage(
+  usage: AnthropicUsageLike | null | undefined
+): BenchmarkTokenUsage | null {
+  if (!usage) {
+    return null;
+  }
+
+  const prompt =
+    (usage.input_tokens ?? 0) +
+    (usage.cache_creation_input_tokens ?? 0) +
+    (usage.cache_read_input_tokens ?? 0);
+  const completion = usage.output_tokens ?? 0;
+
+  return {
+    prompt,
+    completion,
+    total: prompt + completion,
+  };
+}
+
+export function extractCliResultTokenUsage(message: unknown): BenchmarkTokenUsage | null {
+  if (!message || typeof message !== "object") {
+    return null;
+  }
+
+  const resultMessage = message as CliResultMessageLike;
+  if (resultMessage.type !== "result") {
+    return null;
+  }
+
+  const usage = anthropicUsageToBenchmarkTokenUsage(resultMessage.usage);
+  if (usage) {
+    return usage;
+  }
+
+  if (!resultMessage.modelUsage || typeof resultMessage.modelUsage !== "object") {
+    return null;
+  }
+
+  let prompt = 0;
+  let completion = 0;
+  let sawModelUsage = false;
+
+  for (const modelUsage of Object.values(resultMessage.modelUsage)) {
+    if (!modelUsage || typeof modelUsage !== "object") {
+      continue;
+    }
+
+    prompt +=
+      (modelUsage.inputTokens ?? 0) +
+      (modelUsage.cacheCreationInputTokens ?? 0) +
+      (modelUsage.cacheReadInputTokens ?? 0);
+    completion += modelUsage.outputTokens ?? 0;
+    sawModelUsage = true;
+  }
+
+  if (!sawModelUsage) {
+    return null;
+  }
+
+  return {
+    prompt,
+    completion,
+    total: prompt + completion,
+  };
+}
+
+export async function runPromptAndCapture(
+  prompt: string,
+  cwd: string,
+  maxTurns: number = 3,
+  modelConfig: CliEvalModelConfig = DEFAULT_CLI_EVAL_MODEL
+): Promise<PromptRunResult> {
+  const toolsUsed: ToolInvocation[] = [];
+  const skillsInvoked: string[] = [];
+  let output = "";
+  let assistantMessageCount = 0;
+  let tokenUsage: BenchmarkTokenUsage | null = null;
+  const startedAt = Date.now();
+
+  const options: Options = {
+    cwd,
+    model: modelConfig.model,
+    maxTurns,
+    settingSources: ["project"],
+    allowedTools: ["Skill", "Read", "Glob", "Grep", "Bash", "Write", "Edit"]
+  };
+
+  for await (const message of query({ prompt, options })) {
+    if (message.type === "assistant") {
+      assistantMessageCount += 1;
+      const content = message.message?.content;
+      if (Array.isArray(content)) {
+        for (const block of content) {
+          if (block.type === "tool_use") {
+            toolsUsed.push({
+              tool: block.name,
+              input: block.input as Record<string, unknown>,
+              timestamp: Date.now()
+            });
+
+            if (block.name === "Skill" && typeof block.input === "object" && block.input !== null) {
+              const skillInput = block.input as { skill?: string };
+              if (skillInput.skill) {
+                skillsInvoked.push(skillInput.skill);
+              }
+            }
+          } else if (block.type === "text") {
+            output += block.text;
+          }
+        }
+      }
+    } else if (message.type === "result") {
+      const resultMessage = message as { result?: string };
+      tokenUsage = extractCliResultTokenUsage(message) ?? tokenUsage;
+      if (typeof resultMessage.result === "string") {
+        output += resultMessage.result;
+      }
+    }
+  }
+
+  return {
+    toolsUsed,
+    skillsInvoked,
+    output,
+    durationMs: Date.now() - startedAt,
+    assistantMessageCount,
+    tokenUsage,
+  };
+}
+
+export function wasSkillInvoked(result: PromptRunResult, skillName: string): boolean {
+  return result.skillsInvoked.some((skill) => skill === skillName || skill.includes(skillName));
+}
+
+export function wasToolUsed(result: PromptRunResult, toolName: string): boolean {
+  return result.toolsUsed.some((tool) => tool.tool === toolName);
+}
+
+export function formatCliRunModelLabel(modelConfig: CliEvalModelConfig): string {
+  return `${modelConfig.provider}:${modelConfig.model}`;
+}
+
+export function getToolInputs(
+  result: PromptRunResult,
+  toolName: string
+): Record<string, unknown>[] {
+  return result.toolsUsed
+    .filter((tool) => tool.tool === toolName)
+    .map((tool) => tool.input);
+}
--- a/ai_evals/adapters/frontend/benchmarkRunner.ts
+++ b/ai_evals/adapters/frontend/benchmarkRunner.ts
@@ -0,0 +1,87 @@
+import { loadSelectedCases } from "../../core/cases";
+import {
+  formatRunModelLabel,
+  getFrontendEvalModel,
+  resolveEvalModel,
+} from "../../core/models";
+import { buildRunResult } from "../../core/results";
+import { runSuite } from "../../core/runSuite";
+import type { BenchmarkRunResult, ModeRunner } from "../../core/types";
+import { emitFrontendBenchmarkProgress } from "./progress";
+import { createAppModeRunner } from "../../modes/app";
+import { createFlowModeRunner } from "../../modes/flow";
+import { createScriptModeRunner } from "../../modes/script";
+import { DEFAULT_JUDGE_MODEL } from "../../core/judge";
+
+export type FrontendBenchmarkMode = "flow" | "app" | "script";
+
+export async function runFrontendBenchmarkFromEnv(): Promise<BenchmarkRunResult> {
+  const mode = parseMode(process.env.WMILL_FRONTEND_AI_EVAL_MODE);
+  const caseIds = parseOptionalJsonStringArray(process.env.WMILL_FRONTEND_AI_EVAL_CASE_IDS);
+  const runs = parsePositiveInteger(process.env.WMILL_FRONTEND_AI_EVAL_RUNS, "WMILL_FRONTEND_AI_EVAL_RUNS");
+  const emitProgress = process.env.WMILL_FRONTEND_AI_EVAL_PROGRESS === "1";
+  const verbose = process.env.WMILL_FRONTEND_AI_EVAL_VERBOSE === "1";
+  const model = resolveEvalModel(mode, process.env.WMILL_FRONTEND_AI_EVAL_MODEL);
+
+  const selectedCases = await loadSelectedCases(mode, caseIds);
+  const modeRunner = getModeRunner(mode, getFrontendEvalModel(model));
+  const runModel = formatRunModelLabel(mode, model);
+  const caseResults = await runSuite({
+    modeRunner,
+    cases: selectedCases,
+    runs,
+    runModel,
+    judgeModel: DEFAULT_JUDGE_MODEL,
+    concurrency: verbose ? 1 : undefined,
+    verbose,
+    onProgress: emitProgress ? (event) => emitFrontendBenchmarkProgress(event) : undefined,
+  });
+
+  return buildRunResult({
+    mode,
+    runs,
+    runModel,
+    judgeModel: DEFAULT_JUDGE_MODEL,
+    caseResults,
+  });
+}
+
+function getModeRunner(
+  mode: FrontendBenchmarkMode,
+  model: ReturnType<typeof getFrontendEvalModel>
+): ModeRunner<any, any, any> {
+  switch (mode) {
+    case "flow":
+      return createFlowModeRunner(model);
+    case "app":
+      return createAppModeRunner(model);
+    case "script":
+      return createScriptModeRunner(model);
+  }
+}
+
+function parseMode(value: string | undefined): FrontendBenchmarkMode {
+  if (value === "flow" || value === "app" || value === "script") {
+    return value;
+  }
+  throw new Error(`Unsupported frontend benchmark mode: ${String(value)}`);
+}
+
+function parseOptionalJsonStringArray(value: string | undefined): string[] {
+  if (!value) {
+    return [];
+  }
+  const parsed = JSON.parse(value) as unknown;
+  if (!Array.isArray(parsed) || parsed.some((entry) => typeof entry !== "string")) {
+    throw new Error("WMILL_FRONTEND_AI_EVAL_CASE_IDS must be a JSON string array");
+  }
+  return parsed;
+}
+
+function parsePositiveInteger(value: string | undefined, envName: string): number {
+  const parsed = Number(value);
+  if (!Number.isInteger(parsed) || parsed <= 0) {
+    throw new Error(`${envName} must be a positive integer`);
+  }
+  return parsed;
+}
--- a/ai_evals/adapters/frontend/core/app/appEvalRunner.ts
+++ b/ai_evals/adapters/frontend/core/app/appEvalRunner.ts
@@ -0,0 +1,92 @@
+import { mkdtemp } from 'fs/promises'
+import { tmpdir } from 'os'
+import { join } from 'path'
+import type {
+	AppFiles,
+	BackendRunnable,
+	AppAIChatHelpers
+} from '../../../../../frontend/src/lib/components/copilot/chat/app/core'
+import {
+	getAppTools,
+	prepareAppSystemMessage,
+	prepareAppUserMessage
+} from '../../../../../frontend/src/lib/components/copilot/chat/app/core'
+import type { Tool as ProductionTool } from '../../../../../frontend/src/lib/components/copilot/chat/shared'
+import { createAppFileHelpers } from './fileHelpers'
+import { runEval } from '../shared'
+import type { AIProvider } from '$lib/gen/types.gen'
+import type { ModeRunContext } from '../../../../core/types'
+import type { TokenUsage } from '../shared/types'
+
+export interface AppEvalResult {
+	success: boolean
+	files: AppFiles
+	error?: string
+	assistantMessageCount: number
+	toolCallCount: number
+	toolsUsed: string[]
+	tokenUsage: TokenUsage
+}
+
+export interface AppEvalOptions {
+	initialFrontend?: Record<string, string>
+	initialBackend?: Record<string, BackendRunnable>
+	model?: string
+	maxIterations?: number
+	provider?: AIProvider
+	workspaceRoot?: string
+	runContext?: ModeRunContext
+}
+
+export async function runAppEval(
+	userPrompt: string,
+	apiKey: string,
+	options?: AppEvalOptions
+): Promise<AppEvalResult> {
+	const workspaceRoot =
+		options?.workspaceRoot ??
+		(await mkdtemp(join(tmpdir(), 'wmill-frontend-app-benchmark-')))
+	const { helpers, getFiles, cleanup } = await createAppFileHelpers(
+		options?.initialFrontend ?? {},
+		options?.initialBackend ?? {},
+		workspaceRoot
+	)
+
+	try {
+		const systemMessage = prepareAppSystemMessage()
+		const tools = getAppTools() as ProductionTool<AppAIChatHelpers>[]
+		const model = options?.model ?? 'claude-haiku-4-5-20251001'
+		const userMessage = prepareAppUserMessage(userPrompt, helpers.getSelectedContext())
+
+		const rawResult = await runEval({
+			userPrompt,
+			systemMessage,
+			userMessage,
+			tools,
+			helpers,
+			apiKey,
+			getOutput: getFiles,
+			onAssistantMessageStart: options?.runContext?.onAssistantMessageStart,
+			onAssistantToken: options?.runContext?.onAssistantChunk,
+			onAssistantMessageEnd: options?.runContext?.onAssistantMessageEnd,
+			options: {
+				maxIterations: options?.maxIterations,
+				model,
+				workspace: workspaceRoot,
+				provider: options?.provider
+			}
+		})
+
+		return {
+			files: rawResult.output,
+			success: rawResult.success,
+			error: rawResult.error,
+			assistantMessageCount: rawResult.iterations,
+			toolCallCount: rawResult.toolCallsCount,
+			toolsUsed: rawResult.toolsCalled,
+			tokenUsage: rawResult.tokenUsage
+		}
+	} finally {
+		await cleanup()
+	}
+}
--- a/frontend/src/lib/components/copilot/chat/tests/app/appFixtureLoader.ts
+++ b/frontend/src/lib/components/copilot/chat/tests/app/appFixtureLoader.ts
@@ -1,4 +1,8 @@
-import type { AppFiles, BackendRunnable, InlineScript } from '../../app/core'
+import type {
+	AppFiles,
+	BackendRunnable,
+	InlineScript
+} from '../../../../../frontend/src/lib/components/copilot/chat/app/core'

 /**
 * Backend runnable metadata stored in meta.json files.
--- a/ai_evals/adapters/frontend/core/app/fileHelpers.ts
+++ b/ai_evals/adapters/frontend/core/app/fileHelpers.ts
@@ -0,0 +1,255 @@
+import { mkdir, rm, writeFile } from 'fs/promises'
+import { dirname, join } from 'path'
+import type {
+	AppAIChatHelpers,
+	AppFiles,
+	BackendRunnable,
+	DataTableSchema,
+	LintResult,
+	SelectedContext
+} from '../../../../../frontend/src/lib/components/copilot/chat/app/core'
+
+function createEmptyLintResult(): LintResult {
+	return {
+		errorCount: 0,
+		warningCount: 0,
+		errors: { frontend: {}, backend: {} },
+		warnings: { frontend: {}, backend: {} }
+	}
+}
+
+async function writeFrontendFile(
+	workspaceRoot: string | undefined,
+	path: string,
+	content: string
+): Promise<void> {
+	if (!workspaceRoot) {
+		return
+	}
+	const relativePath = path.startsWith('/') ? path.slice(1) : path
+	const fullPath = join(workspaceRoot, 'frontend', relativePath)
+	await mkdir(dirname(fullPath), { recursive: true })
+	await writeFile(fullPath, content, 'utf8')
+}
+
+async function removeFrontendFile(workspaceRoot: string | undefined, path: string): Promise<void> {
+	if (!workspaceRoot) {
+		return
+	}
+	const relativePath = path.startsWith('/') ? path.slice(1) : path
+	await rm(join(workspaceRoot, 'frontend', relativePath), { force: true })
+}
+
+async function writeBackendRunnable(
+	workspaceRoot: string | undefined,
+	key: string,
+	runnable: BackendRunnable
+): Promise<void> {
+	if (!workspaceRoot) {
+		return
+	}
+	const runnableDir = join(workspaceRoot, 'backend', key)
+	await mkdir(runnableDir, { recursive: true })
+
+	const meta: { name: string; language?: string; type?: string; path?: string } = {
+		name: runnable.name
+	}
+
+	if (runnable.type === 'inline' && runnable.inlineScript) {
+		meta.language = runnable.inlineScript.language
+		const extension = runnable.inlineScript.language === 'python3' ? 'py' : 'ts'
+		await writeFile(
+			join(runnableDir, `main.${extension}`),
+			runnable.inlineScript.content,
+			'utf8'
+		)
+	} else {
+		meta.type = runnable.type
+		if (runnable.path) {
+			meta.path = runnable.path
+		}
+	}
+
+	await writeFile(join(runnableDir, 'meta.json'), JSON.stringify(meta, null, 2) + '\n', 'utf8')
+}
+
+async function removeBackendRunnable(workspaceRoot: string | undefined, key: string): Promise<void> {
+	if (!workspaceRoot) {
+		return
+	}
+	await rm(join(workspaceRoot, 'backend', key), { recursive: true, force: true })
+}
+
+async function persistDatatables(
+	workspaceRoot: string | undefined,
+	datatables: DataTableSchema[]
+): Promise<void> {
+	if (!workspaceRoot) {
+		return
+	}
+	await writeFile(
+		join(workspaceRoot, 'datatables.json'),
+		JSON.stringify(datatables, null, 2) + '\n',
+		'utf8'
+	)
+}
+
+export async function createAppFileHelpers(
+	initialFrontend: Record<string, string> = {},
+	initialBackend: Record<string, BackendRunnable> = {},
+	workspaceRoot?: string
+): Promise<{
+	helpers: AppAIChatHelpers
+	getFiles: () => AppFiles
+	getFrontend: () => Record<string, string>
+	getBackend: () => Record<string, BackendRunnable>
+	cleanup: () => Promise<void>
+	workspaceDir: string | null
+}> {
+	let frontend = { ...initialFrontend }
+	let backend = { ...initialBackend }
+	let snapshotId = 0
+	const snapshots = new Map<
+		number,
+		{ frontend: Record<string, string>; backend: Record<string, BackendRunnable> }
+	>()
+	const datatables: DataTableSchema[] = []
+
+	for (const [path, content] of Object.entries(frontend)) {
+		await writeFrontendFile(workspaceRoot, path, content)
+	}
+	for (const [key, runnable] of Object.entries(backend)) {
+		await writeBackendRunnable(workspaceRoot, key, runnable)
+	}
+	await persistDatatables(workspaceRoot, datatables)
+
+	const helpers: AppAIChatHelpers = {
+		listFrontendFiles: () => Object.keys(frontend),
+		getFrontendFile: (path: string) => frontend[path],
+		getFrontendFiles: () => ({ ...frontend }),
+		setFrontendFile: (path: string, content: string) => {
+			frontend[path] = content
+			void writeFrontendFile(workspaceRoot, path, content)
+			return createEmptyLintResult()
+		},
+		deleteFrontendFile: (path: string) => {
+			delete frontend[path]
+			void removeFrontendFile(workspaceRoot, path)
+		},
+		listBackendRunnables: () =>
+			Object.entries(backend).map(([key, runnable]) => ({
+				key,
+				name: runnable.name
+			})),
+		getBackendRunnable: (key: string) => backend[key],
+		getBackendRunnables: () => ({ ...backend }),
+		setBackendRunnable: async (key: string, runnable: BackendRunnable) => {
+			backend[key] = runnable
+			await writeBackendRunnable(workspaceRoot, key, runnable)
+			return createEmptyLintResult()
+		},
+		deleteBackendRunnable: (key: string) => {
+			delete backend[key]
+			void removeBackendRunnable(workspaceRoot, key)
+		},
+		getFiles: (): AppFiles => ({
+			frontend: { ...frontend },
+			backend: { ...backend }
+		}),
+		getSelectedContext: (): SelectedContext => ({ type: 'none' }),
+		snapshot: () => {
+			const id = ++snapshotId
+			snapshots.set(id, {
+				frontend: { ...frontend },
+				backend: { ...backend }
+			})
+			return id
+		},
+		revertToSnapshot: (id: number) => {
+			const snapshot = snapshots.get(id)
+			if (!snapshot) {
+				return
+			}
+			frontend = { ...snapshot.frontend }
+			backend = { ...snapshot.backend }
+			void syncWorkspace()
+		},
+		lint: () => createEmptyLintResult(),
+		getDatatables: async () => structuredClone(datatables),
+		getAvailableDatatableNames: () => datatables.map((datatable) => datatable.datatable_name),
+		execDatatableSql: async (
+			datatableName: string,
+			sql: string,
+			newTable?: { schema: string; name: string }
+		) => {
+			if (newTable) {
+				datatables.push({
+					datatable_name: datatableName,
+					schemas: {
+						[newTable.schema]: {
+							[newTable.name]: {}
+						}
+					}
+				})
+				await persistDatatables(workspaceRoot, datatables)
+			}
+			return {
+				success: true,
+				result: [
+					{
+						datatableName,
+						sql
+					}
+				]
+			}
+		},
+		addTableToWhitelist: (datatableName: string, schemaName: string, tableName: string) => {
+			const existing = datatables.find((entry) => entry.datatable_name === datatableName)
+			if (existing) {
+				existing.schemas[schemaName] ??= {}
+				existing.schemas[schemaName][tableName] ??= {}
+			} else {
+				datatables.push({
+					datatable_name: datatableName,
+					schemas: {
+						[schemaName]: {
+							[tableName]: {}
+						}
+					}
+				})
+			}
+			void persistDatatables(workspaceRoot, datatables)
+		}
+	}
+
+	async function syncWorkspace(): Promise<void> {
+		if (!workspaceRoot) {
+			return
+		}
+		await rm(join(workspaceRoot, 'frontend'), { recursive: true, force: true })
+		await rm(join(workspaceRoot, 'backend'), { recursive: true, force: true })
+		for (const [path, content] of Object.entries(frontend)) {
+			await writeFrontendFile(workspaceRoot, path, content)
+		}
+		for (const [key, runnable] of Object.entries(backend)) {
+			await writeBackendRunnable(workspaceRoot, key, runnable)
+		}
+		await persistDatatables(workspaceRoot, datatables)
+	}
+
+	return {
+		helpers,
+		getFiles: () => ({
+			frontend: { ...frontend },
+			backend: { ...backend }
+		}),
+		getFrontend: () => ({ ...frontend }),
+		getBackend: () => ({ ...backend }),
+		cleanup: async () => {
+			if (workspaceRoot) {
+				await rm(workspaceRoot, { recursive: true, force: true })
+			}
+		},
+		workspaceDir: workspaceRoot ?? null
+	}
+}
--- a/ai_evals/adapters/frontend/core/flow/fileHelpers.ts
+++ b/ai_evals/adapters/frontend/core/flow/fileHelpers.ts
@@ -0,0 +1,161 @@
+import { mkdir, rm, writeFile } from 'fs/promises'
+import { dirname, join } from 'path'
+import type { FlowModule, InputTransform } from '../../../../../frontend/src/lib/gen'
+import type { ExtendedOpenFlow } from '../../../../../frontend/src/lib/components/flows/types'
+import type { FlowAIChatHelpers } from '../../../../../frontend/src/lib/components/copilot/chat/flow/core'
+import type { ScriptLintResult } from '../../../../../frontend/src/lib/components/copilot/chat/shared'
+import { findModuleById } from '../../../../../frontend/src/lib/components/copilot/chat/shared'
+import {
+	createInlineScriptSession
+} from '../../../../../frontend/src/lib/components/copilot/chat/flow/inlineScriptsUtils'
+import {
+	registerBenchmarkWorkspace,
+	registerBenchmarkWorkspaceRunnables,
+	unregisterBenchmarkWorkspaceRunnables,
+	createBenchmarkCompletedJob,
+	type BenchmarkWorkspaceFlow,
+	type BenchmarkWorkspaceScript
+} from '../../mockBackend'
+
+const EMPTY_SCRIPT_LINT_RESULT: ScriptLintResult = {
+	errorCount: 0,
+	warningCount: 0,
+	errors: [],
+	warnings: []
+}
+
+export interface FlowWorkspaceFixtures {
+	scripts?: BenchmarkWorkspaceScript[]
+	flows?: BenchmarkWorkspaceFlow[]
+}
+
+export async function createFlowFileHelpers(
+	initialModules: FlowModule[] = [],
+	initialSchema?: Record<string, any>,
+	workspaceRoot?: string,
+	workspaceFixtures?: FlowWorkspaceFixtures
+): Promise<{
+	helpers: FlowAIChatHelpers
+	getFlow: () => ExtendedOpenFlow
+	getModules: () => FlowModule[]
+	cleanup: () => Promise<void>
+	workspaceDir: string | null
+}> {
+	let flow: ExtendedOpenFlow = {
+		value: { modules: structuredClone(initialModules) },
+		summary: '',
+		schema: initialSchema ?? {
+			$schema: 'https://json-schema.org/draft/2020-12/schema',
+			properties: {},
+			required: [],
+			type: 'object'
+		}
+	}
+	const inlineScriptSession = createInlineScriptSession()
+
+	const flowFilePath = workspaceRoot ? join(workspaceRoot, 'flow.json') : null
+
+	async function persistFlow(): Promise<void> {
+		if (!flowFilePath) {
+			return
+		}
+		await mkdir(dirname(flowFilePath), { recursive: true })
+		await writeFile(flowFilePath, JSON.stringify(flow, null, 2) + '\n', 'utf8')
+	}
+
+	await persistFlow()
+
+	if (workspaceRoot) {
+		registerBenchmarkWorkspace(workspaceRoot)
+		if (workspaceFixtures) {
+			registerBenchmarkWorkspaceRunnables(workspaceRoot, workspaceFixtures)
+		}
+	}
+
+	const helpers: FlowAIChatHelpers = {
+		getFlowAndSelectedId: () => ({ flow, selectedId: '' }),
+		getModules: (id?: string) => {
+			if (!id) return flow.value.modules
+			const module = findModuleById(flow.value.modules, id)
+			return module ? [module] : []
+		},
+		inlineScriptSession,
+		setSnapshot: () => {},
+		revertToSnapshot: () => {},
+		setCode: async (id: string, code: string) => {
+			const module = findModuleById(flow.value.modules, id)
+			if (module && module.value.type === 'rawscript') {
+				module.value.content = code
+			}
+			inlineScriptSession.set(id, code)
+			await persistFlow()
+		},
+		setFlowJson: async (
+			modules: FlowModule[] | undefined,
+			schema: Record<string, any> | undefined
+		) => {
+			if (modules) {
+				flow.value.modules = inlineScriptSession.restoreInlineScriptReferences(modules)
+				const unresolvedRefs = inlineScriptSession.findUnresolvedInlineScriptRefs(flow.value.modules)
+				if (unresolvedRefs.length > 0) {
+					throw new Error(
+						`Unresolved inline script references: ${unresolvedRefs.join(', ')}`
+					)
+				}
+			}
+			if (schema !== undefined) {
+				flow.schema = schema
+			}
+			await persistFlow()
+		},
+		getFlowInputsSchema: async () => flow.schema ?? {},
+		updateExprsToSet: (_id: string, _inputTransforms: Record<string, InputTransform>) => {},
+		acceptAllModuleActions: () => {},
+		rejectAllModuleActions: () => {},
+		hasPendingChanges: () => false,
+		selectStep: (_id: string) => {},
+		testFlow: async (args?: Record<string, any>) => {
+			if (workspaceRoot) {
+				const runPath = join(workspaceRoot, 'test-run.json')
+				await writeFile(
+					runPath,
+					JSON.stringify(
+						{
+							requestedArgs: args ?? {},
+							modules: flow.value.modules.map((module) => module.id)
+						},
+						null,
+						2
+					) + '\n',
+					'utf8'
+				)
+			}
+			return createBenchmarkCompletedJob({
+				workspace: workspaceRoot ?? 'benchmark',
+				jobKind: 'flowpreview',
+				result: {
+					requestedArgs: args ?? {},
+					modules: flow.value.modules.map((module) => module.id),
+					mocked: true
+				},
+				logs: 'Mock benchmark flow test run completed successfully.'
+			})
+		},
+		getLintErrors: async () => EMPTY_SCRIPT_LINT_RESULT
+	}
+
+	return {
+		helpers,
+		getFlow: () => flow,
+		getModules: () => flow.value.modules,
+		cleanup: async () => {
+			if (workspaceRoot) {
+				unregisterBenchmarkWorkspaceRunnables(workspaceRoot)
+			}
+			if (workspaceRoot) {
+				await rm(workspaceRoot, { recursive: true, force: true })
+			}
+		},
+		workspaceDir: workspaceRoot ?? null
+	}
+}
--- a/ai_evals/adapters/frontend/core/flow/flowEvalRunner.ts
+++ b/ai_evals/adapters/frontend/core/flow/flowEvalRunner.ts
@@ -0,0 +1,103 @@
+import { mkdtemp } from 'fs/promises'
+import { tmpdir } from 'os'
+import { join } from 'path'
+import type { FlowModule } from '$lib/gen'
+import type { AIProvider } from '$lib/gen/types.gen'
+import type { ExtendedOpenFlow } from '$lib/components/flows/types'
+import {
+	flowTools,
+	prepareFlowSystemMessage,
+	prepareFlowUserMessage,
+	type FlowAIChatHelpers
+} from '../../../../../frontend/src/lib/components/copilot/chat/flow/core'
+import type { Tool as ProductionTool } from '../../../../../frontend/src/lib/components/copilot/chat/shared'
+import { createFlowFileHelpers, type FlowWorkspaceFixtures } from './fileHelpers'
+import { runEval } from '../shared'
+import type { ModeRunContext } from '../../../../core/types'
+import type { TokenUsage } from '../shared/types'
+
+export interface FlowFixture {
+	value?: {
+		modules?: FlowModule[]
+	}
+	schema?: Record<string, unknown>
+}
+
+export interface FlowEvalResult {
+	success: boolean
+	flow: ExtendedOpenFlow
+	error?: string
+	assistantMessageCount: number
+	toolCallCount: number
+	toolsUsed: string[]
+	tokenUsage: TokenUsage
+}
+
+export interface FlowEvalOptions {
+	initialFlow?: FlowFixture
+	workspaceFixtures?: FlowWorkspaceFixtures
+	model?: string
+	maxIterations?: number
+	provider?: AIProvider
+	workspaceRoot?: string
+	runContext?: ModeRunContext
+}
+
+export async function runFlowEval(
+	userPrompt: string,
+	apiKey: string,
+	options?: FlowEvalOptions
+): Promise<FlowEvalResult> {
+	const workspaceRoot =
+		options?.workspaceRoot ??
+		(await mkdtemp(join(tmpdir(), 'wmill-frontend-flow-benchmark-')))
+	const { helpers, getFlow, cleanup } = await createFlowFileHelpers(
+		options?.initialFlow?.value?.modules ?? [],
+		options?.initialFlow?.schema,
+		workspaceRoot,
+		options?.workspaceFixtures
+	)
+
+	try {
+		const systemMessage = prepareFlowSystemMessage()
+		const tools = flowTools as ProductionTool<FlowAIChatHelpers>[]
+		const model = options?.model ?? 'claude-haiku-4-5-20251001'
+		const userMessage = prepareFlowUserMessage(
+			userPrompt,
+			helpers.getFlowAndSelectedId(),
+			[],
+			helpers.inlineScriptSession
+		)
+
+		const rawResult = await runEval({
+			userPrompt,
+			systemMessage,
+			userMessage,
+			tools,
+			helpers,
+			apiKey,
+			getOutput: getFlow,
+			onAssistantMessageStart: options?.runContext?.onAssistantMessageStart,
+			onAssistantToken: options?.runContext?.onAssistantChunk,
+			onAssistantMessageEnd: options?.runContext?.onAssistantMessageEnd,
+			options: {
+				maxIterations: options?.maxIterations,
+				model,
+				workspace: workspaceRoot,
+				provider: options?.provider
+			}
+		})
+
+		return {
+			flow: rawResult.output,
+			success: rawResult.success,
+			error: rawResult.error,
+			assistantMessageCount: rawResult.iterations,
+			toolCallCount: rawResult.toolCallsCount,
+			toolsUsed: rawResult.toolsCalled,
+			tokenUsage: rawResult.tokenUsage
+		}
+	} finally {
+		await cleanup()
+	}
+}
--- a/ai_evals/adapters/frontend/core/script/fileHelpers.ts
+++ b/ai_evals/adapters/frontend/core/script/fileHelpers.ts
@@ -0,0 +1,73 @@
+import { mkdir, rm, writeFile } from 'fs/promises'
+import { dirname, join } from 'path'
+import type { ScriptLang } from '../../../../../frontend/src/lib/gen/types.gen'
+import type { ReviewChangesOpts } from '../../../../../frontend/src/lib/components/copilot/chat/monaco-adapter'
+import type { ScriptChatHelpers } from '../../../../../frontend/src/lib/components/copilot/chat/script/core'
+import { buildScriptLintResult } from './preview'
+import { registerBenchmarkWorkspace, unregisterBenchmarkWorkspace } from '../../mockBackend'
+
+export interface ScriptEvalState {
+	code: string
+	lang: ScriptLang | 'bunnative'
+	path: string
+	args: Record<string, any>
+}
+
+export async function createScriptFileHelpers(
+	initialScript: ScriptEvalState,
+	workspaceRoot?: string
+): Promise<{
+	helpers: ScriptChatHelpers
+	getScript: () => ScriptEvalState
+	cleanup: () => Promise<void>
+	workspaceDir: string | null
+}> {
+	let script = structuredClone(initialScript)
+	const scriptFilePath = workspaceRoot ? join(workspaceRoot, script.path) : null
+
+	async function persistScript(): Promise<void> {
+		if (!scriptFilePath) {
+			return
+		}
+		await mkdir(dirname(scriptFilePath), { recursive: true })
+		await writeFile(scriptFilePath, script.code, 'utf8')
+	}
+
+	await persistScript()
+
+	if (workspaceRoot) {
+		registerBenchmarkWorkspace(workspaceRoot)
+	}
+
+	const helpers: ScriptChatHelpers = {
+		getScriptOptions: () => ({
+			code: script.code,
+			lang: script.lang,
+			path: script.path,
+			args: structuredClone(script.args)
+		}),
+		applyCode: async (code: string, opts?: ReviewChangesOpts) => {
+			if (opts?.mode === 'revert') {
+				return
+			}
+			script = {
+				...script,
+				code
+			}
+			await persistScript()
+		},
+		getLintErrors: () => buildScriptLintResult(script.code, script.lang)
+	}
+
+	return {
+		helpers,
+		getScript: () => structuredClone(script),
+		cleanup: async () => {
+			if (workspaceRoot) {
+				unregisterBenchmarkWorkspace(workspaceRoot)
+				await rm(workspaceRoot, { recursive: true, force: true })
+			}
+		},
+		workspaceDir: workspaceRoot ?? null
+	}
+}
--- a/ai_evals/adapters/frontend/core/script/preview.ts
+++ b/ai_evals/adapters/frontend/core/script/preview.ts
@@ -0,0 +1,96 @@
+import ts from 'typescript'
+import type { ScriptLang } from '../../../../../frontend/src/lib/gen/types.gen'
+import type { ScriptLintResult } from '../../../../../frontend/src/lib/components/copilot/chat/shared'
+
+export type ScriptPreviewLanguage = ScriptLang | 'bunnative'
+
+const TS_LIKE_LANGUAGES = new Set<ScriptPreviewLanguage>(['bun', 'deno', 'nativets', 'bunnative'])
+const JS_LIKE_LANGUAGES = new Set<ScriptPreviewLanguage>(['bun', 'deno', 'nativets', 'bunnative'])
+
+function hasSupportedEntrypoint(code: string): boolean {
+	return (
+		/export\s+(async\s+)?function\s+main\s*\(/.test(code) ||
+		/export\s+(async\s+)?function\s+preprocessor\s*\(/.test(code)
+	)
+}
+
+function compilerOptionsForLanguage(lang: ScriptPreviewLanguage): ts.CompilerOptions | null {
+	if (!TS_LIKE_LANGUAGES.has(lang)) {
+		return null
+	}
+
+	return {
+		target: ts.ScriptTarget.ES2022,
+		module: ts.ModuleKind.ESNext,
+		moduleResolution: ts.ModuleResolutionKind.Bundler,
+		noEmit: true,
+		allowJs: true,
+		checkJs: false,
+		strict: false,
+		skipLibCheck: true
+	}
+}
+
+function getLineAndColumn(sourceText: string, start: number): { line: number; column: number } {
+	const prefix = sourceText.slice(0, Math.max(0, start))
+	const line = prefix.split('\n').length
+	const lastNewline = prefix.lastIndexOf('\n')
+	const column = lastNewline === -1 ? prefix.length + 1 : prefix.length - lastNewline
+	return { line, column }
+}
+
+export function buildScriptLintResult(
+	code: string,
+	lang: ScriptPreviewLanguage
+): ScriptLintResult {
+	const diagnostics: ScriptLintResult['errors'] = []
+	const compilerOptions = compilerOptionsForLanguage(lang)
+
+	if (compilerOptions) {
+		const sourceFile = ts.createSourceFile(
+			'script.ts',
+			code,
+			ts.ScriptTarget.ES2022,
+			true,
+			JS_LIKE_LANGUAGES.has(lang) ? ts.ScriptKind.TS : ts.ScriptKind.JS
+		)
+		const output = ts.transpileModule(code, {
+			compilerOptions,
+			fileName: sourceFile.fileName,
+			reportDiagnostics: true
+		})
+
+		for (const diagnostic of output.diagnostics ?? []) {
+			const start = diagnostic.start ?? 0
+			const length = diagnostic.length ?? 1
+			const { line, column } = getLineAndColumn(code, start)
+			const message = ts.flattenDiagnosticMessageText(diagnostic.messageText, '\n')
+			diagnostics.push({
+				startLineNumber: line,
+				startColumn: column,
+				endLineNumber: line,
+				endColumn: column + Math.max(1, length),
+				message,
+				severity: 8
+			} as ScriptLintResult['errors'][number])
+		}
+	}
+
+	if (!hasSupportedEntrypoint(code)) {
+		diagnostics.push({
+			startLineNumber: 1,
+			startColumn: 1,
+			endLineNumber: 1,
+			endColumn: 1,
+			message: 'Script must export a main or preprocessor function.',
+			severity: 8
+		} as ScriptLintResult['errors'][number])
+	}
+
+	return {
+		errorCount: diagnostics.length,
+		warningCount: 0,
+		errors: diagnostics,
+		warnings: []
+	}
+}
--- a/ai_evals/adapters/frontend/core/script/scriptEvalRunner.ts
+++ b/ai_evals/adapters/frontend/core/script/scriptEvalRunner.ts
@@ -0,0 +1,109 @@
+import { mkdtemp } from 'fs/promises'
+import { tmpdir } from 'os'
+import { join } from 'path'
+import type { AIProvider, AIProviderModel, ScriptLang } from '$lib/gen/types.gen'
+import type { ContextElement } from '../../../../../frontend/src/lib/components/copilot/chat/context'
+import {
+	prepareScriptSystemMessage,
+	prepareScriptTools,
+	prepareScriptUserMessage,
+	type ScriptChatHelpers
+} from '../../../../../frontend/src/lib/components/copilot/chat/script/core'
+import type { Tool as ProductionTool } from '../../../../../frontend/src/lib/components/copilot/chat/shared'
+import { createScriptFileHelpers, type ScriptEvalState } from './fileHelpers'
+import { runEval } from '../shared'
+import type { ModeRunContext } from '../../../../core/types'
+import type { TokenUsage } from '../shared/types'
+
+export interface ScriptEvalResult {
+	success: boolean
+	script: ScriptEvalState
+	error?: string
+	assistantMessageCount: number
+	toolCallCount: number
+	toolsUsed: string[]
+	tokenUsage: TokenUsage
+}
+
+export interface ScriptEvalOptions {
+	initialScript: ScriptEvalState
+	model?: string
+	maxIterations?: number
+	provider?: AIProvider
+	workspaceRoot?: string
+	runContext?: ModeRunContext
+}
+
+function resolveModelProvider(
+	model: string,
+	provider?: AIProvider
+): AIProviderModel {
+	if (provider) {
+		return { provider, model }
+	}
+	if (model.startsWith('claude')) {
+		return { provider: 'anthropic', model }
+	}
+	return { provider: 'openai', model }
+}
+
+export async function runScriptEval(
+	userPrompt: string,
+	apiKey: string,
+	options: ScriptEvalOptions
+): Promise<ScriptEvalResult> {
+	const workspaceRoot =
+		options.workspaceRoot ?? (await mkdtemp(join(tmpdir(), 'wmill-frontend-script-benchmark-')))
+	const { helpers, getScript, cleanup } = await createScriptFileHelpers(
+		options.initialScript,
+		workspaceRoot
+	)
+
+	try {
+		const model = options.model ?? 'claude-haiku-4-5-20251001'
+		const modelProvider = resolveModelProvider(model, options.provider)
+		const selectedContext: ContextElement[] = []
+		const systemMessage = prepareScriptSystemMessage(
+			modelProvider,
+			options.initialScript.lang,
+			{}
+		)
+		const tools = prepareScriptTools(
+			modelProvider,
+			options.initialScript.lang,
+			selectedContext
+		) as ProductionTool<ScriptChatHelpers>[]
+		const userMessage = prepareScriptUserMessage(userPrompt, selectedContext)
+
+		const rawResult = await runEval({
+			userPrompt,
+			systemMessage,
+			userMessage,
+			tools,
+			helpers,
+			apiKey,
+			getOutput: getScript,
+			onAssistantMessageStart: options.runContext?.onAssistantMessageStart,
+			onAssistantToken: options.runContext?.onAssistantChunk,
+			onAssistantMessageEnd: options.runContext?.onAssistantMessageEnd,
+			options: {
+				maxIterations: options.maxIterations,
+				model,
+				workspace: workspaceRoot,
+				provider: modelProvider.provider
+			}
+		})
+
+		return {
+			script: rawResult.output,
+			success: rawResult.success,
+			error: rawResult.error,
+			assistantMessageCount: rawResult.iterations,
+			toolCallCount: rawResult.toolCallsCount,
+			toolsUsed: rawResult.toolsCalled,
+			tokenUsage: rawResult.tokenUsage
+		}
+	} finally {
+		await cleanup()
+	}
+}
--- a/frontend/src/lib/components/copilot/chat/tests/shared/baseEvalRunner.ts
+++ b/frontend/src/lib/components/copilot/chat/tests/shared/baseEvalRunner.ts
@@ -1,29 +1,19 @@
-import OpenAI from 'openai'
-import Anthropic from '@anthropic-ai/sdk'
 import type {
 	ChatCompletionMessageParam,
 	ChatCompletionSystemMessageParam
 } from 'openai/resources/chat/completions.mjs'
-import type { AIProvider, AIProviderModel } from '$lib/gen/types.gen'
-import type { TokenUsage, ToolCallDetail, EvalRunnerOptions } from './types'
-import type { Tool } from './baseVariants'
-import { runChatLoop, type ChatClients } from '../../chatLoop'
-import type { Tool as ProductionTool, ToolCallbacks } from '../../shared'
-
-/**
- * Result from a single eval run (before domain-specific evaluation).
- */
-export interface RawEvalResult<TOutput> {
-	success: boolean
-	output: TOutput
-	error?: string
-	tokenUsage: TokenUsage
-	toolCallsCount: number
-	toolsCalled: string[]
-	toolCallDetails: ToolCallDetail[]
-	iterations: number
-	messages: ChatCompletionMessageParam[]
-}
+import type { AIProviderModel } from '$lib/gen/types.gen'
+import type { TokenUsage, ToolCallDetail, EvalRunnerOptions, RawEvalResult } from './types'
+import { runChatLoop, type ChatClients } from '../../../../../frontend/src/lib/components/copilot/chat/chatLoop'
+import type {
+	Tool as ProductionTool,
+	ToolCallbacks
+} from '../../../../../frontend/src/lib/components/copilot/chat/shared'
+import {
+	createEvalClients,
+	type FrontendEvalProvider,
+	resolveEvalModelProvider
+} from './providerConfig'

 /**
 * Parameters for running a base evaluation.
@@ -38,7 +28,7 @@ export interface RunEvalParams<THelpers, TOutput> {
 	/** Tool definitions for the LLM API (unused — derived from tools) */
 	toolDefs?: unknown
 	/** Full tool implementations for execution */
-	tools: Tool<THelpers>[]
+	tools: ProductionTool<THelpers>[]
 	/** Domain-specific helpers for tool execution */
 	helpers: THelpers
 	/** API key for the provider */
@@ -47,35 +37,9 @@ export interface RunEvalParams<THelpers, TOutput> {
 	getOutput: () => TOutput
 	/** Optional configuration */
 	options?: EvalRunnerOptions
-}
-
-/**
- * Creates SDK clients for the given provider.
- */
-function createEvalClients(provider: AIProvider, apiKey: string): ChatClients {
-	if (provider === 'anthropic') {
-		return {
-			openai: new OpenAI({ apiKey: 'unused' }),
-			anthropic: new Anthropic({ apiKey })
-		}
-	}
-	return {
-		openai: new OpenAI({ apiKey }),
-		anthropic: new Anthropic({ apiKey: 'unused' })
-	}
-}
-
-/**
- * Resolves model string to AIProviderModel.
- */
-function resolveModelProvider(
-	model: string,
-	provider?: AIProvider
-): AIProviderModel {
-	if (provider) return { provider, model }
-	if (model.startsWith('claude')) return { provider: 'anthropic', model }
-	if (model.startsWith('gpt') || model.startsWith('o')) return { provider: 'openai', model }
-	return { provider: 'openai', model }
+	onAssistantMessageStart?: () => void
+	onAssistantToken?: (token: string) => void
+	onAssistantMessageEnd?: () => void
 }

 /**
@@ -92,16 +56,23 @@ export async function runEval<THelpers, TOutput>(
 		helpers,
 		apiKey,
 		getOutput,
-		options
+		options,
+		onAssistantMessageStart,
+		onAssistantToken,
+		onAssistantMessageEnd
 	} = params
+	let shouldEmitMessageStart = true

 	const model = options?.model ?? 'gpt-4o'
 	const maxIterations = options?.maxIterations ?? 20
 	const workspace = options?.workspace ?? 'test-workspace'
 	const provider = options?.provider

-	const modelProvider = resolveModelProvider(model, provider)
-	const clients = createEvalClients(modelProvider.provider, apiKey)
+	const modelProvider = resolveEvalModelProvider(
+		model,
+		provider as FrontendEvalProvider | undefined
+	) as AIProviderModel
+	const clients = createEvalClients(modelProvider.provider, apiKey) as ChatClients

 	const messages: ChatCompletionMessageParam[] = [userMessage]
 	let toolCallsCount = 0
@@ -128,7 +99,7 @@ export async function runEval<THelpers, TOutput>(
 			}
 			return tool.fn(p)
 		}
-	})) as ProductionTool<THelpers>[]
+	}))

 	// No-op callbacks for eval
 	const callbacks: ToolCallbacks & {
@@ -137,8 +108,19 @@ export async function runEval<THelpers, TOutput>(
 	} = {
 		setToolStatus: () => {},
 		removeToolStatus: () => {},
-		onNewToken: () => {},
-		onMessageEnd: () => {}
+		onNewToken: (token: string) => {
+			if (shouldEmitMessageStart) {
+				onAssistantMessageStart?.()
+				shouldEmitMessageStart = false
+			}
+			onAssistantToken?.(token)
+		},
+		onMessageEnd: () => {
+			if (!shouldEmitMessageStart) {
+				onAssistantMessageEnd?.()
+			}
+			shouldEmitMessageStart = true
+		}
 	}

 	const abortController = new AbortController()
@@ -161,7 +143,7 @@ export async function runEval<THelpers, TOutput>(
 		return {
 			success: true,
 			output: getOutput(),
-			tokenUsage: { prompt: 0, completion: 0, total: 0 },
+			tokenUsage: result.tokenUsage,
 			toolCallsCount,
 			toolsCalled,
 			toolCallDetails,
--- a/ai_evals/adapters/frontend/core/shared/index.ts
+++ b/ai_evals/adapters/frontend/core/shared/index.ts
@@ -0,0 +1,3 @@
+export type { TokenUsage, ToolCallDetail, EvalRunnerOptions, RawEvalResult } from './types'
+export type { RunEvalParams } from './baseEvalRunner'
+export { runEval } from './baseEvalRunner'
--- a/ai_evals/adapters/frontend/core/shared/providerConfig.test.ts
+++ b/ai_evals/adapters/frontend/core/shared/providerConfig.test.ts
@@ -0,0 +1,41 @@
+import { describe, expect, it } from "bun:test";
+import {
+  buildOpenAICompatibleClientOptions,
+  resolveEvalModelProvider,
+} from "./providerConfig";
+
+describe("buildOpenAICompatibleClientOptions", () => {
+  it("adds Gemini's OpenAI-compatible base URL and client header", () => {
+    const options = buildOpenAICompatibleClientOptions("googleai", "gemini-test-key");
+
+    expect(options).toMatchObject({
+      apiKey: "gemini-test-key",
+      baseURL: "https://generativelanguage.googleapis.com/v1beta/openai/",
+      defaultHeaders: {
+        "x-goog-api-client": "windmill-ai-evals/1.0",
+      },
+    });
+  });
+
+  it("keeps the default OpenAI-compatible config for OpenAI", () => {
+    expect(buildOpenAICompatibleClientOptions("openai", "openai-test-key")).toEqual({
+      apiKey: "openai-test-key",
+    });
+  });
+});
+
+describe("resolveEvalModelProvider", () => {
+  it("infers googleai from Gemini model ids", () => {
+    expect(resolveEvalModelProvider("gemini-2.5-flash")).toEqual({
+      provider: "googleai",
+      model: "gemini-2.5-flash",
+    });
+  });
+
+  it("preserves an explicit provider", () => {
+    expect(resolveEvalModelProvider("gemini-2.5-pro", "googleai")).toEqual({
+      provider: "googleai",
+      model: "gemini-2.5-pro",
+    });
+  });
+});
--- a/ai_evals/adapters/frontend/core/shared/providerConfig.ts
+++ b/ai_evals/adapters/frontend/core/shared/providerConfig.ts
@@ -0,0 +1,71 @@
+import Anthropic from "@anthropic-ai/sdk";
+import OpenAI from "openai";
+import type { FrontendEvalModelConfig } from "../../../../core/models";
+
+export type FrontendEvalProvider = FrontendEvalModelConfig["provider"];
+
+export interface EvalClients {
+  openai: OpenAI;
+  anthropic: Anthropic;
+}
+
+export interface ResolvedEvalModelProvider {
+  provider: FrontendEvalProvider;
+  model: string;
+}
+
+const GEMINI_OPENAI_BASE_URL = "https://generativelanguage.googleapis.com/v1beta/openai/";
+const GEMINI_GOOG_API_CLIENT = "windmill-ai-evals/1.0";
+
+export function buildOpenAICompatibleClientOptions(
+  provider: Exclude<FrontendEvalProvider, "anthropic">,
+  apiKey: string
+): ConstructorParameters<typeof OpenAI>[0] {
+  if (provider === "googleai") {
+    return {
+      apiKey,
+      baseURL: GEMINI_OPENAI_BASE_URL,
+      defaultHeaders: {
+        "x-goog-api-client": GEMINI_GOOG_API_CLIENT,
+      },
+    };
+  }
+
+  return { apiKey };
+}
+
+export function createEvalClients(
+  provider: FrontendEvalProvider,
+  apiKey: string
+): EvalClients {
+  if (provider === "anthropic") {
+    return {
+      openai: new OpenAI({ apiKey: "unused" }),
+      anthropic: new Anthropic({ apiKey }),
+    };
+  }
+
+  return {
+    openai: new OpenAI(buildOpenAICompatibleClientOptions(provider, apiKey)),
+    anthropic: new Anthropic({ apiKey: "unused" }),
+  };
+}
+
+export function resolveEvalModelProvider(
+  model: string,
+  provider?: FrontendEvalProvider
+): ResolvedEvalModelProvider {
+  if (provider) {
+    return { provider, model };
+  }
+  if (model.startsWith("claude")) {
+    return { provider: "anthropic", model };
+  }
+  if (model.startsWith("gemini")) {
+    return { provider: "googleai", model };
+  }
+  if (model.startsWith("gpt") || model.startsWith("o")) {
+    return { provider: "openai", model };
+  }
+  return { provider: "openai", model };
+}
--- a/ai_evals/adapters/frontend/core/shared/types.ts
+++ b/ai_evals/adapters/frontend/core/shared/types.ts
@@ -0,0 +1,32 @@
+import type { ChatCompletionMessageParam } from 'openai/resources/chat/completions.mjs'
+import type { AIProvider } from '$lib/gen/types.gen'
+
+export interface TokenUsage {
+	prompt: number
+	completion: number
+	total: number
+}
+
+export interface ToolCallDetail {
+	name: string
+	arguments: Record<string, unknown>
+}
+
+export interface EvalRunnerOptions {
+	maxIterations?: number
+	model?: string
+	workspace?: string
+	provider?: AIProvider
+}
+
+export interface RawEvalResult<TOutput> {
+	success: boolean
+	output: TOutput
+	error?: string
+	tokenUsage: TokenUsage
+	toolCallsCount: number
+	toolsCalled: string[]
+	toolCallDetails: ToolCallDetail[]
+	iterations: number
+	messages: ChatCompletionMessageParam[]
+}
--- a/ai_evals/adapters/frontend/mockBackend.ts
+++ b/ai_evals/adapters/frontend/mockBackend.ts
@@ -0,0 +1,270 @@
+import { randomUUID } from 'node:crypto'
+import type { CompletedJob, Flow, Script } from '../../../frontend/src/lib/gen'
+import type { ScriptLang } from '../../../frontend/src/lib/gen/types.gen'
+import { buildScriptLintResult } from './core/script/preview'
+
+const BENCHMARK_TIMESTAMP = '1970-01-01T00:00:00.000Z'
+
+export interface BenchmarkWorkspaceScript {
+	path: string
+	summary: string
+	description?: string
+	language: Script['language']
+	schema?: Record<string, unknown>
+	content: string
+}
+
+export interface BenchmarkWorkspaceFlow {
+	path: string
+	summary: string
+	description?: string
+	schema?: Record<string, unknown>
+	value: Flow['value']
+}
+
+export interface BenchmarkWorkspaceRunnables {
+	scripts?: BenchmarkWorkspaceScript[]
+	flows?: BenchmarkWorkspaceFlow[]
+}
+
+type BenchmarkCompletedJob = CompletedJob & { type: 'CompletedJob' }
+
+const benchmarkWorkspaces = new Set<string>()
+const benchmarkWorkspaceRunnables = new Map<string, BenchmarkWorkspaceRunnables>()
+const benchmarkJobs = new Map<string, { workspace: string; job: BenchmarkCompletedJob }>()
+
+export function resetBenchmarkMockBackend(): void {
+	benchmarkWorkspaces.clear()
+	benchmarkWorkspaceRunnables.clear()
+	benchmarkJobs.clear()
+}
+
+export function registerBenchmarkWorkspace(workspace: string): void {
+	benchmarkWorkspaces.add(workspace)
+}
+
+export function registerBenchmarkWorkspaceRunnables(
+	workspace: string,
+	runnables: BenchmarkWorkspaceRunnables
+): void {
+	benchmarkWorkspaces.add(workspace)
+	benchmarkWorkspaceRunnables.set(workspace, runnables)
+}
+
+export function unregisterBenchmarkWorkspace(workspace: string): void {
+	benchmarkWorkspaces.delete(workspace)
+	benchmarkWorkspaceRunnables.delete(workspace)
+	for (const [jobId, entry] of benchmarkJobs.entries()) {
+		if (entry.workspace === workspace) {
+			benchmarkJobs.delete(jobId)
+		}
+	}
+}
+
+export function unregisterBenchmarkWorkspaceRunnables(workspace: string): void {
+	unregisterBenchmarkWorkspace(workspace)
+}
+
+export function hasBenchmarkWorkspace(workspace: string): boolean {
+	return benchmarkWorkspaces.has(workspace)
+}
+
+export function listBenchmarkScripts(workspace: string): Script[] | null {
+	const runnables = benchmarkWorkspaceRunnables.get(workspace)
+	if (!runnables) {
+		return null
+	}
+	return (runnables.scripts ?? []).map(buildBenchmarkScript)
+}
+
+export function listBenchmarkFlows(workspace: string): Flow[] | null {
+	const runnables = benchmarkWorkspaceRunnables.get(workspace)
+	if (!runnables) {
+		return null
+	}
+	return (runnables.flows ?? []).map(buildBenchmarkFlow)
+}
+
+export function getBenchmarkScriptByPath(workspace: string, path: string): Script | null {
+	const script = benchmarkWorkspaceRunnables
+		.get(workspace)
+		?.scripts?.find((entry) => entry.path === path)
+
+	return script ? buildBenchmarkScript(script) : null
+}
+
+export function getBenchmarkScriptByHash(workspace: string, hash: string): Script | null {
+	const script = benchmarkWorkspaceRunnables
+		.get(workspace)
+		?.scripts?.find((entry) => buildBenchmarkScriptHash(entry.path) === hash)
+
+	return script ? buildBenchmarkScript(script) : null
+}
+
+export function getBenchmarkFlowByPath(workspace: string, path: string): Flow | null {
+	const flow = benchmarkWorkspaceRunnables
+		.get(workspace)
+		?.flows?.find((entry) => entry.path === path)
+
+	return flow ? buildBenchmarkFlow(flow) : null
+}
+
+export function createBenchmarkCompletedJob(input: {
+	workspace: string
+	jobKind: CompletedJob['job_kind']
+	success?: boolean
+	result?: unknown
+	logs?: string
+	scriptPath?: string
+	scriptHash?: string
+	args?: Record<string, unknown>
+}): string {
+	const jobId = `benchmark-job-${randomUUID()}`
+	const now = new Date().toISOString()
+	const job: BenchmarkCompletedJob = {
+		type: 'CompletedJob',
+		id: jobId,
+		workspace_id: input.workspace,
+		created_by: 'ai-evals',
+		created_at: now,
+		started_at: now,
+		completed_at: now,
+		duration_ms: 0,
+		success: input.success ?? true,
+		script_path: input.scriptPath,
+		script_hash: input.scriptHash,
+		args: input.args,
+		result: input.result,
+		logs: input.logs,
+		canceled: false,
+		job_kind: input.jobKind,
+		permissioned_as: 'u/ai-evals',
+		is_flow_step: false,
+		is_skipped: false,
+		email: 'ai-evals@local',
+		visible_to_owner: true,
+		tag: 'benchmark'
+	}
+
+	benchmarkJobs.set(jobId, { workspace: input.workspace, job })
+	return jobId
+}
+
+export function getBenchmarkCompletedJob(
+	workspace: string,
+	jobId: string
+): BenchmarkCompletedJob | null {
+	const entry = benchmarkJobs.get(jobId)
+	if (!entry || entry.workspace !== workspace) {
+		return null
+	}
+	return structuredClone(entry.job)
+}
+
+export function runBenchmarkScriptPreview(input: {
+	workspace: string
+	requestBody: {
+		content?: string
+		language?: ScriptLang | 'bunnative'
+		args?: Record<string, unknown>
+		path?: string
+	}
+}): string {
+	const content = input.requestBody.content ?? ''
+	const language = input.requestBody.language ?? 'bun'
+	const lintResult = buildScriptLintResult(content, language)
+	const success = lintResult.errorCount === 0
+
+	return createBenchmarkCompletedJob({
+		workspace: input.workspace,
+		jobKind: 'preview',
+		success,
+		scriptPath: input.requestBody.path,
+		args: input.requestBody.args,
+		result: success
+			? {
+					path: input.requestBody.path,
+					args: input.requestBody.args ?? {},
+					validated: true
+				}
+			: {
+					path: input.requestBody.path,
+					args: input.requestBody.args ?? {},
+					errorCount: lintResult.errorCount,
+					errors: lintResult.errors.map((entry) => ({
+						line: entry.startLineNumber,
+						message: entry.message
+					}))
+				}
+	})
+}
+
+export function runBenchmarkFlowByPath(input: {
+	workspace: string
+	path: string
+	args?: Record<string, unknown>
+}): string {
+	const flow = getBenchmarkFlowByPath(input.workspace, input.path)
+	return createBenchmarkCompletedJob({
+		workspace: input.workspace,
+		jobKind: 'flowpreview',
+		success: flow !== null,
+		args: input.args,
+		result:
+			flow !== null
+				? {
+						path: input.path,
+						args: input.args ?? {},
+						mocked: true
+					}
+				: {
+						error: `Flow "${input.path}" not found in benchmark workspace`
+					},
+		logs:
+			flow !== null
+				? 'Mock benchmark flow run completed successfully.'
+				: `Flow "${input.path}" not found in benchmark workspace.`
+	})
+}
+
+function buildBenchmarkScriptHash(path: string): string {
+	return `benchmark:${path}`
+}
+
+function buildBenchmarkScript(script: BenchmarkWorkspaceScript): Script {
+	return {
+		workspace_id: 'benchmark',
+		hash: buildBenchmarkScriptHash(script.path),
+		path: script.path,
+		parent_hashes: [],
+		summary: script.summary,
+		description: script.description ?? '',
+		content: script.content,
+		created_by: 'benchmark',
+		created_at: BENCHMARK_TIMESTAMP,
+		archived: false,
+		schema: script.schema ?? {},
+		deleted: false,
+		is_template: false,
+		extra_perms: {},
+		language: script.language,
+		kind: 'script',
+		starred: false,
+		has_preprocessor: false,
+		modules: null
+	}
+}
+
+function buildBenchmarkFlow(flow: BenchmarkWorkspaceFlow): Flow {
+	return {
+		path: flow.path,
+		summary: flow.summary,
+		description: flow.description ?? '',
+		value: flow.value,
+		schema: flow.schema ?? {},
+		edited_by: 'benchmark',
+		edited_at: BENCHMARK_TIMESTAMP,
+		archived: false,
+		extra_perms: {}
+	} as Flow
+}
--- a/ai_evals/adapters/frontend/progress.ts
+++ b/ai_evals/adapters/frontend/progress.ts
@@ -0,0 +1,133 @@
+export type FrontendBenchmarkProgressSurface = 'flow' | 'app' | 'script'
+
+export type FrontendBenchmarkProgressEvent =
+	| {
+			type: 'run-start'
+			surface: FrontendBenchmarkProgressSurface
+			totalCases: number
+			runs: number
+			concurrency: number
+	  }
+	| {
+			type: 'attempt-start'
+			surface: FrontendBenchmarkProgressSurface
+			caseId: string
+			caseNumber: number
+			totalCases: number
+			attempt: number
+			runs: number
+	  }
+	| {
+			type: 'attempt-finish'
+			surface: FrontendBenchmarkProgressSurface
+			caseId: string
+			caseNumber: number
+			totalCases: number
+			attempt: number
+			runs: number
+			passed: boolean
+			durationMs: number
+			judgeScore: number | null
+			error: string | null
+	  }
+	| {
+			type: 'assistant-message-start'
+			surface: FrontendBenchmarkProgressSurface
+			caseId: string
+			caseNumber: number
+			totalCases: number
+			attempt: number
+			runs: number
+	  }
+	| {
+			type: 'assistant-chunk'
+			surface: FrontendBenchmarkProgressSurface
+			caseId: string
+			caseNumber: number
+			totalCases: number
+			attempt: number
+			runs: number
+			chunk: string
+	  }
+	| {
+			type: 'assistant-message-end'
+			surface: FrontendBenchmarkProgressSurface
+			caseId: string
+			caseNumber: number
+			totalCases: number
+			attempt: number
+			runs: number
+	  }
+
+export const FRONTEND_BENCHMARK_PROGRESS_PREFIX = 'WMILL_FRONTEND_AI_EVAL_PROGRESS '
+
+export function emitFrontendBenchmarkProgress(event: FrontendBenchmarkProgressEvent): void {
+	process.stderr.write(
+		`${FRONTEND_BENCHMARK_PROGRESS_PREFIX}${JSON.stringify(event)}\n`
+	)
+}
+
+export function parseFrontendBenchmarkProgressLine(
+	line: string
+): FrontendBenchmarkProgressEvent | null {
+	if (!line.startsWith(FRONTEND_BENCHMARK_PROGRESS_PREFIX)) {
+		return null
+	}
+
+	try {
+		const parsed = JSON.parse(
+			line.slice(FRONTEND_BENCHMARK_PROGRESS_PREFIX.length)
+		) as FrontendBenchmarkProgressEvent
+		return parsed?.type ? parsed : null
+	} catch {
+		return null
+	}
+}
+
+export function formatFrontendBenchmarkProgressEvent(
+	event: FrontendBenchmarkProgressEvent
+): string {
+	switch (event.type) {
+		case 'run-start':
+			return `Running ${event.surface}: ${event.totalCases} cases x ${event.runs} run${event.runs === 1 ? '' : 's'}, concurrency ${event.concurrency}`
+		case 'attempt-start':
+			return `${formatCasePrefix(event.caseNumber, event.totalCases)} ${event.caseId} attempt ${event.attempt}/${event.runs}...`
+		case 'attempt-finish': {
+			const parts = [
+				`${formatCasePrefix(event.caseNumber, event.totalCases)} ${event.caseId} attempt ${event.attempt}/${event.runs} ${event.passed ? 'pass' : 'fail'}`,
+				formatDuration(event.durationMs)
+			]
+			if (event.judgeScore !== null) {
+				parts.push(`judge ${formatNumber(event.judgeScore)}`)
+			}
+			if (event.error) {
+				parts.push(truncateSingleLine(event.error, 120))
+			}
+			return parts.join(' | ')
+		}
+		case 'assistant-message-start':
+		case 'assistant-chunk':
+		case 'assistant-message-end':
+			return ''
+	}
+}
+
+function formatCasePrefix(caseNumber: number, totalCases: number): string {
+	return `[${caseNumber}/${totalCases}]`
+}
+
+function formatDuration(durationMs: number): string {
+	return `${formatNumber(durationMs / 1000)}s`
+}
+
+function formatNumber(value: number): string {
+	return Number.isInteger(value) ? String(value) : value.toFixed(1)
+}
+
+function truncateSingleLine(value: string, maxLength: number): string {
+	const normalized = value.replace(/\s+/g, ' ').trim()
+	if (normalized.length <= maxLength) {
+		return normalized
+	}
+	return `${normalized.slice(0, Math.max(0, maxLength - 3))}...`
+}
--- a/ai_evals/adapters/frontend/runtime.ts
+++ b/ai_evals/adapters/frontend/runtime.ts
@@ -0,0 +1,216 @@
+import { spawn } from 'node:child_process'
+import { mkdtemp, readFile, rm } from 'node:fs/promises'
+import { tmpdir } from 'node:os'
+import path from 'node:path'
+import { fileURLToPath } from 'node:url'
+import {
+	formatFrontendBenchmarkProgressEvent,
+	parseFrontendBenchmarkProgressLine
+} from './progress'
+import type { BenchmarkRunResult } from '../../core/types'
+
+const REPO_ROOT = fileURLToPath(new URL('../../../', import.meta.url))
+const FRONTEND_DIR = path.join(REPO_ROOT, 'frontend')
+const FRONTEND_BENCHMARK_TEST = '../ai_evals/adapters/frontend/vitestAdapter.test.ts'
+const FRONTEND_BENCHMARK_CONFIG = '../ai_evals/adapters/frontend/vitest.config.ts'
+
+export type FrontendMode = 'flow' | 'app' | 'script'
+
+export async function runFrontendBenchmarkAdapter(input: {
+	mode: FrontendMode
+	caseIds: string[]
+	runs: number
+	model?: string
+	verbose?: boolean
+}): Promise<BenchmarkRunResult> {
+	const tempDir = await mkdtemp(path.join(tmpdir(), 'wmill-frontend-benchmark-'))
+	const outputPath = path.join(tempDir, 'result.json')
+
+	try {
+		await runVitestBenchmark(
+			path.join(FRONTEND_DIR, 'node_modules', '.bin', 'vitest'),
+			[
+				'run',
+				FRONTEND_BENCHMARK_TEST,
+				'--project',
+				'server',
+				'--config',
+				FRONTEND_BENCHMARK_CONFIG
+			],
+			{
+				cwd: FRONTEND_DIR,
+				env: {
+					...process.env,
+					BROWSERSLIST_IGNORE_OLD_DATA: '1',
+					WMILL_FRONTEND_AI_EVAL_OUTPUT_PATH: outputPath,
+					WMILL_FRONTEND_AI_EVAL_MODE: input.mode,
+					WMILL_FRONTEND_AI_EVAL_CASE_IDS: JSON.stringify(input.caseIds),
+					WMILL_FRONTEND_AI_EVAL_RUNS: String(input.runs),
+					WMILL_FRONTEND_AI_EVAL_MODEL: input.model ?? "",
+					WMILL_FRONTEND_AI_EVAL_PROGRESS: '1',
+					WMILL_FRONTEND_AI_EVAL_VERBOSE: input.verbose ? '1' : '0'
+				}
+			}
+		)
+
+		const raw = await readFile(outputPath, 'utf8')
+		return JSON.parse(raw) as BenchmarkRunResult
+	} catch (error) {
+		throw new Error(`Frontend benchmark adapter failed:\n${toErrorMessage(error)}`)
+	} finally {
+		await rm(tempDir, { recursive: true, force: true })
+	}
+}
+
+async function runVitestBenchmark(
+	command: string,
+	args: string[],
+	options: {
+		cwd: string
+		env: NodeJS.ProcessEnv
+	}
+): Promise<void> {
+	const child = spawn(command, args, {
+		cwd: options.cwd,
+		env: options.env,
+		stdio: ['ignore', 'pipe', 'pipe']
+	})
+
+	let stdout = ''
+	let stderr = ''
+	let stderrLineBuffer = ''
+	let assistantStreamOpen = false
+
+	child.stdout?.setEncoding('utf8')
+	child.stdout?.on('data', (chunk: string) => {
+		stdout += chunk
+	})
+
+	child.stderr?.setEncoding('utf8')
+	child.stderr?.on('data', (chunk: string) => {
+		stderrLineBuffer += chunk
+		const { remainder, passthrough, nextAssistantStreamOpen } = drainProgressLines(
+			stderrLineBuffer,
+			assistantStreamOpen
+		)
+		stderrLineBuffer = remainder
+		stderr += passthrough
+		assistantStreamOpen = nextAssistantStreamOpen
+	})
+
+	await new Promise<void>((resolve, reject) => {
+		child.once('error', reject)
+		child.once('close', (code) => {
+			if (stderrLineBuffer.length > 0) {
+				const {
+					remainder,
+					passthrough,
+					nextAssistantStreamOpen
+				} = drainProgressLines(`${stderrLineBuffer}\n`, assistantStreamOpen)
+				stderrLineBuffer = remainder
+				stderr += passthrough
+				assistantStreamOpen = nextAssistantStreamOpen
+			}
+
+			if (code === 0) {
+				if (assistantStreamOpen) {
+					process.stderr.write('\n')
+				}
+				resolve()
+				return
+			}
+
+			const details = [`vitest exited with code ${code}`, stdout, stderr].filter(Boolean).join('\n')
+			reject(new Error(details))
+		})
+	})
+}
+
+function drainProgressLines(buffer: string): {
+	remainder: string
+	passthrough: string
+	nextAssistantStreamOpen: boolean
+}
+function drainProgressLines(
+	buffer: string,
+	initialAssistantStreamOpen: boolean
+): {
+	remainder: string
+	passthrough: string
+	nextAssistantStreamOpen: boolean
+} {
+	let remainder = buffer
+	let passthrough = ''
+	let assistantStreamOpen = initialAssistantStreamOpen
+
+	while (true) {
+		const newlineIndex = remainder.indexOf('\n')
+		if (newlineIndex === -1) {
+			return { remainder, passthrough, nextAssistantStreamOpen: assistantStreamOpen }
+		}
+
+		const line = remainder.slice(0, newlineIndex).replace(/\r$/, '')
+		remainder = remainder.slice(newlineIndex + 1)
+
+		const progressEvent = parseFrontendBenchmarkProgressLine(line)
+		if (progressEvent) {
+			if (progressEvent.type === 'assistant-message-start') {
+				if (assistantStreamOpen) {
+					process.stderr.write('\n')
+				}
+				process.stderr.write(
+					`${formatCasePrefix(progressEvent.caseNumber, progressEvent.totalCases)} ${progressEvent.caseId} attempt ${progressEvent.attempt}/${progressEvent.runs} assistant:\n`
+				)
+				assistantStreamOpen = true
+				continue
+			}
+
+			if (progressEvent.type === 'assistant-chunk') {
+				process.stderr.write(progressEvent.chunk)
+				continue
+			}
+
+			if (progressEvent.type === 'assistant-message-end') {
+				if (assistantStreamOpen) {
+					process.stderr.write('\n')
+				}
+				assistantStreamOpen = false
+				continue
+			}
+
+			if (assistantStreamOpen) {
+				process.stderr.write('\n')
+				assistantStreamOpen = false
+			}
+			process.stderr.write(`${formatFrontendBenchmarkProgressEvent(progressEvent)}\n`)
+			continue
+		}
+
+		if (shouldSuppressFrontendStderrLine(line)) {
+			continue
+		}
+
+		passthrough += `${line}\n`
+		process.stderr.write(`${line}\n`)
+	}
+}
+
+function formatCasePrefix(caseNumber: number, totalCases: number): string {
+	return `[${caseNumber}/${totalCases}]`
+}
+
+function shouldSuppressFrontendStderrLine(line: string): boolean {
+	return (
+		line.startsWith('[baseline-browser-mapping] ') ||
+		line.startsWith('Browserslist: browsers data (caniuse-lite) is ') ||
+		line.includes('update-browserslist-db@latest') ||
+		line.includes('update-db#readme')
+	)
+}
+
+function toErrorMessage(error: unknown): string {
+	if (error instanceof Error) {
+		return error.message
+	}
+	return String(error)
+}
--- a/ai_evals/adapters/frontend/vitest.config.ts
+++ b/ai_evals/adapters/frontend/vitest.config.ts
@@ -0,0 +1,28 @@
+import { fileURLToPath } from 'node:url'
+import frontendConfig from '../../../frontend/vite.config.js'
+
+const FRONTEND_VITE_CONFIG_PATH = fileURLToPath(new URL('../../../frontend/vite.config.js', import.meta.url))
+const FRONTEND_TEST_SETUP_PATH = fileURLToPath(
+	new URL('../../../frontend/src/lib/test-setup.ts', import.meta.url)
+)
+const ADAPTER_TEST_PATH = fileURLToPath(new URL('./vitestAdapter.test.ts', import.meta.url))
+
+const config = {
+	...frontendConfig,
+	test: {
+		...frontendConfig.test,
+		projects: [
+			{
+				extends: FRONTEND_VITE_CONFIG_PATH,
+				test: {
+					name: 'server',
+					environment: 'node',
+					include: [ADAPTER_TEST_PATH],
+					setupFiles: [FRONTEND_TEST_SETUP_PATH]
+				}
+			}
+		]
+	}
+}
+
+export default config
--- a/ai_evals/adapters/frontend/vitestAdapter.test.ts
+++ b/ai_evals/adapters/frontend/vitestAdapter.test.ts
@@ -0,0 +1,165 @@
+import { expect, it, vi } from 'vitest'
+// @ts-ignore - Node.js fs/promises
+import { mkdir, writeFile } from 'fs/promises'
+// @ts-ignore - Node.js path
+import { dirname, resolve } from 'path'
+
+vi.mock('monaco-editor', () => ({
+	editor: {},
+	languages: {},
+	KeyCode: {},
+	Uri: {
+		parse: (value: string) => ({ toString: () => value })
+	},
+	MarkerSeverity: {
+		Error: 8,
+		Warning: 4,
+		Info: 2,
+		Hint: 1
+	}
+}))
+
+vi.mock('@codingame/monaco-vscode-standalone-typescript-language-features', () => ({
+	getTypeScriptWorker: async () => async () => ({}),
+	typescriptVersion: 'test'
+}))
+
+vi.mock('@codingame/monaco-vscode-languages-service-override', () => ({
+	default: () => ({})
+}))
+
+vi.mock('$lib/components/vscode', () => ({}))
+
+vi.mock('$lib/gen', async () => {
+	const actual = await vi.importActual<any>('$lib/gen')
+	const {
+		getBenchmarkCompletedJob,
+		getBenchmarkFlowByPath,
+		getBenchmarkScriptByHash,
+		getBenchmarkScriptByPath,
+		hasBenchmarkWorkspace,
+		listBenchmarkFlows,
+		listBenchmarkScripts,
+		runBenchmarkFlowByPath,
+		runBenchmarkScriptPreview
+	} = await import('./mockBackend')
+
+	function wrapService<T extends object>(target: T, overrides: Record<string, unknown>): T {
+		return new Proxy(target, {
+			get(source, property, receiver) {
+				if (typeof property === 'string' && property in overrides) {
+					return overrides[property]
+				}
+				return Reflect.get(source, property, receiver)
+			}
+		})
+	}
+
+	return {
+		...actual,
+		ScriptService: wrapService(actual.ScriptService, {
+			listScripts: async (data: { workspace: string }) =>
+				hasBenchmarkWorkspace(data.workspace)
+					? (listBenchmarkScripts(data.workspace) ?? [])
+					: actual.ScriptService.listScripts(data),
+			getScriptByPath: async (data: { workspace: string; path: string }) => {
+				if (hasBenchmarkWorkspace(data.workspace)) {
+					const script = getBenchmarkScriptByPath(data.workspace, data.path)
+					if (!script) {
+						throw new Error(`Script "${data.path}" not found in benchmark workspace`)
+					}
+					return script
+				}
+				return actual.ScriptService.getScriptByPath(data)
+			},
+			getScriptByHash: async (data: { workspace: string; hash: string }) => {
+				if (hasBenchmarkWorkspace(data.workspace)) {
+					const script = getBenchmarkScriptByHash(data.workspace, data.hash)
+					if (!script) {
+						throw new Error(`Script hash "${data.hash}" not found in benchmark workspace`)
+					}
+					return script
+				}
+				return actual.ScriptService.getScriptByHash(data)
+			}
+		}),
+		FlowService: wrapService(actual.FlowService, {
+			listFlows: async (data: { workspace: string }) =>
+				hasBenchmarkWorkspace(data.workspace)
+					? (listBenchmarkFlows(data.workspace) ?? [])
+					: actual.FlowService.listFlows(data),
+			getFlowByPath: async (data: { workspace: string; path: string }) => {
+				if (hasBenchmarkWorkspace(data.workspace)) {
+					const flow = getBenchmarkFlowByPath(data.workspace, data.path)
+					if (!flow) {
+						throw new Error(`Flow "${data.path}" not found in benchmark workspace`)
+					}
+					return flow
+				}
+				return actual.FlowService.getFlowByPath(data)
+			}
+		}),
+		JobService: wrapService(actual.JobService, {
+			runScriptPreview: async (data: {
+				workspace: string
+				requestBody?: {
+					content?: string
+					language?: string
+					args?: Record<string, unknown>
+					path?: string
+				}
+			}) =>
+				hasBenchmarkWorkspace(data.workspace)
+					? runBenchmarkScriptPreview({
+							workspace: data.workspace,
+							requestBody: data.requestBody ?? {}
+						})
+					: actual.JobService.runScriptPreview(data),
+			runFlowByPath: async (data: {
+				workspace: string
+				path: string
+				requestBody?: Record<string, unknown>
+			}) =>
+				hasBenchmarkWorkspace(data.workspace)
+					? runBenchmarkFlowByPath({
+							workspace: data.workspace,
+							path: data.path,
+							args: data.requestBody
+						})
+					: actual.JobService.runFlowByPath(data),
+			getJob: async (data: { workspace: string; id: string }) => {
+				if (hasBenchmarkWorkspace(data.workspace)) {
+					const job = getBenchmarkCompletedJob(data.workspace, data.id)
+					if (!job) {
+						throw new Error(`Job "${data.id}" not found in benchmark workspace`)
+					}
+					return job
+				}
+				return actual.JobService.getJob(data)
+			}
+		})
+	}
+})
+
+const benchmarkOutputPath = process.env.WMILL_FRONTEND_AI_EVAL_OUTPUT_PATH
+const benchmarkIt = benchmarkOutputPath ? it : it.skip
+
+benchmarkIt(
+	'runs the frontend benchmark adapter from environment input',
+	async () => {
+		const { resetBenchmarkMockBackend } = await import('./mockBackend')
+		resetBenchmarkMockBackend()
+		const { runFrontendBenchmarkFromEnv } = await import('./benchmarkRunner')
+		try {
+			const payload = await runFrontendBenchmarkFromEnv()
+			const absoluteOutputPath = resolve(benchmarkOutputPath!)
+			await mkdir(dirname(absoluteOutputPath), { recursive: true })
+			await writeFile(absoluteOutputPath, JSON.stringify(payload, null, 2) + '\n', 'utf8')
+
+			expect(payload.cases.length).toBeGreaterThan(0)
+		} finally {
+			resetBenchmarkMockBackend()
+		}
+	},
+	600_000
+)
--- a/ai_evals/bun.lock
+++ b/ai_evals/bun.lock
@@ -0,0 +1,313 @@
+{
+  "lockfileVersion": 1,
+  "configVersion": 1,
+  "workspaces": {
+    "": {
+      "name": "windmill-ai-evals",
+      "dependencies": {
+        "@anthropic-ai/claude-agent-sdk": "^0.2.25",
+        "@anthropic-ai/sdk": "^0.39.0",
+        "commander": "^14.0.3",
+        "openai": "^6.9.1",
+        "yaml": "^2.8.3",
+      },
+      "devDependencies": {
+        "@types/bun": "latest",
+        "typescript": "^5.0.0",
+      },
+    },
+  },
+  "packages": {
+    "@anthropic-ai/claude-agent-sdk": ["@anthropic-ai/claude-agent-sdk@0.2.87", "", { "dependencies": { "@anthropic-ai/sdk": "^0.74.0", "@modelcontextprotocol/sdk": "^1.27.1" }, "optionalDependencies": { "@img/sharp-darwin-arm64": "^0.34.2", "@img/sharp-darwin-x64": "^0.34.2", "@img/sharp-linux-arm": "^0.34.2", "@img/sharp-linux-arm64": "^0.34.2", "@img/sharp-linux-x64": "^0.34.2", "@img/sharp-linuxmusl-arm64": "^0.34.2", "@img/sharp-linuxmusl-x64": "^0.34.2", "@img/sharp-win32-arm64": "^0.34.2", "@img/sharp-win32-x64": "^0.34.2" }, "peerDependencies": { "zod": "^4.0.0" } }, "sha512-WWmgBPxPhBOvNT0ujI8vPTI2lK+w5YEkEZ/y1mH0EDkK/0kBnxVJNhCtG5vnueiAViwLoUOFn66pbkDiivijdA=="],
+
+    "@anthropic-ai/sdk": ["@anthropic-ai/sdk@0.39.0", "", { "dependencies": { "@types/node": "^18.11.18", "@types/node-fetch": "^2.6.4", "abort-controller": "^3.0.0", "agentkeepalive": "^4.2.1", "form-data-encoder": "1.7.2", "formdata-node": "^4.3.2", "node-fetch": "^2.6.7" } }, "sha512-eMyDIPRZbt1CCLErRCi3exlAvNkBtRe+kW5vvJyef93PmNr/clstYgHhtvmkxN82nlKgzyGPCyGxrm0JQ1ZIdg=="],
+
+    "@babel/runtime": ["@babel/runtime@7.29.2", "", {}, "sha512-JiDShH45zKHWyGe4ZNVRrCjBz8Nh9TMmZG1kh4QTK8hCBTWBi8Da+i7s1fJw7/lYpM4ccepSNfqzZ/QvABBi5g=="],
+
+    "@hono/node-server": ["@hono/node-server@1.19.12", "", { "peerDependencies": { "hono": "^4" } }, "sha512-txsUW4SQ1iilgE0l9/e9VQWmELXifEFvmdA1j6WFh/aFPj99hIntrSsq/if0UWyGVkmrRPKA1wCeP+UCr1B9Uw=="],
+
+    "@img/sharp-darwin-arm64": ["@img/sharp-darwin-arm64@0.34.5", "", { "optionalDependencies": { "@img/sharp-libvips-darwin-arm64": "1.2.4" }, "os": "darwin", "cpu": "arm64" }, "sha512-imtQ3WMJXbMY4fxb/Ndp6HBTNVtWCUI0WdobyheGf5+ad6xX8VIDO8u2xE4qc/fr08CKG/7dDseFtn6M6g/r3w=="],
+
+    "@img/sharp-darwin-x64": ["@img/sharp-darwin-x64@0.34.5", "", { "optionalDependencies": { "@img/sharp-libvips-darwin-x64": "1.2.4" }, "os": "darwin", "cpu": "x64" }, "sha512-YNEFAF/4KQ/PeW0N+r+aVVsoIY0/qxxikF2SWdp+NRkmMB7y9LBZAVqQ4yhGCm/H3H270OSykqmQMKLBhBJDEw=="],
+
+    "@img/sharp-libvips-darwin-arm64": ["@img/sharp-libvips-darwin-arm64@1.2.4", "", { "os": "darwin", "cpu": "arm64" }, "sha512-zqjjo7RatFfFoP0MkQ51jfuFZBnVE2pRiaydKJ1G/rHZvnsrHAOcQALIi9sA5co5xenQdTugCvtb1cuf78Vf4g=="],
+
+    "@img/sharp-libvips-darwin-x64": ["@img/sharp-libvips-darwin-x64@1.2.4", "", { "os": "darwin", "cpu": "x64" }, "sha512-1IOd5xfVhlGwX+zXv2N93k0yMONvUlANylbJw1eTah8K/Jtpi15KC+WSiaX/nBmbm2HxRM1gZ0nSdjSsrZbGKg=="],
+
+    "@img/sharp-libvips-linux-arm": ["@img/sharp-libvips-linux-arm@1.2.4", "", { "os": "linux", "cpu": "arm" }, "sha512-bFI7xcKFELdiNCVov8e44Ia4u2byA+l3XtsAj+Q8tfCwO6BQ8iDojYdvoPMqsKDkuoOo+X6HZA0s0q11ANMQ8A=="],
+
+    "@img/sharp-libvips-linux-arm64": ["@img/sharp-libvips-linux-arm64@1.2.4", "", { "os": "linux", "cpu": "arm64" }, "sha512-excjX8DfsIcJ10x1Kzr4RcWe1edC9PquDRRPx3YVCvQv+U5p7Yin2s32ftzikXojb1PIFc/9Mt28/y+iRklkrw=="],
+
+    "@img/sharp-libvips-linux-x64": ["@img/sharp-libvips-linux-x64@1.2.4", "", { "os": "linux", "cpu": "x64" }, "sha512-tJxiiLsmHc9Ax1bz3oaOYBURTXGIRDODBqhveVHonrHJ9/+k89qbLl0bcJns+e4t4rvaNBxaEZsFtSfAdquPrw=="],
+
+    "@img/sharp-libvips-linuxmusl-arm64": ["@img/sharp-libvips-linuxmusl-arm64@1.2.4", "", { "os": "linux", "cpu": "arm64" }, "sha512-FVQHuwx1IIuNow9QAbYUzJ+En8KcVm9Lk5+uGUQJHaZmMECZmOlix9HnH7n1TRkXMS0pGxIJokIVB9SuqZGGXw=="],
+
+    "@img/sharp-libvips-linuxmusl-x64": ["@img/sharp-libvips-linuxmusl-x64@1.2.4", "", { "os": "linux", "cpu": "x64" }, "sha512-+LpyBk7L44ZIXwz/VYfglaX/okxezESc6UxDSoyo2Ks6Jxc4Y7sGjpgU9s4PMgqgjj1gZCylTieNamqA1MF7Dg=="],
+
+    "@img/sharp-linux-arm": ["@img/sharp-linux-arm@0.34.5", "", { "optionalDependencies": { "@img/sharp-libvips-linux-arm": "1.2.4" }, "os": "linux", "cpu": "arm" }, "sha512-9dLqsvwtg1uuXBGZKsxem9595+ujv0sJ6Vi8wcTANSFpwV/GONat5eCkzQo/1O6zRIkh0m/8+5BjrRr7jDUSZw=="],
+
+    "@img/sharp-linux-arm64": ["@img/sharp-linux-arm64@0.34.5", "", { "optionalDependencies": { "@img/sharp-libvips-linux-arm64": "1.2.4" }, "os": "linux", "cpu": "arm64" }, "sha512-bKQzaJRY/bkPOXyKx5EVup7qkaojECG6NLYswgktOZjaXecSAeCWiZwwiFf3/Y+O1HrauiE3FVsGxFg8c24rZg=="],
+
+    "@img/sharp-linux-x64": ["@img/sharp-linux-x64@0.34.5", "", { "optionalDependencies": { "@img/sharp-libvips-linux-x64": "1.2.4" }, "os": "linux", "cpu": "x64" }, "sha512-MEzd8HPKxVxVenwAa+JRPwEC7QFjoPWuS5NZnBt6B3pu7EG2Ge0id1oLHZpPJdn3OQK+BQDiw9zStiHBTJQQQQ=="],
+
+    "@img/sharp-linuxmusl-arm64": ["@img/sharp-linuxmusl-arm64@0.34.5", "", { "optionalDependencies": { "@img/sharp-libvips-linuxmusl-arm64": "1.2.4" }, "os": "linux", "cpu": "arm64" }, "sha512-fprJR6GtRsMt6Kyfq44IsChVZeGN97gTD331weR1ex1c1rypDEABN6Tm2xa1wE6lYb5DdEnk03NZPqA7Id21yg=="],
+
+    "@img/sharp-linuxmusl-x64": ["@img/sharp-linuxmusl-x64@0.34.5", "", { "optionalDependencies": { "@img/sharp-libvips-linuxmusl-x64": "1.2.4" }, "os": "linux", "cpu": "x64" }, "sha512-Jg8wNT1MUzIvhBFxViqrEhWDGzqymo3sV7z7ZsaWbZNDLXRJZoRGrjulp60YYtV4wfY8VIKcWidjojlLcWrd8Q=="],
+
+    "@img/sharp-win32-arm64": ["@img/sharp-win32-arm64@0.34.5", "", { "os": "win32", "cpu": "arm64" }, "sha512-WQ3AgWCWYSb2yt+IG8mnC6Jdk9Whs7O0gxphblsLvdhSpSTtmu69ZG1Gkb6NuvxsNACwiPV6cNSZNzt0KPsw7g=="],
+
+    "@img/sharp-win32-x64": ["@img/sharp-win32-x64@0.34.5", "", { "os": "win32", "cpu": "x64" }, "sha512-+29YMsqY2/9eFEiW93eqWnuLcWcufowXewwSNIT6UwZdUUCrM3oFjMWH/Z6/TMmb4hlFenmfAVbpWeup2jryCw=="],
+
+    "@modelcontextprotocol/sdk": ["@modelcontextprotocol/sdk@1.29.0", "", { "dependencies": { "@hono/node-server": "^1.19.9", "ajv": "^8.17.1", "ajv-formats": "^3.0.1", "content-type": "^1.0.5", "cors": "^2.8.5", "cross-spawn": "^7.0.5", "eventsource": "^3.0.2", "eventsource-parser": "^3.0.0", "express": "^5.2.1", "express-rate-limit": "^8.2.1", "hono": "^4.11.4", "jose": "^6.1.3", "json-schema-typed": "^8.0.2", "pkce-challenge": "^5.0.0", "raw-body": "^3.0.0", "zod": "^3.25 || ^4.0", "zod-to-json-schema": "^3.25.1" }, "peerDependencies": { "@cfworker/json-schema": "^4.1.1" }, "optionalPeers": ["@cfworker/json-schema"] }, "sha512-zo37mZA9hJWpULgkRpowewez1y6ML5GsXJPY8FI0tBBCd77HEvza4jDqRKOXgHNn867PVGCyTdzqpz0izu5ZjQ=="],
+
+    "@types/bun": ["@types/bun@1.3.11", "", { "dependencies": { "bun-types": "1.3.11" } }, "sha512-5vPne5QvtpjGpsGYXiFyycfpDF2ECyPcTSsFBMa0fraoxiQyMJ3SmuQIGhzPg2WJuWxVBoxWJ2kClYTcw/4fAg=="],
+
+    "@types/node": ["@types/node@18.19.130", "", { "dependencies": { "undici-types": "~5.26.4" } }, "sha512-GRaXQx6jGfL8sKfaIDD6OupbIHBr9jv7Jnaml9tB7l4v068PAOXqfcujMMo5PhbIs6ggR1XODELqahT2R8v0fg=="],
+
+    "@types/node-fetch": ["@types/node-fetch@2.6.13", "", { "dependencies": { "@types/node": "*", "form-data": "^4.0.4" } }, "sha512-QGpRVpzSaUs30JBSGPjOg4Uveu384erbHBoT1zeONvyCfwQxIkUshLAOqN/k9EjGviPRmWTTe6aH2qySWKTVSw=="],
+
+    "abort-controller": ["abort-controller@3.0.0", "", { "dependencies": { "event-target-shim": "^5.0.0" } }, "sha512-h8lQ8tacZYnR3vNQTgibj+tODHI5/+l06Au2Pcriv/Gmet0eaj4TwWH41sO9wnHDiQsEj19q0drzdWdeAHtweg=="],
+
+    "accepts": ["accepts@2.0.0", "", { "dependencies": { "mime-types": "^3.0.0", "negotiator": "^1.0.0" } }, "sha512-5cvg6CtKwfgdmVqY1WIiXKc3Q1bkRqGLi+2W/6ao+6Y7gu/RCwRuAhGEzh5B4KlszSuTLgZYuqFqo5bImjNKng=="],
+
+    "agentkeepalive": ["agentkeepalive@4.6.0", "", { "dependencies": { "humanize-ms": "^1.2.1" } }, "sha512-kja8j7PjmncONqaTsB8fQ+wE2mSU2DJ9D4XKoJ5PFWIdRMa6SLSN1ff4mOr4jCbfRSsxR4keIiySJU0N9T5hIQ=="],
+
+    "ajv": ["ajv@8.18.0", "", { "dependencies": { "fast-deep-equal": "^3.1.3", "fast-uri": "^3.0.1", "json-schema-traverse": "^1.0.0", "require-from-string": "^2.0.2" } }, "sha512-PlXPeEWMXMZ7sPYOHqmDyCJzcfNrUr3fGNKtezX14ykXOEIvyK81d+qydx89KY5O71FKMPaQ2vBfBFI5NHR63A=="],
+
+    "ajv-formats": ["ajv-formats@3.0.1", "", { "dependencies": { "ajv": "^8.0.0" } }, "sha512-8iUql50EUR+uUcdRQ3HDqa6EVyo3docL8g5WJ3FNcWmu62IbkGUue/pEyLBW8VGKKucTPgqeks4fIU1DA4yowQ=="],
+
+    "asynckit": ["asynckit@0.4.0", "", {}, "sha512-Oei9OH4tRh0YqU3GxhX79dM/mwVgvbZJaSNaRk+bshkj0S5cfHcgYakreBjrHwatXKbz+IoIdYLxrKim2MjW0Q=="],
+
+    "body-parser": ["body-parser@2.2.2", "", { "dependencies": { "bytes": "^3.1.2", "content-type": "^1.0.5", "debug": "^4.4.3", "http-errors": "^2.0.0", "iconv-lite": "^0.7.0", "on-finished": "^2.4.1", "qs": "^6.14.1", "raw-body": "^3.0.1", "type-is": "^2.0.1" } }, "sha512-oP5VkATKlNwcgvxi0vM0p/D3n2C3EReYVX+DNYs5TjZFn/oQt2j+4sVJtSMr18pdRr8wjTcBl6LoV+FUwzPmNA=="],
+
+    "bun-types": ["bun-types@1.3.11", "", { "dependencies": { "@types/node": "*" } }, "sha512-1KGPpoxQWl9f6wcZh57LvrPIInQMn2TQ7jsgxqpRzg+l0QPOFvJVH7HmvHo/AiPgwXy+/Thf6Ov3EdVn1vOabg=="],
+
+    "bytes": ["bytes@3.1.2", "", {}, "sha512-/Nf7TyzTx6S3yRJObOAV7956r8cr2+Oj8AC5dt8wSP3BQAoeX58NoHyCU8P8zGkNXStjTSi6fzO6F0pBdcYbEg=="],
+
+    "call-bind-apply-helpers": ["call-bind-apply-helpers@1.0.2", "", { "dependencies": { "es-errors": "^1.3.0", "function-bind": "^1.1.2" } }, "sha512-Sp1ablJ0ivDkSzjcaJdxEunN5/XvksFJ2sMBFfq6x0ryhQV/2b/KwFe21cMpmHtPOSij8K99/wSfoEuTObmuMQ=="],
+
+    "call-bound": ["call-bound@1.0.4", "", { "dependencies": { "call-bind-apply-helpers": "^1.0.2", "get-intrinsic": "^1.3.0" } }, "sha512-+ys997U96po4Kx/ABpBCqhA9EuxJaQWDQg7295H4hBphv3IZg0boBKuwYpt4YXp6MZ5AmZQnU/tyMTlRpaSejg=="],
+
+    "combined-stream": ["combined-stream@1.0.8", "", { "dependencies": { "delayed-stream": "~1.0.0" } }, "sha512-FQN4MRfuJeHf7cBbBMJFXhKSDq+2kAArBlmRBvcvFE5BB1HZKXtSFASDhdlz9zOYwxh8lDdnvmMOe/+5cdoEdg=="],
+
+    "commander": ["commander@14.0.3", "", {}, "sha512-H+y0Jo/T1RZ9qPP4Eh1pkcQcLRglraJaSLoyOtHxu6AapkjWVCy2Sit1QQ4x3Dng8qDlSsZEet7g5Pq06MvTgw=="],
+
+    "content-disposition": ["content-disposition@1.0.1", "", {}, "sha512-oIXISMynqSqm241k6kcQ5UwttDILMK4BiurCfGEREw6+X9jkkpEe5T9FZaApyLGGOnFuyMWZpdolTXMtvEJ08Q=="],
+
+    "content-type": ["content-type@1.0.5", "", {}, "sha512-nTjqfcBFEipKdXCv4YDQWCfmcLZKm81ldF0pAopTvyrFGVbcR6P/VAAd5G7N+0tTr8QqiU0tFadD6FK4NtJwOA=="],
+
+    "cookie": ["cookie@0.7.2", "", {}, "sha512-yki5XnKuf750l50uGTllt6kKILY4nQ1eNIQatoXEByZ5dWgnKqbnqmTrBE5B4N7lrMJKQ2ytWMiTO2o0v6Ew/w=="],
+
+    "cookie-signature": ["cookie-signature@1.2.2", "", {}, "sha512-D76uU73ulSXrD1UXF4KE2TMxVVwhsnCgfAyTg9k8P6KGZjlXKrOLe4dJQKI3Bxi5wjesZoFXJWElNWBjPZMbhg=="],
+
+    "cors": ["cors@2.8.6", "", { "dependencies": { "object-assign": "^4", "vary": "^1" } }, "sha512-tJtZBBHA6vjIAaF6EnIaq6laBBP9aq/Y3ouVJjEfoHbRBcHBAHYcMh/w8LDrk2PvIMMq8gmopa5D4V8RmbrxGw=="],
+
+    "cross-spawn": ["cross-spawn@7.0.6", "", { "dependencies": { "path-key": "^3.1.0", "shebang-command": "^2.0.0", "which": "^2.0.1" } }, "sha512-uV2QOWP2nWzsy2aMp8aRibhi9dlzF5Hgh5SHaB9OiTGEyDTiJJyx0uy51QXdyWbtAHNua4XJzUKca3OzKUd3vA=="],
+
+    "debug": ["debug@4.4.3", "", { "dependencies": { "ms": "^2.1.3" } }, "sha512-RGwwWnwQvkVfavKVt22FGLw+xYSdzARwm0ru6DhTVA3umU5hZc28V3kO4stgYryrTlLpuvgI9GiijltAjNbcqA=="],
+
+    "delayed-stream": ["delayed-stream@1.0.0", "", {}, "sha512-ZySD7Nf91aLB0RxL4KGrKHBXl7Eds1DAmEdcoVawXnLD7SDhpNgtuII2aAkg7a7QS41jxPSZ17p4VdGnMHk3MQ=="],
+
+    "depd": ["depd@2.0.0", "", {}, "sha512-g7nH6P6dyDioJogAAGprGpCtVImJhpPk/roCzdb3fIh61/s/nPsfR6onyMwkCAR/OlC3yBC0lESvUoQEAssIrw=="],
+
+    "dunder-proto": ["dunder-proto@1.0.1", "", { "dependencies": { "call-bind-apply-helpers": "^1.0.1", "es-errors": "^1.3.0", "gopd": "^1.2.0" } }, "sha512-KIN/nDJBQRcXw0MLVhZE9iQHmG68qAVIBg9CqmUYjmQIhgij9U5MFvrqkUL5FbtyyzZuOeOt0zdeRe4UY7ct+A=="],
+
+    "ee-first": ["ee-first@1.1.1", "", {}, "sha512-WMwm9LhRUo+WUaRN+vRuETqG89IgZphVSNkdFgeb6sS/E4OrDIN7t48CAewSHXc6C8lefD8KKfr5vY61brQlow=="],
+
+    "encodeurl": ["encodeurl@2.0.0", "", {}, "sha512-Q0n9HRi4m6JuGIV1eFlmvJB7ZEVxu93IrMyiMsGC0lrMJMWzRgx6WGquyfQgZVb31vhGgXnfmPNNXmxnOkRBrg=="],
+
+    "es-define-property": ["es-define-property@1.0.1", "", {}, "sha512-e3nRfgfUZ4rNGL232gUgX06QNyyez04KdjFrF+LTRoOXmrOgFKDg4BCdsjW8EnT69eqdYGmRpJwiPVYNrCaW3g=="],
+
+    "es-errors": ["es-errors@1.3.0", "", {}, "sha512-Zf5H2Kxt2xjTvbJvP2ZWLEICxA6j+hAmMzIlypy4xcBg1vKVnx89Wy0GbS+kf5cwCVFFzdCFh2XSCFNULS6csw=="],
+
+    "es-object-atoms": ["es-object-atoms@1.1.1", "", { "dependencies": { "es-errors": "^1.3.0" } }, "sha512-FGgH2h8zKNim9ljj7dankFPcICIK9Cp5bm+c2gQSYePhpaG5+esrLODihIorn+Pe6FGJzWhXQotPv73jTaldXA=="],
+
+    "es-set-tostringtag": ["es-set-tostringtag@2.1.0", "", { "dependencies": { "es-errors": "^1.3.0", "get-intrinsic": "^1.2.6", "has-tostringtag": "^1.0.2", "hasown": "^2.0.2" } }, "sha512-j6vWzfrGVfyXxge+O0x5sh6cvxAog0a/4Rdd2K36zCMV5eJ+/+tOAngRO8cODMNWbVRdVlmGZQL2YS3yR8bIUA=="],
+
+    "escape-html": ["escape-html@1.0.3", "", {}, "sha512-NiSupZ4OeuGwr68lGIeym/ksIZMJodUGOSCZ/FSnTxcrekbvqrgdUxlJOMpijaKZVjAJrWrGs/6Jy8OMuyj9ow=="],
+
+    "etag": ["etag@1.8.1", "", {}, "sha512-aIL5Fx7mawVa300al2BnEE4iNvo1qETxLrPI/o05L7z6go7fCw1J6EQmbK4FmJ2AS7kgVF/KEZWufBfdClMcPg=="],
+
+    "event-target-shim": ["event-target-shim@5.0.1", "", {}, "sha512-i/2XbnSz/uxRCU6+NdVJgKWDTM427+MqYbkQzD321DuCQJUqOuJKIA0IM2+W2xtYHdKOmZ4dR6fExsd4SXL+WQ=="],
+
+    "eventsource": ["eventsource@3.0.7", "", { "dependencies": { "eventsource-parser": "^3.0.1" } }, "sha512-CRT1WTyuQoD771GW56XEZFQ/ZoSfWid1alKGDYMmkt2yl8UXrVR4pspqWNEcqKvVIzg6PAltWjxcSSPrboA4iA=="],
+
+    "eventsource-parser": ["eventsource-parser@3.0.6", "", {}, "sha512-Vo1ab+QXPzZ4tCa8SwIHJFaSzy4R6SHf7BY79rFBDf0idraZWAkYrDjDj8uWaSm3S2TK+hJ7/t1CEmZ7jXw+pg=="],
+
+    "express": ["express@5.2.1", "", { "dependencies": { "accepts": "^2.0.0", "body-parser": "^2.2.1", "content-disposition": "^1.0.0", "content-type": "^1.0.5", "cookie": "^0.7.1", "cookie-signature": "^1.2.1", "debug": "^4.4.0", "depd": "^2.0.0", "encodeurl": "^2.0.0", "escape-html": "^1.0.3", "etag": "^1.8.1", "finalhandler": "^2.1.0", "fresh": "^2.0.0", "http-errors": "^2.0.0", "merge-descriptors": "^2.0.0", "mime-types": "^3.0.0", "on-finished": "^2.4.1", "once": "^1.4.0", "parseurl": "^1.3.3", "proxy-addr": "^2.0.7", "qs": "^6.14.0", "range-parser": "^1.2.1", "router": "^2.2.0", "send": "^1.1.0", "serve-static": "^2.2.0", "statuses": "^2.0.1", "type-is": "^2.0.1", "vary": "^1.1.2" } }, "sha512-hIS4idWWai69NezIdRt2xFVofaF4j+6INOpJlVOLDO8zXGpUVEVzIYk12UUi2JzjEzWL3IOAxcTubgz9Po0yXw=="],
+
+    "express-rate-limit": ["express-rate-limit@8.3.2", "", { "dependencies": { "ip-address": "10.1.0" }, "peerDependencies": { "express": ">= 4.11" } }, "sha512-77VmFeJkO0/rvimEDuUC5H30oqUC4EyOhyGccfqoLebB0oiEYfM7nwPrsDsBL1gsTpwfzX8SFy2MT3TDyRq+bg=="],
+
+    "fast-deep-equal": ["fast-deep-equal@3.1.3", "", {}, "sha512-f3qQ9oQy9j2AhBe/H9VC91wLmKBCCU/gDOnKNAYG5hswO7BLKj09Hc5HYNz9cGI++xlpDCIgDaitVs03ATR84Q=="],
+
+    "fast-uri": ["fast-uri@3.1.0", "", {}, "sha512-iPeeDKJSWf4IEOasVVrknXpaBV0IApz/gp7S2bb7Z4Lljbl2MGJRqInZiUrQwV16cpzw/D3S5j5Julj/gT52AA=="],
+
+    "finalhandler": ["finalhandler@2.1.1", "", { "dependencies": { "debug": "^4.4.0", "encodeurl": "^2.0.0", "escape-html": "^1.0.3", "on-finished": "^2.4.1", "parseurl": "^1.3.3", "statuses": "^2.0.1" } }, "sha512-S8KoZgRZN+a5rNwqTxlZZePjT/4cnm0ROV70LedRHZ0p8u9fRID0hJUZQpkKLzro8LfmC8sx23bY6tVNxv8pQA=="],
+
+    "form-data": ["form-data@4.0.5", "", { "dependencies": { "asynckit": "^0.4.0", "combined-stream": "^1.0.8", "es-set-tostringtag": "^2.1.0", "hasown": "^2.0.2", "mime-types": "^2.1.12" } }, "sha512-8RipRLol37bNs2bhoV67fiTEvdTrbMUYcFTiy3+wuuOnUog2QBHCZWXDRijWQfAkhBj2Uf5UnVaiWwA5vdd82w=="],
+
+    "form-data-encoder": ["form-data-encoder@1.7.2", "", {}, "sha512-qfqtYan3rxrnCk1VYaA4H+Ms9xdpPqvLZa6xmMgFvhO32x7/3J/ExcTd6qpxM0vH2GdMI+poehyBZvqfMTto8A=="],
+
+    "formdata-node": ["formdata-node@4.4.1", "", { "dependencies": { "node-domexception": "1.0.0", "web-streams-polyfill": "4.0.0-beta.3" } }, "sha512-0iirZp3uVDjVGt9p49aTaqjk84TrglENEDuqfdlZQ1roC9CWlPk6Avf8EEnZNcAqPonwkG35x4n3ww/1THYAeQ=="],
+
+    "forwarded": ["forwarded@0.2.0", "", {}, "sha512-buRG0fpBtRHSTCOASe6hD258tEubFoRLb4ZNA6NxMVHNw2gOcwHo9wyablzMzOA5z9xA9L1KNjk/Nt6MT9aYow=="],
+
+    "fresh": ["fresh@2.0.0", "", {}, "sha512-Rx/WycZ60HOaqLKAi6cHRKKI7zxWbJ31MhntmtwMoaTeF7XFH9hhBp8vITaMidfljRQ6eYWCKkaTK+ykVJHP2A=="],
+
+    "function-bind": ["function-bind@1.1.2", "", {}, "sha512-7XHNxH7qX9xG5mIwxkhumTox/MIRNcOgDrxWsMt2pAr23WHp6MrRlN7FBSFpCpr+oVO0F744iUgR82nJMfG2SA=="],
+
+    "get-intrinsic": ["get-intrinsic@1.3.0", "", { "dependencies": { "call-bind-apply-helpers": "^1.0.2", "es-define-property": "^1.0.1", "es-errors": "^1.3.0", "es-object-atoms": "^1.1.1", "function-bind": "^1.1.2", "get-proto": "^1.0.1", "gopd": "^1.2.0", "has-symbols": "^1.1.0", "hasown": "^2.0.2", "math-intrinsics": "^1.1.0" } }, "sha512-9fSjSaos/fRIVIp+xSJlE6lfwhES7LNtKaCBIamHsjr2na1BiABJPo0mOjjz8GJDURarmCPGqaiVg5mfjb98CQ=="],
+
+    "get-proto": ["get-proto@1.0.1", "", { "dependencies": { "dunder-proto": "^1.0.1", "es-object-atoms": "^1.0.0" } }, "sha512-sTSfBjoXBp89JvIKIefqw7U2CCebsc74kiY6awiGogKtoSGbgjYE/G/+l9sF3MWFPNc9IcoOC4ODfKHfxFmp0g=="],
+
+    "gopd": ["gopd@1.2.0", "", {}, "sha512-ZUKRh6/kUFoAiTAtTYPZJ3hw9wNxx+BIBOijnlG9PnrJsCcSjs1wyyD6vJpaYtgnzDrKYRSqf3OO6Rfa93xsRg=="],
+
+    "has-symbols": ["has-symbols@1.1.0", "", {}, "sha512-1cDNdwJ2Jaohmb3sg4OmKaMBwuC48sYni5HUw2DvsC8LjGTLK9h+eb1X6RyuOHe4hT0ULCW68iomhjUoKUqlPQ=="],
+
+    "has-tostringtag": ["has-tostringtag@1.0.2", "", { "dependencies": { "has-symbols": "^1.0.3" } }, "sha512-NqADB8VjPFLM2V0VvHUewwwsw0ZWBaIdgo+ieHtK3hasLz4qeCRjYcqfB6AQrBggRKppKF8L52/VqdVsO47Dlw=="],
+
+    "hasown": ["hasown@2.0.2", "", { "dependencies": { "function-bind": "^1.1.2" } }, "sha512-0hJU9SCPvmMzIBdZFqNPXWa6dqh7WdH0cII9y+CyS8rG3nL48Bclra9HmKhVVUHyPWNH5Y7xDwAB7bfgSjkUMQ=="],
+
+    "hono": ["hono@4.12.9", "", {}, "sha512-wy3T8Zm2bsEvxKZM5w21VdHDDcwVS1yUFFY6i8UobSsKfFceT7TOwhbhfKsDyx7tYQlmRM5FLpIuYvNFyjctiA=="],
+
+    "http-errors": ["http-errors@2.0.1", "", { "dependencies": { "depd": "~2.0.0", "inherits": "~2.0.4", "setprototypeof": "~1.2.0", "statuses": "~2.0.2", "toidentifier": "~1.0.1" } }, "sha512-4FbRdAX+bSdmo4AUFuS0WNiPz8NgFt+r8ThgNWmlrjQjt1Q7ZR9+zTlce2859x4KSXrwIsaeTqDoKQmtP8pLmQ=="],
+
+    "humanize-ms": ["humanize-ms@1.2.1", "", { "dependencies": { "ms": "^2.0.0" } }, "sha512-Fl70vYtsAFb/C06PTS9dZBo7ihau+Tu/DNCk/OyHhea07S+aeMWpFFkUaXRa8fI+ScZbEI8dfSxwY7gxZ9SAVQ=="],
+
+    "iconv-lite": ["iconv-lite@0.7.2", "", { "dependencies": { "safer-buffer": ">= 2.1.2 < 3.0.0" } }, "sha512-im9DjEDQ55s9fL4EYzOAv0yMqmMBSZp6G0VvFyTMPKWxiSBHUj9NW/qqLmXUwXrrM7AvqSlTCfvqRb0cM8yYqw=="],
+
+    "inherits": ["inherits@2.0.4", "", {}, "sha512-k/vGaX4/Yla3WzyMCvTQOXYeIHvqOKtnqBduzTHpzpQZzAskKMhZ2K+EnBiSM9zGSoIFeMpXKxa4dYeZIQqewQ=="],
+
+    "ip-address": ["ip-address@10.1.0", "", {}, "sha512-XXADHxXmvT9+CRxhXg56LJovE+bmWnEWB78LB83VZTprKTmaC5QfruXocxzTZ2Kl0DNwKuBdlIhjL8LeY8Sf8Q=="],
+
+    "ipaddr.js": ["ipaddr.js@1.9.1", "", {}, "sha512-0KI/607xoxSToH7GjN1FfSbLoU0+btTicjsQSWQlh/hZykN8KpmMf7uYwPW3R+akZ6R/w18ZlXSHBYXiYUPO3g=="],
+
+    "is-promise": ["is-promise@4.0.0", "", {}, "sha512-hvpoI6korhJMnej285dSg6nu1+e6uxs7zG3BYAm5byqDsgJNWwxzM6z6iZiAgQR4TJ30JmBTOwqZUw3WlyH3AQ=="],
+
+    "isexe": ["isexe@2.0.0", "", {}, "sha512-RHxMLp9lnKHGHRng9QFhRCMbYAcVpn69smSGcq3f36xjgVVWThj4qqLbTLlq7Ssj8B+fIQ1EuCEGI2lKsyQeIw=="],
+
+    "jose": ["jose@6.2.2", "", {}, "sha512-d7kPDd34KO/YnzaDOlikGpOurfF0ByC2sEV4cANCtdqLlTfBlw2p14O/5d/zv40gJPbIQxfES3nSx1/oYNyuZQ=="],
+
+    "json-schema-to-ts": ["json-schema-to-ts@3.1.1", "", { "dependencies": { "@babel/runtime": "^7.18.3", "ts-algebra": "^2.0.0" } }, "sha512-+DWg8jCJG2TEnpy7kOm/7/AxaYoaRbjVB4LFZLySZlWn8exGs3A4OLJR966cVvU26N7X9TWxl+Jsw7dzAqKT6g=="],
+
+    "json-schema-traverse": ["json-schema-traverse@1.0.0", "", {}, "sha512-NM8/P9n3XjXhIZn1lLhkFaACTOURQXjWhV4BA/RnOv8xvgqtqpAX9IO4mRQxSx1Rlo4tqzeqb0sOlruaOy3dug=="],
+
+    "json-schema-typed": ["json-schema-typed@8.0.2", "", {}, "sha512-fQhoXdcvc3V28x7C7BMs4P5+kNlgUURe2jmUT1T//oBRMDrqy1QPelJimwZGo7Hg9VPV3EQV5Bnq4hbFy2vetA=="],
+
+    "math-intrinsics": ["math-intrinsics@1.1.0", "", {}, "sha512-/IXtbwEk5HTPyEwyKX6hGkYXxM9nbj64B+ilVJnC/R6B0pH5G4V3b0pVbL7DBj4tkhBAppbQUlf6F6Xl9LHu1g=="],
+
+    "media-typer": ["media-typer@1.1.0", "", {}, "sha512-aisnrDP4GNe06UcKFnV5bfMNPBUw4jsLGaWwWfnH3v02GnBuXX2MCVn5RbrWo0j3pczUilYblq7fQ7Nw2t5XKw=="],
+
+    "merge-descriptors": ["merge-descriptors@2.0.0", "", {}, "sha512-Snk314V5ayFLhp3fkUREub6WtjBfPdCPY1Ln8/8munuLuiYhsABgBVWsozAG+MWMbVEvcdcpbi9R7ww22l9Q3g=="],
+
+    "mime-db": ["mime-db@1.54.0", "", {}, "sha512-aU5EJuIN2WDemCcAp2vFBfp/m4EAhWJnUNSSw0ixs7/kXbd6Pg64EmwJkNdFhB8aWt1sH2CTXrLxo/iAGV3oPQ=="],
+
+    "mime-types": ["mime-types@3.0.2", "", { "dependencies": { "mime-db": "^1.54.0" } }, "sha512-Lbgzdk0h4juoQ9fCKXW4by0UJqj+nOOrI9MJ1sSj4nI8aI2eo1qmvQEie4VD1glsS250n15LsWsYtCugiStS5A=="],
+
+    "ms": ["ms@2.1.3", "", {}, "sha512-6FlzubTLZG3J2a/NVCAleEhjzq5oxgHyaCU9yYXvcLsvoVaHJq/s5xXI6/XXP6tz7R9xAOtHnSO/tXtF3WRTlA=="],
+
+    "negotiator": ["negotiator@1.0.0", "", {}, "sha512-8Ofs/AUQh8MaEcrlq5xOX0CQ9ypTF5dl78mjlMNfOK08fzpgTHQRQPBxcPlEtIw0yRpws+Zo/3r+5WRby7u3Gg=="],
+
+    "node-domexception": ["node-domexception@1.0.0", "", {}, "sha512-/jKZoMpw0F8GRwl4/eLROPA3cfcXtLApP0QzLmUT/HuPCZWyB7IY9ZrMeKw2O/nFIqPQB3PVM9aYm0F312AXDQ=="],
+
+    "node-fetch": ["node-fetch@2.7.0", "", { "dependencies": { "whatwg-url": "^5.0.0" }, "peerDependencies": { "encoding": "^0.1.0" }, "optionalPeers": ["encoding"] }, "sha512-c4FRfUm/dbcWZ7U+1Wq0AwCyFL+3nt2bEw05wfxSz+DWpWsitgmSgYmy2dQdWyKC1694ELPqMs/YzUSNozLt8A=="],
+
+    "object-assign": ["object-assign@4.1.1", "", {}, "sha512-rJgTQnkUnH1sFw8yT6VSU3zD3sWmu6sZhIseY8VX+GRu3P6F7Fu+JNDoXfklElbLJSnc3FUQHVe4cU5hj+BcUg=="],
+
+    "object-inspect": ["object-inspect@1.13.4", "", {}, "sha512-W67iLl4J2EXEGTbfeHCffrjDfitvLANg0UlX3wFUUSTx92KXRFegMHUVgSqE+wvhAbi4WqjGg9czysTV2Epbew=="],
+
+    "on-finished": ["on-finished@2.4.1", "", { "dependencies": { "ee-first": "1.1.1" } }, "sha512-oVlzkg3ENAhCk2zdv7IJwd/QUD4z2RxRwpkcGY8psCVcCYZNq4wYnVWALHM+brtuJjePWiYF/ClmuDr8Ch5+kg=="],
+
+    "once": ["once@1.4.0", "", { "dependencies": { "wrappy": "1" } }, "sha512-lNaJgI+2Q5URQBkccEKHTQOPaXdUxnZZElQTZY0MFUAuaEqe1E+Nyvgdz/aIyNi6Z9MzO5dv1H8n58/GELp3+w=="],
+
+    "openai": ["openai@6.34.0", "", { "peerDependencies": { "ws": "^8.18.0", "zod": "^3.25 || ^4.0" }, "optionalPeers": ["ws", "zod"], "bin": { "openai": "bin/cli" } }, "sha512-yEr2jdGf4tVFYG6ohmr3pF6VJuveP0EA/sS8TBx+4Eq5NT10alu5zg2dmxMXMgqpihRDQlFGpRt2XwsGj+Fyxw=="],
+
+    "parseurl": ["parseurl@1.3.3", "", {}, "sha512-CiyeOxFT/JZyN5m0z9PfXw4SCBJ6Sygz1Dpl0wqjlhDEGGBP1GnsUVEL0p63hoG1fcj3fHynXi9NYO4nWOL+qQ=="],
+
+    "path-key": ["path-key@3.1.1", "", {}, "sha512-ojmeN0qd+y0jszEtoY48r0Peq5dwMEkIlCOu6Q5f41lfkswXuKtYrhgoTpLnyIcHm24Uhqx+5Tqm2InSwLhE6Q=="],
+
+    "path-to-regexp": ["path-to-regexp@8.4.1", "", {}, "sha512-fvU78fIjZ+SBM9YwCknCvKOUKkLVqtWDVctl0s7xIqfmfb38t2TT4ZU2gHm+Z8xGwgW+QWEU3oQSAzIbo89Ggw=="],
+
+    "pkce-challenge": ["pkce-challenge@5.0.1", "", {}, "sha512-wQ0b/W4Fr01qtpHlqSqspcj3EhBvimsdh0KlHhH8HRZnMsEa0ea2fTULOXOS9ccQr3om+GcGRk4e+isrZWV8qQ=="],
+
+    "proxy-addr": ["proxy-addr@2.0.7", "", { "dependencies": { "forwarded": "0.2.0", "ipaddr.js": "1.9.1" } }, "sha512-llQsMLSUDUPT44jdrU/O37qlnifitDP+ZwrmmZcoSKyLKvtZxpyV0n2/bD/N4tBAAZ/gJEdZU7KMraoK1+XYAg=="],
+
+    "qs": ["qs@6.15.0", "", { "dependencies": { "side-channel": "^1.1.0" } }, "sha512-mAZTtNCeetKMH+pSjrb76NAM8V9a05I9aBZOHztWy/UqcJdQYNsf59vrRKWnojAT9Y+GbIvoTBC++CPHqpDBhQ=="],
+
+    "range-parser": ["range-parser@1.2.1", "", {}, "sha512-Hrgsx+orqoygnmhFbKaHE6c296J+HTAQXoxEF6gNupROmmGJRoyzfG3ccAveqCBrwr/2yxQ5BVd/GTl5agOwSg=="],
+
+    "raw-body": ["raw-body@3.0.2", "", { "dependencies": { "bytes": "~3.1.2", "http-errors": "~2.0.1", "iconv-lite": "~0.7.0", "unpipe": "~1.0.0" } }, "sha512-K5zQjDllxWkf7Z5xJdV0/B0WTNqx6vxG70zJE4N0kBs4LovmEYWJzQGxC9bS9RAKu3bgM40lrd5zoLJ12MQ5BA=="],
+
+    "require-from-string": ["require-from-string@2.0.2", "", {}, "sha512-Xf0nWe6RseziFMu+Ap9biiUbmplq6S9/p+7w7YXP/JBHhrUDDUhwa+vANyubuqfZWTveU//DYVGsDG7RKL/vEw=="],
+
+    "router": ["router@2.2.0", "", { "dependencies": { "debug": "^4.4.0", "depd": "^2.0.0", "is-promise": "^4.0.0", "parseurl": "^1.3.3", "path-to-regexp": "^8.0.0" } }, "sha512-nLTrUKm2UyiL7rlhapu/Zl45FwNgkZGaCpZbIHajDYgwlJCOzLSk+cIPAnsEqV955GjILJnKbdQC1nVPz+gAYQ=="],
+
+    "safer-buffer": ["safer-buffer@2.1.2", "", {}, "sha512-YZo3K82SD7Riyi0E1EQPojLz7kpepnSQI9IyPbHHg1XXXevb5dJI7tpyN2ADxGcQbHG7vcyRHk0cbwqcQriUtg=="],
+
+    "send": ["send@1.2.1", "", { "dependencies": { "debug": "^4.4.3", "encodeurl": "^2.0.0", "escape-html": "^1.0.3", "etag": "^1.8.1", "fresh": "^2.0.0", "http-errors": "^2.0.1", "mime-types": "^3.0.2", "ms": "^2.1.3", "on-finished": "^2.4.1", "range-parser": "^1.2.1", "statuses": "^2.0.2" } }, "sha512-1gnZf7DFcoIcajTjTwjwuDjzuz4PPcY2StKPlsGAQ1+YH20IRVrBaXSWmdjowTJ6u8Rc01PoYOGHXfP1mYcZNQ=="],
+
+    "serve-static": ["serve-static@2.2.1", "", { "dependencies": { "encodeurl": "^2.0.0", "escape-html": "^1.0.3", "parseurl": "^1.3.3", "send": "^1.2.0" } }, "sha512-xRXBn0pPqQTVQiC8wyQrKs2MOlX24zQ0POGaj0kultvoOCstBQM5yvOhAVSUwOMjQtTvsPWoNCHfPGwaaQJhTw=="],
+
+    "setprototypeof": ["setprototypeof@1.2.0", "", {}, "sha512-E5LDX7Wrp85Kil5bhZv46j8jOeboKq5JMmYM3gVGdGH8xFpPWXUMsNrlODCrkoxMEeNi/XZIwuRvY4XNwYMJpw=="],
+
+    "shebang-command": ["shebang-command@2.0.0", "", { "dependencies": { "shebang-regex": "^3.0.0" } }, "sha512-kHxr2zZpYtdmrN1qDjrrX/Z1rR1kG8Dx+gkpK1G4eXmvXswmcE1hTWBWYUzlraYw1/yZp6YuDY77YtvbN0dmDA=="],
+
+    "shebang-regex": ["shebang-regex@3.0.0", "", {}, "sha512-7++dFhtcx3353uBaq8DDR4NuxBetBzC7ZQOhmTQInHEd6bSrXdiEyzCvG07Z44UYdLShWUyXt5M/yhz8ekcb1A=="],
+
+    "side-channel": ["side-channel@1.1.0", "", { "dependencies": { "es-errors": "^1.3.0", "object-inspect": "^1.13.3", "side-channel-list": "^1.0.0", "side-channel-map": "^1.0.1", "side-channel-weakmap": "^1.0.2" } }, "sha512-ZX99e6tRweoUXqR+VBrslhda51Nh5MTQwou5tnUDgbtyM0dBgmhEDtWGP/xbKn6hqfPRHujUNwz5fy/wbbhnpw=="],
+
+    "side-channel-list": ["side-channel-list@1.0.0", "", { "dependencies": { "es-errors": "^1.3.0", "object-inspect": "^1.13.3" } }, "sha512-FCLHtRD/gnpCiCHEiJLOwdmFP+wzCmDEkc9y7NsYxeF4u7Btsn1ZuwgwJGxImImHicJArLP4R0yX4c2KCrMrTA=="],
+
+    "side-channel-map": ["side-channel-map@1.0.1", "", { "dependencies": { "call-bound": "^1.0.2", "es-errors": "^1.3.0", "get-intrinsic": "^1.2.5", "object-inspect": "^1.13.3" } }, "sha512-VCjCNfgMsby3tTdo02nbjtM/ewra6jPHmpThenkTYh8pG9ucZ/1P8So4u4FGBek/BjpOVsDCMoLA/iuBKIFXRA=="],
+
+    "side-channel-weakmap": ["side-channel-weakmap@1.0.2", "", { "dependencies": { "call-bound": "^1.0.2", "es-errors": "^1.3.0", "get-intrinsic": "^1.2.5", "object-inspect": "^1.13.3", "side-channel-map": "^1.0.1" } }, "sha512-WPS/HvHQTYnHisLo9McqBHOJk2FkHO/tlpvldyrnem4aeQp4hai3gythswg6p01oSoTl58rcpiFAjF2br2Ak2A=="],
+
+    "statuses": ["statuses@2.0.2", "", {}, "sha512-DvEy55V3DB7uknRo+4iOGT5fP1slR8wQohVdknigZPMpMstaKJQWhwiYBACJE3Ul2pTnATihhBYnRhZQHGBiRw=="],
+
+    "toidentifier": ["toidentifier@1.0.1", "", {}, "sha512-o5sSPKEkg/DIQNmH43V0/uerLrpzVedkUh8tGNvaeXpfpuwjKenlSox/2O/BTlZUtEe+JG7s5YhEz608PlAHRA=="],
+
+    "tr46": ["tr46@0.0.3", "", {}, "sha512-N3WMsuqV66lT30CrXNbEjx4GEwlow3v6rr4mCcv6prnfwhS01rkgyFdjPNBYd9br7LpXV1+Emh01fHnq2Gdgrw=="],
+
+    "ts-algebra": ["ts-algebra@2.0.0", "", {}, "sha512-FPAhNPFMrkwz76P7cdjdmiShwMynZYN6SgOujD1urY4oNm80Ou9oMdmbR45LotcKOXoy7wSmHkRFE6Mxbrhefw=="],
+
+    "type-is": ["type-is@2.0.1", "", { "dependencies": { "content-type": "^1.0.5", "media-typer": "^1.1.0", "mime-types": "^3.0.0" } }, "sha512-OZs6gsjF4vMp32qrCbiVSkrFmXtG/AZhY3t0iAMrMBiAZyV9oALtXO8hsrHbMXF9x6L3grlFuwW2oAz7cav+Gw=="],
+
+    "typescript": ["typescript@5.9.3", "", { "bin": { "tsc": "bin/tsc", "tsserver": "bin/tsserver" } }, "sha512-jl1vZzPDinLr9eUt3J/t7V6FgNEw9QjvBPdysz9KfQDD41fQrC2Y4vKQdiaUpFT4bXlb1RHhLpp8wtm6M5TgSw=="],
+
+    "undici-types": ["undici-types@5.26.5", "", {}, "sha512-JlCMO+ehdEIKqlFxk6IfVoAUVmgz7cU7zD/h9XZ0qzeosSHmUJVOzSQvvYSYWXkFXC+IfLKSIffhv0sVZup6pA=="],
+
+    "unpipe": ["unpipe@1.0.0", "", {}, "sha512-pjy2bYhSsufwWlKwPc+l3cN7+wuJlK6uz0YdJEOlQDbl6jo/YlPi4mb8agUkVC8BF7V8NuzeyPNqRksA3hztKQ=="],
+
+    "vary": ["vary@1.1.2", "", {}, "sha512-BNGbWLfd0eUPabhkXUVm0j8uuvREyTh5ovRa/dyow/BqAbZJyC+5fU+IzQOzmAKzYqYRAISoRhdQr3eIZ/PXqg=="],
+
+    "web-streams-polyfill": ["web-streams-polyfill@4.0.0-beta.3", "", {}, "sha512-QW95TCTaHmsYfHDybGMwO5IJIM93I/6vTRk+daHTWFPhwh+C8Cg7j7XyKrwrj8Ib6vYXe0ocYNrmzY4xAAN6ug=="],
+
+    "webidl-conversions": ["webidl-conversions@3.0.1", "", {}, "sha512-2JAn3z8AR6rjK8Sm8orRC0h/bcl/DqL7tRPdGZ4I1CjdF+EaMLmYxBHyXuKL849eucPFhvBoxMsflfOb8kxaeQ=="],
+
+    "whatwg-url": ["whatwg-url@5.0.0", "", { "dependencies": { "tr46": "~0.0.3", "webidl-conversions": "^3.0.0" } }, "sha512-saE57nupxk6v3HY35+jzBwYa0rKSy0XR8JSxZPwgLr7ys0IBzhGviA1/TUGJLmSVqs8pb9AnvICXEuOHLprYTw=="],
+
+    "which": ["which@2.0.2", "", { "dependencies": { "isexe": "^2.0.0" }, "bin": { "node-which": "./bin/node-which" } }, "sha512-BLI3Tl1TW3Pvl70l3yq3Y64i+awpwXqsGBYWkkqMtnbXgrMD+yj7rhW0kuEDxzJaYXGjEW5ogapKNMEKNMjibA=="],
+
+    "wrappy": ["wrappy@1.0.2", "", {}, "sha512-l4Sp/DRseor9wL6EvV2+TuQn63dMkPjZ/sp9XkghTEbV9KlPS1xUsZ3u7/IQO4wxtcFB4bgpQPRcR3QCvezPcQ=="],
+
+    "yaml": ["yaml@2.8.3", "", { "bin": { "yaml": "bin.mjs" } }, "sha512-AvbaCLOO2Otw/lW5bmh9d/WEdcDFdQp2Z2ZUH3pX9U2ihyUY0nvLv7J6TrWowklRGPYbB/IuIMfYgxaCPg5Bpg=="],
+
+    "zod": ["zod@4.3.6", "", {}, "sha512-rftlrkhHZOcjDwkGlnUtZZkvaPHCsDATp4pGpuOOMDaTdDDXF91wuVDJoWoPsKX/3YPQ5fHuF3STjcYyKr+Qhg=="],
+
+    "zod-to-json-schema": ["zod-to-json-schema@3.25.2", "", { "peerDependencies": { "zod": "^3.25.28 || ^4" } }, "sha512-O/PgfnpT1xKSDeQYSCfRI5Gy3hPf91mKVDuYLUHZJMiDFptvP41MSnWofm8dnCm0256ZNfZIM7DSzuSMAFnjHA=="],
+
+    "@anthropic-ai/claude-agent-sdk/@anthropic-ai/sdk": ["@anthropic-ai/sdk@0.74.0", "", { "dependencies": { "json-schema-to-ts": "^3.1.1" }, "peerDependencies": { "zod": "^3.25.0 || ^4.0.0" }, "optionalPeers": ["zod"], "bin": { "anthropic-ai-sdk": "bin/cli" } }, "sha512-srbJV7JKsc5cQ6eVuFzjZO7UR3xEPJqPamHFIe29bs38Ij2IripoAhC0S5NslNbaFUYqBKypmmpzMTpqfHEUDw=="],
+
+    "@types/node-fetch/@types/node": ["@types/node@25.5.0", "", { "dependencies": { "undici-types": "~7.18.0" } }, "sha512-jp2P3tQMSxWugkCUKLRPVUpGaL5MVFwF8RDuSRztfwgN1wmqJeMSbKlnEtQqU8UrhTmzEmZdu2I6v2dpp7XIxw=="],
+
+    "bun-types/@types/node": ["@types/node@25.5.0", "", { "dependencies": { "undici-types": "~7.18.0" } }, "sha512-jp2P3tQMSxWugkCUKLRPVUpGaL5MVFwF8RDuSRztfwgN1wmqJeMSbKlnEtQqU8UrhTmzEmZdu2I6v2dpp7XIxw=="],
+
+    "form-data/mime-types": ["mime-types@2.1.35", "", { "dependencies": { "mime-db": "1.52.0" } }, "sha512-ZDY+bPm5zTTF+YpCrAU9nK0UgICYPT0QtT1NZWFv4s++TNkcgVaT0g6+4R2uI4MjQjzysHB1zxuWL50hzaeXiw=="],
+
+    "@types/node-fetch/@types/node/undici-types": ["undici-types@7.18.2", "", {}, "sha512-AsuCzffGHJybSaRrmr5eHr81mwJU3kjw6M+uprWvCXiNeN9SOGwQ3Jn8jb8m3Z6izVgknn1R0FTCEAP2QrLY/w=="],
+
+    "bun-types/@types/node/undici-types": ["undici-types@7.18.2", "", {}, "sha512-AsuCzffGHJybSaRrmr5eHr81mwJU3kjw6M+uprWvCXiNeN9SOGwQ3Jn8jb8m3Z6izVgknn1R0FTCEAP2QrLY/w=="],
+
+    "form-data/mime-types/mime-db": ["mime-db@1.52.0", "", {}, "sha512-sPU4uV7dYlvtWJxwwxHD0PuihVNiE7TyAbQ5SWxDCB9mUYvOgroQOwYQQOKPJ8CIbE+1ETVlOoK1UC2nU3gYvg=="],
+  }
+}
--- a/ai_evals/cases/app.yaml
+++ b/ai_evals/cases/app.yaml
@@ -0,0 +1,93 @@
+- id: app-test1-counter-create
+  prompt: |-
+    Create a simple counter app with increment and decrement buttons.
+  judgeChecklist:
+    - shows the current count in the UI
+    - includes an increment button
+    - includes a decrement button
+    - clicking the buttons updates the count correctly
+
+- id: app-test2-counter-reset
+  prompt: |-
+    Add a reset button that sets the counter back to 0
+  initial: ai_evals/fixtures/frontend/app/initial/test1_counter_app
+  judgeChecklist:
+    - adds a reset control to the existing counter app
+    - clicking reset sets the count back to 0
+    - keeps the existing increment and decrement behavior working
+
+- id: app-test3-shopping-cart-quantity
+  prompt: |-
+    Add a quantity selector (+ and - buttons) to each cart item so users can adjust quantities without removing and re-adding items
+  initial: ai_evals/fixtures/frontend/app/initial/shopping_cart
+  judgeChecklist:
+    - each cart item has visible plus and minus quantity controls
+    - users can increase quantity without re-adding the product
+    - users can decrease quantity from the cart UI
+    - cart totals stay in sync with quantity changes
+
+- id: app-test4-shopping-cart-discount
+  prompt: |-
+    Add a discount code input field in the cart.
+    When the code "SAVE10" is entered, apply a 10% discount to the total
+  initial: ai_evals/fixtures/frontend/app/initial/shopping_cart
+  judgeChecklist:
+    - adds a discount code input to the cart
+    - recognizes the code SAVE10
+    - applies a 10 percent discount to the displayed total
+    - keeps the rest of the cart behavior intact
+
+- id: app-test5-file-manager-search
+  prompt: |-
+    Add a search bar in the toolbar that filters files and folders by name as the user types
+  initial: ai_evals/fixtures/frontend/app/initial/file_manager
+  judgeChecklist:
+    - adds a search input in the toolbar
+    - filters files and folders by name as the user types
+    - updates the visible file list from the search query
+    - keeps the rest of the file manager usable
+
+- id: app-test6-file-manager-inline-rename
+  prompt: |-
+    Let users rename files and folders directly from the file list without leaving the page.
+  initial: ai_evals/fixtures/frontend/app/initial/file_manager
+  judgeChecklist:
+    - adds a visible rename action or inline edit mode in the file list
+    - lets users edit an item's name directly from the list
+    - saves the renamed item through the app's existing rename behavior
+    - refreshes the displayed name after a successful rename
+
+- id: app-test7-file-manager-select-all
+  prompt: |-
+    Add a "Select All" checkbox in the file list header and individual checkboxes for each file.
+    Add a "Delete Selected" button that appears when items are selected
+  initial: ai_evals/fixtures/frontend/app/initial/file_manager
+  judgeChecklist:
+    - adds a select-all control in the file list header
+    - adds per-item selection controls
+    - shows a delete-selected action only when there is a selection
+    - deleting selected items updates the visible list
+
+- id: app-test8-inventory-tracker-create
+  prompt: |-
+    Create an inventory tracker app for a small store.
+    Users should be able to add items with a name, sku, quantity, and price, search items by name or sku, and delete items.
+    The inventory should persist between sessions.
+  judgeChecklist:
+    - includes a form to add inventory items with name, sku, quantity, and price
+    - shows a list or table of saved inventory items
+    - supports searching or filtering by name or sku
+    - lets users delete existing inventory items
+    - persists the inventory data appropriately for a raw Windmill app
+
+- id: app-test9-recipe-book-create
+  prompt: |-
+    Create a recipe book app where users can add recipes with a name, ingredients list, and instructions.
+    Include a search bar to filter recipes by name and the ability to delete recipes.
+    Recipes should persist between sessions.
+  judgeChecklist:
+    - includes a form to add recipes with name, ingredients, and instructions
+    - shows saved recipes in the app
+    - supports searching recipes by name
+    - lets users delete recipes
+    - persists recipes appropriately for a raw Windmill app
--- a/ai_evals/cases/cli.yaml
+++ b/ai_evals/cases/cli.yaml
@@ -0,0 +1,66 @@
+- id: bun-hello-script
+  prompt: |-
+    Create a Windmill Bun script at `f/evals/hello.ts`.
+    It should take a `name` input and return a greeting object like `{ greeting: "Hello, Alice!" }`.
+  expected: ai_evals/fixtures/cli/expected/bun-hello-script
+  judgeChecklist:
+    - creates the requested Bun script at f/evals/hello.ts
+    - takes a name input
+    - returns an object containing the greeting
+
+- id: bun-hello-flow
+  prompt: |-
+    Create a Windmill flow at `f/evals/hello__flow`.
+    It should take a `name` input and return a greeting object like `{ greeting: "Hello, Alice!" }`.
+    Put the step code in `hello.ts`.
+  expected: ai_evals/fixtures/cli/expected/bun-hello-flow
+  judgeChecklist:
+    - creates the requested flow folder with flow.yaml and hello.ts
+    - wires the name input into the flow step
+    - returns the greeting object
+
+- id: python-add-numbers-script
+  prompt: |-
+    Add a Windmill Python script at `f/evals/add_numbers.py`.
+    It should take `a` and `b` as inputs and return `{ "total": a + b }`.
+  expected: ai_evals/fixtures/cli/expected/python-add-numbers-script
+  judgeChecklist:
+    - creates the requested Python script at f/evals/add_numbers.py
+    - takes `a` and `b` as inputs
+    - returns an object with total equal to a plus b
+
+- id: bun-hello-script-uppercase
+  prompt: |-
+    Update `f/evals/hello.ts` so it accepts an optional `uppercase` boolean.
+    Keep returning `{ greeting: ... }`, but when `uppercase` is true the greeting should be uppercased before returning it.
+  initial: ai_evals/fixtures/cli/initial/bun-hello-script-uppercase
+  expected: ai_evals/fixtures/cli/expected/bun-hello-script-uppercase
+  judgeChecklist:
+    - updates the existing hello.ts file rather than creating a new script
+    - accepts an optional uppercase boolean input
+    - keeps returning an object with greeting
+    - uppercases the greeting when uppercase is true
+
+- id: bun-hello-flow-punctuation
+  prompt: |-
+    Update the existing flow in `f/evals/hello__flow` so it also accepts an optional `punctuation` input.
+    The greeting should use that punctuation and default to `!` when it is missing.
+  initial: ai_evals/fixtures/cli/initial/bun-hello-flow-punctuation
+  expected: ai_evals/fixtures/cli/expected/bun-hello-flow-punctuation
+  judgeChecklist:
+    - updates the existing hello flow instead of creating a new one
+    - adds an optional punctuation input to the flow
+    - updates the step code so the returned greeting uses punctuation
+    - defaults punctuation to an exclamation mark when omitted
+
+- id: flow-reuse-existing-script
+  prompt: |-
+    There is already a reusable greeting script at `f/lib/format_greeting.ts`.
+    Create a flow at `f/evals/reuse_greeting__flow` that takes a `name` input and reuses that existing script instead of duplicating the logic inline.
+  initial: ai_evals/fixtures/cli/initial/flow-reuse-existing-script
+  expected: ai_evals/fixtures/cli/expected/flow-reuse-existing-script
+  judgeChecklist:
+    - creates the requested flow at f/evals/reuse_greeting__flow
+    - reuses the existing script from f/lib by path
+    - does not duplicate the greeting logic in a new inline script
+    - wires the name input into the reused script
--- a/ai_evals/cases/flow.yaml
+++ b/ai_evals/cases/flow.yaml
@@ -0,0 +1,246 @@
+- id: flow-test0-sum-two-numbers
+  prompt: |-
+    Create a flow that takes two numbers, `a` and `b`, and returns their sum.
+    Keep it simple and use a single step named `sum_numbers`.
+  expected: ai_evals/fixtures/frontend/flow/expected/test0_sum_two_numbers.json
+  judgeChecklist:
+    - "the flow takes `a` and `b` as inputs"
+    - "the main step is named `sum_numbers`"
+    - the flow returns the sum of the two numbers
+
+- id: flow-test1-reuse-existing-script
+  prompt: |-
+    I need a flow that adds two numbers.
+    If there is already a script in the workspace that does that, reuse it instead of rewriting the logic.
+    The flow should take `a` and `b` as inputs and use a single step named `sum_numbers`.
+  initial: ai_evals/fixtures/frontend/flow/initial/test1_reuse_existing_script_initial.json
+  expected: ai_evals/fixtures/frontend/flow/expected/test1_reuse_existing_script.json
+  judgeChecklist:
+    - "the flow takes `a` and `b` as inputs"
+    - "the main step is named `sum_numbers`"
+    - the flow reuses the existing workspace script instead of rewriting the addition logic
+
+- id: flow-test2-call-existing-subflow
+  prompt: |-
+    Create a parent flow that adds two numbers by reusing an existing flow in the workspace if one already exists.
+    The parent flow should take `a` and `b` as inputs and delegate the calculation instead of inlining it.
+    Use a single step named `call_add_numbers`.
+  initial: ai_evals/fixtures/frontend/flow/initial/test2_call_existing_subflow_initial.json
+  expected: ai_evals/fixtures/frontend/flow/expected/test2_call_existing_subflow.json
+  judgeChecklist:
+    - "the parent flow takes `a` and `b` as inputs"
+    - "the main step is named `call_add_numbers`"
+    - the parent flow delegates to an existing workspace subflow instead of inlining the addition logic
+
+- id: flow-test3-branchone-routing
+  prompt: |-
+    Create a flow that routes incoming support requests based on the customer's tier.
+    The input should contain a string field named `tier`.
+    Free, pro, and enterprise requests should go to different queues, and unknown tiers should fall back to a default queue.
+    Name the main routing step `route_by_tier`.
+  expected: ai_evals/fixtures/frontend/flow/expected/test3_branchone_routing.json
+  judgeChecklist:
+    - "the input schema includes a string field named `tier`"
+    - "the main routing step is named `route_by_tier`"
+    - free requests go to a free queue
+    - pro requests go to a pro queue
+    - enterprise requests go to an enterprise queue
+    - unknown tiers fall back to a default queue
+
+- id: flow-test4-order-processing-loop
+  prompt: |-
+    Build an order-processing flow.
+
+    The input should include an order with:
+    - an `items` array containing `name`, `price`, and `quantity`
+    - `customer_email`
+    - `shipping_address`
+
+    The flow should:
+    - validate that every item has a positive price and quantity
+    - calculate the order total with 8% tax
+    - check inventory for each item using placeholder availability data
+    - create a shipment if everything is in stock, otherwise create a backorder
+    - send a confirmation using placeholder email logic
+    - return a final order summary with the status
+  validate:
+    schemaAnyOf:
+      - requiredPaths:
+          - order
+          - order.items
+          - order.customer_email
+          - order.shipping_address
+      - requiredPaths:
+          - items
+          - customer_email
+          - shipping_address
+    resolveResultsRefs: true
+  judgeChecklist:
+    - the flow validates that every item has a positive price and quantity
+    - the flow calculates the order total with 8% tax
+    - the flow checks inventory for each item using placeholder availability data
+    - the flow creates a shipment if everything is in stock, otherwise a backorder
+    - the flow sends a confirmation using placeholder email logic
+    - the flow returns a final order summary with the resulting status
+
+- id: flow-test5-parallel-data-pipeline
+  prompt: |-
+    Create a data-processing flow for three external data sources.
+
+    It should:
+    - load a small placeholder configuration listing the three sources
+    - fetch placeholder records from each source
+    - clean and validate each source's records
+    - combine everything into one dataset
+    - compute an overall quality score
+    - store the result differently depending on the score:
+      - 90 or above goes to the primary database
+      - 70 to 89 goes to a secondary database with a warning
+      - below 70 goes to quarantine and triggers an alert
+    - return a processing report with total records, quality score, and destination
+  judgeChecklist:
+    - the flow loads a placeholder configuration listing three external sources
+    - the flow fetches placeholder records from each source
+    - the flow cleans and validates each source's records
+    - the flow combines everything into one dataset
+    - the flow computes an overall quality score
+    - scores of 90 or above go to the primary database
+    - scores from 70 to 89 go to a secondary database with a warning
+    - scores below 70 go to quarantine and trigger an alert
+    - the final report includes total records, quality score, and destination
+
+- id: flow-test6-ai-agent-tools
+  prompt: |-
+    Create a customer support flow.
+
+    The input should include `customer_id` and `query_text`.
+    The flow should load the customer's profile and order history, then use an AI assistant to help with the request.
+    The assistant should be able to:
+    - look up orders
+    - check refund eligibility
+    - search FAQs
+    - open a support ticket when needed
+
+    After that, log the interaction and return the assistant's response along with any actions it took.
+  judgeChecklist:
+    - "the input schema includes `customer_id` and `query_text`"
+    - the flow loads the customer's profile and order history
+    - the flow uses an AI assistant step
+    - the assistant can look up orders
+    - the assistant can check refund eligibility
+    - the assistant can search FAQs
+    - the assistant can open a support ticket
+    - the flow logs the interaction
+    - the final output returns the assistant response along with any actions taken or resulting support action details
+
+- id: flow-test7-simple-modification
+  prompt: |-
+    Update this flow so it validates processed data before saving it.
+
+    After `process_data`, add a `validate_data` step that checks the data array is not empty.
+    If the array is empty, it should return an error object with the message `No data to save`.
+    If validation passes, let the save continue normally.
+    Update `save_results` so it handles the validation result correctly.
+  initial: ai_evals/fixtures/frontend/flow/initial/test5_initial.json
+  expected: ai_evals/fixtures/frontend/flow/expected/test5_modify_simple.json
+  judgeChecklist:
+    - the updated flow keeps the original fetch and process steps intact
+    - "a `validate_data` step is added after `process_data`"
+    - "`validate_data` checks that the processed data array is not empty"
+    - "empty data returns an error object with the message `No data to save`"
+    - "`save_results` handles the validation result correctly"
+
+- id: flow-test8-branching-in-loop
+  prompt: |-
+    Update the order-processing logic inside `loop_orders` so different order types are handled differently.
+
+    For `express`, mark the order as priority and use a shipping cost of $15.99.
+    For `standard`, use a shipping cost of $5.99.
+    For `pickup`, mark it as no shipping required with a cost of $0.
+    Keep the existing processing as a fallback for unknown order types.
+    Each path should return the orderId, shipping cost, and shipping type.
+  initial: ai_evals/fixtures/frontend/flow/initial/test6_initial.json
+  judgeChecklist:
+    - "the existing `loop_orders` flow still handles per-order processing"
+    - exact branching topology is not required as long as `loop_orders` handles the order types correctly
+    - express orders are marked as priority and use a shipping cost of 15.99
+    - standard orders use a shipping cost of 5.99
+    - pickup orders use a shipping cost of 0 and are treated as no shipping required
+    - unknown order types still follow a fallback path
+    - "each processed order returns `orderId`, `shippingCost`, and `shippingType`"
+
+- id: flow-test9-parallel-refactor
+  prompt: |-
+    Refactor this flow so the enrichment work no longer runs one step at a time.
+
+    `enrich_price`, `enrich_inventory`, and `enrich_reviews` should run independently.
+    Each one should return a fallback value if it fails.
+    Update `combine_data` so it merges the enrichment results and sets a `hasFallbacks` flag when any fallback was used.
+    Keep `get_item` as the first step and `return_result` as the last step.
+  initial: ai_evals/fixtures/frontend/flow/initial/test7_initial.json
+  expected: ai_evals/fixtures/frontend/flow/expected/test7_modify_complex.json
+  judgeChecklist:
+    - "the updated flow keeps `get_item` as the first step"
+    - "the updated flow keeps `return_result` as the last step"
+    - "`enrich_price`, `enrich_inventory`, and `enrich_reviews` run independently rather than sequentially"
+    - each enrichment path returns a fallback value if it fails
+    - "`combine_data` merges the enrichment results"
+    - "`combine_data` sets `hasFallbacks` when any fallback was used"
+
+- id: flow-test10-while-loop-counter
+  prompt: |-
+    Create a flow that keeps incrementing a counter until it reaches a target value.
+    The input should include a number field named `target`.
+    Name the looping step `count_until_target`.
+    Once the target is reached, return the final counter value.
+  expected: ai_evals/fixtures/frontend/flow/expected/test10_while_loop_counter.json
+  judgeChecklist:
+    - "the input schema includes a number field named `target`"
+    - "the looping step is named `count_until_target`"
+    - the flow keeps incrementing a counter until the target is reached
+    - the final output returns the final counter value
+
+- id: flow-test11-preprocessor-and-failure-handler
+  prompt: |-
+    Create an event-processing flow for a string payload.
+
+    Before the main processing runs, trim the payload and reject empty strings.
+    The main step should be named `process_event` and return a simple success object.
+    If anything fails, return a compact error object with the error message and the failing step id.
+  expected: ai_evals/fixtures/frontend/flow/expected/test11_preprocessor_failure.json
+  validate:
+    requireSpecialModules:
+      - preprocessor_module
+      - failure_module
+  judgeChecklist:
+    - the flow trims the payload before the main processing runs
+    - the flow rejects empty payload strings
+    - "the main step is named `process_event`"
+    - "`process_event` returns a simple success object"
+    - failures return a compact error object with the error message and failing step id
+
+- id: flow-test12-approval-step
+  prompt: |-
+    Create a purchase approval flow.
+
+    The input should include `requester_email` and `amount`.
+    Add an approval step named `request_approval` that pauses the flow and asks the approver for a comment.
+    One approval should be enough to continue.
+    After approval, add a final step named `finalize_purchase` that returns an approved status object.
+  expected: ai_evals/fixtures/frontend/flow/expected/test12_approval_step.json
+  validate:
+    schemaRequiredPaths:
+      - requester_email
+      - amount
+    requireSuspendSteps:
+      - id: request_approval
+        requiredEvents: 1
+        resumeRequiredStringFieldAnyOf:
+          - comment
+          - approver_comment
+  judgeChecklist:
+    - "the flow includes an approval step named `request_approval`"
+    - "`request_approval` pauses the flow and asks the approver for a comment"
+    - one approval is enough to continue
+    - "the flow includes a final step named `finalize_purchase`"
+    - "`finalize_purchase` returns an approved status object after approval"
--- a/ai_evals/cases/script.yaml
+++ b/ai_evals/cases/script.yaml
@@ -0,0 +1,11 @@
+- id: script-test1-greet-user
+  prompt: |-
+    Update the current Bun script so it takes the existing `name` input and returns a plain greeting string like `Hello, Alice!`.
+    Do not wrap the result in an object or array.
+    Keep it simple and do not add external dependencies.
+  initial: ai_evals/fixtures/frontend/script/initial/test1_empty_bun.json
+  expected: ai_evals/fixtures/frontend/script/expected/test1_greet_user.json
+  judgeChecklist:
+    - uses the existing `name` input
+    - returns a plain greeting string
+    - does not wrap the result in an object or array
--- a/ai_evals/cli/index.ts
+++ b/ai_evals/cli/index.ts
@@ -0,0 +1,295 @@
+#!/usr/bin/env bun
+
+import { Command, InvalidArgumentError } from "commander";
+import { loadCases, loadSelectedCases } from "../core/cases";
+import {
+  EVAL_MODELS,
+  type EvalModelSpec,
+  formatRunModelLabel,
+  getCliEvalModel,
+  getEvalModelHelpText,
+  resolveEvalModel,
+} from "../core/models";
+import {
+  appendHistoryRecord,
+  buildRunResult,
+  formatRunSummary,
+  resolveRunOutputPath,
+  writeRunArtifacts,
+  writeRunResult,
+} from "../core/results";
+import { runSuite } from "../core/runSuite";
+import { EVAL_MODES, type EvalMode } from "../core/types";
+import { DEFAULT_JUDGE_MODEL } from "../core/judge";
+import { createCliModeRunner } from "../modes/cli";
+import { runFrontendBenchmarkAdapter } from "../adapters/frontend/runtime";
+
+async function main() {
+  const program = new Command()
+    .name("bun run cli --")
+    .description("Run AI eval cases against the current production prompts and guidance")
+    .showHelpAfterError()
+    .showSuggestionAfterError()
+    .addHelpText(
+      "after",
+      [
+        "",
+        "Examples:",
+        "  bun run cli -- models",
+        "  bun run cli -- cases",
+        "  bun run cli -- cases flow",
+        "  bun run cli -- run flow",
+        "  bun run cli -- run flow --model 4o",
+        "  bun run cli -- run flow --models haiku,opus,4o",
+        "  bun run cli -- run flow flow-test0-sum-two-numbers --verbose",
+        "  bun run cli -- run flow --record",
+        "  bun run cli -- run flow flow-test5-simple-modification --runs 3",
+        "  bun run cli -- run cli bun-hello-script",
+        "",
+        "Models:",
+        getEvalModelHelpText(),
+      ].join("\n")
+    );
+
+  program
+    .command("models")
+    .description("List available model aliases")
+    .action(() => {
+      handleModels();
+    });
+
+  program
+    .command("cases")
+    .description("List available cases")
+    .argument("[mode]", "cli, flow, script, or app", parseOptionalMode)
+    .action(async (mode?: EvalMode) => {
+      await handleCases(mode);
+    });
+
+  program
+    .command("run")
+    .description("Run one benchmark mode")
+    .argument("<mode>", "cli, flow, script, or app", parseMode)
+    .argument("[caseIds...]", "specific case ids to run")
+    .option("--runs <n>", "number of attempts per case", parsePositiveInteger, 1)
+    .option("--output <path>", "write the result JSON to this path")
+    .option("--model <name>", `model alias (${EVAL_MODELS.map((entry) => entry.id).join(", ")})`)
+    .option("--models <names>", "comma-separated model aliases to run sequentially")
+    .option("--verbose", "stream assistant output during frontend runs")
+    .option("--record", "append a compact summary line to ai_evals/history/<mode>.jsonl")
+    .action(
+      async (
+        mode: EvalMode,
+        caseIds: string[],
+        options: {
+          runs: number;
+          output?: string;
+          model?: string;
+          models?: string;
+          verbose?: boolean;
+          record?: boolean;
+        }
+      ) => {
+        await handleRun({
+          mode,
+          caseIds,
+          runs: options.runs,
+          outputPath: options.output,
+          model: options.model,
+          models: options.models,
+          verbose: options.verbose ?? false,
+          record: options.record ?? false,
+        });
+      }
+    );
+
+  await program.parseAsync(process.argv);
+}
+
+async function handleCases(mode?: EvalMode) {
+  const modes = mode ? [mode] : [...EVAL_MODES];
+
+  for (const entry of modes) {
+    const cases = await loadCases(entry);
+    process.stdout.write(`${entry} (${cases.length})\n`);
+    for (const evalCase of cases) {
+      process.stdout.write(`- ${evalCase.id}\n`);
+    }
+    process.stdout.write("\n");
+  }
+}
+
+function handleModels() {
+  process.stdout.write("Available models\n");
+  for (const model of EVAL_MODELS) {
+    const supports = [
+      ...(model.frontend ? ["flow", "script", "app"] : []),
+      ...(model.cli ? ["cli"] : []),
+    ];
+    const aliases = [model.id, ...model.aliases.filter((alias) => alias !== model.id)];
+    process.stdout.write(`- ${model.id}: ${model.label}\n`);
+    process.stdout.write(`  aliases: ${aliases.join(", ")}\n`);
+    process.stdout.write(`  modes: ${supports.join(", ")}\n`);
+  }
+  process.stdout.write(`\nJudge model: ${DEFAULT_JUDGE_MODEL}\n`);
+}
+
+async function handleRun(input: {
+  mode: EvalMode;
+  caseIds: string[];
+  runs: number;
+  outputPath?: string;
+  model?: string;
+  models?: string;
+  verbose: boolean;
+  record: boolean;
+}) {
+  if (input.record && input.caseIds.length > 0) {
+    throw new Error("--record only supports full-suite runs; omit case ids to record history");
+  }
+  if (input.model && input.models) {
+    throw new Error("Use either --model or --models, not both");
+  }
+
+  const selectedCases = await loadSelectedCases(input.mode, input.caseIds);
+  const models = resolveRequestedModels(input.mode, input.model, input.models);
+  if (input.outputPath && models.length > 1) {
+    throw new Error("--output only supports a single model run");
+  }
+
+  const summaries: Array<{ label: string; passRate: number; averageDurationMs: number }> = [];
+
+  for (const [index, model] of models.entries()) {
+    const runModel = formatRunModelLabel(input.mode, model);
+    if (models.length > 1) {
+      process.stdout.write(
+        `${index > 0 ? "\n" : ""}=== ${input.mode} ${model.id} (${runModel}) ===\n`
+      );
+    }
+    process.stderr.write(`Starting ${input.mode} benchmark...\n`);
+
+    const result =
+      input.mode === "cli"
+        ? await runCliBenchmark(selectedCases, input.runs, getCliEvalModel(model), runModel)
+        : await runFrontendBenchmarkAdapter({
+            mode: input.mode,
+            caseIds: input.caseIds,
+            runs: input.runs,
+            model: model.id,
+            verbose: input.verbose,
+          });
+
+    const resolvedOutputPath =
+      models.length === 1
+        ? resolveRunOutputPath(input.mode, input.outputPath)
+        : resolveRunOutputPath(input.mode);
+    const artifactsPath = await writeRunArtifacts(result, resolvedOutputPath);
+    const resultPath = await writeRunResult(result, resolvedOutputPath);
+    const historyPath = input.record ? await appendHistoryRecord(result) : null;
+    process.stdout.write(`${formatRunSummary(result)}\n`);
+    process.stdout.write(`Saved: ${resultPath}\n`);
+    if (artifactsPath) {
+      process.stdout.write(`Artifacts: ${artifactsPath}\n`);
+    }
+    if (historyPath) {
+      process.stdout.write(`Recorded: ${historyPath}\n`);
+    }
+
+    summaries.push({
+      label: `${model.id} (${runModel})`,
+      passRate: result.passRate,
+      averageDurationMs: result.averageDurationMs,
+    });
+  }
+
+  if (summaries.length > 1) {
+    process.stdout.write("\nModel summary\n");
+    for (const summary of summaries) {
+      process.stdout.write(
+        `- ${summary.label}: ${formatPercent(summary.passRate)} | ${Math.round(summary.averageDurationMs)}ms\n`
+      );
+    }
+  }
+}
+
+async function runCliBenchmark(
+  cases: Awaited<ReturnType<typeof loadSelectedCases>>,
+  runs: number,
+  model: ReturnType<typeof getCliEvalModel>,
+  runModel: string
+) {
+  const caseResults = await runSuite({
+    modeRunner: createCliModeRunner(model),
+    cases,
+    runs,
+    runModel,
+    judgeModel: DEFAULT_JUDGE_MODEL,
+  });
+
+  return buildRunResult({
+    mode: "cli",
+    runs,
+    runModel,
+    judgeModel: DEFAULT_JUDGE_MODEL,
+    caseResults,
+  });
+}
+
+function parseMode(value: string): EvalMode {
+  if (EVAL_MODES.includes(value as EvalMode)) {
+    return value as EvalMode;
+  }
+  throw new InvalidArgumentError(`mode must be one of: ${EVAL_MODES.join(", ")}`);
+}
+
+function parseOptionalMode(value: string | undefined): EvalMode | undefined {
+  return value ? parseMode(value) : undefined;
+}
+
+function parsePositiveInteger(value: string): number {
+  const parsed = Number(value);
+  if (!Number.isInteger(parsed) || parsed <= 0) {
+    throw new InvalidArgumentError("must be a positive integer");
+  }
+  return parsed;
+}
+
+function resolveRequestedModels(
+  mode: EvalMode,
+  singleModel?: string,
+  multipleModels?: string
+): EvalModelSpec[] {
+  if (!multipleModels) {
+    return [resolveEvalModel(mode, singleModel)];
+  }
+
+  const aliases = multipleModels
+    .split(",")
+    .map((value) => value.trim())
+    .filter(Boolean);
+  if (aliases.length === 0) {
+    throw new Error("--models requires at least one model alias");
+  }
+
+  const seen = new Set<string>();
+  const models: EvalModelSpec[] = [];
+  for (const alias of aliases) {
+    const model = resolveEvalModel(mode, alias);
+    if (seen.has(model.id)) {
+      continue;
+    }
+    seen.add(model.id);
+    models.push(model);
+  }
+  return models;
+}
+
+function formatPercent(value: number): string {
+  return `${(value * 100).toFixed(1)}%`;
+}
+
+void main().catch((error) => {
+  const message = error instanceof Error ? error.message : String(error);
+  process.stderr.write(`${message}\n`);
+  process.exit(1);
+});
--- a/ai_evals/core/cases.ts
+++ b/ai_evals/core/cases.ts
@@ -0,0 +1,71 @@
+import { readFile } from "node:fs/promises";
+import path from "node:path";
+import { fileURLToPath } from "node:url";
+import { parse } from "yaml";
+import type { EvalCase, EvalMode, FlowValidationSpec } from "./types";
+
+const REPO_ROOT = fileURLToPath(new URL("../../", import.meta.url));
+const CASES_DIR = path.join(REPO_ROOT, "ai_evals", "cases");
+
+interface RawEvalCase {
+  id: string;
+  prompt: string;
+  initial?: string;
+  expected?: string;
+  validate?: FlowValidationSpec;
+  judgeChecklist?: string[];
+}
+
+export function getRepoRoot(): string {
+  return REPO_ROOT;
+}
+
+export function getAiEvalsRoot(): string {
+  return path.join(REPO_ROOT, "ai_evals");
+}
+
+export async function loadCases(mode: EvalMode): Promise<EvalCase[]> {
+  const filePath = path.join(CASES_DIR, `${mode}.yaml`);
+  const raw = await readFile(filePath, "utf8");
+  const parsed = parse(raw);
+
+  if (!Array.isArray(parsed)) {
+    throw new Error(`Expected ${filePath} to contain a YAML list of cases`);
+  }
+
+  return parsed.map((entry) => ({
+    id: entry.id,
+    prompt: entry.prompt,
+    initialPath: resolveFixturePath(entry.initial),
+    expectedPath: resolveFixturePath(entry.expected),
+    validate: entry.validate,
+    judgeChecklist: entry.judgeChecklist,
+  }));
+}
+
+export async function loadSelectedCases(
+  mode: EvalMode,
+  selectedIds: string[]
+): Promise<EvalCase[]> {
+  const allCases = await loadCases(mode);
+  if (selectedIds.length === 0) {
+    return allCases;
+  }
+
+  const caseMap = new Map(allCases.map((entry) => [entry.id, entry]));
+  const missing = selectedIds.filter((id) => !caseMap.has(id));
+  if (missing.length > 0) {
+    throw new Error(
+      `Unknown ${mode} case${missing.length === 1 ? "" : "s"}: ${missing.join(", ")}`
+    );
+  }
+
+  return selectedIds.map((id) => caseMap.get(id)!);
+}
+
+function resolveFixturePath(value: string | undefined): string | undefined {
+  if (!value) {
+    return undefined;
+  }
+  return path.isAbsolute(value) ? value : path.join(REPO_ROOT, value);
+}
--- a/ai_evals/core/files.ts
+++ b/ai_evals/core/files.ts
@@ -0,0 +1,67 @@
+import { access, copyFile, mkdir, readdir, readFile } from "node:fs/promises";
+import path from "node:path";
+
+export async function exists(filePath: string): Promise<boolean> {
+  try {
+    await access(filePath);
+    return true;
+  } catch {
+    return false;
+  }
+}
+
+export async function readJsonFile<T>(filePath: string): Promise<T> {
+  const raw = await readFile(filePath, "utf8");
+  return JSON.parse(raw) as T;
+}
+
+export async function readDirectoryFiles(
+  rootDir: string,
+  options: {
+    ignore?: Set<string>;
+  } = {}
+): Promise<Record<string, string>> {
+  const files: Record<string, string> = {};
+  await walkDirectory(rootDir, "", files, options.ignore ?? new Set());
+  return files;
+}
+
+export async function copyDirectory(sourceDir: string, targetDir: string): Promise<void> {
+  const entries = await readdir(sourceDir, { withFileTypes: true });
+  await mkdir(targetDir, { recursive: true });
+
+  for (const entry of entries) {
+    const sourcePath = path.join(sourceDir, entry.name);
+    const targetPath = path.join(targetDir, entry.name);
+    if (entry.isDirectory()) {
+      await copyDirectory(sourcePath, targetPath);
+      continue;
+    }
+    await mkdir(path.dirname(targetPath), { recursive: true });
+    await copyFile(sourcePath, targetPath);
+  }
+}
+
+async function walkDirectory(
+  absoluteDir: string,
+  relativeDir: string,
+  output: Record<string, string>,
+  ignore: Set<string>
+): Promise<void> {
+  const entries = await readdir(absoluteDir, { withFileTypes: true });
+
+  for (const entry of entries) {
+    const relativePath = relativeDir ? `${relativeDir}/${entry.name}` : entry.name;
+    if (ignore.has(relativePath) || ignore.has(entry.name)) {
+      continue;
+    }
+
+    const absolutePath = path.join(absoluteDir, entry.name);
+    if (entry.isDirectory()) {
+      await walkDirectory(absolutePath, relativePath, output, ignore);
+      continue;
+    }
+
+    output[relativePath] = await readFile(absolutePath, "utf8");
+  }
+}
--- a/ai_evals/core/judge.ts
+++ b/ai_evals/core/judge.ts
@@ -0,0 +1,149 @@
+import Anthropic from "@anthropic-ai/sdk";
+import type { EvalMode, JudgeResult } from "./types";
+
+export const DEFAULT_JUDGE_MODEL = "claude-sonnet-4-6";
+
+const JUDGE_TOOL_NAME = "submit_judgement";
+
+export async function judgeOutput(input: {
+  mode: EvalMode;
+  prompt: string;
+  checklist?: string[];
+  initial?: unknown;
+  expected?: unknown;
+  actual: unknown;
+  model?: string;
+}): Promise<JudgeResult> {
+  const apiKey = process.env.ANTHROPIC_API_KEY;
+  if (!apiKey) {
+    return {
+      success: false,
+      score: 0,
+      summary: "Judge unavailable",
+      error: "ANTHROPIC_API_KEY is not set",
+    };
+  }
+
+  const client = new Anthropic({ apiKey });
+  const model = input.model ?? DEFAULT_JUDGE_MODEL;
+
+  const system = [
+    "You evaluate benchmark outputs for Windmill AI generation.",
+    "Deterministic checks already run separately. Focus on whether the final output satisfies the user request.",
+    "If expected state is provided, treat it as a valid example and reward semantically equivalent outputs.",
+    "If a checklist is provided, treat it as the explicit acceptance criteria for this case.",
+    "Be strict about missing requested functionality.",
+    "When the prompt wording is ambiguous, prefer the checklist over inferred structural requirements.",
+    "Do not require exact ids, exact topology, or exact field names unless the prompt, checklist, or expected state clearly requires them.",
+    `Always respond by calling the ${JUDGE_TOOL_NAME} tool exactly once.`,
+  ].join("\n\n");
+
+  const user = [
+    `Mode: ${input.mode}`,
+    "",
+    "User prompt:",
+    input.prompt,
+    "",
+    "Checklist:",
+    formatChecklist(input.checklist),
+    "",
+    "Initial state:",
+    formatJsonBlock(input.initial),
+    "",
+    "Expected state:",
+    formatJsonBlock(input.expected),
+    "",
+    "Actual result:",
+    formatJsonBlock(input.actual),
+  ].join("\n");
+
+  try {
+    const response = await client.messages.create({
+      model,
+      max_tokens: 1024,
+      temperature: 0,
+      system,
+      messages: [{ role: "user", content: user }],
+      tools: [
+        {
+          name: JUDGE_TOOL_NAME,
+          description: "Submit the benchmark judgement as structured data.",
+          input_schema: {
+            type: "object",
+            properties: {
+              score: {
+                type: "integer",
+                minimum: 0,
+                maximum: 100,
+              },
+              summary: {
+                type: "string",
+              },
+            },
+            required: ["score", "summary"],
+          },
+        },
+      ],
+      tool_choice: {
+        type: "tool",
+        name: JUDGE_TOOL_NAME,
+        disable_parallel_tool_use: true,
+      },
+    });
+
+    const toolUseBlock = response.content.find(
+      (block): block is Anthropic.ToolUseBlock =>
+        block.type === "tool_use" && block.name === JUDGE_TOOL_NAME
+    );
+
+    if (!toolUseBlock) {
+      return {
+        success: false,
+        score: 0,
+        summary: "Judge returned no tool output",
+        error: "Expected structured tool output from judge",
+      };
+    }
+
+    const parsed = toolUseBlock.input as {
+      score: number;
+      summary: string;
+    };
+
+    return {
+      success: true,
+      score: normalizeScore(parsed.score),
+      summary: parsed.summary,
+    };
+  } catch (error) {
+    const message = error instanceof Error ? error.message : String(error);
+    return {
+      success: false,
+      score: 0,
+      summary: "Judge failed",
+      error: message,
+    };
+  }
+}
+
+function formatJsonBlock(value: unknown): string {
+  if (value === undefined) {
+    return "(none)";
+  }
+  return JSON.stringify(value, null, 2);
+}
+
+function formatChecklist(checklist: string[] | undefined): string {
+  if (!checklist || checklist.length === 0) {
+    return "(none)";
+  }
+
+  return checklist.map((item) => `- ${item}`).join("\n");
+}
+
+function normalizeScore(value: number): number {
+  if (!Number.isFinite(value)) {
+    return 0;
+  }
+  return Math.max(0, Math.min(100, Math.round(value)));
+}
--- a/ai_evals/core/models.test.ts
+++ b/ai_evals/core/models.test.ts
@@ -0,0 +1,29 @@
+import { describe, expect, it } from "bun:test";
+import { resolveEvalModel } from "./models";
+
+describe("resolveEvalModel", () => {
+  it("supports Gemini aliases for frontend evals", () => {
+    expect(resolveEvalModel("flow", "gemini").frontend).toEqual({
+      provider: "googleai",
+      model: "gemini-2.5-flash",
+    });
+    expect(resolveEvalModel("app", "gemini-pro").frontend).toEqual({
+      provider: "googleai",
+      model: "gemini-2.5-pro",
+    });
+    expect(resolveEvalModel("script", "gemini-3-flash-preview").frontend).toEqual({
+      provider: "googleai",
+      model: "gemini-3-flash-preview",
+    });
+    expect(resolveEvalModel("flow", "gemini-3.1-pro-preview").frontend).toEqual({
+      provider: "googleai",
+      model: "gemini-3.1-pro-preview",
+    });
+  });
+
+  it("rejects Gemini aliases for cli evals", () => {
+    expect(() => resolveEvalModel("cli", "gemini")).toThrow(
+      "Model gemini-flash is not supported for cli mode"
+    );
+  });
+});
--- a/ai_evals/core/models.ts
+++ b/ai_evals/core/models.ts
@@ -0,0 +1,185 @@
+import type { EvalMode } from "./types";
+
+export interface FrontendEvalModelConfig {
+  provider: "anthropic" | "openai" | "googleai";
+  model: string;
+}
+
+export interface CliEvalModelConfig {
+  provider: "anthropic";
+  model: string;
+}
+
+export interface EvalModelSpec {
+  id: string;
+  label: string;
+  aliases: string[];
+  frontend?: FrontendEvalModelConfig;
+  cli?: CliEvalModelConfig;
+}
+
+export const EVAL_MODELS: EvalModelSpec[] = [
+  {
+    id: "haiku",
+    label: "Claude Haiku 4.5",
+    aliases: [
+      "haiku",
+      "haiku-4.5",
+      "claude-haiku",
+      "claude-haiku-4.5",
+      "claude-haiku-4-5",
+      "claude-haiku-4-5-20251001",
+    ],
+    frontend: {
+      provider: "anthropic",
+      model: "claude-haiku-4-5-20251001",
+    },
+    cli: {
+      provider: "anthropic",
+      model: "haiku",
+    },
+  },
+  {
+    id: "sonnet",
+    label: "Claude Sonnet 4.5",
+    aliases: [
+      "sonnet",
+      "sonnet-4.5",
+      "claude-sonnet",
+      "claude-sonnet-4.5",
+      "claude-sonnet-4-5",
+      "claude-sonnet-4-5-20250929",
+    ],
+    frontend: {
+      provider: "anthropic",
+      model: "claude-sonnet-4-5-20250929",
+    },
+    cli: {
+      provider: "anthropic",
+      model: "sonnet",
+    },
+  },
+  {
+    id: "opus",
+    label: "Claude Opus 4.6",
+    aliases: [
+      "opus",
+      "opus-4.6",
+      "claude-opus",
+      "claude-opus-4.6",
+      "claude-opus-4-6",
+    ],
+    frontend: {
+      provider: "anthropic",
+      model: "claude-opus-4-6",
+    },
+    cli: {
+      provider: "anthropic",
+      model: "opus",
+    },
+  },
+  {
+    id: "4o",
+    label: "GPT-4o",
+    aliases: ["4o", "gpt-4o"],
+    frontend: {
+      provider: "openai",
+      model: "gpt-4o",
+    },
+  },
+  {
+    id: "gemini-flash",
+    label: "Gemini 2.5 Flash",
+    aliases: ["gemini", "gemini-flash", "gemini-2.5-flash"],
+    frontend: {
+      provider: "googleai",
+      model: "gemini-2.5-flash",
+    },
+  },
+  {
+    id: "gemini-pro",
+    label: "Gemini 2.5 Pro",
+    aliases: ["gemini-pro", "gemini-2.5-pro"],
+    frontend: {
+      provider: "googleai",
+      model: "gemini-2.5-pro",
+    },
+  },
+  {
+    id: "gemini-3-flash-preview",
+    label: "Gemini 3 Flash Preview",
+    aliases: ["gemini-3-flash-preview", "gemini-3-flash"],
+    frontend: {
+      provider: "googleai",
+      model: "gemini-3-flash-preview",
+    },
+  },
+  {
+    id: "gemini-3.1-pro-preview",
+    label: "Gemini 3.1 Pro Preview",
+    aliases: ["gemini-3.1-pro-preview", "gemini-3.1-pro", "gemini-3-pro-preview"],
+    frontend: {
+      provider: "googleai",
+      model: "gemini-3.1-pro-preview",
+    },
+  },
+];
+
+export function resolveEvalModel(mode: EvalMode, alias?: string): EvalModelSpec {
+  const spec = alias ? findEvalModel(alias) : getDefaultEvalModel(mode);
+  if (!spec) {
+    throw new Error(`Unknown model: ${alias}`);
+  }
+
+  if (mode === "cli" && !spec.cli) {
+    throw new Error(`Model ${spec.id} is not supported for cli mode`);
+  }
+
+  if (mode !== "cli" && !spec.frontend) {
+    throw new Error(`Model ${spec.id} is not supported for ${mode} mode`);
+  }
+
+  return spec;
+}
+
+export function getEvalModelHelpText(): string {
+  return EVAL_MODELS.map((model) => {
+    const modes = [
+      ...(model.frontend ? ["flow", "script", "app"] : []),
+      ...(model.cli ? ["cli"] : []),
+    ];
+    return `  ${model.id.padEnd(8)} ${model.label} (${modes.join(", ")})`;
+  }).join("\n");
+}
+
+export function formatRunModelLabel(mode: EvalMode, model: EvalModelSpec): string {
+  if (mode === "cli") {
+    return `${model.cli!.provider}:${model.cli!.model}`;
+  }
+  return `${model.frontend!.provider}:${model.frontend!.model}`;
+}
+
+export function getFrontendEvalModel(model: EvalModelSpec): FrontendEvalModelConfig {
+  if (!model.frontend) {
+    throw new Error(`Model ${model.id} does not support frontend evals`);
+  }
+  return model.frontend;
+}
+
+export function getCliEvalModel(model: EvalModelSpec): CliEvalModelConfig {
+  if (!model.cli) {
+    throw new Error(`Model ${model.id} does not support cli evals`);
+  }
+  return model.cli;
+}
+
+function getDefaultEvalModel(mode: EvalMode): EvalModelSpec {
+  return mode === "cli" ? EVAL_MODELS[0]! : EVAL_MODELS[0]!;
+}
+
+function findEvalModel(alias: string): EvalModelSpec | undefined {
+  const normalized = alias.trim().toLowerCase();
+  return EVAL_MODELS.find((model) =>
+    [model.id, ...model.aliases].some((candidate) => candidate.toLowerCase() === normalized)
+  );
+}
--- a/ai_evals/core/results.ts
+++ b/ai_evals/core/results.ts
@@ -0,0 +1,296 @@
+import { appendFile, mkdir, rm, writeFile } from "node:fs/promises";
+import path from "node:path";
+import { execFileSync } from "node:child_process";
+import { getAiEvalsRoot, getRepoRoot } from "./cases";
+import type {
+  BenchmarkArtifactFile,
+  BenchmarkCaseResult,
+  BenchmarkRunResult,
+  BenchmarkTokenUsage,
+  EvalMode,
+} from "./types";
+
+export async function writeRunResult(
+  result: BenchmarkRunResult,
+  outputPath?: string
+): Promise<string> {
+  const targetPath = resolveRunOutputPath(result.mode, outputPath);
+  await mkdir(path.dirname(targetPath), { recursive: true });
+  await writeFile(targetPath, JSON.stringify(toSerializableRunResult(result), null, 2) + "\n", "utf8");
+  return targetPath;
+}
+
+export async function appendHistoryRecord(
+  result: BenchmarkRunResult,
+  historyPath = resolveHistoryPath(result.mode)
+): Promise<string> {
+  await mkdir(path.dirname(historyPath), { recursive: true });
+  await appendFile(historyPath, JSON.stringify(toHistoryRecord(result)) + "\n", "utf8");
+  return historyPath;
+}
+
+export async function writeRunArtifacts(
+  result: BenchmarkRunResult,
+  outputPath?: string
+): Promise<string | null> {
+  const targetPath = resolveRunOutputPath(result.mode, outputPath);
+  const artifactRoot = defaultArtifactsRoot(targetPath);
+
+  await rm(artifactRoot, { recursive: true, force: true });
+
+  let wroteArtifacts = false;
+  for (const caseResult of result.cases) {
+    for (const attempt of caseResult.attempts) {
+      const artifactFiles = attempt.artifactFiles ?? [];
+      if (artifactFiles.length === 0) {
+        attempt.artifactsPath = null;
+        continue;
+      }
+
+      const attemptDir = path.join(artifactRoot, caseResult.id, `attempt-${attempt.attempt}`);
+      await writeArtifactFiles(attemptDir, artifactFiles);
+      attempt.artifactsPath = attemptDir;
+      wroteArtifacts = true;
+    }
+  }
+
+  result.artifactsPath = wroteArtifacts ? artifactRoot : null;
+  return result.artifactsPath ?? null;
+}
+
+export function buildRunResult(input: {
+  mode: EvalMode;
+  runs: number;
+  runModel: string | null;
+  judgeModel: string | null;
+  caseResults: BenchmarkCaseResult[];
+}): BenchmarkRunResult {
+  const attemptCount = input.caseResults.reduce((sum, entry) => sum + entry.attempts.length, 0);
+  const passedAttempts = input.caseResults.reduce(
+    (sum, entry) => sum + entry.attempts.filter((attempt) => attempt.passed).length,
+    0
+  );
+  const durationTotal = input.caseResults.reduce(
+    (sum, entry) => sum + entry.attempts.reduce((inner, attempt) => inner + attempt.durationMs, 0),
+    0
+  );
+  const tokenUsageTotal = input.caseResults.reduce<BenchmarkTokenUsage | null>(
+    (sum, entry) => {
+      for (const attempt of entry.attempts) {
+        if (!attempt.tokenUsage) {
+          continue;
+        }
+        sum ??= { prompt: 0, completion: 0, total: 0 };
+        sum.prompt += attempt.tokenUsage.prompt;
+        sum.completion += attempt.tokenUsage.completion;
+        sum.total += attempt.tokenUsage.total;
+      }
+      return sum;
+    },
+    null
+  );
+
+  return {
+    version: 1,
+    mode: input.mode,
+    createdAt: new Date().toISOString(),
+    gitSha: getGitSha(),
+    runs: input.runs,
+    runModel: input.runModel,
+    judgeModel: input.judgeModel,
+    caseCount: input.caseResults.length,
+    attemptCount,
+    passedAttempts,
+    passRate: attemptCount === 0 ? 0 : passedAttempts / attemptCount,
+    averageDurationMs: attemptCount === 0 ? 0 : durationTotal / attemptCount,
+    totalTokenUsage: tokenUsageTotal,
+    averageTokenUsagePerAttempt:
+      attemptCount === 0 || !tokenUsageTotal
+        ? null
+        : {
+            prompt: tokenUsageTotal.prompt / attemptCount,
+            completion: tokenUsageTotal.completion / attemptCount,
+            total: tokenUsageTotal.total / attemptCount,
+          },
+    cases: input.caseResults,
+  };
+}
+
+export function formatRunSummary(result: BenchmarkRunResult): string {
+  const lines = [
+    `${result.mode} benchmark complete`,
+    `Pass rate: ${formatPercent(result.passRate)} (${result.passedAttempts}/${result.attemptCount})`,
+    `Average duration: ${Math.round(result.averageDurationMs)}ms`,
+  ];
+
+  const failures = collectFailures(result);
+  if (failures.length > 0) {
+    lines.push("Failures:");
+    for (const entry of failures.slice(0, 10)) {
+      lines.push(`- ${entry}`);
+    }
+  }
+
+  return lines.join("\n");
+}
+
+function collectFailures(result: BenchmarkRunResult): string[] {
+  const failures: string[] = [];
+
+  for (const caseResult of result.cases) {
+    for (const attempt of caseResult.attempts) {
+      if (attempt.passed) {
+        continue;
+      }
+      const failedChecks = attempt.checks.filter((check) => !check.passed).map((check) => check.name);
+      failures.push(
+        `${caseResult.id} attempt ${attempt.attempt}: ${failedChecks.join(", ") || attempt.error || "failed"}`
+      );
+    }
+  }
+
+  return failures;
+}
+
+function defaultFileName(mode: EvalMode): string {
+  return `${new Date().toISOString().replaceAll(":", "-")}__${mode}.json`;
+}
+
+export function resolveRunOutputPath(mode: EvalMode, outputPath?: string): string {
+  return outputPath ?? path.join(getAiEvalsRoot(), "results", defaultFileName(mode));
+}
+
+export function resolveHistoryPath(mode: EvalMode): string {
+  return path.join(getAiEvalsRoot(), "history", `${mode}.jsonl`);
+}
+
+function defaultArtifactsRoot(resultPath: string): string {
+  return resultPath.endsWith(".json")
+    ? resultPath.slice(0, -".json".length)
+    : `${resultPath}.artifacts`;
+}
+
+async function writeArtifactFiles(
+  rootDir: string,
+  files: BenchmarkArtifactFile[]
+): Promise<void> {
+  for (const file of files) {
+    const relativePath = normalizeArtifactPath(file.path);
+    const targetPath = path.join(rootDir, relativePath);
+    await mkdir(path.dirname(targetPath), { recursive: true });
+    await writeFile(targetPath, file.content, "utf8");
+  }
+}
+
+function normalizeArtifactPath(filePath: string): string {
+  const normalized = filePath.replaceAll("\\", "/").replace(/^\/+/, "");
+  const parts = normalized.split("/").filter(Boolean);
+  if (parts.length === 0 || parts.some((part) => part === "." || part === "..")) {
+    throw new Error(`Invalid artifact path: ${filePath}`);
+  }
+  return parts.join("/");
+}
+
+function toSerializableRunResult(result: BenchmarkRunResult): BenchmarkRunResult {
+  return {
+    ...result,
+    cases: result.cases.map((caseResult) => ({
+      ...caseResult,
+      attempts: caseResult.attempts.map(({ artifactFiles, ...attempt }) => attempt),
+    })),
+  };
+}
+
+function toHistoryRecord(result: BenchmarkRunResult) {
+  const judgeScores = result.cases.flatMap((caseResult) =>
+    caseResult.attempts.flatMap((attempt) =>
+      typeof attempt.judgeScore === "number" ? [attempt.judgeScore] : []
+    )
+  );
+
+  return {
+    createdAt: result.createdAt,
+    gitSha: result.gitSha,
+    mode: result.mode,
+    runs: result.runs,
+    runModel: result.runModel,
+    judgeModel: result.judgeModel,
+    caseCount: result.caseCount,
+    attemptCount: result.attemptCount,
+    passedAttempts: result.passedAttempts,
+    passRate: result.passRate,
+    averageDurationMs: result.averageDurationMs,
+    averageJudgeScore:
+      judgeScores.length === 0
+        ? null
+        : judgeScores.reduce((sum, score) => sum + score, 0) / judgeScores.length,
+    averageTokenUsagePerAttempt: result.averageTokenUsagePerAttempt ?? null,
+    failedCaseIds: Array.from(
+      new Set(
+        result.cases
+          .filter((caseResult) => caseResult.attempts.some((attempt) => !attempt.passed))
+          .map((caseResult) => caseResult.id)
+      )
+    ),
+    cases: result.cases.map((caseResult) => {
+      const attemptCount = caseResult.attempts.length;
+      const passedAttempts = caseResult.attempts.filter((attempt) => attempt.passed).length;
+      const totalDurationMs = caseResult.attempts.reduce(
+        (sum, attempt) => sum + attempt.durationMs,
+        0
+      );
+      const judgeScores = caseResult.attempts.flatMap((attempt) =>
+        typeof attempt.judgeScore === "number" ? [attempt.judgeScore] : []
+      );
+      const totalTokenUsage = caseResult.attempts.reduce<BenchmarkTokenUsage | null>(
+        (sum, attempt) => {
+          if (!attempt.tokenUsage) {
+            return sum;
+          }
+          sum ??= { prompt: 0, completion: 0, total: 0 };
+          sum.prompt += attempt.tokenUsage.prompt;
+          sum.completion += attempt.tokenUsage.completion;
+          sum.total += attempt.tokenUsage.total;
+          return sum;
+        },
+        null
+      );
+
+      return {
+        id: caseResult.id,
+        attemptCount,
+        passedAttempts,
+        passRate: attemptCount === 0 ? 0 : passedAttempts / attemptCount,
+        averageDurationMs: attemptCount === 0 ? 0 : totalDurationMs / attemptCount,
+        averageJudgeScore:
+          judgeScores.length === 0
+            ? null
+            : judgeScores.reduce((sum, score) => sum + score, 0) / judgeScores.length,
+        averageTokenUsagePerAttempt:
+          attemptCount === 0 || !totalTokenUsage
+            ? null
+            : {
+                prompt: totalTokenUsage.prompt / attemptCount,
+                completion: totalTokenUsage.completion / attemptCount,
+                total: totalTokenUsage.total / attemptCount,
+              },
+      };
+    }),
+  };
+}
+
+function getGitSha(): string | null {
+  try {
+    return execFileSync("git", ["rev-parse", "HEAD"], {
+      cwd: getRepoRoot(),
+      encoding: "utf8",
+      stdio: ["ignore", "pipe", "ignore"],
+    }).trim();
+  } catch {
+    return null;
+  }
+}
+
+function formatPercent(value: number): string {
+  return `${(value * 100).toFixed(1)}%`;
+}
--- a/ai_evals/core/runSuite.ts
+++ b/ai_evals/core/runSuite.ts
@@ -0,0 +1,264 @@
+import { judgeOutput, DEFAULT_JUDGE_MODEL } from "./judge";
+import type {
+  BenchmarkAttemptResult,
+  BenchmarkCaseResult,
+  BenchmarkCheck,
+  EvalCase,
+  FrontendBenchmarkProgressEvent,
+  ModeRunner,
+} from "./types";
+
+export async function runSuite<TInitial, TExpected, TActual>(input: {
+  modeRunner: ModeRunner<TInitial, TExpected, TActual>;
+  cases: EvalCase[];
+  runs: number;
+  runModel: string | null;
+  judgeModel?: string | null;
+  concurrency?: number;
+  verbose?: boolean;
+  onProgress?: (event: FrontendBenchmarkProgressEvent) => void;
+}): Promise<BenchmarkCaseResult[]> {
+  const judgeModel = input.judgeModel ?? DEFAULT_JUDGE_MODEL;
+  const concurrency = Math.max(1, input.concurrency ?? input.modeRunner.concurrency);
+  const results = new Array<BenchmarkCaseResult>(input.cases.length);
+  let cursor = 0;
+
+  if (input.modeRunner.mode !== "cli") {
+    input.onProgress?.({
+      type: "run-start",
+      surface: input.modeRunner.mode,
+      totalCases: input.cases.length,
+      runs: input.runs,
+      concurrency,
+    });
+  }
+
+  async function worker(): Promise<void> {
+    while (true) {
+      const caseIndex = cursor++;
+      if (caseIndex >= input.cases.length) {
+        return;
+      }
+      const evalCase = input.cases[caseIndex];
+      results[caseIndex] = {
+        id: evalCase.id,
+        prompt: evalCase.prompt,
+        initialPath: evalCase.initialPath,
+        expectedPath: evalCase.expectedPath,
+        attempts: await runCaseAttempts({
+          caseIndex,
+          evalCase,
+          runs: input.runs,
+          judgeModel,
+          judgeThreshold: input.modeRunner.judgeThreshold ?? 80,
+          modeRunner: input.modeRunner,
+          totalCases: input.cases.length,
+          verbose: input.verbose ?? false,
+          onProgress: input.onProgress,
+        }),
+      };
+    }
+  }
+
+  await Promise.all(
+    Array.from({ length: Math.min(concurrency, input.cases.length) }, () => worker())
+  );
+
+  return results;
+}
+
+async function runCaseAttempts<TInitial, TExpected, TActual>(input: {
+  caseIndex: number;
+  evalCase: EvalCase;
+  runs: number;
+  judgeModel: string;
+  judgeThreshold: number;
+  modeRunner: ModeRunner<TInitial, TExpected, TActual>;
+  totalCases: number;
+  verbose: boolean;
+  onProgress?: (event: FrontendBenchmarkProgressEvent) => void;
+}): Promise<BenchmarkAttemptResult[]> {
+  const attempts: BenchmarkAttemptResult[] = [];
+  const surface = input.modeRunner.mode === "cli" ? null : input.modeRunner.mode;
+
+  for (let attempt = 1; attempt <= input.runs; attempt += 1) {
+    if (surface) {
+      input.onProgress?.({
+        type: "attempt-start",
+        surface,
+        caseId: input.evalCase.id,
+        caseNumber: input.caseIndex + 1,
+        totalCases: input.totalCases,
+        attempt,
+        runs: input.runs,
+      });
+    }
+
+    const startedAt = Date.now();
+
+    try {
+      const initial = await input.modeRunner.loadInitial(input.evalCase.initialPath);
+      const expected = await input.modeRunner.loadExpected(input.evalCase.expectedPath);
+      const run = await input.modeRunner.run(input.evalCase.prompt, initial, {
+        caseId: input.evalCase.id,
+        caseNumber: input.caseIndex + 1,
+        totalCases: input.totalCases,
+        attempt,
+        runs: input.runs,
+        verbose: input.verbose,
+        onAssistantMessageStart: input.verbose && surface
+          ? () =>
+              input.onProgress?.({
+                type: "assistant-message-start",
+                surface,
+                caseId: input.evalCase.id,
+                caseNumber: input.caseIndex + 1,
+                totalCases: input.totalCases,
+                attempt,
+                runs: input.runs,
+              })
+          : undefined,
+        onAssistantChunk: input.verbose && surface
+          ? (chunk: string) =>
+              input.onProgress?.({
+                type: "assistant-chunk",
+                surface,
+                caseId: input.evalCase.id,
+                caseNumber: input.caseIndex + 1,
+                totalCases: input.totalCases,
+                attempt,
+                runs: input.runs,
+                chunk,
+              })
+          : undefined,
+        onAssistantMessageEnd: input.verbose && surface
+          ? () =>
+              input.onProgress?.({
+                type: "assistant-message-end",
+                surface,
+                caseId: input.evalCase.id,
+                caseNumber: input.caseIndex + 1,
+                totalCases: input.totalCases,
+                attempt,
+                runs: input.runs,
+              })
+          : undefined,
+      });
+      const checks: BenchmarkCheck[] = [
+        buildCheck("run succeeded", run.success, run.error),
+        ...input.modeRunner.validate({
+          evalCase: input.evalCase,
+          prompt: input.evalCase.prompt,
+          initial,
+          expected,
+          actual: run.actual,
+          run,
+        }),
+      ];
+
+      let judgeScore: number | null = null;
+      let judgeSummary: string | null = null;
+
+      if (run.success) {
+        const judge = await judgeOutput({
+          mode: input.modeRunner.mode,
+          prompt: input.evalCase.prompt,
+          checklist: input.evalCase.judgeChecklist,
+          initial,
+          expected: input.modeRunner.mode === "cli" ? undefined : expected,
+          actual: run.actual,
+          model: input.judgeModel,
+        });
+
+        judgeScore = judge.success ? judge.score : null;
+        judgeSummary = judge.summary;
+        checks.push(buildCheck("judge succeeded", judge.success, judge.error));
+        checks.push(
+          buildCheck(
+            `judge score >= ${input.judgeThreshold}`,
+            (judgeScore ?? 0) >= input.judgeThreshold,
+            judge.success ? `score=${judgeScore}` : judge.error
+          )
+        );
+      }
+
+      const artifactFiles = input.modeRunner.buildArtifacts?.(run.actual) ?? [];
+      const attemptResult: BenchmarkAttemptResult = {
+        attempt,
+        passed: checks.every((check) => check.passed),
+        durationMs: Date.now() - startedAt,
+        assistantMessageCount: run.assistantMessageCount,
+        toolCallCount: run.toolCallCount,
+        toolsUsed: uniqueStrings(run.toolsUsed),
+        skillsInvoked: uniqueStrings(run.skillsInvoked),
+        checks,
+        judgeScore,
+        judgeSummary,
+        error: run.error ?? null,
+        tokenUsage: run.tokenUsage ?? null,
+        artifactsPath: null,
+        artifactFiles,
+      };
+
+      if (surface) {
+        input.onProgress?.({
+          type: "attempt-finish",
+          surface,
+          caseId: input.evalCase.id,
+          caseNumber: input.caseIndex + 1,
+          totalCases: input.totalCases,
+          attempt,
+          runs: input.runs,
+          passed: attemptResult.passed,
+          durationMs: attemptResult.durationMs,
+          judgeScore: attemptResult.judgeScore,
+          error: attemptResult.error,
+        });
+      }
+
+      attempts.push(attemptResult);
+    } catch (error) {
+      const message = error instanceof Error ? error.message : String(error);
+      const failedAttempt: BenchmarkAttemptResult = {
+        attempt,
+        passed: false,
+        durationMs: Date.now() - startedAt,
+        assistantMessageCount: 0,
+        toolCallCount: 0,
+        toolsUsed: [],
+        skillsInvoked: [],
+        checks: [buildCheck("run crashed", false, message)],
+        judgeScore: null,
+        judgeSummary: null,
+        error: message,
+        tokenUsage: null,
+      };
+      if (surface) {
+        input.onProgress?.({
+          type: "attempt-finish",
+          surface,
+          caseId: input.evalCase.id,
+          caseNumber: input.caseIndex + 1,
+          totalCases: input.totalCases,
+          attempt,
+          runs: input.runs,
+          passed: false,
+          durationMs: failedAttempt.durationMs,
+          judgeScore: null,
+          error: message,
+        });
+      }
+      attempts.push(failedAttempt);
+    }
+  }
+
+  return attempts;
+}
+
+function buildCheck(name: string, passed: boolean, details?: string): BenchmarkCheck {
+  return details ? { name, passed, details } : { name, passed };
+}
+
+function uniqueStrings(values: string[]): string[] {
+  return [...new Set(values)];
+}
--- a/ai_evals/core/types.ts
+++ b/ai_evals/core/types.ts
@@ -0,0 +1,198 @@
+export const EVAL_MODES = ["cli", "flow", "script", "app"] as const;
+
+export type EvalMode = (typeof EVAL_MODES)[number];
+
+export interface FlowValidationSpec {
+  schemaRequiredPaths?: string[];
+  schemaAnyOf?: Array<{
+    requiredPaths: string[];
+  }>;
+  resolveResultsRefs?: boolean;
+  requireSpecialModules?: Array<"preprocessor_module" | "failure_module">;
+  requireSuspendSteps?: Array<{
+    id: string;
+    requiredEvents?: number;
+    resumeRequiredStringFieldAnyOf?: string[];
+  }>;
+}
+
+export interface EvalCase {
+  id: string;
+  prompt: string;
+  initialPath?: string;
+  expectedPath?: string;
+  validate?: FlowValidationSpec;
+  judgeChecklist?: string[];
+}
+
+export interface BenchmarkCheck {
+  name: string;
+  passed: boolean;
+  details?: string;
+}
+
+export interface JudgeResult {
+  success: boolean;
+  score: number;
+  summary: string;
+  error?: string;
+}
+
+export interface BenchmarkArtifactFile {
+  path: string;
+  content: string;
+}
+
+export interface BenchmarkTokenUsage {
+  prompt: number;
+  completion: number;
+  total: number;
+}
+
+export interface ModeRunOutput<TActual> {
+  success: boolean;
+  actual: TActual;
+  error?: string;
+  assistantMessageCount: number;
+  toolCallCount: number;
+  toolsUsed: string[];
+  skillsInvoked: string[];
+  tokenUsage?: BenchmarkTokenUsage | null;
+}
+
+export interface ModeRunContext {
+  caseId: string;
+  caseNumber: number;
+  totalCases: number;
+  attempt: number;
+  runs: number;
+  verbose: boolean;
+  onAssistantMessageStart?: () => void;
+  onAssistantChunk?: (chunk: string) => void;
+  onAssistantMessageEnd?: () => void;
+}
+
+export interface ModeRunner<TInitial, TExpected, TActual> {
+  mode: EvalMode;
+  concurrency: number;
+  judgeThreshold?: number;
+  loadInitial(path?: string): Promise<TInitial | undefined>;
+  loadExpected(path?: string): Promise<TExpected | undefined>;
+  run(
+    prompt: string,
+    initial: TInitial | undefined,
+    context: ModeRunContext
+  ): Promise<ModeRunOutput<TActual>>;
+  validate(input: {
+    evalCase: EvalCase;
+    prompt: string;
+    initial: TInitial | undefined;
+    expected: TExpected | undefined;
+    actual: TActual;
+    run: ModeRunOutput<TActual>;
+  }): BenchmarkCheck[];
+  buildArtifacts?(actual: TActual): BenchmarkArtifactFile[];
+}
+
+export interface BenchmarkAttemptResult {
+  attempt: number;
+  passed: boolean;
+  durationMs: number;
+  assistantMessageCount: number;
+  toolCallCount: number;
+  toolsUsed: string[];
+  skillsInvoked: string[];
+  checks: BenchmarkCheck[];
+  judgeScore: number | null;
+  judgeSummary: string | null;
+  error: string | null;
+  tokenUsage?: BenchmarkTokenUsage | null;
+  artifactsPath?: string | null;
+  artifactFiles?: BenchmarkArtifactFile[];
+}
+
+export interface BenchmarkCaseResult {
+  id: string;
+  prompt: string;
+  initialPath?: string;
+  expectedPath?: string;
+  attempts: BenchmarkAttemptResult[];
+}
+
+export interface BenchmarkRunResult {
+  version: 1;
+  mode: EvalMode;
+  createdAt: string;
+  gitSha: string | null;
+  runs: number;
+  runModel: string | null;
+  judgeModel: string | null;
+  caseCount: number;
+  attemptCount: number;
+  passedAttempts: number;
+  passRate: number;
+  averageDurationMs: number;
+  totalTokenUsage?: BenchmarkTokenUsage | null;
+  averageTokenUsagePerAttempt?: BenchmarkTokenUsage | null;
+  artifactsPath?: string | null;
+  cases: BenchmarkCaseResult[];
+}
+
+export type FrontendBenchmarkProgressEvent =
+  | {
+      type: "run-start";
+      surface: Exclude<EvalMode, "cli">;
+      totalCases: number;
+      runs: number;
+      concurrency: number;
+    }
+  | {
+      type: "attempt-start";
+      surface: Exclude<EvalMode, "cli">;
+      caseId: string;
+      caseNumber: number;
+      totalCases: number;
+      attempt: number;
+      runs: number;
+    }
+  | {
+      type: "attempt-finish";
+      surface: Exclude<EvalMode, "cli">;
+      caseId: string;
+      caseNumber: number;
+      totalCases: number;
+      attempt: number;
+      runs: number;
+      passed: boolean;
+      durationMs: number;
+      judgeScore: number | null;
+      error: string | null;
+    }
+  | {
+      type: "assistant-message-start";
+      surface: Exclude<EvalMode, "cli">;
+      caseId: string;
+      caseNumber: number;
+      totalCases: number;
+      attempt: number;
+      runs: number;
+    }
+  | {
+      type: "assistant-chunk";
+      surface: Exclude<EvalMode, "cli">;
+      caseId: string;
+      caseNumber: number;
+      totalCases: number;
+      attempt: number;
+      runs: number;
+      chunk: string;
+    }
+  | {
+      type: "assistant-message-end";
+      surface: Exclude<EvalMode, "cli">;
+      caseId: string;
+      caseNumber: number;
+      totalCases: number;
+      attempt: number;
+      runs: number;
+    };
--- a/ai_evals/core/validators.test.ts
+++ b/ai_evals/core/validators.test.ts
@@ -0,0 +1,36 @@
+import { describe, expect, it } from "bun:test";
+import { validateScriptState } from "./validators";
+
+describe("validateScriptState", () => {
+  it("accepts semantically equivalent script implementations", () => {
+    const checks = validateScriptState({
+      actual: {
+        path: "f/evals/greet_user.ts",
+        lang: "bun",
+        code: "export async function main(name: string): Promise<string> {\n  return `Hello, ${name}!`;\n}\n",
+      },
+      expected: {
+        path: "f/evals/greet_user.ts",
+        lang: "bun",
+        code: "export async function main(name: string) {\n\treturn `Hello, ${name}!`\n}\n",
+      },
+    });
+
+    expect(checks.every((check) => check.passed)).toBe(true);
+  });
+
+  it("still requires an exported main entrypoint", () => {
+    const checks = validateScriptState({
+      actual: {
+        path: "f/evals/greet_user.ts",
+        lang: "bun",
+        code: "async function main(name: string) {\n  return `Hello, ${name}!`;\n}\n",
+      },
+    });
+
+    expect(checks).toContainEqual({
+      name: "script exports entrypoint",
+      passed: false,
+    });
+  });
+});
--- a/ai_evals/core/validators.ts
+++ b/ai_evals/core/validators.ts
@@ -0,0 +1,997 @@
+import path from "node:path";
+import ts from "typescript";
+import type { BenchmarkCheck, FlowValidationSpec } from "./types";
+
+export interface ScriptState {
+  path: string;
+  lang: string;
+  args?: Record<string, unknown>;
+  code: string;
+}
+
+export interface FlowState {
+  summary?: string;
+  value?: {
+    preprocessor_module?: Record<string, unknown>;
+    failure_module?: Record<string, unknown>;
+    modules?: Array<Record<string, unknown>>;
+    [key: string]: unknown;
+  };
+  schema?: Record<string, unknown>;
+}
+
+export interface AppFilesState {
+  frontend: Record<string, string>;
+  backend: Record<string, AppRunnableState>;
+}
+
+export interface AppRunnableState {
+  type?: string;
+  name?: string;
+  path?: string;
+  inlineScript?: {
+    language?: string;
+    content?: string;
+  };
+}
+
+const TS_LIKE_LANGUAGES = new Set(["bun", "deno", "nativets", "bunnative", "ts", "typescript"]);
+const CONTROL_FLOW_MODULE_TYPES = new Set(["branchone", "branchall", "forloopflow", "whileloopflow"]);
+
+export function validateScriptState(input: {
+  actual: ScriptState;
+  initial?: ScriptState;
+  expected?: ScriptState;
+}): BenchmarkCheck[] {
+  const checks: BenchmarkCheck[] = [
+    check("script exports entrypoint", hasSupportedEntrypoint(input.actual.code)),
+    check("script has no syntax errors", getScriptSyntaxErrors(input.actual.code, input.actual.lang).length === 0),
+  ];
+
+  if (input.expected) {
+    checks.push(
+      check(
+        "script path matches expected",
+        input.actual.path === input.expected.path,
+        `expected ${input.expected.path}, got ${input.actual.path}`
+      )
+    );
+    checks.push(
+      check(
+        "script language matches expected",
+        input.actual.lang === input.expected.lang,
+        `expected ${input.expected.lang}, got ${input.actual.lang}`
+      )
+    );
+  }
+
+  if (input.initial) {
+    checks.push(
+      check(
+        "script differs from initial",
+        normalizeText(input.actual.code) !== normalizeText(input.initial.code)
+      )
+    );
+  }
+
+  return checks;
+}
+
+export function validateFlowState(input: {
+  actual: FlowState;
+  initial?: FlowState;
+  expected?: FlowState;
+  validate?: FlowValidationSpec;
+}): BenchmarkCheck[] {
+  const actualModules = getFlowModules(input.actual);
+  const placeholderModuleIds = getInlineScriptPlaceholderModuleIds(input.actual);
+  const checks: BenchmarkCheck[] = [
+    check("flow has modules", actualModules.length > 0),
+    check(
+      "flow has no inline placeholder code",
+      placeholderModuleIds.length === 0,
+      placeholderModuleIds.length > 0
+        ? `placeholder content in: ${placeholderModuleIds.join(", ")}`
+        : undefined
+    ),
+  ];
+
+  if (input.initial) {
+    checks.push(
+      check(
+        "flow differs from initial",
+        normalizeJson(input.actual) !== normalizeJson(input.initial)
+      )
+    );
+  }
+
+  if (input.expected) {
+    checks.push(...validateFlowExpectedStructure(input.actual, input.expected));
+  }
+
+  if (input.validate) {
+    checks.push(...validateFlowRequirements(input.actual, input.validate));
+  }
+
+  return checks;
+}
+
+export function validateAppState(input: {
+  actual: AppFilesState;
+  initial?: AppFilesState;
+  expected?: AppFilesState;
+}): BenchmarkCheck[] {
+  const checks: BenchmarkCheck[] = [];
+  const frontendEntries = Object.entries(input.actual.frontend ?? {});
+  const backendEntries = Object.entries(input.actual.backend ?? {});
+  const frontendSyntaxProblems = getAppFrontendSyntaxProblems(input.actual.frontend);
+  const backendSyntaxProblems = getAppBackendSyntaxProblems(input.actual.backend);
+  const unresolvedBackendRefs = getUnresolvedBackendReferences(
+    input.actual.frontend,
+    input.actual.backend
+  );
+
+  checks.push(check("app has frontend entrypoint", Boolean(input.actual.frontend["/index.tsx"])));
+  checks.push(
+    check(
+      "app has non-empty frontend files",
+      frontendEntries.some(([, content]) => content.trim().length > 0)
+    )
+  );
+  checks.push(
+    check(
+      "frontend files have no syntax errors",
+      frontendSyntaxProblems.length === 0,
+      summarizeProblems(frontendSyntaxProblems)
+    )
+  );
+  checks.push(
+    check(
+      "backend inline scripts have entrypoints",
+      backendEntries.every(([, runnable]) => {
+        if (runnable.type !== "inline") {
+          return true;
+        }
+        return hasSupportedEntrypoint(runnable.inlineScript?.content ?? "");
+      })
+    )
+  );
+  checks.push(
+    check(
+      "backend inline scripts have no syntax errors",
+      backendSyntaxProblems.length === 0,
+      summarizeProblems(backendSyntaxProblems)
+    )
+  );
+  checks.push(
+    check(
+      "frontend backend references resolve",
+      unresolvedBackendRefs.length === 0,
+      summarizeProblems(unresolvedBackendRefs)
+    )
+  );
+
+  if (input.initial) {
+    checks.push(check("app differs from initial", !appStatesEqual(input.actual, input.initial)));
+  }
+
+  if (input.expected) {
+    for (const [filePath, content] of Object.entries(input.expected.frontend)) {
+      checks.push(
+        check(
+          `frontend includes ${filePath}`,
+          normalizeText(input.actual.frontend[filePath] ?? "") === normalizeText(content)
+        )
+      );
+    }
+    for (const [runnableName, runnable] of Object.entries(input.expected.backend)) {
+      const actualRunnable = input.actual.backend[runnableName];
+      checks.push(check(`backend includes ${runnableName}`, Boolean(actualRunnable)));
+      if (actualRunnable && runnable.inlineScript?.content) {
+        checks.push(
+          check(
+            `${runnableName} code matches expected`,
+            normalizeText(actualRunnable.inlineScript?.content ?? "") ===
+              normalizeText(runnable.inlineScript.content)
+          )
+        );
+      }
+    }
+  }
+
+  return checks;
+}
+
+export function validateCliWorkspace(input: {
+  actualFiles: Record<string, string>;
+  expectedFiles?: Record<string, string>;
+  initialFiles?: Record<string, string>;
+}): BenchmarkCheck[] {
+  const checks: BenchmarkCheck[] = [];
+
+  if (input.expectedFiles) {
+    for (const [filePath, expectedContent] of Object.entries(input.expectedFiles)) {
+      const actualContent = input.actualFiles[filePath];
+      checks.push(check(`creates ${filePath}`, actualContent !== undefined));
+      if (actualContent !== undefined) {
+        checks.push(
+          check(
+            `${filePath} contains expected content`,
+            cliFileContainsExpectedContent(actualContent, expectedContent)
+          )
+        );
+      }
+    }
+
+    const expectedPaths = new Set(Object.keys(input.expectedFiles));
+    const unexpectedPaths = Object.keys(input.actualFiles).filter((filePath) => !expectedPaths.has(filePath));
+    checks.push(
+      check(
+        "workspace contains no unexpected files",
+        unexpectedPaths.length === 0,
+        summarizeProblems(unexpectedPaths)
+      )
+    );
+  }
+
+  if (input.initialFiles) {
+    checks.push(check("workspace differs from initial", !fileMapsEqual(input.actualFiles, input.initialFiles)));
+  }
+
+  return checks;
+}
+
+function cliFileContainsExpectedContent(actualContent: string, expectedContent: string): boolean {
+  const expectedSnippets = expectedContent
+    .replace(/\r\n/g, "\n")
+    .split("\n")
+    .map((line) => line.trim())
+    .filter((line) => line.length > 0);
+
+  if (expectedSnippets.length === 0) {
+    return true;
+  }
+
+  const normalizedActual = actualContent.replace(/\r\n/g, "\n");
+
+  return expectedSnippets.every((snippet) => normalizedActual.includes(snippet));
+}
+
+function check(name: string, passed: boolean, details?: string): BenchmarkCheck {
+  return !passed && details ? { name, passed, details } : { name, passed };
+}
+
+function normalizeText(value: string): string {
+  return value.replace(/\r\n/g, "\n").trim();
+}
+
+function normalizeJson(value: unknown): string {
+  return JSON.stringify(value);
+}
+
+function summarizeProblems(problems: string[], limit = 5): string | undefined {
+  if (problems.length === 0) {
+    return undefined;
+  }
+
+  if (problems.length <= limit) {
+    return problems.join("; ");
+  }
+
+  return `${problems.slice(0, limit).join("; ")}; ...and ${problems.length - limit} more`;
+}
+
+function hasSupportedEntrypoint(code: string): boolean {
+  return (
+    /export\s+(async\s+)?function\s+main\s*\(/.test(code) ||
+    /export\s+default\s+(async\s+)?function\s*\(/.test(code)
+  );
+}
+
+function getScriptSyntaxErrors(code: string, lang: string): string[] {
+  if (!TS_LIKE_LANGUAGES.has(lang)) {
+    return [];
+  }
+
+  return getTypeScriptSyntaxErrors(code, "eval.ts");
+}
+
+function getTypeScriptSyntaxErrors(code: string, fileName: string): string[] {
+  const result = ts.transpileModule(code, {
+    compilerOptions: {
+      target: ts.ScriptTarget.ES2022,
+      module: ts.ModuleKind.ESNext,
+      jsx: ts.JsxEmit.ReactJSX,
+    },
+    reportDiagnostics: true,
+    fileName,
+  });
+
+  return (result.diagnostics ?? []).map((diagnostic) =>
+    ts.flattenDiagnosticMessageText(diagnostic.messageText, "\n")
+  );
+}
+
+function getAppFrontendSyntaxProblems(frontend: Record<string, string>): string[] {
+  const problems: string[] = [];
+
+  for (const [filePath, content] of Object.entries(frontend)) {
+    if (!isFrontendCodeFile(filePath)) {
+      continue;
+    }
+
+    const errors = getTypeScriptSyntaxErrors(content, filePath);
+    for (const error of errors) {
+      problems.push(`${filePath}: ${error}`);
+    }
+  }
+
+  return problems;
+}
+
+function getAppBackendSyntaxProblems(backend: Record<string, AppRunnableState>): string[] {
+  const problems: string[] = [];
+
+  for (const [key, runnable] of Object.entries(backend)) {
+    if (runnable.type !== "inline") {
+      continue;
+    }
+
+    const language = runnable.inlineScript?.language ?? "";
+    const content = runnable.inlineScript?.content ?? "";
+    for (const error of getScriptSyntaxErrors(content, language)) {
+      problems.push(`${key}: ${error}`);
+    }
+  }
+
+  return problems;
+}
+
+function isFrontendCodeFile(filePath: string): boolean {
+  const extension = path.extname(filePath).toLowerCase();
+  return extension === ".ts" || extension === ".tsx" || extension === ".js" || extension === ".jsx";
+}
+
+function getUnresolvedBackendReferences(
+  frontend: Record<string, string>,
+  backend: Record<string, AppRunnableState>
+): string[] {
+  const backendKeys = new Set(Object.keys(backend));
+  const unresolved = new Set<string>();
+
+  for (const [filePath, content] of Object.entries(frontend)) {
+    for (const key of extractBackendCallKeys(content)) {
+      if (!backendKeys.has(key)) {
+        unresolved.add(`${filePath} references missing backend.${key}()`);
+      }
+    }
+  }
+
+  return [...unresolved];
+}
+
+function extractBackendCallKeys(content: string): string[] {
+  const matches = content.matchAll(/\bbackend\.([A-Za-z_][A-Za-z0-9_]*)\s*\(/g);
+  return [...new Set([...matches].map((match) => match[1]))];
+}
+
+function getFlowModules(flow: FlowState): Array<Record<string, unknown>> {
+  return Array.isArray(flow.value?.modules) ? flow.value.modules : [];
+}
+
+function validateFlowExpectedStructure(
+  actual: FlowState,
+  expected: FlowState
+): BenchmarkCheck[] {
+  const checks: BenchmarkCheck[] = [];
+  const expectedTopLevelModules = getFlowModules(expected);
+  const actualTopLevelModules = getFlowModules(actual);
+
+  const expectedSchemaFields = getTopLevelSchemaFields(expected.schema);
+  if (expectedSchemaFields.length > 0) {
+    checks.push(
+      check(
+        "flow schema includes expected top-level fields",
+        expectedSchemaFields.every((field) => hasSchemaPath(actual.schema, field)),
+        `missing one of: ${expectedSchemaFields.join(", ")}`
+      )
+    );
+  }
+
+  if (expectedTopLevelModules.length > 0) {
+    const actualIds = actualTopLevelModules
+      .map((module) => (typeof module.id === "string" ? module.id : null))
+      .filter((id): id is string => Boolean(id));
+    const expectedIds = expectedTopLevelModules
+      .map((module) => (typeof module.id === "string" ? module.id : null))
+      .filter((id): id is string => Boolean(id));
+
+    checks.push(
+      check(
+        "flow includes expected top-level step ids",
+        expectedIds.every((id) => actualIds.includes(id)),
+        `expected ids: ${expectedIds.join(", ")}; actual ids: ${actualIds.join(", ")}`
+      )
+    );
+
+    checks.push(
+      check(
+        "flow preserves expected top-level step order",
+        preservesRelativeOrder(actualIds, expectedIds),
+        `expected order: ${expectedIds.join(" -> ")}; actual ids: ${actualIds.join(" -> ")}`
+      )
+    );
+
+    for (const expectedModule of expectedTopLevelModules) {
+      const moduleId = typeof expectedModule.id === "string" ? expectedModule.id : null;
+      if (!moduleId) {
+        continue;
+      }
+
+      const actualModule = actualTopLevelModules.find((module) => module.id === moduleId);
+      if (!actualModule) {
+        continue;
+      }
+
+      const expectedType = getModuleType(expectedModule);
+      if (expectedType && !(hasSuspendConfig(expectedModule) || hasSuspendConfig(actualModule))) {
+        checks.push(
+          check(
+            `${moduleId} type matches expected`,
+            getModuleType(actualModule) === expectedType,
+            `expected ${expectedType}, got ${getModuleType(actualModule) ?? "(missing)"}`
+          )
+        );
+      }
+
+      const expectedPath = getModulePath(expectedModule);
+      if (expectedPath) {
+        checks.push(
+          check(
+            `${moduleId} path matches expected`,
+            getModulePath(actualModule) === expectedPath,
+            `expected ${expectedPath}, got ${getModulePath(actualModule) ?? "(missing)"}`
+          )
+        );
+      }
+    }
+  }
+
+  for (const specialModuleKey of ["preprocessor_module", "failure_module"] as const) {
+    const expectedSpecialModule = getSpecialFlowModule(expected, specialModuleKey);
+    if (!expectedSpecialModule) {
+      continue;
+    }
+
+    const actualSpecialModule = getSpecialFlowModule(actual, specialModuleKey);
+    checks.push(check(`${specialModuleKey} matches expected presence`, Boolean(actualSpecialModule)));
+
+    if (!actualSpecialModule) {
+      continue;
+    }
+
+    const expectedType = getModuleType(expectedSpecialModule);
+    if (expectedType) {
+      checks.push(
+        check(
+          `${specialModuleKey} type matches expected`,
+          getModuleType(actualSpecialModule) === expectedType,
+          `expected ${expectedType}, got ${getModuleType(actualSpecialModule) ?? "(missing)"}`
+        )
+      );
+    }
+  }
+
+  return checks;
+}
+
+function validateFlowRequirements(
+  flow: FlowState,
+  validate: FlowValidationSpec
+): BenchmarkCheck[] {
+  const checks: BenchmarkCheck[] = [];
+
+  for (const requiredPath of validate.schemaRequiredPaths ?? []) {
+    checks.push(
+      check(
+        `schema includes ${requiredPath}`,
+        hasSchemaPath(flow.schema, requiredPath),
+        `missing schema path ${requiredPath}`
+      )
+    );
+  }
+
+  if (validate.schemaAnyOf && validate.schemaAnyOf.length > 0) {
+    const matchingVariant = validate.schemaAnyOf.find((variant) =>
+      variant.requiredPaths.every((requiredPath) => hasSchemaPath(flow.schema, requiredPath))
+    );
+
+    checks.push(
+      check(
+        "schema matches one accepted input shape",
+        Boolean(matchingVariant),
+        matchingVariant
+          ? undefined
+          : `expected one of: ${validate.schemaAnyOf
+              .map((variant) => `[${variant.requiredPaths.join(", ")}]`)
+              .join(" or ")}`
+      )
+    );
+  }
+
+  if (validate.resolveResultsRefs) {
+    const unresolved = collectUnresolvedResultsRefs(flow);
+    checks.push(
+      check(
+        "results references resolve",
+        unresolved.length === 0,
+        unresolved.length > 0 ? unresolved.join("; ") : undefined
+      )
+    );
+  }
+
+  for (const specialModule of validate.requireSpecialModules ?? []) {
+    checks.push(
+      check(
+        `${specialModule} exists`,
+        Boolean(getSpecialFlowModule(flow, specialModule))
+      )
+    );
+  }
+
+  for (const suspendStep of validate.requireSuspendSteps ?? []) {
+    const module = findFlowModuleById(flow, suspendStep.id);
+    checks.push(check(`${suspendStep.id} step exists`, Boolean(module)));
+    if (!module) {
+      continue;
+    }
+
+    checks.push(check(`${suspendStep.id} includes suspend config`, hasSuspendConfig(module)));
+    if (!hasSuspendConfig(module)) {
+      continue;
+    }
+
+    if (suspendStep.requiredEvents !== undefined) {
+      checks.push(
+        check(
+          `${suspendStep.id} requires ${suspendStep.requiredEvents} approval event${suspendStep.requiredEvents === 1 ? "" : "s"}`,
+          getSuspendRequiredEvents(module) === suspendStep.requiredEvents,
+          `expected ${suspendStep.requiredEvents}, got ${getSuspendRequiredEvents(module) ?? "(missing)"}`
+        )
+      );
+    }
+
+    if (
+      suspendStep.resumeRequiredStringFieldAnyOf &&
+      suspendStep.resumeRequiredStringFieldAnyOf.length > 0
+    ) {
+      const stringFields = getSuspendResumeStringFields(module);
+      checks.push(
+        check(
+          `${suspendStep.id} resume form includes one accepted comment field`,
+          suspendStep.resumeRequiredStringFieldAnyOf.some((field) =>
+            stringFields.includes(field)
+          ),
+          `expected one of [${suspendStep.resumeRequiredStringFieldAnyOf.join(", ")}], got [${stringFields.join(", ")}]`
+        )
+      );
+    }
+  }
+
+  return checks;
+}
+
+function hasSchemaPath(schema: Record<string, unknown> | undefined, dottedPath: string): boolean {
+  if (!schema || typeof schema !== "object") {
+    return false;
+  }
+
+  const segments = dottedPath.split(".").filter(Boolean);
+  if (segments.length === 0) {
+    return false;
+  }
+
+  let current: Record<string, unknown> | undefined = schema;
+  for (const segment of segments) {
+    const properties = current?.properties;
+    if (!properties || typeof properties !== "object") {
+      return false;
+    }
+
+    const next = (properties as Record<string, unknown>)[segment];
+    if (!next || typeof next !== "object") {
+      return false;
+    }
+    current = next as Record<string, unknown>;
+  }
+
+  return true;
+}
+
+function getTopLevelSchemaFields(schema: Record<string, unknown> | undefined): string[] {
+  if (!schema || typeof schema !== "object") {
+    return [];
+  }
+
+  const properties = schema.properties;
+  if (!properties || typeof properties !== "object") {
+    return [];
+  }
+
+  return Object.keys(properties as Record<string, unknown>).filter((key) => key.length > 0);
+}
+
+function preservesRelativeOrder(actualIds: string[], expectedIds: string[]): boolean {
+  if (expectedIds.length === 0) {
+    return true;
+  }
+
+  let cursor = 0;
+  for (const actualId of actualIds) {
+    if (actualId === expectedIds[cursor]) {
+      cursor += 1;
+      if (cursor === expectedIds.length) {
+        return true;
+      }
+    }
+  }
+
+  return false;
+}
+
+function collectUnresolvedResultsRefs(flow: FlowState): string[] {
+  const unresolved = new Set<string>();
+  validateModuleSequence(getFlowModules(flow), new Map<string, Record<string, unknown>>(), unresolved);
+  return [...unresolved];
+}
+
+function validateModuleSequence(
+  modules: Array<Record<string, unknown>>,
+  parentVisibleModules: Map<string, Record<string, unknown>>,
+  unresolved: Set<string>
+): void {
+  const visibleModules = new Map(parentVisibleModules);
+
+  for (const module of modules) {
+    validateResultsRefsInRecord(module, visibleModules, unresolved);
+    validateNestedModuleResultsRefs(module, visibleModules, unresolved);
+
+    if (typeof module.id === "string" && module.id.length > 0) {
+      visibleModules.set(module.id, module);
+    }
+  }
+}
+
+function validateNestedModuleResultsRefs(
+  module: Record<string, unknown>,
+  visibleModules: Map<string, Record<string, unknown>>,
+  unresolved: Set<string>
+): void {
+  const value = isObjectRecord(module.value) ? module.value : null;
+  if (!value) {
+    return;
+  }
+
+  const nestedSequences: Array<Array<Record<string, unknown>>> = [];
+
+  if (Array.isArray(value.modules)) {
+    nestedSequences.push(asModuleArray(value.modules));
+  }
+
+  if (Array.isArray(value.default)) {
+    nestedSequences.push(asModuleArray(value.default));
+  }
+
+  if (Array.isArray(value.branches)) {
+    for (const branch of value.branches) {
+      if (!isObjectRecord(branch)) {
+        continue;
+      }
+      if (typeof branch.expr === "string") {
+        validateResultsRefsInExpression(
+          branch.expr,
+          `branch ${module.id ?? "(unnamed)"}`,
+          visibleModules,
+          unresolved
+        );
+      }
+      if (Array.isArray(branch.modules)) {
+        nestedSequences.push(asModuleArray(branch.modules));
+      }
+    }
+  }
+
+  for (const sequence of nestedSequences) {
+    validateModuleSequence(sequence, visibleModules, unresolved);
+  }
+}
+
+function validateResultsRefsInRecord(
+  value: unknown,
+  visibleModules: Map<string, Record<string, unknown>>,
+  unresolved: Set<string>,
+  context = "expression"
+): void {
+  if (typeof value === "string") {
+    validateResultsRefsInExpression(value, context, visibleModules, unresolved);
+    return;
+  }
+
+  if (Array.isArray(value)) {
+    for (const entry of value) {
+      validateResultsRefsInRecord(entry, visibleModules, unresolved, context);
+    }
+    return;
+  }
+
+  if (!isObjectRecord(value)) {
+    return;
+  }
+
+  for (const [key, entry] of Object.entries(value)) {
+    if (key === "content" || key === "modules" || key === "branches" || key === "default") {
+      continue;
+    }
+    validateResultsRefsInRecord(entry, visibleModules, unresolved, key);
+  }
+}
+
+function validateResultsRefsInExpression(
+  expression: string,
+  context: string,
+  visibleModules: Map<string, Record<string, unknown>>,
+  unresolved: Set<string>
+): void {
+  for (const ref of extractResultsRefs(expression)) {
+    const module = visibleModules.get(ref.root);
+    if (!module) {
+      unresolved.add(`${context} references missing results.${ref.root}`);
+      continue;
+    }
+    validateNestedResultsRefPath(ref.root, ref.path, module, context, unresolved);
+  }
+}
+
+function extractResultsRefs(
+  expression: string
+): Array<{ root: string; path: string[] }> {
+  const matches = expression.matchAll(/\bresults\.([A-Za-z0-9_-]+)((?:\.[A-Za-z0-9_-]+)*)/g);
+  const refs = new Map<string, { root: string; path: string[] }>();
+
+  for (const match of matches) {
+    const root = match[1];
+    const path = match[2]
+      .split(".")
+      .filter(Boolean);
+    const key = `${root}:${path.join(".")}`;
+    refs.set(key, { root, path });
+  }
+
+  return [...refs.values()];
+}
+
+function validateNestedResultsRefPath(
+  rootId: string,
+  path: string[],
+  module: Record<string, unknown>,
+  context: string,
+  unresolved: Set<string>
+): void {
+  if (path.length === 0) {
+    return;
+  }
+
+  const moduleType = getModuleType(module);
+  if (!moduleType || !CONTROL_FLOW_MODULE_TYPES.has(moduleType)) {
+    return;
+  }
+
+  const nestedIds = new Set(getImmediateNestedModuleIds(module));
+  const [firstSegment] = path;
+  if (nestedIds.has(firstSegment)) {
+    unresolved.add(
+      `${context} references nested results.${rootId}.${firstSegment} inside ${moduleType} ${rootId}`
+    );
+  }
+}
+
+function getAllFlowModules(flow: FlowState): Array<Record<string, unknown>> {
+  const modules: Array<Record<string, unknown>> = [];
+  const specialModules = ["preprocessor_module", "failure_module"] as const;
+
+  for (const key of specialModules) {
+    const specialModule = getSpecialFlowModule(flow, key);
+    if (specialModule) {
+      modules.push(specialModule);
+      modules.push(...collectNestedModules(specialModule));
+    }
+  }
+
+  for (const module of getFlowModules(flow)) {
+    modules.push(module);
+    modules.push(...collectNestedModules(module));
+  }
+
+  return modules;
+}
+
+function collectNestedModules(module: Record<string, unknown>): Array<Record<string, unknown>> {
+  const nested: Array<Record<string, unknown>> = [];
+  const value = isObjectRecord(module.value) ? module.value : null;
+  if (!value) {
+    return nested;
+  }
+
+  if (Array.isArray(value.modules)) {
+    for (const child of asModuleArray(value.modules)) {
+      nested.push(child, ...collectNestedModules(child));
+    }
+  }
+
+  if (Array.isArray(value.default)) {
+    for (const child of asModuleArray(value.default)) {
+      nested.push(child, ...collectNestedModules(child));
+    }
+  }
+
+  if (Array.isArray(value.branches)) {
+    for (const branch of value.branches) {
+      if (!isObjectRecord(branch) || !Array.isArray(branch.modules)) {
+        continue;
+      }
+      for (const child of asModuleArray(branch.modules)) {
+        nested.push(child, ...collectNestedModules(child));
+      }
+    }
+  }
+
+  return nested;
+}
+
+function findFlowModuleById(flow: FlowState, id: string): Record<string, unknown> | null {
+  for (const module of getAllFlowModules(flow)) {
+    if (module.id === id) {
+      return module;
+    }
+  }
+  return null;
+}
+
+function getInlineScriptPlaceholderModuleIds(flow: FlowState): string[] {
+  return getAllFlowModules(flow).flatMap((module) => {
+    const code = getModuleCode(module)?.trim();
+    if (!code || !/^inline_script\.[A-Za-z0-9_-]+$/.test(code)) {
+      return [];
+    }
+
+    if (typeof module.id === "string" && module.id.length > 0) {
+      return [module.id];
+    }
+
+    return ["(unnamed)"];
+  });
+}
+
+function getImmediateNestedModuleIds(module: Record<string, unknown>): string[] {
+  const ids: string[] = [];
+  const value = isObjectRecord(module.value) ? module.value : null;
+  if (!value) {
+    return ids;
+  }
+
+  if (Array.isArray(value.modules)) {
+    ids.push(...asModuleArray(value.modules).flatMap((child) => (typeof child.id === "string" ? [child.id] : [])));
+  }
+
+  if (Array.isArray(value.default)) {
+    ids.push(...asModuleArray(value.default).flatMap((child) => (typeof child.id === "string" ? [child.id] : [])));
+  }
+
+  if (Array.isArray(value.branches)) {
+    for (const branch of value.branches) {
+      if (!isObjectRecord(branch) || !Array.isArray(branch.modules)) {
+        continue;
+      }
+      ids.push(
+        ...asModuleArray(branch.modules).flatMap((child) => (typeof child.id === "string" ? [child.id] : []))
+      );
+    }
+  }
+
+  return ids;
+}
+
+function getModuleCode(module: Record<string, unknown>): string | null {
+  const value = isObjectRecord(module.value) ? module.value : null;
+  return typeof value?.content === "string" ? value.content : null;
+}
+
+function asModuleArray(value: unknown[]): Array<Record<string, unknown>> {
+  return value.filter(isObjectRecord);
+}
+
+function isObjectRecord(value: unknown): value is Record<string, any> {
+  return typeof value === "object" && value !== null && !Array.isArray(value);
+}
+
+function getSpecialFlowModule(
+  flow: FlowState,
+  key: "preprocessor_module" | "failure_module"
+): Record<string, unknown> | null {
+  if (!flow.value || typeof flow.value !== "object") {
+    return null;
+  }
+  const module = (flow.value as Record<string, unknown>)[key];
+  return module && typeof module === "object" ? (module as Record<string, unknown>) : null;
+}
+
+function getModuleType(module: Record<string, unknown>): string | null {
+  const value = module.value;
+  if (!value || typeof value !== "object") {
+    return null;
+  }
+  return typeof (value as Record<string, unknown>).type === "string"
+    ? ((value as Record<string, string>).type)
+    : null;
+}
+
+function getModulePath(module: Record<string, unknown>): string | null {
+  const value = module.value;
+  if (!value || typeof value !== "object") {
+    return null;
+  }
+
+  return typeof (value as Record<string, unknown>).path === "string"
+    ? ((value as Record<string, string>).path)
+    : null;
+}
+
+function hasSuspendConfig(module: Record<string, unknown>): boolean {
+  return typeof module.suspend === "object" && module.suspend !== null;
+}
+
+function getSuspendRequiredEvents(module: Record<string, unknown>): number | null {
+  const suspend = isObjectRecord(module.suspend) ? module.suspend : null;
+  return typeof suspend?.required_events === "number" ? suspend.required_events : null;
+}
+
+function getSuspendResumeStringFields(module: Record<string, unknown>): string[] {
+  const suspend = isObjectRecord(module.suspend) ? module.suspend : null;
+  const resumeForm = isObjectRecord(suspend?.resume_form) ? suspend.resume_form : null;
+  const schema = isObjectRecord(resumeForm?.schema) ? resumeForm.schema : null;
+  const properties = isObjectRecord(schema?.properties) ? schema.properties : null;
+  if (!properties) {
+    return [];
+  }
+
+  return Object.entries(properties).flatMap(([field, property]) => {
+    if (!isObjectRecord(property) || property.type !== "string") {
+      return [];
+    }
+    return [field];
+  });
+}
+
+function appStatesEqual(left: AppFilesState, right: AppFilesState): boolean {
+  return fileMapsEqual(left.frontend, right.frontend) && fileMapsEqual(stringifyBackend(left.backend), stringifyBackend(right.backend));
+}
+
+function stringifyBackend(backend: Record<string, AppRunnableState>): Record<string, string> {
+  const result: Record<string, string> = {};
+  for (const [key, value] of Object.entries(backend)) {
+    result[key] = JSON.stringify(value);
+  }
+  return result;
+}
+
+function fileMapsEqual(left: Record<string, string>, right: Record<string, string>): boolean {
+  const leftEntries = Object.entries(left).sort(([a], [b]) => a.localeCompare(b));
+  const rightEntries = Object.entries(right).sort(([a], [b]) => a.localeCompare(b));
+  if (leftEntries.length !== rightEntries.length) {
+    return false;
+  }
+  return leftEntries.every(([key, value], index) => {
+    const [otherKey, otherValue] = rightEntries[index];
+    return key === otherKey && normalizeText(value) === normalizeText(otherValue);
+  });
+}
--- a/ai_evals/fixtures/cli/expected/bun-hello-flow-punctuation/f/evals/hello__flow/flow.yaml
+++ b/ai_evals/fixtures/cli/expected/bun-hello-flow-punctuation/f/evals/hello__flow/flow.yaml
--- a/ai_evals/fixtures/cli/expected/bun-hello-flow-punctuation/f/evals/hello__flow/hello.ts
+++ b/ai_evals/fixtures/cli/expected/bun-hello-flow-punctuation/f/evals/hello__flow/hello.ts
--- a/ai_evals/fixtures/cli/expected/bun-hello-flow/f/evals/hello__flow/flow.yaml
+++ b/ai_evals/fixtures/cli/expected/bun-hello-flow/f/evals/hello__flow/flow.yaml
--- a/ai_evals/fixtures/cli/expected/bun-hello-flow/f/evals/hello__flow/hello.ts
+++ b/ai_evals/fixtures/cli/expected/bun-hello-flow/f/evals/hello__flow/hello.ts
@@ -0,0 +1,2 @@
+main(name: string)
+greeting: `Hello, ${name}!`
--- a/ai_evals/fixtures/cli/expected/bun-hello-script-uppercase/f/evals/hello.ts
+++ b/ai_evals/fixtures/cli/expected/bun-hello-script-uppercase/f/evals/hello.ts
--- a/ai_evals/fixtures/cli/expected/bun-hello-script/f/evals/hello.ts
+++ b/ai_evals/fixtures/cli/expected/bun-hello-script/f/evals/hello.ts
@@ -0,0 +1,3 @@
+export async function main(name: string) {
+  return { greeting: `Hello, ${name}!` };
+}
--- a/ai_evals/fixtures/cli/expected/flow-reuse-existing-script/f/evals/reuse_greeting__flow/flow.yaml
+++ b/ai_evals/fixtures/cli/expected/flow-reuse-existing-script/f/evals/reuse_greeting__flow/flow.yaml
@@ -0,0 +1,2 @@
+type: script
+path: f/lib/format_greeting
--- a/ai_evals/fixtures/cli/expected/flow-reuse-existing-script/f/lib/format_greeting.ts
+++ b/ai_evals/fixtures/cli/expected/flow-reuse-existing-script/f/lib/format_greeting.ts
@@ -0,0 +1,3 @@
+export async function main(name: string) {
+  return { greeting: `Hello, ${name}!` };
+}
--- a/ai_evals/fixtures/cli/expected/python-add-numbers-script/f/evals/add_numbers.py
+++ b/ai_evals/fixtures/cli/expected/python-add-numbers-script/f/evals/add_numbers.py
@@ -0,0 +1,2 @@
+def main(
+return {"total": a + b}
--- a/ai_evals/fixtures/cli/initial/bun-hello-flow-punctuation/f/evals/hello__flow/flow.yaml
+++ b/ai_evals/fixtures/cli/initial/bun-hello-flow-punctuation/f/evals/hello__flow/flow.yaml
@@ -0,0 +1,20 @@
+summary: Simple greeting flow
+schema:
+  type: object
+  properties:
+    name:
+      type: string
+      description: Name to greet
+  required:
+    - name
+value:
+  modules:
+    - id: hello_step
+      value:
+        type: rawscript
+        language: bun
+        content: !inline hello.ts
+        input_transforms:
+          name:
+            type: javascript
+            expr: flow_input.name
--- a/ai_evals/fixtures/cli/initial/bun-hello-flow-punctuation/f/evals/hello__flow/hello.ts
+++ b/ai_evals/fixtures/cli/initial/bun-hello-flow-punctuation/f/evals/hello__flow/hello.ts
@@ -0,0 +1,3 @@
+export async function main(name: string) {
+  return { greeting: `Hello, ${name}!` };
+}
--- a/ai_evals/fixtures/cli/initial/bun-hello-script-uppercase/f/evals/hello.ts
+++ b/ai_evals/fixtures/cli/initial/bun-hello-script-uppercase/f/evals/hello.ts
@@ -0,0 +1,3 @@
+export async function main(name: string) {
+  return { greeting: `Hello, ${name}!` };
+}
--- a/ai_evals/fixtures/cli/initial/flow-reuse-existing-script/f/lib/format_greeting.ts
+++ b/ai_evals/fixtures/cli/initial/flow-reuse-existing-script/f/lib/format_greeting.ts
@@ -0,0 +1,3 @@
+export async function main(name: string) {
+  return { greeting: `Hello, ${name}!` };
+}
--- a/frontend/src/lib/components/copilot/chat/tests/app/initial/file_manager/backend/createFolder/main.ts
+++ b/frontend/src/lib/components/copilot/chat/tests/app/initial/file_manager/backend/createFolder/main.ts
--- a/frontend/src/lib/components/copilot/chat/tests/app/initial/file_manager/backend/createFolder/meta.json
+++ b/frontend/src/lib/components/copilot/chat/tests/app/initial/file_manager/backend/createFolder/meta.json
--- a/frontend/src/lib/components/copilot/chat/tests/app/initial/file_manager/backend/deleteItem/main.ts
+++ b/frontend/src/lib/components/copilot/chat/tests/app/initial/file_manager/backend/deleteItem/main.ts
--- a/frontend/src/lib/components/copilot/chat/tests/app/initial/file_manager/backend/deleteItem/meta.json
+++ b/frontend/src/lib/components/copilot/chat/tests/app/initial/file_manager/backend/deleteItem/meta.json
--- a/frontend/src/lib/components/copilot/chat/tests/app/initial/file_manager/backend/listFiles/main.ts
+++ b/frontend/src/lib/components/copilot/chat/tests/app/initial/file_manager/backend/listFiles/main.ts
--- a/frontend/src/lib/components/copilot/chat/tests/app/initial/file_manager/backend/listFiles/meta.json
+++ b/frontend/src/lib/components/copilot/chat/tests/app/initial/file_manager/backend/listFiles/meta.json
--- a/frontend/src/lib/components/copilot/chat/tests/app/initial/file_manager/backend/listFolders/main.ts
+++ b/frontend/src/lib/components/copilot/chat/tests/app/initial/file_manager/backend/listFolders/main.ts
--- a/frontend/src/lib/components/copilot/chat/tests/app/initial/file_manager/backend/listFolders/meta.json
+++ b/frontend/src/lib/components/copilot/chat/tests/app/initial/file_manager/backend/listFolders/meta.json
--- a/frontend/src/lib/components/copilot/chat/tests/app/initial/file_manager/backend/moveItem/main.ts
+++ b/frontend/src/lib/components/copilot/chat/tests/app/initial/file_manager/backend/moveItem/main.ts
--- a/frontend/src/lib/components/copilot/chat/tests/app/initial/file_manager/backend/moveItem/meta.json
+++ b/frontend/src/lib/components/copilot/chat/tests/app/initial/file_manager/backend/moveItem/meta.json
--- a/frontend/src/lib/components/copilot/chat/tests/app/initial/file_manager/backend/renameItem/main.ts
+++ b/frontend/src/lib/components/copilot/chat/tests/app/initial/file_manager/backend/renameItem/main.ts
--- a/frontend/src/lib/components/copilot/chat/tests/app/initial/file_manager/backend/renameItem/meta.json
+++ b/frontend/src/lib/components/copilot/chat/tests/app/initial/file_manager/backend/renameItem/meta.json
--- a/frontend/src/lib/components/copilot/chat/tests/app/initial/file_manager/frontend/components/Breadcrumb.tsx
+++ b/frontend/src/lib/components/copilot/chat/tests/app/initial/file_manager/frontend/components/Breadcrumb.tsx
--- a/frontend/src/lib/components/copilot/chat/tests/app/initial/file_manager/frontend/components/FileItem.tsx
+++ b/frontend/src/lib/components/copilot/chat/tests/app/initial/file_manager/frontend/components/FileItem.tsx
--- a/frontend/src/lib/components/copilot/chat/tests/app/initial/file_manager/frontend/components/FileList.tsx
+++ b/frontend/src/lib/components/copilot/chat/tests/app/initial/file_manager/frontend/components/FileList.tsx
--- a/frontend/src/lib/components/copilot/chat/tests/app/initial/file_manager/frontend/components/FolderTree.tsx
+++ b/frontend/src/lib/components/copilot/chat/tests/app/initial/file_manager/frontend/components/FolderTree.tsx
--- a/frontend/src/lib/components/copilot/chat/tests/app/initial/file_manager/frontend/components/Toolbar.tsx
+++ b/frontend/src/lib/components/copilot/chat/tests/app/initial/file_manager/frontend/components/Toolbar.tsx
--- a/frontend/src/lib/components/copilot/chat/tests/app/initial/file_manager/frontend/index.tsx
+++ b/frontend/src/lib/components/copilot/chat/tests/app/initial/file_manager/frontend/index.tsx
--- a/frontend/src/lib/components/copilot/chat/tests/app/initial/shopping_cart/backend/addToCart/main.ts
+++ b/frontend/src/lib/components/copilot/chat/tests/app/initial/shopping_cart/backend/addToCart/main.ts
--- a/frontend/src/lib/components/copilot/chat/tests/app/initial/shopping_cart/backend/addToCart/meta.json
+++ b/frontend/src/lib/components/copilot/chat/tests/app/initial/shopping_cart/backend/addToCart/meta.json
--- a/frontend/src/lib/components/copilot/chat/tests/app/initial/shopping_cart/backend/calculateTotal/main.ts
+++ b/frontend/src/lib/components/copilot/chat/tests/app/initial/shopping_cart/backend/calculateTotal/main.ts
--- a/frontend/src/lib/components/copilot/chat/tests/app/initial/shopping_cart/backend/calculateTotal/meta.json
+++ b/frontend/src/lib/components/copilot/chat/tests/app/initial/shopping_cart/backend/calculateTotal/meta.json
--- a/frontend/src/lib/components/copilot/chat/tests/app/initial/shopping_cart/backend/getProducts/main.ts
+++ b/frontend/src/lib/components/copilot/chat/tests/app/initial/shopping_cart/backend/getProducts/main.ts
--- a/frontend/src/lib/components/copilot/chat/tests/app/initial/shopping_cart/backend/getProducts/meta.json
+++ b/frontend/src/lib/components/copilot/chat/tests/app/initial/shopping_cart/backend/getProducts/meta.json
--- a/frontend/src/lib/components/copilot/chat/tests/app/initial/shopping_cart/backend/removeFromCart/main.ts
+++ b/frontend/src/lib/components/copilot/chat/tests/app/initial/shopping_cart/backend/removeFromCart/main.ts
--- a/frontend/src/lib/components/copilot/chat/tests/app/initial/shopping_cart/backend/removeFromCart/meta.json
+++ b/frontend/src/lib/components/copilot/chat/tests/app/initial/shopping_cart/backend/removeFromCart/meta.json
--- a/frontend/src/lib/components/copilot/chat/tests/app/initial/shopping_cart/frontend/components/Cart.tsx
+++ b/frontend/src/lib/components/copilot/chat/tests/app/initial/shopping_cart/frontend/components/Cart.tsx
--- a/frontend/src/lib/components/copilot/chat/tests/app/initial/shopping_cart/frontend/components/ProductCard.tsx
+++ b/frontend/src/lib/components/copilot/chat/tests/app/initial/shopping_cart/frontend/components/ProductCard.tsx
--- a/frontend/src/lib/components/copilot/chat/tests/app/initial/shopping_cart/frontend/components/ProductList.tsx
+++ b/frontend/src/lib/components/copilot/chat/tests/app/initial/shopping_cart/frontend/components/ProductList.tsx
--- a/frontend/src/lib/components/copilot/chat/tests/app/initial/shopping_cart/frontend/index.tsx
+++ b/frontend/src/lib/components/copilot/chat/tests/app/initial/shopping_cart/frontend/index.tsx
--- a/frontend/src/lib/components/copilot/chat/tests/app/initial/test1_counter_app/backend/decrementCounter/main.ts
+++ b/frontend/src/lib/components/copilot/chat/tests/app/initial/test1_counter_app/backend/decrementCounter/main.ts
--- a/frontend/src/lib/components/copilot/chat/tests/app/initial/test1_counter_app/backend/decrementCounter/meta.json
+++ b/frontend/src/lib/components/copilot/chat/tests/app/initial/test1_counter_app/backend/decrementCounter/meta.json
--- a/frontend/src/lib/components/copilot/chat/tests/app/initial/test1_counter_app/backend/incrementCounter/main.ts
+++ b/frontend/src/lib/components/copilot/chat/tests/app/initial/test1_counter_app/backend/incrementCounter/main.ts
--- a/frontend/src/lib/components/copilot/chat/tests/app/initial/test1_counter_app/backend/incrementCounter/meta.json
+++ b/frontend/src/lib/components/copilot/chat/tests/app/initial/test1_counter_app/backend/incrementCounter/meta.json
--- a/frontend/src/lib/components/copilot/chat/tests/app/initial/test1_counter_app/frontend/index.tsx
+++ b/frontend/src/lib/components/copilot/chat/tests/app/initial/test1_counter_app/frontend/index.tsx
--- a/ai_evals/fixtures/frontend/flow/expected/test0_sum_two_numbers.json
+++ b/ai_evals/fixtures/frontend/flow/expected/test0_sum_two_numbers.json
@@ -0,0 +1,31 @@
+{
+  "summary": "",
+  "value": {
+    "modules": [
+      {
+        "id": "sum_numbers",
+        "value": {
+          "type": "rawscript",
+          "language": "bun",
+          "content": "export async function main(a: number, b: number) {\n  return a + b;\n}",
+          "input_transforms": {
+            "a": {
+              "type": "javascript",
+              "expr": "flow_input.a"
+            },
+            "b": {
+              "type": "javascript",
+              "expr": "flow_input.b"
+            }
+          }
+        }
+      }
+    ]
+  },
+  "schema": {
+    "$schema": "https://json-schema.org/draft/2020-12/schema",
+    "properties": {},
+    "required": [],
+    "type": "object"
+  }
+}
--- a/frontend/src/lib/components/copilot/chat/tests/flow/expected/test1.json
+++ b/frontend/src/lib/components/copilot/chat/tests/flow/expected/test1.json
--- a/ai_evals/fixtures/frontend/flow/expected/test10_while_loop_counter.json
+++ b/ai_evals/fixtures/frontend/flow/expected/test10_while_loop_counter.json
@@ -0,0 +1,30 @@
+{
+  "value": {
+    "modules": [
+      {
+        "id": "count_until_target",
+        "value": {
+          "type": "whileloopflow"
+        }
+      },
+      {
+        "id": "return_final_count",
+        "value": {
+          "type": "rawscript"
+        }
+      }
+    ]
+  },
+  "schema": {
+    "$schema": "https://json-schema.org/draft/2020-12/schema",
+    "type": "object",
+    "properties": {
+      "target": {
+        "type": "number"
+      }
+    },
+    "required": [
+      "target"
+    ]
+  }
+}
--- a/ai_evals/fixtures/frontend/flow/expected/test11_preprocessor_failure.json
+++ b/ai_evals/fixtures/frontend/flow/expected/test11_preprocessor_failure.json
@@ -0,0 +1,36 @@
+{
+  "value": {
+    "preprocessor_module": {
+      "id": "preprocessor",
+      "value": {
+        "type": "rawscript"
+      }
+    },
+    "failure_module": {
+      "id": "failure",
+      "value": {
+        "type": "rawscript"
+      }
+    },
+    "modules": [
+      {
+        "id": "process_event",
+        "value": {
+          "type": "rawscript"
+        }
+      }
+    ]
+  },
+  "schema": {
+    "$schema": "https://json-schema.org/draft/2020-12/schema",
+    "type": "object",
+    "properties": {
+      "payload": {
+        "type": "string"
+      }
+    },
+    "required": [
+      "payload"
+    ]
+  }
+}
--- a/ai_evals/fixtures/frontend/flow/expected/test12_approval_step.json
+++ b/ai_evals/fixtures/frontend/flow/expected/test12_approval_step.json
@@ -0,0 +1,44 @@
+{
+  "value": {
+    "modules": [
+      {
+        "id": "request_approval",
+        "suspend": {
+          "required_events": 1,
+          "resume_form": {
+            "schema": {
+              "approver_comment": {
+                "type": "string"
+              }
+            }
+          }
+        },
+        "value": {
+          "type": "rawscript"
+        }
+      },
+      {
+        "id": "finalize_purchase",
+        "value": {
+          "type": "rawscript"
+        }
+      }
+    ]
+  },
+  "schema": {
+    "$schema": "https://json-schema.org/draft/2020-12/schema",
+    "type": "object",
+    "properties": {
+      "requester_email": {
+        "type": "string"
+      },
+      "amount": {
+        "type": "number"
+      }
+    },
+    "required": [
+      "requester_email",
+      "amount"
+    ]
+  }
+}
--- a/ai_evals/fixtures/frontend/flow/expected/test1_reuse_existing_script.json
+++ b/ai_evals/fixtures/frontend/flow/expected/test1_reuse_existing_script.json
@@ -0,0 +1,39 @@
+{
+  "value": {
+    "modules": [
+      {
+        "id": "sum_numbers",
+        "value": {
+          "type": "script",
+          "path": "f/evals/add_two_numbers.ts",
+          "input_transforms": {
+            "a": {
+              "type": "javascript",
+              "expr": "flow_input.a"
+            },
+            "b": {
+              "type": "javascript",
+              "expr": "flow_input.b"
+            }
+          }
+        }
+      }
+    ]
+  },
+  "schema": {
+    "$schema": "https://json-schema.org/draft/2020-12/schema",
+    "type": "object",
+    "properties": {
+      "a": {
+        "type": "number"
+      },
+      "b": {
+        "type": "number"
+      }
+    },
+    "required": [
+      "a",
+      "b"
+    ]
+  }
+}
--- a/frontend/src/lib/components/copilot/chat/tests/flow/expected/test2.json
+++ b/frontend/src/lib/components/copilot/chat/tests/flow/expected/test2.json
--- a/ai_evals/fixtures/frontend/flow/expected/test2_call_existing_subflow.json
+++ b/ai_evals/fixtures/frontend/flow/expected/test2_call_existing_subflow.json
@@ -0,0 +1,39 @@
+{
+  "value": {
+    "modules": [
+      {
+        "id": "call_add_numbers",
+        "value": {
+          "type": "flow",
+          "path": "f/evals/add_numbers_flow",
+          "input_transforms": {
+            "a": {
+              "type": "javascript",
+              "expr": "flow_input.a"
+            },
+            "b": {
+              "type": "javascript",
+              "expr": "flow_input.b"
+            }
+          }
+        }
+      }
+    ]
+  },
+  "schema": {
+    "$schema": "https://json-schema.org/draft/2020-12/schema",
+    "type": "object",
+    "properties": {
+      "a": {
+        "type": "number"
+      },
+      "b": {
+        "type": "number"
+      }
+    },
+    "required": [
+      "a",
+      "b"
+    ]
+  }
+}
--- a/frontend/src/lib/components/copilot/chat/tests/flow/expected/test3.json
+++ b/frontend/src/lib/components/copilot/chat/tests/flow/expected/test3.json
--- a/Show More
+++ b/Show More