feat: add black-box ai eval benchmarks (#8618)

2026-04-13 14:05:46 +02:00
parent 60211c1d19
commit cdcc56461b
166 changed files with 8640 additions and 3647 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -25,6 +25,8 @@ rust-client/Cargo.toml
 backend/target
 frontend/node_modules
 typescript-client/node_modules
 ai_evals/node_modules
 ai_evals/results/
 frontend/.svelte-kit
 backend/chrome_profiler.json
 .fast-check/
--- a/ai_evals/.gitignore
+++ b/ai_evals/.gitignore
@@ -0,0 +1,2 @@
 .env
 results/
--- a/ai_evals/AGENTS.md
+++ b/ai_evals/AGENTS.md
@@ -0,0 +1,172 @@
 # AI Evals Authoring Guide
 This folder contains black-box benchmark cases for:
 - `flow`
 - `app`
 - `script`
 - `cli`
 The goal is to test the current production prompts and guidance with realistic user requests, not to test one exact implementation shape.
 ## Core rules
 1. Write prompts like a real user request.
 2. Prefer behavior, inputs, constraints, and outcomes over internal implementation details.
 3. Keep deterministic validation narrow and hard.
 4. Put semantic expectations in `judgeChecklist`.
 5. Use `expected` fixtures only when exact structure really matters.
 ## Prompt writing
 Prompts should sound like something a user would naturally ask.
 Good:
 - "Create a flow that routes support requests based on customer tier."
 - "Add a reset button that sets the counter back to 0."
 - "Create a flow that reuses the existing greeting script instead of duplicating the logic."
 Bad:
 - "Use `branchone` with 3 branches and a default branch."
 - "Create a `rawscript` step with this exact topology."
 - "This is a benchmark harness."
 Do not write prompts as if the user knows Windmill internals unless the case is explicitly testing a power-user workflow.
 ## Flow-specific rules
 This is the main principle you asked for:
 - flow prompts should read like requests from a user who does not know the product internals
 - the user should ask for behavior, not for `branchone`, `branchall`, `rawscript`, `preprocessor_module`, `failure_module`, exact graph topology, or other internal constructs
 That means:
 - creation cases should describe the business behavior and expected result
 - modification cases may mention existing step names, because the user can see the current flow
 - only mention special Windmill constructs when the case is explicitly about those constructs
 Examples:
 - acceptable creation prompt:
  "Create a purchase approval flow that pauses for approval and asks the approver for a comment."
 - avoid:
  "Create a suspend step with one required event and a resume form."
 For flow cases, do not fail a case just because the model chose a different valid topology.
 ## App-specific rules
 App prompts should focus on user-visible behavior:
 - what the UI should let the user do
 - what should persist
 - what backend behavior is needed
 Avoid prompting in terms of React structure, component names, or implementation unless the case is specifically about editing an existing app.
 ## CLI-specific rules
 CLI prompts can be more explicit about paths and file names because real CLI users often do specify them.
 Still, avoid benchmark phrasing. The prompt should read like a repo task, not a harness instruction.
 When relevant, ask the assistant to tell the user which `wmill` commands to run next. That is part of the benchmarked behavior.
 ## Deterministic validation
 Use deterministic validation only for hard failures such as:
 - missing required files
 - unexpected extra files when the prompt says not to create them
 - syntax errors
 - unresolved flow refs
 - missing required special modules or suspend config
 - obvious artifact corruption
 Do not use deterministic validation to enforce one preferred implementation for broad creation tasks.
 Examples of bad hard checks:
 - exact step topology for a creation flow
 - exact branch structure when the prompt only asked for routing behavior
 - exact input shape when multiple reasonable shapes are acceptable
 ## Judge checklist
 Every non-trivial case should have a `judgeChecklist`.
 The checklist should capture:
 - the user-visible behavior that must be present
 - important constraints
 - key completion criteria
 The checklist should not duplicate low-level implementation details unless they are truly required by the task.
 Good checklist items:
 - "the flow calculates the order total with 8% tax"
 - "the app persists recipes appropriately for a raw Windmill app"
 - "the flow reuses the existing workspace script instead of rewriting the logic"
 Bad checklist items:
 - "uses `branchone`"
 - "contains a `rawscript` node"
 ## When to use `expected`
 Use `expected` fixtures when the case is structure-sensitive, for example:
 - exact file creation
 - exact script content
 - modification cases where a specific file must change in a specific way
 - cases where preserving an existing structure is part of the requirement
 Do not use a full `expected` artifact as the semantic oracle for broad creation tasks when multiple valid outputs should pass.
 ## When to use `initial`
 Use `initial` when the benchmark is about:
 - editing an existing artifact
 - reusing existing workspace assets
 - preserving existing behavior while adding a change
 If the case is greenfield, prefer no `initial`.
 ## Case design ladder
 Prefer suites that get gradually harder:
 1. trivial create case
 2. realistic create case
 3. reuse-existing-assets case
 4. modification case
 5. refactor case
 6. edge-case or niche product behavior
 The last cases in a suite should cover unusual or product-specific behavior.
 ## Anti-patterns
 Avoid these:
 - benchmark framing in prompts
 - over-specified internal topology for creation tasks
 - judge checklists that just restate implementation details
 - deterministic validation that encodes one preferred solution
 - fixtures that are so minimal or brittle that they create false negatives
 ## Before adding a case
 Ask:
 1. Would a real user plausibly write this prompt?
 2. If the model solves it in a different valid way, would the case still pass?
 3. Are the hard deterministic checks only catching objectively broken output?
 4. Does the `judgeChecklist` describe the real success criteria?
 5. If this case fails, will the reason be understandable from the saved artifacts?
--- a/ai_evals/CLAUDE.md
+++ b/ai_evals/CLAUDE.md
@@ -0,0 +1 @@
@AGENTS.md
--- a/ai_evals/README.md
+++ b/ai_evals/README.md
@@ -0,0 +1,176 @@
 # AI Evals
 Small benchmark runner for the four Windmill AI generation modes:
 - `cli`
 - `flow`
 - `script`
 - `app`
 The benchmark always tests the current production prompts, tools, and guidance in this checkout.
 Each attempt runs:
 1. the real production path
 2. deterministic validation
 3. LLM judging
 ## Install
 ```bash
 cd ai_evals
 bun install
 ```
 Frontend modes also require frontend dependencies:
 ```bash
 cd frontend
 bun install
 ```
 ## Commands
 List model aliases:
 ```bash
 cd ai_evals
 bun run cli -- models
 ```
 List cases:
 ```bash
 cd ai_evals
 bun run cli -- cases
 bun run cli -- cases flow
 ```
 Run benchmarks:
 ```bash
 cd ai_evals
 bun run cli -- run flow
 bun run cli -- run flow flow-test4-order-processing-loop --model opus
 bun run cli -- run flow flow-test0-sum-two-numbers --models haiku,opus,4o
 bun run cli -- run flow flow-test0-sum-two-numbers --runs 3 --verbose
 bun run cli -- run flow --record
 bun run cli -- run cli bun-hello-script
 ```
 Public CLI surface:
 - `models`
 - `cases [mode]`
 - `run <mode> [caseIds...]`
 `run` options:
 - `--runs <n>`: repeat each case `n` times
 - `--output <path>`: custom result JSON path
 - `--model <alias>`: choose the model under test
 - `--models <a,b,c>`: run the same cases sequentially against several model aliases
 - `--verbose`: stream assistant output for frontend runs
 - `--record`: append a compact tracked summary line to `ai_evals/history/<mode>.jsonl` for full-suite runs only
 ## Models
 Use `bun run cli -- models` to see the current aliases.
 Today:
 - `haiku`
 - `sonnet`
 - `opus`
 - `4o`
 - `gemini-flash`
 - `gemini-pro`
 - `gemini-3-flash-preview`
 - `gemini-3.1-pro-preview`
 Notes:
 - the command also prints accepted alias spellings such as `gpt-4o`, `claude-opus-4.6`, and `claude-haiku-4.5`
 - frontend modes (`flow`, `script`, `app`) can use Anthropic, OpenAI, and Gemini-backed aliases
 - `cli` mode always uses the Anthropic agent SDK, so only Anthropic aliases are valid there
 - the judge model is separate and currently defaults to `claude-sonnet-4-6`
 ## Case Format
 Cases live in one YAML file per mode under `ai_evals/cases/`.
 Minimal shape:
 ```yaml
 - id: flow-test0-sum-two-numbers
  prompt: |-
    Create a flow that takes two numbers, `a` and `b`, and returns their sum.
  initial: ai_evals/fixtures/...
  expected: ai_evals/fixtures/...
 ```
 Optional fields:
 - `initial`: starting state fixture
 - `expected`: expected artifact fixture
 - `validate`: extra deterministic validation rules
 For `flow` mode, `validate` can express requirements such as:
 - accepted input schema shapes
 - required `results.*` reference validity
 - required module/code/input characteristics
 For `flow` mode, an `initial` fixture can also include a benchmark workspace catalog of
 existing scripts and flows. That lets the real `search_workspace` and
 `get_runnable_details` tools discover reusable workspace runnables during evals.
 ## Results And Artifacts
 Every run writes:
 - a summary JSON under `ai_evals/results/`
 - generated artifacts in a sibling directory
 If `--record` is used, the CLI also appends one compact JSON line to:
 - `ai_evals/history/flow.jsonl`
 - `ai_evals/history/script.jsonl`
 - `ai_evals/history/app.jsonl`
 - `ai_evals/history/cli.jsonl`
 Each recorded line contains:
 - run metadata (`createdAt`, `gitSha`, `mode`, `runModel`, `judgeModel`)
 - suite totals (`caseCount`, `attemptCount`, `passedAttempts`, `passRate`, `averageDurationMs`, `averageJudgeScore`)
 - average token usage (`averageTokenUsagePerAttempt`)
 - per-case metrics under `cases[]` (`averageDurationMs`, `averageJudgeScore`, `averageTokenUsagePerAttempt`, pass rate)
 - `failedCaseIds`
 Example:
 - summary: `ai_evals/results/2026-04-09T09-40-33.051Z__flow.json`
 - artifacts: `ai_evals/results/2026-04-09T09-40-33.051Z__flow/`
 Typical artifacts by mode:
 - `flow`: `flow.json`
 - `script`: `script.json` plus the generated script file
 - `app`: `app.json` plus frontend/backend files
 - `cli`: `assistant-output.txt` plus generated workspace files
 ## Layout
 - `cases/`: one YAML file per mode
 - `fixtures/`: initial and expected fixtures
 - `core/`: shared loading, model resolution, validation, judging, and result writing
 - `modes/`: one runner per mode
 - `history/`: optional tracked pass-rate history written by `run --record`, one JSONL file per mode
 - `results/`: local benchmark output and artifacts
 ## Notes
 - Frontend modes reuse the production frontend chat code through the Vitest bridge.
 - CLI mode creates an isolated workspace, writes the current checkout guidance into it, and benchmarks the real skills / `AGENTS.md` flow.
 - Frontend progress streams live while the benchmark is running.
 - Deterministic validators should stay focused on real correctness constraints, not one exact implementation shape.
--- a/ai_evals/adapters/cli/runtime.test.ts
+++ b/ai_evals/adapters/cli/runtime.test.ts
@@ -0,0 +1,72 @@
 import { describe, expect, it } from "bun:test";
 import {
  anthropicUsageToBenchmarkTokenUsage,
  extractCliResultTokenUsage,
 } from "./runtime";
 describe("anthropicUsageToBenchmarkTokenUsage", () => {
  it("includes cache tokens in prompt usage", () => {
    expect(
      anthropicUsageToBenchmarkTokenUsage({
        input_tokens: 120,
        output_tokens: 45,
        cache_creation_input_tokens: 30,
        cache_read_input_tokens: 5,
      })
    ).toEqual({
      prompt: 155,
      completion: 45,
      total: 200,
    });
  });
  it("returns null when usage is absent", () => {
    expect(anthropicUsageToBenchmarkTokenUsage(null)).toBeNull();
  });
 });
 describe("extractCliResultTokenUsage", () => {
  it("reads aggregate usage from the SDK result event", () => {
    expect(
      extractCliResultTokenUsage({
        type: "result",
        usage: {
          input_tokens: 400,
          output_tokens: 120,
          cache_creation_input_tokens: 50,
          cache_read_input_tokens: 25,
        },
      })
    ).toEqual({
      prompt: 475,
      completion: 120,
      total: 595,
    });
  });
  it("falls back to modelUsage when aggregate usage is unavailable", () => {
    expect(
      extractCliResultTokenUsage({
        type: "result",
        modelUsage: {
          opus: {
            inputTokens: 200,
            outputTokens: 60,
            cacheCreationInputTokens: 10,
            cacheReadInputTokens: 5,
          },
          haiku: {
            inputTokens: 80,
            outputTokens: 20,
            cacheCreationInputTokens: 0,
            cacheReadInputTokens: 15,
          },
        },
      })
    ).toEqual({
      prompt: 310,
      completion: 80,
      total: 390,
    });
  });
 });
--- a/ai_evals/adapters/cli/runtime.ts
+++ b/ai_evals/adapters/cli/runtime.ts
@@ -0,0 +1,199 @@
 import { query, type Options } from "@anthropic-ai/claude-agent-sdk";
 import { join } from "path";
 import { fileURLToPath } from "url";
 import { getCliEvalModel, resolveEvalModel, type CliEvalModelConfig } from "../../core/models";
 import type { BenchmarkTokenUsage } from "../../core/types";
 export interface ToolInvocation {
  tool: string;
  input: Record<string, unknown>;
  timestamp: number;
 }
 export interface PromptRunResult {
  toolsUsed: ToolInvocation[];
  skillsInvoked: string[];
  output: string;
  durationMs: number;
  assistantMessageCount: number;
  tokenUsage: BenchmarkTokenUsage | null;
 }
 interface AnthropicUsageLike {
  input_tokens?: number | null;
  output_tokens?: number | null;
  cache_creation_input_tokens?: number | null;
  cache_read_input_tokens?: number | null;
 }
 interface AnthropicModelUsageLike {
  inputTokens?: number | null;
  outputTokens?: number | null;
  cacheCreationInputTokens?: number | null;
  cacheReadInputTokens?: number | null;
 }
 interface CliResultMessageLike {
  type?: string;
  usage?: AnthropicUsageLike | null;
  modelUsage?: Record<string, AnthropicModelUsageLike> | null;
 }
 const REPO_ROOT = fileURLToPath(new URL("../../../", import.meta.url));
 export const DEFAULT_CLI_EVAL_MODEL: CliEvalModelConfig = getCliEvalModel(resolveEvalModel("cli"));
 export function getGeneratedSkillsSource(): string {
  return join(REPO_ROOT, "system_prompts", "auto-generated", "skills");
 }
 export function anthropicUsageToBenchmarkTokenUsage(
  usage: AnthropicUsageLike | null | undefined
 ): BenchmarkTokenUsage | null {
  if (!usage) {
    return null;
  }
  const prompt =
    (usage.input_tokens ?? 0) +
    (usage.cache_creation_input_tokens ?? 0) +
    (usage.cache_read_input_tokens ?? 0);
  const completion = usage.output_tokens ?? 0;
  return {
    prompt,
    completion,
    total: prompt + completion,
  };
 }
 export function extractCliResultTokenUsage(message: unknown): BenchmarkTokenUsage | null {
  if (!message || typeof message !== "object") {
    return null;
  }
  const resultMessage = message as CliResultMessageLike;
  if (resultMessage.type !== "result") {
    return null;
  }
  const usage = anthropicUsageToBenchmarkTokenUsage(resultMessage.usage);
  if (usage) {
    return usage;
  }
  if (!resultMessage.modelUsage || typeof resultMessage.modelUsage !== "object") {
    return null;
  }
  let prompt = 0;
  let completion = 0;
  let sawModelUsage = false;
  for (const modelUsage of Object.values(resultMessage.modelUsage)) {
    if (!modelUsage || typeof modelUsage !== "object") {
      continue;
    }
    prompt +=
      (modelUsage.inputTokens ?? 0) +
      (modelUsage.cacheCreationInputTokens ?? 0) +
      (modelUsage.cacheReadInputTokens ?? 0);
    completion += modelUsage.outputTokens ?? 0;
    sawModelUsage = true;
  }
  if (!sawModelUsage) {
    return null;
  }
  return {
    prompt,
    completion,
    total: prompt + completion,
  };
 }
 export async function runPromptAndCapture(
  prompt: string,
  cwd: string,
  maxTurns: number = 3,
  modelConfig: CliEvalModelConfig = DEFAULT_CLI_EVAL_MODEL
 ): Promise<PromptRunResult> {
  const toolsUsed: ToolInvocation[] = [];
  const skillsInvoked: string[] = [];
  let output = "";
  let assistantMessageCount = 0;
  let tokenUsage: BenchmarkTokenUsage | null = null;
  const startedAt = Date.now();
  const options: Options = {
    cwd,
    model: modelConfig.model,
    maxTurns,
    settingSources: ["project"],
    allowedTools: ["Skill", "Read", "Glob", "Grep", "Bash", "Write", "Edit"]
  };
  for await (const message of query({ prompt, options })) {
    if (message.type === "assistant") {
      assistantMessageCount += 1;
      const content = message.message?.content;
      if (Array.isArray(content)) {
        for (const block of content) {
          if (block.type === "tool_use") {
            toolsUsed.push({
              tool: block.name,
              input: block.input as Record<string, unknown>,
              timestamp: Date.now()
            });
            if (block.name === "Skill" && typeof block.input === "object" && block.input !== null) {
              const skillInput = block.input as { skill?: string };
              if (skillInput.skill) {
                skillsInvoked.push(skillInput.skill);
              }
            }
          } else if (block.type === "text") {
            output += block.text;
          }
        }
      }
    } else if (message.type === "result") {
      const resultMessage = message as { result?: string };
      tokenUsage = extractCliResultTokenUsage(message) ?? tokenUsage;
      if (typeof resultMessage.result === "string") {
        output += resultMessage.result;
      }
    }
  }
  return {
    toolsUsed,
    skillsInvoked,
    output,
    durationMs: Date.now() - startedAt,
    assistantMessageCount,
    tokenUsage,
  };
 }
 export function wasSkillInvoked(result: PromptRunResult, skillName: string): boolean {
  return result.skillsInvoked.some((skill) => skill === skillName || skill.includes(skillName));
 }
 export function wasToolUsed(result: PromptRunResult, toolName: string): boolean {
  return result.toolsUsed.some((tool) => tool.tool === toolName);
 }
 export function formatCliRunModelLabel(modelConfig: CliEvalModelConfig): string {
  return `${modelConfig.provider}:${modelConfig.model}`;
 }
 export function getToolInputs(
  result: PromptRunResult,
  toolName: string
 ): Record<string, unknown>[] {
  return result.toolsUsed
    .filter((tool) => tool.tool === toolName)
    .map((tool) => tool.input);
 }
--- a/ai_evals/adapters/frontend/benchmarkRunner.ts
+++ b/ai_evals/adapters/frontend/benchmarkRunner.ts
@@ -0,0 +1,87 @@
 import { loadSelectedCases } from "../../core/cases";
 import {
  formatRunModelLabel,
  getFrontendEvalModel,
  resolveEvalModel,
 } from "../../core/models";
 import { buildRunResult } from "../../core/results";
 import { runSuite } from "../../core/runSuite";
 import type { BenchmarkRunResult, ModeRunner } from "../../core/types";
 import { emitFrontendBenchmarkProgress } from "./progress";
 import { createAppModeRunner } from "../../modes/app";
 import { createFlowModeRunner } from "../../modes/flow";
 import { createScriptModeRunner } from "../../modes/script";
 import { DEFAULT_JUDGE_MODEL } from "../../core/judge";
 export type FrontendBenchmarkMode = "flow" | "app" | "script";
 export async function runFrontendBenchmarkFromEnv(): Promise<BenchmarkRunResult> {
  const mode = parseMode(process.env.WMILL_FRONTEND_AI_EVAL_MODE);
  const caseIds = parseOptionalJsonStringArray(process.env.WMILL_FRONTEND_AI_EVAL_CASE_IDS);
  const runs = parsePositiveInteger(process.env.WMILL_FRONTEND_AI_EVAL_RUNS, "WMILL_FRONTEND_AI_EVAL_RUNS");
  const emitProgress = process.env.WMILL_FRONTEND_AI_EVAL_PROGRESS === "1";
  const verbose = process.env.WMILL_FRONTEND_AI_EVAL_VERBOSE === "1";
  const model = resolveEvalModel(mode, process.env.WMILL_FRONTEND_AI_EVAL_MODEL);
  const selectedCases = await loadSelectedCases(mode, caseIds);
  const modeRunner = getModeRunner(mode, getFrontendEvalModel(model));
  const runModel = formatRunModelLabel(mode, model);
  const caseResults = await runSuite({
    modeRunner,
    cases: selectedCases,
    runs,
    runModel,
    judgeModel: DEFAULT_JUDGE_MODEL,
    concurrency: verbose ? 1 : undefined,
    verbose,
    onProgress: emitProgress ? (event) => emitFrontendBenchmarkProgress(event) : undefined,
  });
  return buildRunResult({
    mode,
    runs,
    runModel,
    judgeModel: DEFAULT_JUDGE_MODEL,
    caseResults,
  });
 }
 function getModeRunner(
  mode: FrontendBenchmarkMode,
  model: ReturnType<typeof getFrontendEvalModel>
 ): ModeRunner<any, any, any> {
  switch (mode) {
    case "flow":
      return createFlowModeRunner(model);
    case "app":
      return createAppModeRunner(model);
    case "script":
      return createScriptModeRunner(model);
  }
 }
 function parseMode(value: string | undefined): FrontendBenchmarkMode {
  if (value === "flow" || value === "app" || value === "script") {
    return value;
  }
  throw new Error(`Unsupported frontend benchmark mode: ${String(value)}`);
 }
 function parseOptionalJsonStringArray(value: string | undefined): string[] {
  if (!value) {
    return [];
  }
  const parsed = JSON.parse(value) as unknown;
  if (!Array.isArray(parsed) || parsed.some((entry) => typeof entry !== "string")) {
    throw new Error("WMILL_FRONTEND_AI_EVAL_CASE_IDS must be a JSON string array");
  }
  return parsed;
 }
 function parsePositiveInteger(value: string | undefined, envName: string): number {
  const parsed = Number(value);
  if (!Number.isInteger(parsed) || parsed <= 0) {
    throw new Error(`${envName} must be a positive integer`);
  }
  return parsed;
 }
--- a/ai_evals/adapters/frontend/core/app/appEvalRunner.ts
+++ b/ai_evals/adapters/frontend/core/app/appEvalRunner.ts
@@ -0,0 +1,92 @@
 import { mkdtemp } from 'fs/promises'
 import { tmpdir } from 'os'
 import { join } from 'path'
 import type {
 	AppFiles,
 	BackendRunnable,
 	AppAIChatHelpers
 } from '../../../../../frontend/src/lib/components/copilot/chat/app/core'
 import {
 	getAppTools,
 	prepareAppSystemMessage,
 	prepareAppUserMessage
 } from '../../../../../frontend/src/lib/components/copilot/chat/app/core'
 import type { Tool as ProductionTool } from '../../../../../frontend/src/lib/components/copilot/chat/shared'
 import { createAppFileHelpers } from './fileHelpers'
 import { runEval } from '../shared'
 import type { AIProvider } from '$lib/gen/types.gen'
 import type { ModeRunContext } from '../../../../core/types'
 import type { TokenUsage } from '../shared/types'
 export interface AppEvalResult {
 	success: boolean
 	files: AppFiles
 	error?: string
 	assistantMessageCount: number
 	toolCallCount: number
 	toolsUsed: string[]
 	tokenUsage: TokenUsage
 }
 export interface AppEvalOptions {
 	initialFrontend?: Record<string, string>
 	initialBackend?: Record<string, BackendRunnable>
 	model?: string
 	maxIterations?: number
 	provider?: AIProvider
 	workspaceRoot?: string
 	runContext?: ModeRunContext
 }
 export async function runAppEval(
 	userPrompt: string,
 	apiKey: string,
 	options?: AppEvalOptions
 ): Promise<AppEvalResult> {
 	const workspaceRoot =
 		options?.workspaceRoot ??
 		(await mkdtemp(join(tmpdir(), 'wmill-frontend-app-benchmark-')))
 	const { helpers, getFiles, cleanup } = await createAppFileHelpers(
 		options?.initialFrontend ?? {},
 		options?.initialBackend ?? {},
 		workspaceRoot
 	)
 	try {
 		const systemMessage = prepareAppSystemMessage()
 		const tools = getAppTools() as ProductionTool<AppAIChatHelpers>[]
 		const model = options?.model ?? 'claude-haiku-4-5-20251001'
 		const userMessage = prepareAppUserMessage(userPrompt, helpers.getSelectedContext())
 		const rawResult = await runEval({
 			userPrompt,
 			systemMessage,
 			userMessage,
 			tools,
 			helpers,
 			apiKey,
 			getOutput: getFiles,
 			onAssistantMessageStart: options?.runContext?.onAssistantMessageStart,
 			onAssistantToken: options?.runContext?.onAssistantChunk,
 			onAssistantMessageEnd: options?.runContext?.onAssistantMessageEnd,
 			options: {
 				maxIterations: options?.maxIterations,
 				model,
 				workspace: workspaceRoot,
 				provider: options?.provider
 			}
 		})
 		return {
 			files: rawResult.output,
 			success: rawResult.success,
 			error: rawResult.error,
 			assistantMessageCount: rawResult.iterations,
 			toolCallCount: rawResult.toolCallsCount,
 			toolsUsed: rawResult.toolsCalled,
 			tokenUsage: rawResult.tokenUsage
 		}
 	} finally {
 		await cleanup()
 	}
 }
--- a/frontend/src/lib/components/copilot/chat/tests/app/appFixtureLoader.ts
+++ b/frontend/src/lib/components/copilot/chat/tests/app/appFixtureLoader.ts
@@ -1,4 +1,8 @@
-import type { AppFiles, BackendRunnable, InlineScript } from '../../app/core'
+import type {
 	AppFiles,
 	BackendRunnable,
 	InlineScript
 } from '../../../../../frontend/src/lib/components/copilot/chat/app/core'
 /**
 * Backend runnable metadata stored in meta.json files.
--- a/ai_evals/adapters/frontend/core/app/fileHelpers.ts
+++ b/ai_evals/adapters/frontend/core/app/fileHelpers.ts
@@ -0,0 +1,255 @@
 import { mkdir, rm, writeFile } from 'fs/promises'
 import { dirname, join } from 'path'
 import type {
 	AppAIChatHelpers,
 	AppFiles,
 	BackendRunnable,
 	DataTableSchema,
 	LintResult,
 	SelectedContext
 } from '../../../../../frontend/src/lib/components/copilot/chat/app/core'
 function createEmptyLintResult(): LintResult {
 	return {
 		errorCount: 0,
 		warningCount: 0,
 		errors: { frontend: {}, backend: {} },
 		warnings: { frontend: {}, backend: {} }
 	}
 }
 async function writeFrontendFile(
 	workspaceRoot: string | undefined,
 	path: string,
 	content: string
 ): Promise<void> {
 	if (!workspaceRoot) {
 		return
 	}
 	const relativePath = path.startsWith('/') ? path.slice(1) : path
 	const fullPath = join(workspaceRoot, 'frontend', relativePath)
 	await mkdir(dirname(fullPath), { recursive: true })
 	await writeFile(fullPath, content, 'utf8')
 }
 async function removeFrontendFile(workspaceRoot: string | undefined, path: string): Promise<void> {
 	if (!workspaceRoot) {
 		return
 	}
 	const relativePath = path.startsWith('/') ? path.slice(1) : path
 	await rm(join(workspaceRoot, 'frontend', relativePath), { force: true })
 }
 async function writeBackendRunnable(
 	workspaceRoot: string | undefined,
 	key: string,
 	runnable: BackendRunnable
 ): Promise<void> {
 	if (!workspaceRoot) {
 		return
 	}
 	const runnableDir = join(workspaceRoot, 'backend', key)
 	await mkdir(runnableDir, { recursive: true })
 	const meta: { name: string; language?: string; type?: string; path?: string } = {
 		name: runnable.name
 	}
 	if (runnable.type === 'inline' && runnable.inlineScript) {
 		meta.language = runnable.inlineScript.language
 		const extension = runnable.inlineScript.language === 'python3' ? 'py' : 'ts'
 		await writeFile(
 			join(runnableDir, `main.${extension}`),
 			runnable.inlineScript.content,
 			'utf8'
 		)
 	} else {
 		meta.type = runnable.type
 		if (runnable.path) {
 			meta.path = runnable.path
 		}
 	}
 	await writeFile(join(runnableDir, 'meta.json'), JSON.stringify(meta, null, 2) + '\n', 'utf8')
 }
 async function removeBackendRunnable(workspaceRoot: string | undefined, key: string): Promise<void> {
 	if (!workspaceRoot) {
 		return
 	}
 	await rm(join(workspaceRoot, 'backend', key), { recursive: true, force: true })
 }
 async function persistDatatables(
 	workspaceRoot: string | undefined,
 	datatables: DataTableSchema[]
 ): Promise<void> {
 	if (!workspaceRoot) {
 		return
 	}
 	await writeFile(
 		join(workspaceRoot, 'datatables.json'),
 		JSON.stringify(datatables, null, 2) + '\n',
 		'utf8'
 	)
 }
 export async function createAppFileHelpers(
 	initialFrontend: Record<string, string> = {},
 	initialBackend: Record<string, BackendRunnable> = {},
 	workspaceRoot?: string
 ): Promise<{
 	helpers: AppAIChatHelpers
 	getFiles: () => AppFiles
 	getFrontend: () => Record<string, string>
 	getBackend: () => Record<string, BackendRunnable>
 	cleanup: () => Promise<void>
 	workspaceDir: string | null
 }> {
 	let frontend = { ...initialFrontend }
 	let backend = { ...initialBackend }
 	let snapshotId = 0
 	const snapshots = new Map<
 		number,
 		{ frontend: Record<string, string>; backend: Record<string, BackendRunnable> }
 	>()
 	const datatables: DataTableSchema[] = []
 	for (const [path, content] of Object.entries(frontend)) {
 		await writeFrontendFile(workspaceRoot, path, content)
 	}
 	for (const [key, runnable] of Object.entries(backend)) {
 		await writeBackendRunnable(workspaceRoot, key, runnable)
 	}
 	await persistDatatables(workspaceRoot, datatables)
 	const helpers: AppAIChatHelpers = {
 		listFrontendFiles: () => Object.keys(frontend),
 		getFrontendFile: (path: string) => frontend[path],
 		getFrontendFiles: () => ({ ...frontend }),
 		setFrontendFile: (path: string, content: string) => {
 			frontend[path] = content
 			void writeFrontendFile(workspaceRoot, path, content)
 			return createEmptyLintResult()
 		},
 		deleteFrontendFile: (path: string) => {
 			delete frontend[path]
 			void removeFrontendFile(workspaceRoot, path)
 		},
 		listBackendRunnables: () =>
 			Object.entries(backend).map(([key, runnable]) => ({
 				key,
 				name: runnable.name
 			})),
 		getBackendRunnable: (key: string) => backend[key],
 		getBackendRunnables: () => ({ ...backend }),
 		setBackendRunnable: async (key: string, runnable: BackendRunnable) => {
 			backend[key] = runnable
 			await writeBackendRunnable(workspaceRoot, key, runnable)
 			return createEmptyLintResult()
 		},
 		deleteBackendRunnable: (key: string) => {
 			delete backend[key]
 			void removeBackendRunnable(workspaceRoot, key)
 		},
 		getFiles: (): AppFiles => ({
 			frontend: { ...frontend },
 			backend: { ...backend }
 		}),
 		getSelectedContext: (): SelectedContext => ({ type: 'none' }),
 		snapshot: () => {
 			const id = ++snapshotId
 			snapshots.set(id, {
 				frontend: { ...frontend },
 				backend: { ...backend }
 			})
 			return id
 		},
 		revertToSnapshot: (id: number) => {
 			const snapshot = snapshots.get(id)
 			if (!snapshot) {
 				return
 			}
 			frontend = { ...snapshot.frontend }
 			backend = { ...snapshot.backend }
 			void syncWorkspace()
 		},
 		lint: () => createEmptyLintResult(),
 		getDatatables: async () => structuredClone(datatables),
 		getAvailableDatatableNames: () => datatables.map((datatable) => datatable.datatable_name),
 		execDatatableSql: async (
 			datatableName: string,
 			sql: string,
 			newTable?: { schema: string; name: string }
 		) => {
 			if (newTable) {
 				datatables.push({
 					datatable_name: datatableName,
 					schemas: {
 						[newTable.schema]: {
 							[newTable.name]: {}
 						}
 					}
 				})
 				await persistDatatables(workspaceRoot, datatables)
 			}
 			return {
 				success: true,
 				result: [
 					{
 						datatableName,
 						sql
 					}
 				]
 			}
 		},
 		addTableToWhitelist: (datatableName: string, schemaName: string, tableName: string) => {
 			const existing = datatables.find((entry) => entry.datatable_name === datatableName)
 			if (existing) {
 				existing.schemas[schemaName] ??= {}
 				existing.schemas[schemaName][tableName] ??= {}
 			} else {
 				datatables.push({
 					datatable_name: datatableName,
 					schemas: {
 						[schemaName]: {
 							[tableName]: {}
 						}
 					}
 				})
 			}
 			void persistDatatables(workspaceRoot, datatables)
 		}
 	}
 	async function syncWorkspace(): Promise<void> {
 		if (!workspaceRoot) {
 			return
 		}
 		await rm(join(workspaceRoot, 'frontend'), { recursive: true, force: true })
 		await rm(join(workspaceRoot, 'backend'), { recursive: true, force: true })
 		for (const [path, content] of Object.entries(frontend)) {
 			await writeFrontendFile(workspaceRoot, path, content)
 		}
 		for (const [key, runnable] of Object.entries(backend)) {
 			await writeBackendRunnable(workspaceRoot, key, runnable)
 		}
 		await persistDatatables(workspaceRoot, datatables)
 	}
 	return {
 		helpers,
 		getFiles: () => ({
 			frontend: { ...frontend },
 			backend: { ...backend }
 		}),
 		getFrontend: () => ({ ...frontend }),
 		getBackend: () => ({ ...backend }),
 		cleanup: async () => {
 			if (workspaceRoot) {
 				await rm(workspaceRoot, { recursive: true, force: true })
 			}
 		},
 		workspaceDir: workspaceRoot ?? null
 	}
 }
--- a/ai_evals/adapters/frontend/core/flow/fileHelpers.ts
+++ b/ai_evals/adapters/frontend/core/flow/fileHelpers.ts
@@ -0,0 +1,161 @@
 import { mkdir, rm, writeFile } from 'fs/promises'
 import { dirname, join } from 'path'
 import type { FlowModule, InputTransform } from '../../../../../frontend/src/lib/gen'
 import type { ExtendedOpenFlow } from '../../../../../frontend/src/lib/components/flows/types'
 import type { FlowAIChatHelpers } from '../../../../../frontend/src/lib/components/copilot/chat/flow/core'
 import type { ScriptLintResult } from '../../../../../frontend/src/lib/components/copilot/chat/shared'
 import { findModuleById } from '../../../../../frontend/src/lib/components/copilot/chat/shared'
 import {
 	createInlineScriptSession
 } from '../../../../../frontend/src/lib/components/copilot/chat/flow/inlineScriptsUtils'
 import {
 	registerBenchmarkWorkspace,
 	registerBenchmarkWorkspaceRunnables,
 	unregisterBenchmarkWorkspaceRunnables,
 	createBenchmarkCompletedJob,
 	type BenchmarkWorkspaceFlow,
 	type BenchmarkWorkspaceScript
 } from '../../mockBackend'
 const EMPTY_SCRIPT_LINT_RESULT: ScriptLintResult = {
 	errorCount: 0,
 	warningCount: 0,
 	errors: [],
 	warnings: []
 }
 export interface FlowWorkspaceFixtures {
 	scripts?: BenchmarkWorkspaceScript[]
 	flows?: BenchmarkWorkspaceFlow[]
 }
 export async function createFlowFileHelpers(
 	initialModules: FlowModule[] = [],
 	initialSchema?: Record<string, any>,
 	workspaceRoot?: string,
 	workspaceFixtures?: FlowWorkspaceFixtures
 ): Promise<{
 	helpers: FlowAIChatHelpers
 	getFlow: () => ExtendedOpenFlow
 	getModules: () => FlowModule[]
 	cleanup: () => Promise<void>
 	workspaceDir: string | null
 }> {
 	let flow: ExtendedOpenFlow = {
 		value: { modules: structuredClone(initialModules) },
 		summary: '',
 		schema: initialSchema ?? {
 			$schema: 'https://json-schema.org/draft/2020-12/schema',
 			properties: {},
 			required: [],
 			type: 'object'
 		}
 	}
 	const inlineScriptSession = createInlineScriptSession()
 	const flowFilePath = workspaceRoot ? join(workspaceRoot, 'flow.json') : null
 	async function persistFlow(): Promise<void> {
 		if (!flowFilePath) {
 			return
 		}
 		await mkdir(dirname(flowFilePath), { recursive: true })
 		await writeFile(flowFilePath, JSON.stringify(flow, null, 2) + '\n', 'utf8')
 	}
 	await persistFlow()
 	if (workspaceRoot) {
 		registerBenchmarkWorkspace(workspaceRoot)
 		if (workspaceFixtures) {
 			registerBenchmarkWorkspaceRunnables(workspaceRoot, workspaceFixtures)
 		}
 	}
 	const helpers: FlowAIChatHelpers = {
 		getFlowAndSelectedId: () => ({ flow, selectedId: '' }),
 		getModules: (id?: string) => {
 			if (!id) return flow.value.modules
 			const module = findModuleById(flow.value.modules, id)
 			return module ? [module] : []
 		},
 		inlineScriptSession,
 		setSnapshot: () => {},
 		revertToSnapshot: () => {},
 		setCode: async (id: string, code: string) => {
 			const module = findModuleById(flow.value.modules, id)
 			if (module && module.value.type === 'rawscript') {
 				module.value.content = code
 			}
 			inlineScriptSession.set(id, code)
 			await persistFlow()
 		},
 		setFlowJson: async (
 			modules: FlowModule[] | undefined,
 			schema: Record<string, any> | undefined
 		) => {
 			if (modules) {
 				flow.value.modules = inlineScriptSession.restoreInlineScriptReferences(modules)
 				const unresolvedRefs = inlineScriptSession.findUnresolvedInlineScriptRefs(flow.value.modules)
 				if (unresolvedRefs.length > 0) {
 					throw new Error(
 						`Unresolved inline script references: ${unresolvedRefs.join(', ')}`
 					)
 				}
 			}
 			if (schema !== undefined) {
 				flow.schema = schema
 			}
 			await persistFlow()
 		},
 		getFlowInputsSchema: async () => flow.schema ?? {},
 		updateExprsToSet: (_id: string, _inputTransforms: Record<string, InputTransform>) => {},
 		acceptAllModuleActions: () => {},
 		rejectAllModuleActions: () => {},
 		hasPendingChanges: () => false,
 		selectStep: (_id: string) => {},
 		testFlow: async (args?: Record<string, any>) => {
 			if (workspaceRoot) {
 				const runPath = join(workspaceRoot, 'test-run.json')
 				await writeFile(
 					runPath,
 					JSON.stringify(
 						{
 							requestedArgs: args ?? {},
 							modules: flow.value.modules.map((module) => module.id)
 						},
 						null,
 						2
 					) + '\n',
 					'utf8'
 				)
 			}
 			return createBenchmarkCompletedJob({
 				workspace: workspaceRoot ?? 'benchmark',
 				jobKind: 'flowpreview',
 				result: {
 					requestedArgs: args ?? {},
 					modules: flow.value.modules.map((module) => module.id),
 					mocked: true
 				},
 				logs: 'Mock benchmark flow test run completed successfully.'
 			})
 		},
 		getLintErrors: async () => EMPTY_SCRIPT_LINT_RESULT
 	}
 	return {
 		helpers,
 		getFlow: () => flow,
 		getModules: () => flow.value.modules,
 		cleanup: async () => {
 			if (workspaceRoot) {
 				unregisterBenchmarkWorkspaceRunnables(workspaceRoot)
 			}
 			if (workspaceRoot) {
 				await rm(workspaceRoot, { recursive: true, force: true })
 			}
 		},
 		workspaceDir: workspaceRoot ?? null
 	}
 }
--- a/ai_evals/adapters/frontend/core/flow/flowEvalRunner.ts
+++ b/ai_evals/adapters/frontend/core/flow/flowEvalRunner.ts
@@ -0,0 +1,103 @@
 import { mkdtemp } from 'fs/promises'
 import { tmpdir } from 'os'
 import { join } from 'path'
 import type { FlowModule } from '$lib/gen'
 import type { AIProvider } from '$lib/gen/types.gen'
 import type { ExtendedOpenFlow } from '$lib/components/flows/types'
 import {
 	flowTools,
 	prepareFlowSystemMessage,
 	prepareFlowUserMessage,
 	type FlowAIChatHelpers
 } from '../../../../../frontend/src/lib/components/copilot/chat/flow/core'
 import type { Tool as ProductionTool } from '../../../../../frontend/src/lib/components/copilot/chat/shared'
 import { createFlowFileHelpers, type FlowWorkspaceFixtures } from './fileHelpers'
 import { runEval } from '../shared'
 import type { ModeRunContext } from '../../../../core/types'
 import type { TokenUsage } from '../shared/types'
 export interface FlowFixture {
 	value?: {
 		modules?: FlowModule[]
 	}
 	schema?: Record<string, unknown>
 }
 export interface FlowEvalResult {
 	success: boolean
 	flow: ExtendedOpenFlow
 	error?: string
 	assistantMessageCount: number
 	toolCallCount: number
 	toolsUsed: string[]
 	tokenUsage: TokenUsage
 }
 export interface FlowEvalOptions {
 	initialFlow?: FlowFixture
 	workspaceFixtures?: FlowWorkspaceFixtures
 	model?: string
 	maxIterations?: number
 	provider?: AIProvider
 	workspaceRoot?: string
 	runContext?: ModeRunContext
 }
 export async function runFlowEval(
 	userPrompt: string,
 	apiKey: string,
 	options?: FlowEvalOptions
 ): Promise<FlowEvalResult> {
 	const workspaceRoot =
 		options?.workspaceRoot ??
 		(await mkdtemp(join(tmpdir(), 'wmill-frontend-flow-benchmark-')))
 	const { helpers, getFlow, cleanup } = await createFlowFileHelpers(
 		options?.initialFlow?.value?.modules ?? [],
 		options?.initialFlow?.schema,
 		workspaceRoot,
 		options?.workspaceFixtures
 	)
 	try {
 		const systemMessage = prepareFlowSystemMessage()
 		const tools = flowTools as ProductionTool<FlowAIChatHelpers>[]
 		const model = options?.model ?? 'claude-haiku-4-5-20251001'
 		const userMessage = prepareFlowUserMessage(
 			userPrompt,
 			helpers.getFlowAndSelectedId(),
 			[],
 			helpers.inlineScriptSession
 		)
 		const rawResult = await runEval({
 			userPrompt,
 			systemMessage,
 			userMessage,
 			tools,
 			helpers,
 			apiKey,
 			getOutput: getFlow,
 			onAssistantMessageStart: options?.runContext?.onAssistantMessageStart,
 			onAssistantToken: options?.runContext?.onAssistantChunk,
 			onAssistantMessageEnd: options?.runContext?.onAssistantMessageEnd,
 			options: {
 				maxIterations: options?.maxIterations,
 				model,
 				workspace: workspaceRoot,
 				provider: options?.provider
 			}
 		})
 		return {
 			flow: rawResult.output,
 			success: rawResult.success,
 			error: rawResult.error,
 			assistantMessageCount: rawResult.iterations,
 			toolCallCount: rawResult.toolCallsCount,
 			toolsUsed: rawResult.toolsCalled,
 			tokenUsage: rawResult.tokenUsage
 		}
 	} finally {
 		await cleanup()
 	}
 }
--- a/ai_evals/adapters/frontend/core/script/fileHelpers.ts
+++ b/ai_evals/adapters/frontend/core/script/fileHelpers.ts
@@ -0,0 +1,73 @@
 import { mkdir, rm, writeFile } from 'fs/promises'
 import { dirname, join } from 'path'
 import type { ScriptLang } from '../../../../../frontend/src/lib/gen/types.gen'
 import type { ReviewChangesOpts } from '../../../../../frontend/src/lib/components/copilot/chat/monaco-adapter'
 import type { ScriptChatHelpers } from '../../../../../frontend/src/lib/components/copilot/chat/script/core'
 import { buildScriptLintResult } from './preview'
 import { registerBenchmarkWorkspace, unregisterBenchmarkWorkspace } from '../../mockBackend'
 export interface ScriptEvalState {
 	code: string
 	lang: ScriptLang | 'bunnative'
 	path: string
 	args: Record<string, any>
 }
 export async function createScriptFileHelpers(
 	initialScript: ScriptEvalState,
 	workspaceRoot?: string
 ): Promise<{
 	helpers: ScriptChatHelpers
 	getScript: () => ScriptEvalState
 	cleanup: () => Promise<void>
 	workspaceDir: string | null
 }> {
 	let script = structuredClone(initialScript)
 	const scriptFilePath = workspaceRoot ? join(workspaceRoot, script.path) : null
 	async function persistScript(): Promise<void> {
 		if (!scriptFilePath) {
 			return
 		}
 		await mkdir(dirname(scriptFilePath), { recursive: true })
 		await writeFile(scriptFilePath, script.code, 'utf8')
 	}
 	await persistScript()
 	if (workspaceRoot) {
 		registerBenchmarkWorkspace(workspaceRoot)
 	}
 	const helpers: ScriptChatHelpers = {
 		getScriptOptions: () => ({
 			code: script.code,
 			lang: script.lang,
 			path: script.path,
 			args: structuredClone(script.args)
 		}),
 		applyCode: async (code: string, opts?: ReviewChangesOpts) => {
 			if (opts?.mode === 'revert') {
 				return
 			}
 			script = {
 				...script,
 				code
 			}
 			await persistScript()
 		},
 		getLintErrors: () => buildScriptLintResult(script.code, script.lang)
 	}
 	return {
 		helpers,
 		getScript: () => structuredClone(script),
 		cleanup: async () => {
 			if (workspaceRoot) {
 				unregisterBenchmarkWorkspace(workspaceRoot)
 				await rm(workspaceRoot, { recursive: true, force: true })
 			}
 		},
 		workspaceDir: workspaceRoot ?? null
 	}
 }
--- a/ai_evals/adapters/frontend/core/script/preview.ts
+++ b/ai_evals/adapters/frontend/core/script/preview.ts
@@ -0,0 +1,96 @@
 import ts from 'typescript'
 import type { ScriptLang } from '../../../../../frontend/src/lib/gen/types.gen'
 import type { ScriptLintResult } from '../../../../../frontend/src/lib/components/copilot/chat/shared'
 export type ScriptPreviewLanguage = ScriptLang | 'bunnative'
 const TS_LIKE_LANGUAGES = new Set<ScriptPreviewLanguage>(['bun', 'deno', 'nativets', 'bunnative'])
 const JS_LIKE_LANGUAGES = new Set<ScriptPreviewLanguage>(['bun', 'deno', 'nativets', 'bunnative'])
 function hasSupportedEntrypoint(code: string): boolean {
 	return (
 		/export\s+(async\s+)?function\s+main\s*\(/.test(code) ||
 		/export\s+(async\s+)?function\s+preprocessor\s*\(/.test(code)
 	)
 }
 function compilerOptionsForLanguage(lang: ScriptPreviewLanguage): ts.CompilerOptions | null {
 	if (!TS_LIKE_LANGUAGES.has(lang)) {
 		return null
 	}
 	return {
 		target: ts.ScriptTarget.ES2022,
 		module: ts.ModuleKind.ESNext,
 		moduleResolution: ts.ModuleResolutionKind.Bundler,
 		noEmit: true,
 		allowJs: true,
 		checkJs: false,
 		strict: false,
 		skipLibCheck: true
 	}
 }
 function getLineAndColumn(sourceText: string, start: number): { line: number; column: number } {
 	const prefix = sourceText.slice(0, Math.max(0, start))
 	const line = prefix.split('\n').length
 	const lastNewline = prefix.lastIndexOf('\n')
 	const column = lastNewline === -1 ? prefix.length + 1 : prefix.length - lastNewline
 	return { line, column }
 }
 export function buildScriptLintResult(
 	code: string,
 	lang: ScriptPreviewLanguage
 ): ScriptLintResult {
 	const diagnostics: ScriptLintResult['errors'] = []
 	const compilerOptions = compilerOptionsForLanguage(lang)
 	if (compilerOptions) {
 		const sourceFile = ts.createSourceFile(
 			'script.ts',
 			code,
 			ts.ScriptTarget.ES2022,
 			true,
 			JS_LIKE_LANGUAGES.has(lang) ? ts.ScriptKind.TS : ts.ScriptKind.JS
 		)
 		const output = ts.transpileModule(code, {
 			compilerOptions,
 			fileName: sourceFile.fileName,
 			reportDiagnostics: true
 		})
 		for (const diagnostic of output.diagnostics ?? []) {
 			const start = diagnostic.start ?? 0
 			const length = diagnostic.length ?? 1
 			const { line, column } = getLineAndColumn(code, start)
 			const message = ts.flattenDiagnosticMessageText(diagnostic.messageText, '\n')
 			diagnostics.push({
 				startLineNumber: line,
 				startColumn: column,
 				endLineNumber: line,
 				endColumn: column + Math.max(1, length),
 				message,
 				severity: 8
 			} as ScriptLintResult['errors'][number])
 		}
 	}
 	if (!hasSupportedEntrypoint(code)) {
 		diagnostics.push({
 			startLineNumber: 1,
 			startColumn: 1,
 			endLineNumber: 1,
 			endColumn: 1,
 			message: 'Script must export a main or preprocessor function.',
 			severity: 8
 		} as ScriptLintResult['errors'][number])
 	}
 	return {
 		errorCount: diagnostics.length,
 		warningCount: 0,
 		errors: diagnostics,
 		warnings: []
 	}
 }
--- a/ai_evals/adapters/frontend/core/script/scriptEvalRunner.ts
+++ b/ai_evals/adapters/frontend/core/script/scriptEvalRunner.ts
@@ -0,0 +1,109 @@
 import { mkdtemp } from 'fs/promises'
 import { tmpdir } from 'os'
 import { join } from 'path'
 import type { AIProvider, AIProviderModel, ScriptLang } from '$lib/gen/types.gen'
 import type { ContextElement } from '../../../../../frontend/src/lib/components/copilot/chat/context'
 import {
 	prepareScriptSystemMessage,
 	prepareScriptTools,
 	prepareScriptUserMessage,
 	type ScriptChatHelpers
 } from '../../../../../frontend/src/lib/components/copilot/chat/script/core'
 import type { Tool as ProductionTool } from '../../../../../frontend/src/lib/components/copilot/chat/shared'
 import { createScriptFileHelpers, type ScriptEvalState } from './fileHelpers'
 import { runEval } from '../shared'
 import type { ModeRunContext } from '../../../../core/types'
 import type { TokenUsage } from '../shared/types'
 export interface ScriptEvalResult {
 	success: boolean
 	script: ScriptEvalState
 	error?: string
 	assistantMessageCount: number
 	toolCallCount: number
 	toolsUsed: string[]
 	tokenUsage: TokenUsage
 }
 export interface ScriptEvalOptions {
 	initialScript: ScriptEvalState
 	model?: string
 	maxIterations?: number
 	provider?: AIProvider
 	workspaceRoot?: string
 	runContext?: ModeRunContext
 }
 function resolveModelProvider(
 	model: string,
 	provider?: AIProvider
 ): AIProviderModel {
 	if (provider) {
 		return { provider, model }
 	}
 	if (model.startsWith('claude')) {
 		return { provider: 'anthropic', model }
 	}
 	return { provider: 'openai', model }
 }
 export async function runScriptEval(
 	userPrompt: string,
 	apiKey: string,
 	options: ScriptEvalOptions
 ): Promise<ScriptEvalResult> {
 	const workspaceRoot =
 		options.workspaceRoot ?? (await mkdtemp(join(tmpdir(), 'wmill-frontend-script-benchmark-')))
 	const { helpers, getScript, cleanup } = await createScriptFileHelpers(
 		options.initialScript,
 		workspaceRoot
 	)
 	try {
 		const model = options.model ?? 'claude-haiku-4-5-20251001'
 		const modelProvider = resolveModelProvider(model, options.provider)
 		const selectedContext: ContextElement[] = []
 		const systemMessage = prepareScriptSystemMessage(
 			modelProvider,
 			options.initialScript.lang,
 			{}
 		)
 		const tools = prepareScriptTools(
 			modelProvider,
 			options.initialScript.lang,
 			selectedContext
 		) as ProductionTool<ScriptChatHelpers>[]
 		const userMessage = prepareScriptUserMessage(userPrompt, selectedContext)
 		const rawResult = await runEval({
 			userPrompt,
 			systemMessage,
 			userMessage,
 			tools,
 			helpers,
 			apiKey,
 			getOutput: getScript,
 			onAssistantMessageStart: options.runContext?.onAssistantMessageStart,
 			onAssistantToken: options.runContext?.onAssistantChunk,
 			onAssistantMessageEnd: options.runContext?.onAssistantMessageEnd,
 			options: {
 				maxIterations: options.maxIterations,
 				model,
 				workspace: workspaceRoot,
 				provider: modelProvider.provider
 			}
 		})
 		return {
 			script: rawResult.output,
 			success: rawResult.success,
 			error: rawResult.error,
 			assistantMessageCount: rawResult.iterations,
 			toolCallCount: rawResult.toolCallsCount,
 			toolsUsed: rawResult.toolsCalled,
 			tokenUsage: rawResult.tokenUsage
 		}
 	} finally {
 		await cleanup()
 	}
 }
--- a/frontend/src/lib/components/copilot/chat/tests/shared/baseEvalRunner.ts
+++ b/frontend/src/lib/components/copilot/chat/tests/shared/baseEvalRunner.ts
@@ -1,29 +1,19 @@
 import OpenAI from 'openai'
 import Anthropic from '@anthropic-ai/sdk'
 import type {
 	ChatCompletionMessageParam,
 	ChatCompletionSystemMessageParam
 } from 'openai/resources/chat/completions.mjs'
-import type { AIProvider, AIProviderModel } from '$lib/gen/types.gen'
+import type { AIProviderModel } from '$lib/gen/types.gen'
-import type { TokenUsage, ToolCallDetail, EvalRunnerOptions } from './types'
+import type { TokenUsage, ToolCallDetail, EvalRunnerOptions, RawEvalResult } from './types'
-import type { Tool } from './baseVariants'
+import { runChatLoop, type ChatClients } from '../../../../../frontend/src/lib/components/copilot/chat/chatLoop'
-import { runChatLoop, type ChatClients } from '../../chatLoop'
+import type {
-import type { Tool as ProductionTool, ToolCallbacks } from '../../shared'
+	Tool as ProductionTool,
-
+	ToolCallbacks
-/**
+} from '../../../../../frontend/src/lib/components/copilot/chat/shared'
- * Result from a single eval run (before domain-specific evaluation).
+import {
- */
+	createEvalClients,
-export interface RawEvalResult<TOutput> {
+	type FrontendEvalProvider,
-	success: boolean
+	resolveEvalModelProvider
-	output: TOutput
+} from './providerConfig'
 	error?: string
 	tokenUsage: TokenUsage
 	toolCallsCount: number
 	toolsCalled: string[]
 	toolCallDetails: ToolCallDetail[]
 	iterations: number
 	messages: ChatCompletionMessageParam[]
 }
 /**
 * Parameters for running a base evaluation.
@@ -38,7 +28,7 @@ export interface RunEvalParams<THelpers, TOutput> {
 	/** Tool definitions for the LLM API (unused — derived from tools) */
 	toolDefs?: unknown
 	/** Full tool implementations for execution */
-	tools: Tool<THelpers>[]
+	tools: ProductionTool<THelpers>[]
 	/** Domain-specific helpers for tool execution */
 	helpers: THelpers
 	/** API key for the provider */
@@ -47,35 +37,9 @@ export interface RunEvalParams<THelpers, TOutput> {
 	getOutput: () => TOutput
 	/** Optional configuration */
 	options?: EvalRunnerOptions
-}
+	onAssistantMessageStart?: () => void
-
+	onAssistantToken?: (token: string) => void
-/**
+	onAssistantMessageEnd?: () => void
 * Creates SDK clients for the given provider.
 */
 function createEvalClients(provider: AIProvider, apiKey: string): ChatClients {
 	if (provider === 'anthropic') {
 		return {
 			openai: new OpenAI({ apiKey: 'unused' }),
 			anthropic: new Anthropic({ apiKey })
 		}
 	}
 	return {
 		openai: new OpenAI({ apiKey }),
 		anthropic: new Anthropic({ apiKey: 'unused' })
 	}
 }
 /**
 * Resolves model string to AIProviderModel.
 */
 function resolveModelProvider(
 	model: string,
 	provider?: AIProvider
 ): AIProviderModel {
 	if (provider) return { provider, model }
 	if (model.startsWith('claude')) return { provider: 'anthropic', model }
 	if (model.startsWith('gpt') || model.startsWith('o')) return { provider: 'openai', model }
 	return { provider: 'openai', model }
 }
 /**
@@ -92,16 +56,23 @@ export async function runEval<THelpers, TOutput>(
 		helpers,
 		apiKey,
 		getOutput,
-		options
+		options,
 		onAssistantMessageStart,
 		onAssistantToken,
 		onAssistantMessageEnd
 	} = params
 	let shouldEmitMessageStart = true
 	const model = options?.model ?? 'gpt-4o'
 	const maxIterations = options?.maxIterations ?? 20
 	const workspace = options?.workspace ?? 'test-workspace'
 	const provider = options?.provider
-	const modelProvider = resolveModelProvider(model, provider)
+	const modelProvider = resolveEvalModelProvider(
-	const clients = createEvalClients(modelProvider.provider, apiKey)
+		model,
 		provider as FrontendEvalProvider | undefined
 	) as AIProviderModel
 	const clients = createEvalClients(modelProvider.provider, apiKey) as ChatClients
 	const messages: ChatCompletionMessageParam[] = [userMessage]
 	let toolCallsCount = 0
@@ -128,7 +99,7 @@ export async function runEval<THelpers, TOutput>(
 			}
 			return tool.fn(p)
 		}
-	})) as ProductionTool<THelpers>[]
+	}))
 	// No-op callbacks for eval
 	const callbacks: ToolCallbacks & {
@@ -137,8 +108,19 @@ export async function runEval<THelpers, TOutput>(
 	} = {
 		setToolStatus: () => {},
 		removeToolStatus: () => {},
-		onNewToken: () => {},
+		onNewToken: (token: string) => {
-		onMessageEnd: () => {}
+			if (shouldEmitMessageStart) {
 				onAssistantMessageStart?.()
 				shouldEmitMessageStart = false
 			}
 			onAssistantToken?.(token)
 		},
 		onMessageEnd: () => {
 			if (!shouldEmitMessageStart) {
 				onAssistantMessageEnd?.()
 			}
 			shouldEmitMessageStart = true
 		}
 	}
 	const abortController = new AbortController()
@@ -161,7 +143,7 @@ export async function runEval<THelpers, TOutput>(
 		return {
 			success: true,
 			output: getOutput(),
-			tokenUsage: { prompt: 0, completion: 0, total: 0 },
+			tokenUsage: result.tokenUsage,
 			toolCallsCount,
 			toolsCalled,
 			toolCallDetails,
--- a/ai_evals/adapters/frontend/core/shared/index.ts
+++ b/ai_evals/adapters/frontend/core/shared/index.ts
@@ -0,0 +1,3 @@
 export type { TokenUsage, ToolCallDetail, EvalRunnerOptions, RawEvalResult } from './types'
 export type { RunEvalParams } from './baseEvalRunner'
 export { runEval } from './baseEvalRunner'
--- a/ai_evals/adapters/frontend/core/shared/providerConfig.test.ts
+++ b/ai_evals/adapters/frontend/core/shared/providerConfig.test.ts
@@ -0,0 +1,41 @@
 import { describe, expect, it } from "bun:test";
 import {
  buildOpenAICompatibleClientOptions,
  resolveEvalModelProvider,
 } from "./providerConfig";
 describe("buildOpenAICompatibleClientOptions", () => {
  it("adds Gemini's OpenAI-compatible base URL and client header", () => {
    const options = buildOpenAICompatibleClientOptions("googleai", "gemini-test-key");
    expect(options).toMatchObject({
      apiKey: "gemini-test-key",
      baseURL: "https://generativelanguage.googleapis.com/v1beta/openai/",
      defaultHeaders: {
        "x-goog-api-client": "windmill-ai-evals/1.0",
      },
    });
  });
  it("keeps the default OpenAI-compatible config for OpenAI", () => {
    expect(buildOpenAICompatibleClientOptions("openai", "openai-test-key")).toEqual({
      apiKey: "openai-test-key",
    });
  });
 });
 describe("resolveEvalModelProvider", () => {
  it("infers googleai from Gemini model ids", () => {
    expect(resolveEvalModelProvider("gemini-2.5-flash")).toEqual({
      provider: "googleai",
      model: "gemini-2.5-flash",
    });
  });
  it("preserves an explicit provider", () => {
    expect(resolveEvalModelProvider("gemini-2.5-pro", "googleai")).toEqual({
      provider: "googleai",
      model: "gemini-2.5-pro",
    });
  });
 });
--- a/ai_evals/adapters/frontend/core/shared/providerConfig.ts
+++ b/ai_evals/adapters/frontend/core/shared/providerConfig.ts
@@ -0,0 +1,71 @@
 import Anthropic from "@anthropic-ai/sdk";
 import OpenAI from "openai";
 import type { FrontendEvalModelConfig } from "../../../../core/models";
 export type FrontendEvalProvider = FrontendEvalModelConfig["provider"];
 export interface EvalClients {
  openai: OpenAI;
  anthropic: Anthropic;
 }
 export interface ResolvedEvalModelProvider {
  provider: FrontendEvalProvider;
  model: string;
 }
 const GEMINI_OPENAI_BASE_URL = "https://generativelanguage.googleapis.com/v1beta/openai/";
 const GEMINI_GOOG_API_CLIENT = "windmill-ai-evals/1.0";
 export function buildOpenAICompatibleClientOptions(
  provider: Exclude<FrontendEvalProvider, "anthropic">,
  apiKey: string
 ): ConstructorParameters<typeof OpenAI>[0] {
  if (provider === "googleai") {
    return {
      apiKey,
      baseURL: GEMINI_OPENAI_BASE_URL,
      defaultHeaders: {
        "x-goog-api-client": GEMINI_GOOG_API_CLIENT,
      },
    };
  }
  return { apiKey };
 }
 export function createEvalClients(
  provider: FrontendEvalProvider,
  apiKey: string
 ): EvalClients {
  if (provider === "anthropic") {
    return {
      openai: new OpenAI({ apiKey: "unused" }),
      anthropic: new Anthropic({ apiKey }),
    };
  }
  return {
    openai: new OpenAI(buildOpenAICompatibleClientOptions(provider, apiKey)),
    anthropic: new Anthropic({ apiKey: "unused" }),
  };
 }
 export function resolveEvalModelProvider(
  model: string,
  provider?: FrontendEvalProvider
 ): ResolvedEvalModelProvider {
  if (provider) {
    return { provider, model };
  }
  if (model.startsWith("claude")) {
    return { provider: "anthropic", model };
  }
  if (model.startsWith("gemini")) {
    return { provider: "googleai", model };
  }
  if (model.startsWith("gpt") || model.startsWith("o")) {
    return { provider: "openai", model };
  }
  return { provider: "openai", model };
 }
--- a/ai_evals/adapters/frontend/core/shared/types.ts
+++ b/ai_evals/adapters/frontend/core/shared/types.ts
@@ -0,0 +1,32 @@
 import type { ChatCompletionMessageParam } from 'openai/resources/chat/completions.mjs'
 import type { AIProvider } from '$lib/gen/types.gen'
 export interface TokenUsage {
 	prompt: number
 	completion: number
 	total: number
 }
 export interface ToolCallDetail {
 	name: string
 	arguments: Record<string, unknown>
 }
 export interface EvalRunnerOptions {
 	maxIterations?: number
 	model?: string
 	workspace?: string
 	provider?: AIProvider
 }
 export interface RawEvalResult<TOutput> {
 	success: boolean
 	output: TOutput
 	error?: string
 	tokenUsage: TokenUsage
 	toolCallsCount: number
 	toolsCalled: string[]
 	toolCallDetails: ToolCallDetail[]
 	iterations: number
 	messages: ChatCompletionMessageParam[]
 }
--- a/ai_evals/adapters/frontend/mockBackend.ts
+++ b/ai_evals/adapters/frontend/mockBackend.ts
@@ -0,0 +1,270 @@
 import { randomUUID } from 'node:crypto'
 import type { CompletedJob, Flow, Script } from '../../../frontend/src/lib/gen'
 import type { ScriptLang } from '../../../frontend/src/lib/gen/types.gen'
 import { buildScriptLintResult } from './core/script/preview'
 const BENCHMARK_TIMESTAMP = '1970-01-01T00:00:00.000Z'
 export interface BenchmarkWorkspaceScript {
 	path: string
 	summary: string
 	description?: string
 	language: Script['language']
 	schema?: Record<string, unknown>
 	content: string
 }
 export interface BenchmarkWorkspaceFlow {
 	path: string
 	summary: string
 	description?: string
 	schema?: Record<string, unknown>
 	value: Flow['value']
 }
 export interface BenchmarkWorkspaceRunnables {
 	scripts?: BenchmarkWorkspaceScript[]
 	flows?: BenchmarkWorkspaceFlow[]
 }
 type BenchmarkCompletedJob = CompletedJob & { type: 'CompletedJob' }
 const benchmarkWorkspaces = new Set<string>()
 const benchmarkWorkspaceRunnables = new Map<string, BenchmarkWorkspaceRunnables>()
 const benchmarkJobs = new Map<string, { workspace: string; job: BenchmarkCompletedJob }>()
 export function resetBenchmarkMockBackend(): void {
 	benchmarkWorkspaces.clear()
 	benchmarkWorkspaceRunnables.clear()
 	benchmarkJobs.clear()
 }
 export function registerBenchmarkWorkspace(workspace: string): void {
 	benchmarkWorkspaces.add(workspace)
 }
 export function registerBenchmarkWorkspaceRunnables(
 	workspace: string,
 	runnables: BenchmarkWorkspaceRunnables
 ): void {
 	benchmarkWorkspaces.add(workspace)
 	benchmarkWorkspaceRunnables.set(workspace, runnables)
 }
 export function unregisterBenchmarkWorkspace(workspace: string): void {
 	benchmarkWorkspaces.delete(workspace)
 	benchmarkWorkspaceRunnables.delete(workspace)
 	for (const [jobId, entry] of benchmarkJobs.entries()) {
 		if (entry.workspace === workspace) {
 			benchmarkJobs.delete(jobId)
 		}
 	}
 }
 export function unregisterBenchmarkWorkspaceRunnables(workspace: string): void {
 	unregisterBenchmarkWorkspace(workspace)
 }
 export function hasBenchmarkWorkspace(workspace: string): boolean {
 	return benchmarkWorkspaces.has(workspace)
 }
 export function listBenchmarkScripts(workspace: string): Script[] | null {
 	const runnables = benchmarkWorkspaceRunnables.get(workspace)
 	if (!runnables) {
 		return null
 	}
 	return (runnables.scripts ?? []).map(buildBenchmarkScript)
 }
 export function listBenchmarkFlows(workspace: string): Flow[] | null {
 	const runnables = benchmarkWorkspaceRunnables.get(workspace)
 	if (!runnables) {
 		return null
 	}
 	return (runnables.flows ?? []).map(buildBenchmarkFlow)
 }
 export function getBenchmarkScriptByPath(workspace: string, path: string): Script | null {
 	const script = benchmarkWorkspaceRunnables
 		.get(workspace)
 		?.scripts?.find((entry) => entry.path === path)
 	return script ? buildBenchmarkScript(script) : null
 }
 export function getBenchmarkScriptByHash(workspace: string, hash: string): Script | null {
 	const script = benchmarkWorkspaceRunnables
 		.get(workspace)
 		?.scripts?.find((entry) => buildBenchmarkScriptHash(entry.path) === hash)
 	return script ? buildBenchmarkScript(script) : null
 }
 export function getBenchmarkFlowByPath(workspace: string, path: string): Flow | null {
 	const flow = benchmarkWorkspaceRunnables
 		.get(workspace)
 		?.flows?.find((entry) => entry.path === path)
 	return flow ? buildBenchmarkFlow(flow) : null
 }
 export function createBenchmarkCompletedJob(input: {
 	workspace: string
 	jobKind: CompletedJob['job_kind']
 	success?: boolean
 	result?: unknown
 	logs?: string
 	scriptPath?: string
 	scriptHash?: string
 	args?: Record<string, unknown>
 }): string {
 	const jobId = `benchmark-job-${randomUUID()}`
 	const now = new Date().toISOString()
 	const job: BenchmarkCompletedJob = {
 		type: 'CompletedJob',
 		id: jobId,
 		workspace_id: input.workspace,
 		created_by: 'ai-evals',
 		created_at: now,
 		started_at: now,
 		completed_at: now,
 		duration_ms: 0,
 		success: input.success ?? true,
 		script_path: input.scriptPath,
 		script_hash: input.scriptHash,
 		args: input.args,
 		result: input.result,
 		logs: input.logs,
 		canceled: false,
 		job_kind: input.jobKind,
 		permissioned_as: 'u/ai-evals',
 		is_flow_step: false,
 		is_skipped: false,
 		email: 'ai-evals@local',
 		visible_to_owner: true,
 		tag: 'benchmark'
 	}
 	benchmarkJobs.set(jobId, { workspace: input.workspace, job })
 	return jobId
 }
 export function getBenchmarkCompletedJob(
 	workspace: string,
 	jobId: string
 ): BenchmarkCompletedJob | null {
 	const entry = benchmarkJobs.get(jobId)
 	if (!entry || entry.workspace !== workspace) {
 		return null
 	}
 	return structuredClone(entry.job)
 }
 export function runBenchmarkScriptPreview(input: {
 	workspace: string
 	requestBody: {
 		content?: string
 		language?: ScriptLang | 'bunnative'
 		args?: Record<string, unknown>
 		path?: string
 	}
 }): string {
 	const content = input.requestBody.content ?? ''
 	const language = input.requestBody.language ?? 'bun'
 	const lintResult = buildScriptLintResult(content, language)
 	const success = lintResult.errorCount === 0
 	return createBenchmarkCompletedJob({
 		workspace: input.workspace,
 		jobKind: 'preview',
 		success,
 		scriptPath: input.requestBody.path,
 		args: input.requestBody.args,
 		result: success
 			? {
 					path: input.requestBody.path,
 					args: input.requestBody.args ?? {},
 					validated: true
 				}
 			: {
 					path: input.requestBody.path,
 					args: input.requestBody.args ?? {},
 					errorCount: lintResult.errorCount,
 					errors: lintResult.errors.map((entry) => ({
 						line: entry.startLineNumber,
 						message: entry.message
 					}))
 				}
 	})
 }
 export function runBenchmarkFlowByPath(input: {
 	workspace: string
 	path: string
 	args?: Record<string, unknown>
 }): string {
 	const flow = getBenchmarkFlowByPath(input.workspace, input.path)
 	return createBenchmarkCompletedJob({
 		workspace: input.workspace,
 		jobKind: 'flowpreview',
 		success: flow !== null,
 		args: input.args,
 		result:
 			flow !== null
 				? {
 						path: input.path,
 						args: input.args ?? {},
 						mocked: true
 					}
 				: {
 						error: `Flow "${input.path}" not found in benchmark workspace`
 					},
 		logs:
 			flow !== null
 				? 'Mock benchmark flow run completed successfully.'
 				: `Flow "${input.path}" not found in benchmark workspace.`
 	})
 }
 function buildBenchmarkScriptHash(path: string): string {
 	return `benchmark:${path}`
 }
 function buildBenchmarkScript(script: BenchmarkWorkspaceScript): Script {
 	return {
 		workspace_id: 'benchmark',
 		hash: buildBenchmarkScriptHash(script.path),
 		path: script.path,
 		parent_hashes: [],
 		summary: script.summary,
 		description: script.description ?? '',
 		content: script.content,
 		created_by: 'benchmark',
 		created_at: BENCHMARK_TIMESTAMP,
 		archived: false,
 		schema: script.schema ?? {},
 		deleted: false,
 		is_template: false,
 		extra_perms: {},
 		language: script.language,
 		kind: 'script',
 		starred: false,
 		has_preprocessor: false,
 		modules: null
 	}
 }
 function buildBenchmarkFlow(flow: BenchmarkWorkspaceFlow): Flow {
 	return {
 		path: flow.path,
 		summary: flow.summary,
 		description: flow.description ?? '',
 		value: flow.value,
 		schema: flow.schema ?? {},
 		edited_by: 'benchmark',
 		edited_at: BENCHMARK_TIMESTAMP,
 		archived: false,
 		extra_perms: {}
 	} as Flow
 }
--- a/ai_evals/adapters/frontend/progress.ts
+++ b/ai_evals/adapters/frontend/progress.ts
@@ -0,0 +1,133 @@
 export type FrontendBenchmarkProgressSurface = 'flow' | 'app' | 'script'
 export type FrontendBenchmarkProgressEvent =
 	| {
 			type: 'run-start'
 			surface: FrontendBenchmarkProgressSurface
 			totalCases: number
 			runs: number
 			concurrency: number
 	  }
 	| {
 			type: 'attempt-start'
 			surface: FrontendBenchmarkProgressSurface
 			caseId: string
 			caseNumber: number
 			totalCases: number
 			attempt: number
 			runs: number
 	  }
 	| {
 			type: 'attempt-finish'
 			surface: FrontendBenchmarkProgressSurface
 			caseId: string
 			caseNumber: number
 			totalCases: number
 			attempt: number
 			runs: number
 			passed: boolean
 			durationMs: number
 			judgeScore: number | null
 			error: string | null
 	  }
 	| {
 			type: 'assistant-message-start'
 			surface: FrontendBenchmarkProgressSurface
 			caseId: string
 			caseNumber: number
 			totalCases: number
 			attempt: number
 			runs: number
 	  }
 	| {
 			type: 'assistant-chunk'
 			surface: FrontendBenchmarkProgressSurface
 			caseId: string
 			caseNumber: number
 			totalCases: number
 			attempt: number
 			runs: number
 			chunk: string
 	  }
 	| {
 			type: 'assistant-message-end'
 			surface: FrontendBenchmarkProgressSurface
 			caseId: string
 			caseNumber: number
 			totalCases: number
 			attempt: number
 			runs: number
 	  }
 export const FRONTEND_BENCHMARK_PROGRESS_PREFIX = 'WMILL_FRONTEND_AI_EVAL_PROGRESS '
 export function emitFrontendBenchmarkProgress(event: FrontendBenchmarkProgressEvent): void {
 	process.stderr.write(
 		`${FRONTEND_BENCHMARK_PROGRESS_PREFIX}${JSON.stringify(event)}\n`
 	)
 }
 export function parseFrontendBenchmarkProgressLine(
 	line: string
 ): FrontendBenchmarkProgressEvent | null {
 	if (!line.startsWith(FRONTEND_BENCHMARK_PROGRESS_PREFIX)) {
 		return null
 	}
 	try {
 		const parsed = JSON.parse(
 			line.slice(FRONTEND_BENCHMARK_PROGRESS_PREFIX.length)
 		) as FrontendBenchmarkProgressEvent
 		return parsed?.type ? parsed : null
 	} catch {
 		return null
 	}
 }
 export function formatFrontendBenchmarkProgressEvent(
 	event: FrontendBenchmarkProgressEvent
 ): string {
 	switch (event.type) {
 		case 'run-start':
 			return `Running ${event.surface}: ${event.totalCases} cases x ${event.runs} run${event.runs === 1 ? '' : 's'}, concurrency ${event.concurrency}`
 		case 'attempt-start':
 			return `${formatCasePrefix(event.caseNumber, event.totalCases)} ${event.caseId} attempt ${event.attempt}/${event.runs}...`
 		case 'attempt-finish': {
 			const parts = [
 				`${formatCasePrefix(event.caseNumber, event.totalCases)} ${event.caseId} attempt ${event.attempt}/${event.runs} ${event.passed ? 'pass' : 'fail'}`,
 				formatDuration(event.durationMs)
 			]
 			if (event.judgeScore !== null) {
 				parts.push(`judge ${formatNumber(event.judgeScore)}`)
 			}
 			if (event.error) {
 				parts.push(truncateSingleLine(event.error, 120))
 			}
 			return parts.join(' | ')
 		}
 		case 'assistant-message-start':
 		case 'assistant-chunk':
 		case 'assistant-message-end':
 			return ''
 	}
 }
 function formatCasePrefix(caseNumber: number, totalCases: number): string {
 	return `[${caseNumber}/${totalCases}]`
 }
 function formatDuration(durationMs: number): string {
 	return `${formatNumber(durationMs / 1000)}s`
 }
 function formatNumber(value: number): string {
 	return Number.isInteger(value) ? String(value) : value.toFixed(1)
 }
 function truncateSingleLine(value: string, maxLength: number): string {
 	const normalized = value.replace(/\s+/g, ' ').trim()
 	if (normalized.length <= maxLength) {
 		return normalized
 	}
 	return `${normalized.slice(0, Math.max(0, maxLength - 3))}...`
 }
--- a/ai_evals/adapters/frontend/runtime.ts
+++ b/ai_evals/adapters/frontend/runtime.ts
@@ -0,0 +1,216 @@
 import { spawn } from 'node:child_process'
 import { mkdtemp, readFile, rm } from 'node:fs/promises'
 import { tmpdir } from 'node:os'
 import path from 'node:path'
 import { fileURLToPath } from 'node:url'
 import {
 	formatFrontendBenchmarkProgressEvent,
 	parseFrontendBenchmarkProgressLine
 } from './progress'
 import type { BenchmarkRunResult } from '../../core/types'
 const REPO_ROOT = fileURLToPath(new URL('../../../', import.meta.url))
 const FRONTEND_DIR = path.join(REPO_ROOT, 'frontend')
 const FRONTEND_BENCHMARK_TEST = '../ai_evals/adapters/frontend/vitestAdapter.test.ts'
 const FRONTEND_BENCHMARK_CONFIG = '../ai_evals/adapters/frontend/vitest.config.ts'
 export type FrontendMode = 'flow' | 'app' | 'script'
 export async function runFrontendBenchmarkAdapter(input: {
 	mode: FrontendMode
 	caseIds: string[]
 	runs: number
 	model?: string
 	verbose?: boolean
 }): Promise<BenchmarkRunResult> {
 	const tempDir = await mkdtemp(path.join(tmpdir(), 'wmill-frontend-benchmark-'))
 	const outputPath = path.join(tempDir, 'result.json')
 	try {
 		await runVitestBenchmark(
 			path.join(FRONTEND_DIR, 'node_modules', '.bin', 'vitest'),
 			[
 				'run',
 				FRONTEND_BENCHMARK_TEST,
 				'--project',
 				'server',
 				'--config',
 				FRONTEND_BENCHMARK_CONFIG
 			],
 			{
 				cwd: FRONTEND_DIR,
 				env: {
 					...process.env,
 					BROWSERSLIST_IGNORE_OLD_DATA: '1',
 					WMILL_FRONTEND_AI_EVAL_OUTPUT_PATH: outputPath,
 					WMILL_FRONTEND_AI_EVAL_MODE: input.mode,
 					WMILL_FRONTEND_AI_EVAL_CASE_IDS: JSON.stringify(input.caseIds),
 					WMILL_FRONTEND_AI_EVAL_RUNS: String(input.runs),
 					WMILL_FRONTEND_AI_EVAL_MODEL: input.model ?? "",
 					WMILL_FRONTEND_AI_EVAL_PROGRESS: '1',
 					WMILL_FRONTEND_AI_EVAL_VERBOSE: input.verbose ? '1' : '0'
 				}
 			}
 		)
 		const raw = await readFile(outputPath, 'utf8')
 		return JSON.parse(raw) as BenchmarkRunResult
 	} catch (error) {
 		throw new Error(`Frontend benchmark adapter failed:\n${toErrorMessage(error)}`)
 	} finally {
 		await rm(tempDir, { recursive: true, force: true })
 	}
 }
 async function runVitestBenchmark(
 	command: string,
 	args: string[],
 	options: {
 		cwd: string
 		env: NodeJS.ProcessEnv
 	}
 ): Promise<void> {
 	const child = spawn(command, args, {
 		cwd: options.cwd,
 		env: options.env,
 		stdio: ['ignore', 'pipe', 'pipe']
 	})
 	let stdout = ''
 	let stderr = ''
 	let stderrLineBuffer = ''
 	let assistantStreamOpen = false
 	child.stdout?.setEncoding('utf8')
 	child.stdout?.on('data', (chunk: string) => {
 		stdout += chunk
 	})
 	child.stderr?.setEncoding('utf8')
 	child.stderr?.on('data', (chunk: string) => {
 		stderrLineBuffer += chunk
 		const { remainder, passthrough, nextAssistantStreamOpen } = drainProgressLines(
 			stderrLineBuffer,
 			assistantStreamOpen
 		)
 		stderrLineBuffer = remainder
 		stderr += passthrough
 		assistantStreamOpen = nextAssistantStreamOpen
 	})
 	await new Promise<void>((resolve, reject) => {
 		child.once('error', reject)
 		child.once('close', (code) => {
 			if (stderrLineBuffer.length > 0) {
 				const {
 					remainder,
 					passthrough,
 					nextAssistantStreamOpen
 				} = drainProgressLines(`${stderrLineBuffer}\n`, assistantStreamOpen)
 				stderrLineBuffer = remainder
 				stderr += passthrough
 				assistantStreamOpen = nextAssistantStreamOpen
 			}
 			if (code === 0) {
 				if (assistantStreamOpen) {
 					process.stderr.write('\n')
 				}
 				resolve()
 				return
 			}
 			const details = [`vitest exited with code ${code}`, stdout, stderr].filter(Boolean).join('\n')
 			reject(new Error(details))
 		})
 	})
 }
 function drainProgressLines(buffer: string): {
 	remainder: string
 	passthrough: string
 	nextAssistantStreamOpen: boolean
 }
 function drainProgressLines(
 	buffer: string,
 	initialAssistantStreamOpen: boolean
 ): {
 	remainder: string
 	passthrough: string
 	nextAssistantStreamOpen: boolean
 } {
 	let remainder = buffer
 	let passthrough = ''
 	let assistantStreamOpen = initialAssistantStreamOpen
 	while (true) {
 		const newlineIndex = remainder.indexOf('\n')
 		if (newlineIndex === -1) {
 			return { remainder, passthrough, nextAssistantStreamOpen: assistantStreamOpen }
 		}
 		const line = remainder.slice(0, newlineIndex).replace(/\r$/, '')
 		remainder = remainder.slice(newlineIndex + 1)
 		const progressEvent = parseFrontendBenchmarkProgressLine(line)
 		if (progressEvent) {
 			if (progressEvent.type === 'assistant-message-start') {
 				if (assistantStreamOpen) {
 					process.stderr.write('\n')
 				}
 				process.stderr.write(
 					`${formatCasePrefix(progressEvent.caseNumber, progressEvent.totalCases)} ${progressEvent.caseId} attempt ${progressEvent.attempt}/${progressEvent.runs} assistant:\n`
 				)
 				assistantStreamOpen = true
 				continue
 			}
 			if (progressEvent.type === 'assistant-chunk') {
 				process.stderr.write(progressEvent.chunk)
 				continue
 			}
 			if (progressEvent.type === 'assistant-message-end') {
 				if (assistantStreamOpen) {
 					process.stderr.write('\n')
 				}
 				assistantStreamOpen = false
 				continue
 			}
 			if (assistantStreamOpen) {
 				process.stderr.write('\n')
 				assistantStreamOpen = false
 			}
 			process.stderr.write(`${formatFrontendBenchmarkProgressEvent(progressEvent)}\n`)
 			continue
 		}
 		if (shouldSuppressFrontendStderrLine(line)) {
 			continue
 		}
 		passthrough += `${line}\n`
 		process.stderr.write(`${line}\n`)
 	}
 }
 function formatCasePrefix(caseNumber: number, totalCases: number): string {
 	return `[${caseNumber}/${totalCases}]`
 }
 function shouldSuppressFrontendStderrLine(line: string): boolean {
 	return (
 		line.startsWith('[baseline-browser-mapping] ') ||
 		line.startsWith('Browserslist: browsers data (caniuse-lite) is ') ||
 		line.includes('update-browserslist-db@latest') ||
 		line.includes('update-db#readme')
 	)
 }
 function toErrorMessage(error: unknown): string {
 	if (error instanceof Error) {
 		return error.message
 	}
 	return String(error)
 }
--- a/ai_evals/adapters/frontend/vitest.config.ts
+++ b/ai_evals/adapters/frontend/vitest.config.ts
@@ -0,0 +1,28 @@
 import { fileURLToPath } from 'node:url'
 import frontendConfig from '../../../frontend/vite.config.js'
 const FRONTEND_VITE_CONFIG_PATH = fileURLToPath(new URL('../../../frontend/vite.config.js', import.meta.url))
 const FRONTEND_TEST_SETUP_PATH = fileURLToPath(
 	new URL('../../../frontend/src/lib/test-setup.ts', import.meta.url)
 )
 const ADAPTER_TEST_PATH = fileURLToPath(new URL('./vitestAdapter.test.ts', import.meta.url))
 const config = {
 	...frontendConfig,
 	test: {
 		...frontendConfig.test,
 		projects: [
 			{
 				extends: FRONTEND_VITE_CONFIG_PATH,
 				test: {
 					name: 'server',
 					environment: 'node',
 					include: [ADAPTER_TEST_PATH],
 					setupFiles: [FRONTEND_TEST_SETUP_PATH]
 				}
 			}
 		]
 	}
 }
 export default config
--- a/ai_evals/adapters/frontend/vitestAdapter.test.ts
+++ b/ai_evals/adapters/frontend/vitestAdapter.test.ts
@@ -0,0 +1,165 @@
 import { expect, it, vi } from 'vitest'
 // @ts-ignore - Node.js fs/promises
 import { mkdir, writeFile } from 'fs/promises'
 // @ts-ignore - Node.js path
 import { dirname, resolve } from 'path'
 vi.mock('monaco-editor', () => ({
 	editor: {},
 	languages: {},
 	KeyCode: {},
 	Uri: {
 		parse: (value: string) => ({ toString: () => value })
 	},
 	MarkerSeverity: {
 		Error: 8,
 		Warning: 4,
 		Info: 2,
 		Hint: 1
 	}
 }))
 vi.mock('@codingame/monaco-vscode-standalone-typescript-language-features', () => ({
 	getTypeScriptWorker: async () => async () => ({}),
 	typescriptVersion: 'test'
 }))
 vi.mock('@codingame/monaco-vscode-languages-service-override', () => ({
 	default: () => ({})
 }))
 vi.mock('$lib/components/vscode', () => ({}))
 vi.mock('$lib/gen', async () => {
 	const actual = await vi.importActual<any>('$lib/gen')
 	const {
 		getBenchmarkCompletedJob,
 		getBenchmarkFlowByPath,
 		getBenchmarkScriptByHash,
 		getBenchmarkScriptByPath,
 		hasBenchmarkWorkspace,
 		listBenchmarkFlows,
 		listBenchmarkScripts,
 		runBenchmarkFlowByPath,
 		runBenchmarkScriptPreview
 	} = await import('./mockBackend')
 	function wrapService<T extends object>(target: T, overrides: Record<string, unknown>): T {
 		return new Proxy(target, {
 			get(source, property, receiver) {
 				if (typeof property === 'string' && property in overrides) {
 					return overrides[property]
 				}
 				return Reflect.get(source, property, receiver)
 			}
 		})
 	}
 	return {
 		...actual,
 		ScriptService: wrapService(actual.ScriptService, {
 			listScripts: async (data: { workspace: string }) =>
 				hasBenchmarkWorkspace(data.workspace)
 					? (listBenchmarkScripts(data.workspace) ?? [])
 					: actual.ScriptService.listScripts(data),
 			getScriptByPath: async (data: { workspace: string; path: string }) => {
 				if (hasBenchmarkWorkspace(data.workspace)) {
 					const script = getBenchmarkScriptByPath(data.workspace, data.path)
 					if (!script) {
 						throw new Error(`Script "${data.path}" not found in benchmark workspace`)
 					}
 					return script
 				}
 				return actual.ScriptService.getScriptByPath(data)
 			},
 			getScriptByHash: async (data: { workspace: string; hash: string }) => {
 				if (hasBenchmarkWorkspace(data.workspace)) {
 					const script = getBenchmarkScriptByHash(data.workspace, data.hash)
 					if (!script) {
 						throw new Error(`Script hash "${data.hash}" not found in benchmark workspace`)
 					}
 					return script
 				}
 				return actual.ScriptService.getScriptByHash(data)
 			}
 		}),
 		FlowService: wrapService(actual.FlowService, {
 			listFlows: async (data: { workspace: string }) =>
 				hasBenchmarkWorkspace(data.workspace)
 					? (listBenchmarkFlows(data.workspace) ?? [])
 					: actual.FlowService.listFlows(data),
 			getFlowByPath: async (data: { workspace: string; path: string }) => {
 				if (hasBenchmarkWorkspace(data.workspace)) {
 					const flow = getBenchmarkFlowByPath(data.workspace, data.path)
 					if (!flow) {
 						throw new Error(`Flow "${data.path}" not found in benchmark workspace`)
 					}
 					return flow
 				}
 				return actual.FlowService.getFlowByPath(data)
 			}
 		}),
 		JobService: wrapService(actual.JobService, {
 			runScriptPreview: async (data: {
 				workspace: string
 				requestBody?: {
 					content?: string
 					language?: string
 					args?: Record<string, unknown>
 					path?: string
 				}
 			}) =>
 				hasBenchmarkWorkspace(data.workspace)
 					? runBenchmarkScriptPreview({
 							workspace: data.workspace,
 							requestBody: data.requestBody ?? {}
 						})
 					: actual.JobService.runScriptPreview(data),
 			runFlowByPath: async (data: {
 				workspace: string
 				path: string
 				requestBody?: Record<string, unknown>
 			}) =>
 				hasBenchmarkWorkspace(data.workspace)
 					? runBenchmarkFlowByPath({
 							workspace: data.workspace,
 							path: data.path,
 							args: data.requestBody
 						})
 					: actual.JobService.runFlowByPath(data),
 			getJob: async (data: { workspace: string; id: string }) => {
 				if (hasBenchmarkWorkspace(data.workspace)) {
 					const job = getBenchmarkCompletedJob(data.workspace, data.id)
 					if (!job) {
 						throw new Error(`Job "${data.id}" not found in benchmark workspace`)
 					}
 					return job
 				}
 				return actual.JobService.getJob(data)
 			}
 		})
 	}
 })
 const benchmarkOutputPath = process.env.WMILL_FRONTEND_AI_EVAL_OUTPUT_PATH
 const benchmarkIt = benchmarkOutputPath ? it : it.skip
 benchmarkIt(
 	'runs the frontend benchmark adapter from environment input',
 	async () => {
 		const { resetBenchmarkMockBackend } = await import('./mockBackend')
 		resetBenchmarkMockBackend()
 		const { runFrontendBenchmarkFromEnv } = await import('./benchmarkRunner')
 		try {
 			const payload = await runFrontendBenchmarkFromEnv()
 			const absoluteOutputPath = resolve(benchmarkOutputPath!)
 			await mkdir(dirname(absoluteOutputPath), { recursive: true })
 			await writeFile(absoluteOutputPath, JSON.stringify(payload, null, 2) + '\n', 'utf8')
 			expect(payload.cases.length).toBeGreaterThan(0)
 		} finally {
 			resetBenchmarkMockBackend()
 		}
 	},
 	600_000
 )
--- a/ai_evals/bun.lock
+++ b/ai_evals/bun.lock
@@ -0,0 +1,313 @@
 {
  "lockfileVersion": 1,
  "configVersion": 1,
  "workspaces": {
    "": {
      "name": "windmill-ai-evals",
      "dependencies": {
        "@anthropic-ai/claude-agent-sdk": "^0.2.25",
        "@anthropic-ai/sdk": "^0.39.0",
        "commander": "^14.0.3",
        "openai": "^6.9.1",
        "yaml": "^2.8.3",
      },
      "devDependencies": {
        "@types/bun": "latest",
        "typescript": "^5.0.0",
      },
    },
  },
  "packages": {
    "@anthropic-ai/claude-agent-sdk": ["@anthropic-ai/claude-agent-sdk@0.2.87", "", { "dependencies": { "@anthropic-ai/sdk": "^0.74.0", "@modelcontextprotocol/sdk": "^1.27.1" }, "optionalDependencies": { "@img/sharp-darwin-arm64": "^0.34.2", "@img/sharp-darwin-x64": "^0.34.2", "@img/sharp-linux-arm": "^0.34.2", "@img/sharp-linux-arm64": "^0.34.2", "@img/sharp-linux-x64": "^0.34.2", "@img/sharp-linuxmusl-arm64": "^0.34.2", "@img/sharp-linuxmusl-x64": "^0.34.2", "@img/sharp-win32-arm64": "^0.34.2", "@img/sharp-win32-x64": "^0.34.2" }, "peerDependencies": { "zod": "^4.0.0" } }, "sha512-WWmgBPxPhBOvNT0ujI8vPTI2lK+w5YEkEZ/y1mH0EDkK/0kBnxVJNhCtG5vnueiAViwLoUOFn66pbkDiivijdA=="],
    "@anthropic-ai/sdk": ["@anthropic-ai/sdk@0.39.0", "", { "dependencies": { "@types/node": "^18.11.18", "@types/node-fetch": "^2.6.4", "abort-controller": "^3.0.0", "agentkeepalive": "^4.2.1", "form-data-encoder": "1.7.2", "formdata-node": "^4.3.2", "node-fetch": "^2.6.7" } }, "sha512-eMyDIPRZbt1CCLErRCi3exlAvNkBtRe+kW5vvJyef93PmNr/clstYgHhtvmkxN82nlKgzyGPCyGxrm0JQ1ZIdg=="],
    "@babel/runtime": ["@babel/runtime@7.29.2", "", {}, "sha512-JiDShH45zKHWyGe4ZNVRrCjBz8Nh9TMmZG1kh4QTK8hCBTWBi8Da+i7s1fJw7/lYpM4ccepSNfqzZ/QvABBi5g=="],
    "@hono/node-server": ["@hono/node-server@1.19.12", "", { "peerDependencies": { "hono": "^4" } }, "sha512-txsUW4SQ1iilgE0l9/e9VQWmELXifEFvmdA1j6WFh/aFPj99hIntrSsq/if0UWyGVkmrRPKA1wCeP+UCr1B9Uw=="],
    "@img/sharp-darwin-arm64": ["@img/sharp-darwin-arm64@0.34.5", "", { "optionalDependencies": { "@img/sharp-libvips-darwin-arm64": "1.2.4" }, "os": "darwin", "cpu": "arm64" }, "sha512-imtQ3WMJXbMY4fxb/Ndp6HBTNVtWCUI0WdobyheGf5+ad6xX8VIDO8u2xE4qc/fr08CKG/7dDseFtn6M6g/r3w=="],
    "@img/sharp-darwin-x64": ["@img/sharp-darwin-x64@0.34.5", "", { "optionalDependencies": { "@img/sharp-libvips-darwin-x64": "1.2.4" }, "os": "darwin", "cpu": "x64" }, "sha512-YNEFAF/4KQ/PeW0N+r+aVVsoIY0/qxxikF2SWdp+NRkmMB7y9LBZAVqQ4yhGCm/H3H270OSykqmQMKLBhBJDEw=="],
    "@img/sharp-libvips-darwin-arm64": ["@img/sharp-libvips-darwin-arm64@1.2.4", "", { "os": "darwin", "cpu": "arm64" }, "sha512-zqjjo7RatFfFoP0MkQ51jfuFZBnVE2pRiaydKJ1G/rHZvnsrHAOcQALIi9sA5co5xenQdTugCvtb1cuf78Vf4g=="],
    "@img/sharp-libvips-darwin-x64": ["@img/sharp-libvips-darwin-x64@1.2.4", "", { "os": "darwin", "cpu": "x64" }, "sha512-1IOd5xfVhlGwX+zXv2N93k0yMONvUlANylbJw1eTah8K/Jtpi15KC+WSiaX/nBmbm2HxRM1gZ0nSdjSsrZbGKg=="],
    "@img/sharp-libvips-linux-arm": ["@img/sharp-libvips-linux-arm@1.2.4", "", { "os": "linux", "cpu": "arm" }, "sha512-bFI7xcKFELdiNCVov8e44Ia4u2byA+l3XtsAj+Q8tfCwO6BQ8iDojYdvoPMqsKDkuoOo+X6HZA0s0q11ANMQ8A=="],
    "@img/sharp-libvips-linux-arm64": ["@img/sharp-libvips-linux-arm64@1.2.4", "", { "os": "linux", "cpu": "arm64" }, "sha512-excjX8DfsIcJ10x1Kzr4RcWe1edC9PquDRRPx3YVCvQv+U5p7Yin2s32ftzikXojb1PIFc/9Mt28/y+iRklkrw=="],
    "@img/sharp-libvips-linux-x64": ["@img/sharp-libvips-linux-x64@1.2.4", "", { "os": "linux", "cpu": "x64" }, "sha512-tJxiiLsmHc9Ax1bz3oaOYBURTXGIRDODBqhveVHonrHJ9/+k89qbLl0bcJns+e4t4rvaNBxaEZsFtSfAdquPrw=="],
    "@img/sharp-libvips-linuxmusl-arm64": ["@img/sharp-libvips-linuxmusl-arm64@1.2.4", "", { "os": "linux", "cpu": "arm64" }, "sha512-FVQHuwx1IIuNow9QAbYUzJ+En8KcVm9Lk5+uGUQJHaZmMECZmOlix9HnH7n1TRkXMS0pGxIJokIVB9SuqZGGXw=="],
    "@img/sharp-libvips-linuxmusl-x64": ["@img/sharp-libvips-linuxmusl-x64@1.2.4", "", { "os": "linux", "cpu": "x64" }, "sha512-+LpyBk7L44ZIXwz/VYfglaX/okxezESc6UxDSoyo2Ks6Jxc4Y7sGjpgU9s4PMgqgjj1gZCylTieNamqA1MF7Dg=="],
    "@img/sharp-linux-arm": ["@img/sharp-linux-arm@0.34.5", "", { "optionalDependencies": { "@img/sharp-libvips-linux-arm": "1.2.4" }, "os": "linux", "cpu": "arm" }, "sha512-9dLqsvwtg1uuXBGZKsxem9595+ujv0sJ6Vi8wcTANSFpwV/GONat5eCkzQo/1O6zRIkh0m/8+5BjrRr7jDUSZw=="],
    "@img/sharp-linux-arm64": ["@img/sharp-linux-arm64@0.34.5", "", { "optionalDependencies": { "@img/sharp-libvips-linux-arm64": "1.2.4" }, "os": "linux", "cpu": "arm64" }, "sha512-bKQzaJRY/bkPOXyKx5EVup7qkaojECG6NLYswgktOZjaXecSAeCWiZwwiFf3/Y+O1HrauiE3FVsGxFg8c24rZg=="],
    "@img/sharp-linux-x64": ["@img/sharp-linux-x64@0.34.5", "", { "optionalDependencies": { "@img/sharp-libvips-linux-x64": "1.2.4" }, "os": "linux", "cpu": "x64" }, "sha512-MEzd8HPKxVxVenwAa+JRPwEC7QFjoPWuS5NZnBt6B3pu7EG2Ge0id1oLHZpPJdn3OQK+BQDiw9zStiHBTJQQQQ=="],
    "@img/sharp-linuxmusl-arm64": ["@img/sharp-linuxmusl-arm64@0.34.5", "", { "optionalDependencies": { "@img/sharp-libvips-linuxmusl-arm64": "1.2.4" }, "os": "linux", "cpu": "arm64" }, "sha512-fprJR6GtRsMt6Kyfq44IsChVZeGN97gTD331weR1ex1c1rypDEABN6Tm2xa1wE6lYb5DdEnk03NZPqA7Id21yg=="],
    "@img/sharp-linuxmusl-x64": ["@img/sharp-linuxmusl-x64@0.34.5", "", { "optionalDependencies": { "@img/sharp-libvips-linuxmusl-x64": "1.2.4" }, "os": "linux", "cpu": "x64" }, "sha512-Jg8wNT1MUzIvhBFxViqrEhWDGzqymo3sV7z7ZsaWbZNDLXRJZoRGrjulp60YYtV4wfY8VIKcWidjojlLcWrd8Q=="],
    "@img/sharp-win32-arm64": ["@img/sharp-win32-arm64@0.34.5", "", { "os": "win32", "cpu": "arm64" }, "sha512-WQ3AgWCWYSb2yt+IG8mnC6Jdk9Whs7O0gxphblsLvdhSpSTtmu69ZG1Gkb6NuvxsNACwiPV6cNSZNzt0KPsw7g=="],
    "@img/sharp-win32-x64": ["@img/sharp-win32-x64@0.34.5", "", { "os": "win32", "cpu": "x64" }, "sha512-+29YMsqY2/9eFEiW93eqWnuLcWcufowXewwSNIT6UwZdUUCrM3oFjMWH/Z6/TMmb4hlFenmfAVbpWeup2jryCw=="],
    "@modelcontextprotocol/sdk": ["@modelcontextprotocol/sdk@1.29.0", "", { "dependencies": { "@hono/node-server": "^1.19.9", "ajv": "^8.17.1", "ajv-formats": "^3.0.1", "content-type": "^1.0.5", "cors": "^2.8.5", "cross-spawn": "^7.0.5", "eventsource": "^3.0.2", "eventsource-parser": "^3.0.0", "express": "^5.2.1", "express-rate-limit": "^8.2.1", "hono": "^4.11.4", "jose": "^6.1.3", "json-schema-typed": "^8.0.2", "pkce-challenge": "^5.0.0", "raw-body": "^3.0.0", "zod": "^3.25 || ^4.0", "zod-to-json-schema": "^3.25.1" }, "peerDependencies": { "@cfworker/json-schema": "^4.1.1" }, "optionalPeers": ["@cfworker/json-schema"] }, "sha512-zo37mZA9hJWpULgkRpowewez1y6ML5GsXJPY8FI0tBBCd77HEvza4jDqRKOXgHNn867PVGCyTdzqpz0izu5ZjQ=="],
    "@types/bun": ["@types/bun@1.3.11", "", { "dependencies": { "bun-types": "1.3.11" } }, "sha512-5vPne5QvtpjGpsGYXiFyycfpDF2ECyPcTSsFBMa0fraoxiQyMJ3SmuQIGhzPg2WJuWxVBoxWJ2kClYTcw/4fAg=="],
    "@types/node": ["@types/node@18.19.130", "", { "dependencies": { "undici-types": "~5.26.4" } }, "sha512-GRaXQx6jGfL8sKfaIDD6OupbIHBr9jv7Jnaml9tB7l4v068PAOXqfcujMMo5PhbIs6ggR1XODELqahT2R8v0fg=="],
    "@types/node-fetch": ["@types/node-fetch@2.6.13", "", { "dependencies": { "@types/node": "*", "form-data": "^4.0.4" } }, "sha512-QGpRVpzSaUs30JBSGPjOg4Uveu384erbHBoT1zeONvyCfwQxIkUshLAOqN/k9EjGviPRmWTTe6aH2qySWKTVSw=="],
    "abort-controller": ["abort-controller@3.0.0", "", { "dependencies": { "event-target-shim": "^5.0.0" } }, "sha512-h8lQ8tacZYnR3vNQTgibj+tODHI5/+l06Au2Pcriv/Gmet0eaj4TwWH41sO9wnHDiQsEj19q0drzdWdeAHtweg=="],
    "accepts": ["accepts@2.0.0", "", { "dependencies": { "mime-types": "^3.0.0", "negotiator": "^1.0.0" } }, "sha512-5cvg6CtKwfgdmVqY1WIiXKc3Q1bkRqGLi+2W/6ao+6Y7gu/RCwRuAhGEzh5B4KlszSuTLgZYuqFqo5bImjNKng=="],
    "agentkeepalive": ["agentkeepalive@4.6.0", "", { "dependencies": { "humanize-ms": "^1.2.1" } }, "sha512-kja8j7PjmncONqaTsB8fQ+wE2mSU2DJ9D4XKoJ5PFWIdRMa6SLSN1ff4mOr4jCbfRSsxR4keIiySJU0N9T5hIQ=="],
    "ajv": ["ajv@8.18.0", "", { "dependencies": { "fast-deep-equal": "^3.1.3", "fast-uri": "^3.0.1", "json-schema-traverse": "^1.0.0", "require-from-string": "^2.0.2" } }, "sha512-PlXPeEWMXMZ7sPYOHqmDyCJzcfNrUr3fGNKtezX14ykXOEIvyK81d+qydx89KY5O71FKMPaQ2vBfBFI5NHR63A=="],
    "ajv-formats": ["ajv-formats@3.0.1", "", { "dependencies": { "ajv": "^8.0.0" } }, "sha512-8iUql50EUR+uUcdRQ3HDqa6EVyo3docL8g5WJ3FNcWmu62IbkGUue/pEyLBW8VGKKucTPgqeks4fIU1DA4yowQ=="],
    "asynckit": ["asynckit@0.4.0", "", {}, "sha512-Oei9OH4tRh0YqU3GxhX79dM/mwVgvbZJaSNaRk+bshkj0S5cfHcgYakreBjrHwatXKbz+IoIdYLxrKim2MjW0Q=="],
    "body-parser": ["body-parser@2.2.2", "", { "dependencies": { "bytes": "^3.1.2", "content-type": "^1.0.5", "debug": "^4.4.3", "http-errors": "^2.0.0", "iconv-lite": "^0.7.0", "on-finished": "^2.4.1", "qs": "^6.14.1", "raw-body": "^3.0.1", "type-is": "^2.0.1" } }, "sha512-oP5VkATKlNwcgvxi0vM0p/D3n2C3EReYVX+DNYs5TjZFn/oQt2j+4sVJtSMr18pdRr8wjTcBl6LoV+FUwzPmNA=="],
    "bun-types": ["bun-types@1.3.11", "", { "dependencies": { "@types/node": "*" } }, "sha512-1KGPpoxQWl9f6wcZh57LvrPIInQMn2TQ7jsgxqpRzg+l0QPOFvJVH7HmvHo/AiPgwXy+/Thf6Ov3EdVn1vOabg=="],
    "bytes": ["bytes@3.1.2", "", {}, "sha512-/Nf7TyzTx6S3yRJObOAV7956r8cr2+Oj8AC5dt8wSP3BQAoeX58NoHyCU8P8zGkNXStjTSi6fzO6F0pBdcYbEg=="],
    "call-bind-apply-helpers": ["call-bind-apply-helpers@1.0.2", "", { "dependencies": { "es-errors": "^1.3.0", "function-bind": "^1.1.2" } }, "sha512-Sp1ablJ0ivDkSzjcaJdxEunN5/XvksFJ2sMBFfq6x0ryhQV/2b/KwFe21cMpmHtPOSij8K99/wSfoEuTObmuMQ=="],
    "call-bound": ["call-bound@1.0.4", "", { "dependencies": { "call-bind-apply-helpers": "^1.0.2", "get-intrinsic": "^1.3.0" } }, "sha512-+ys997U96po4Kx/ABpBCqhA9EuxJaQWDQg7295H4hBphv3IZg0boBKuwYpt4YXp6MZ5AmZQnU/tyMTlRpaSejg=="],
    "combined-stream": ["combined-stream@1.0.8", "", { "dependencies": { "delayed-stream": "~1.0.0" } }, "sha512-FQN4MRfuJeHf7cBbBMJFXhKSDq+2kAArBlmRBvcvFE5BB1HZKXtSFASDhdlz9zOYwxh8lDdnvmMOe/+5cdoEdg=="],
    "commander": ["commander@14.0.3", "", {}, "sha512-H+y0Jo/T1RZ9qPP4Eh1pkcQcLRglraJaSLoyOtHxu6AapkjWVCy2Sit1QQ4x3Dng8qDlSsZEet7g5Pq06MvTgw=="],
    "content-disposition": ["content-disposition@1.0.1", "", {}, "sha512-oIXISMynqSqm241k6kcQ5UwttDILMK4BiurCfGEREw6+X9jkkpEe5T9FZaApyLGGOnFuyMWZpdolTXMtvEJ08Q=="],
    "content-type": ["content-type@1.0.5", "", {}, "sha512-nTjqfcBFEipKdXCv4YDQWCfmcLZKm81ldF0pAopTvyrFGVbcR6P/VAAd5G7N+0tTr8QqiU0tFadD6FK4NtJwOA=="],
    "cookie": ["cookie@0.7.2", "", {}, "sha512-yki5XnKuf750l50uGTllt6kKILY4nQ1eNIQatoXEByZ5dWgnKqbnqmTrBE5B4N7lrMJKQ2ytWMiTO2o0v6Ew/w=="],
    "cookie-signature": ["cookie-signature@1.2.2", "", {}, "sha512-D76uU73ulSXrD1UXF4KE2TMxVVwhsnCgfAyTg9k8P6KGZjlXKrOLe4dJQKI3Bxi5wjesZoFXJWElNWBjPZMbhg=="],
    "cors": ["cors@2.8.6", "", { "dependencies": { "object-assign": "^4", "vary": "^1" } }, "sha512-tJtZBBHA6vjIAaF6EnIaq6laBBP9aq/Y3ouVJjEfoHbRBcHBAHYcMh/w8LDrk2PvIMMq8gmopa5D4V8RmbrxGw=="],
    "cross-spawn": ["cross-spawn@7.0.6", "", { "dependencies": { "path-key": "^3.1.0", "shebang-command": "^2.0.0", "which": "^2.0.1" } }, "sha512-uV2QOWP2nWzsy2aMp8aRibhi9dlzF5Hgh5SHaB9OiTGEyDTiJJyx0uy51QXdyWbtAHNua4XJzUKca3OzKUd3vA=="],
    "debug": ["debug@4.4.3", "", { "dependencies": { "ms": "^2.1.3" } }, "sha512-RGwwWnwQvkVfavKVt22FGLw+xYSdzARwm0ru6DhTVA3umU5hZc28V3kO4stgYryrTlLpuvgI9GiijltAjNbcqA=="],
    "delayed-stream": ["delayed-stream@1.0.0", "", {}, "sha512-ZySD7Nf91aLB0RxL4KGrKHBXl7Eds1DAmEdcoVawXnLD7SDhpNgtuII2aAkg7a7QS41jxPSZ17p4VdGnMHk3MQ=="],
    "depd": ["depd@2.0.0", "", {}, "sha512-g7nH6P6dyDioJogAAGprGpCtVImJhpPk/roCzdb3fIh61/s/nPsfR6onyMwkCAR/OlC3yBC0lESvUoQEAssIrw=="],
    "dunder-proto": ["dunder-proto@1.0.1", "", { "dependencies": { "call-bind-apply-helpers": "^1.0.1", "es-errors": "^1.3.0", "gopd": "^1.2.0" } }, "sha512-KIN/nDJBQRcXw0MLVhZE9iQHmG68qAVIBg9CqmUYjmQIhgij9U5MFvrqkUL5FbtyyzZuOeOt0zdeRe4UY7ct+A=="],
    "ee-first": ["ee-first@1.1.1", "", {}, "sha512-WMwm9LhRUo+WUaRN+vRuETqG89IgZphVSNkdFgeb6sS/E4OrDIN7t48CAewSHXc6C8lefD8KKfr5vY61brQlow=="],
    "encodeurl": ["encodeurl@2.0.0", "", {}, "sha512-Q0n9HRi4m6JuGIV1eFlmvJB7ZEVxu93IrMyiMsGC0lrMJMWzRgx6WGquyfQgZVb31vhGgXnfmPNNXmxnOkRBrg=="],
    "es-define-property": ["es-define-property@1.0.1", "", {}, "sha512-e3nRfgfUZ4rNGL232gUgX06QNyyez04KdjFrF+LTRoOXmrOgFKDg4BCdsjW8EnT69eqdYGmRpJwiPVYNrCaW3g=="],
    "es-errors": ["es-errors@1.3.0", "", {}, "sha512-Zf5H2Kxt2xjTvbJvP2ZWLEICxA6j+hAmMzIlypy4xcBg1vKVnx89Wy0GbS+kf5cwCVFFzdCFh2XSCFNULS6csw=="],
    "es-object-atoms": ["es-object-atoms@1.1.1", "", { "dependencies": { "es-errors": "^1.3.0" } }, "sha512-FGgH2h8zKNim9ljj7dankFPcICIK9Cp5bm+c2gQSYePhpaG5+esrLODihIorn+Pe6FGJzWhXQotPv73jTaldXA=="],
    "es-set-tostringtag": ["es-set-tostringtag@2.1.0", "", { "dependencies": { "es-errors": "^1.3.0", "get-intrinsic": "^1.2.6", "has-tostringtag": "^1.0.2", "hasown": "^2.0.2" } }, "sha512-j6vWzfrGVfyXxge+O0x5sh6cvxAog0a/4Rdd2K36zCMV5eJ+/+tOAngRO8cODMNWbVRdVlmGZQL2YS3yR8bIUA=="],
    "escape-html": ["escape-html@1.0.3", "", {}, "sha512-NiSupZ4OeuGwr68lGIeym/ksIZMJodUGOSCZ/FSnTxcrekbvqrgdUxlJOMpijaKZVjAJrWrGs/6Jy8OMuyj9ow=="],
    "etag": ["etag@1.8.1", "", {}, "sha512-aIL5Fx7mawVa300al2BnEE4iNvo1qETxLrPI/o05L7z6go7fCw1J6EQmbK4FmJ2AS7kgVF/KEZWufBfdClMcPg=="],
    "event-target-shim": ["event-target-shim@5.0.1", "", {}, "sha512-i/2XbnSz/uxRCU6+NdVJgKWDTM427+MqYbkQzD321DuCQJUqOuJKIA0IM2+W2xtYHdKOmZ4dR6fExsd4SXL+WQ=="],
    "eventsource": ["eventsource@3.0.7", "", { "dependencies": { "eventsource-parser": "^3.0.1" } }, "sha512-CRT1WTyuQoD771GW56XEZFQ/ZoSfWid1alKGDYMmkt2yl8UXrVR4pspqWNEcqKvVIzg6PAltWjxcSSPrboA4iA=="],
    "eventsource-parser": ["eventsource-parser@3.0.6", "", {}, "sha512-Vo1ab+QXPzZ4tCa8SwIHJFaSzy4R6SHf7BY79rFBDf0idraZWAkYrDjDj8uWaSm3S2TK+hJ7/t1CEmZ7jXw+pg=="],
    "express": ["express@5.2.1", "", { "dependencies": { "accepts": "^2.0.0", "body-parser": "^2.2.1", "content-disposition": "^1.0.0", "content-type": "^1.0.5", "cookie": "^0.7.1", "cookie-signature": "^1.2.1", "debug": "^4.4.0", "depd": "^2.0.0", "encodeurl": "^2.0.0", "escape-html": "^1.0.3", "etag": "^1.8.1", "finalhandler": "^2.1.0", "fresh": "^2.0.0", "http-errors": "^2.0.0", "merge-descriptors": "^2.0.0", "mime-types": "^3.0.0", "on-finished": "^2.4.1", "once": "^1.4.0", "parseurl": "^1.3.3", "proxy-addr": "^2.0.7", "qs": "^6.14.0", "range-parser": "^1.2.1", "router": "^2.2.0", "send": "^1.1.0", "serve-static": "^2.2.0", "statuses": "^2.0.1", "type-is": "^2.0.1", "vary": "^1.1.2" } }, "sha512-hIS4idWWai69NezIdRt2xFVofaF4j+6INOpJlVOLDO8zXGpUVEVzIYk12UUi2JzjEzWL3IOAxcTubgz9Po0yXw=="],
    "express-rate-limit": ["express-rate-limit@8.3.2", "", { "dependencies": { "ip-address": "10.1.0" }, "peerDependencies": { "express": ">= 4.11" } }, "sha512-77VmFeJkO0/rvimEDuUC5H30oqUC4EyOhyGccfqoLebB0oiEYfM7nwPrsDsBL1gsTpwfzX8SFy2MT3TDyRq+bg=="],
    "fast-deep-equal": ["fast-deep-equal@3.1.3", "", {}, "sha512-f3qQ9oQy9j2AhBe/H9VC91wLmKBCCU/gDOnKNAYG5hswO7BLKj09Hc5HYNz9cGI++xlpDCIgDaitVs03ATR84Q=="],
    "fast-uri": ["fast-uri@3.1.0", "", {}, "sha512-iPeeDKJSWf4IEOasVVrknXpaBV0IApz/gp7S2bb7Z4Lljbl2MGJRqInZiUrQwV16cpzw/D3S5j5Julj/gT52AA=="],
    "finalhandler": ["finalhandler@2.1.1", "", { "dependencies": { "debug": "^4.4.0", "encodeurl": "^2.0.0", "escape-html": "^1.0.3", "on-finished": "^2.4.1", "parseurl": "^1.3.3", "statuses": "^2.0.1" } }, "sha512-S8KoZgRZN+a5rNwqTxlZZePjT/4cnm0ROV70LedRHZ0p8u9fRID0hJUZQpkKLzro8LfmC8sx23bY6tVNxv8pQA=="],
    "form-data": ["form-data@4.0.5", "", { "dependencies": { "asynckit": "^0.4.0", "combined-stream": "^1.0.8", "es-set-tostringtag": "^2.1.0", "hasown": "^2.0.2", "mime-types": "^2.1.12" } }, "sha512-8RipRLol37bNs2bhoV67fiTEvdTrbMUYcFTiy3+wuuOnUog2QBHCZWXDRijWQfAkhBj2Uf5UnVaiWwA5vdd82w=="],
    "form-data-encoder": ["form-data-encoder@1.7.2", "", {}, "sha512-qfqtYan3rxrnCk1VYaA4H+Ms9xdpPqvLZa6xmMgFvhO32x7/3J/ExcTd6qpxM0vH2GdMI+poehyBZvqfMTto8A=="],
    "formdata-node": ["formdata-node@4.4.1", "", { "dependencies": { "node-domexception": "1.0.0", "web-streams-polyfill": "4.0.0-beta.3" } }, "sha512-0iirZp3uVDjVGt9p49aTaqjk84TrglENEDuqfdlZQ1roC9CWlPk6Avf8EEnZNcAqPonwkG35x4n3ww/1THYAeQ=="],
    "forwarded": ["forwarded@0.2.0", "", {}, "sha512-buRG0fpBtRHSTCOASe6hD258tEubFoRLb4ZNA6NxMVHNw2gOcwHo9wyablzMzOA5z9xA9L1KNjk/Nt6MT9aYow=="],
    "fresh": ["fresh@2.0.0", "", {}, "sha512-Rx/WycZ60HOaqLKAi6cHRKKI7zxWbJ31MhntmtwMoaTeF7XFH9hhBp8vITaMidfljRQ6eYWCKkaTK+ykVJHP2A=="],
    "function-bind": ["function-bind@1.1.2", "", {}, "sha512-7XHNxH7qX9xG5mIwxkhumTox/MIRNcOgDrxWsMt2pAr23WHp6MrRlN7FBSFpCpr+oVO0F744iUgR82nJMfG2SA=="],
    "get-intrinsic": ["get-intrinsic@1.3.0", "", { "dependencies": { "call-bind-apply-helpers": "^1.0.2", "es-define-property": "^1.0.1", "es-errors": "^1.3.0", "es-object-atoms": "^1.1.1", "function-bind": "^1.1.2", "get-proto": "^1.0.1", "gopd": "^1.2.0", "has-symbols": "^1.1.0", "hasown": "^2.0.2", "math-intrinsics": "^1.1.0" } }, "sha512-9fSjSaos/fRIVIp+xSJlE6lfwhES7LNtKaCBIamHsjr2na1BiABJPo0mOjjz8GJDURarmCPGqaiVg5mfjb98CQ=="],
    "get-proto": ["get-proto@1.0.1", "", { "dependencies": { "dunder-proto": "^1.0.1", "es-object-atoms": "^1.0.0" } }, "sha512-sTSfBjoXBp89JvIKIefqw7U2CCebsc74kiY6awiGogKtoSGbgjYE/G/+l9sF3MWFPNc9IcoOC4ODfKHfxFmp0g=="],
    "gopd": ["gopd@1.2.0", "", {}, "sha512-ZUKRh6/kUFoAiTAtTYPZJ3hw9wNxx+BIBOijnlG9PnrJsCcSjs1wyyD6vJpaYtgnzDrKYRSqf3OO6Rfa93xsRg=="],
    "has-symbols": ["has-symbols@1.1.0", "", {}, "sha512-1cDNdwJ2Jaohmb3sg4OmKaMBwuC48sYni5HUw2DvsC8LjGTLK9h+eb1X6RyuOHe4hT0ULCW68iomhjUoKUqlPQ=="],
    "has-tostringtag": ["has-tostringtag@1.0.2", "", { "dependencies": { "has-symbols": "^1.0.3" } }, "sha512-NqADB8VjPFLM2V0VvHUewwwsw0ZWBaIdgo+ieHtK3hasLz4qeCRjYcqfB6AQrBggRKppKF8L52/VqdVsO47Dlw=="],
    "hasown": ["hasown@2.0.2", "", { "dependencies": { "function-bind": "^1.1.2" } }, "sha512-0hJU9SCPvmMzIBdZFqNPXWa6dqh7WdH0cII9y+CyS8rG3nL48Bclra9HmKhVVUHyPWNH5Y7xDwAB7bfgSjkUMQ=="],
    "hono": ["hono@4.12.9", "", {}, "sha512-wy3T8Zm2bsEvxKZM5w21VdHDDcwVS1yUFFY6i8UobSsKfFceT7TOwhbhfKsDyx7tYQlmRM5FLpIuYvNFyjctiA=="],
    "http-errors": ["http-errors@2.0.1", "", { "dependencies": { "depd": "~2.0.0", "inherits": "~2.0.4", "setprototypeof": "~1.2.0", "statuses": "~2.0.2", "toidentifier": "~1.0.1" } }, "sha512-4FbRdAX+bSdmo4AUFuS0WNiPz8NgFt+r8ThgNWmlrjQjt1Q7ZR9+zTlce2859x4KSXrwIsaeTqDoKQmtP8pLmQ=="],
    "humanize-ms": ["humanize-ms@1.2.1", "", { "dependencies": { "ms": "^2.0.0" } }, "sha512-Fl70vYtsAFb/C06PTS9dZBo7ihau+Tu/DNCk/OyHhea07S+aeMWpFFkUaXRa8fI+ScZbEI8dfSxwY7gxZ9SAVQ=="],
    "iconv-lite": ["iconv-lite@0.7.2", "", { "dependencies": { "safer-buffer": ">= 2.1.2 < 3.0.0" } }, "sha512-im9DjEDQ55s9fL4EYzOAv0yMqmMBSZp6G0VvFyTMPKWxiSBHUj9NW/qqLmXUwXrrM7AvqSlTCfvqRb0cM8yYqw=="],
    "inherits": ["inherits@2.0.4", "", {}, "sha512-k/vGaX4/Yla3WzyMCvTQOXYeIHvqOKtnqBduzTHpzpQZzAskKMhZ2K+EnBiSM9zGSoIFeMpXKxa4dYeZIQqewQ=="],
    "ip-address": ["ip-address@10.1.0", "", {}, "sha512-XXADHxXmvT9+CRxhXg56LJovE+bmWnEWB78LB83VZTprKTmaC5QfruXocxzTZ2Kl0DNwKuBdlIhjL8LeY8Sf8Q=="],
    "ipaddr.js": ["ipaddr.js@1.9.1", "", {}, "sha512-0KI/607xoxSToH7GjN1FfSbLoU0+btTicjsQSWQlh/hZykN8KpmMf7uYwPW3R+akZ6R/w18ZlXSHBYXiYUPO3g=="],
    "is-promise": ["is-promise@4.0.0", "", {}, "sha512-hvpoI6korhJMnej285dSg6nu1+e6uxs7zG3BYAm5byqDsgJNWwxzM6z6iZiAgQR4TJ30JmBTOwqZUw3WlyH3AQ=="],
    "isexe": ["isexe@2.0.0", "", {}, "sha512-RHxMLp9lnKHGHRng9QFhRCMbYAcVpn69smSGcq3f36xjgVVWThj4qqLbTLlq7Ssj8B+fIQ1EuCEGI2lKsyQeIw=="],
    "jose": ["jose@6.2.2", "", {}, "sha512-d7kPDd34KO/YnzaDOlikGpOurfF0ByC2sEV4cANCtdqLlTfBlw2p14O/5d/zv40gJPbIQxfES3nSx1/oYNyuZQ=="],
    "json-schema-to-ts": ["json-schema-to-ts@3.1.1", "", { "dependencies": { "@babel/runtime": "^7.18.3", "ts-algebra": "^2.0.0" } }, "sha512-+DWg8jCJG2TEnpy7kOm/7/AxaYoaRbjVB4LFZLySZlWn8exGs3A4OLJR966cVvU26N7X9TWxl+Jsw7dzAqKT6g=="],
    "json-schema-traverse": ["json-schema-traverse@1.0.0", "", {}, "sha512-NM8/P9n3XjXhIZn1lLhkFaACTOURQXjWhV4BA/RnOv8xvgqtqpAX9IO4mRQxSx1Rlo4tqzeqb0sOlruaOy3dug=="],
    "json-schema-typed": ["json-schema-typed@8.0.2", "", {}, "sha512-fQhoXdcvc3V28x7C7BMs4P5+kNlgUURe2jmUT1T//oBRMDrqy1QPelJimwZGo7Hg9VPV3EQV5Bnq4hbFy2vetA=="],
    "math-intrinsics": ["math-intrinsics@1.1.0", "", {}, "sha512-/IXtbwEk5HTPyEwyKX6hGkYXxM9nbj64B+ilVJnC/R6B0pH5G4V3b0pVbL7DBj4tkhBAppbQUlf6F6Xl9LHu1g=="],
    "media-typer": ["media-typer@1.1.0", "", {}, "sha512-aisnrDP4GNe06UcKFnV5bfMNPBUw4jsLGaWwWfnH3v02GnBuXX2MCVn5RbrWo0j3pczUilYblq7fQ7Nw2t5XKw=="],
    "merge-descriptors": ["merge-descriptors@2.0.0", "", {}, "sha512-Snk314V5ayFLhp3fkUREub6WtjBfPdCPY1Ln8/8munuLuiYhsABgBVWsozAG+MWMbVEvcdcpbi9R7ww22l9Q3g=="],
    "mime-db": ["mime-db@1.54.0", "", {}, "sha512-aU5EJuIN2WDemCcAp2vFBfp/m4EAhWJnUNSSw0ixs7/kXbd6Pg64EmwJkNdFhB8aWt1sH2CTXrLxo/iAGV3oPQ=="],
    "mime-types": ["mime-types@3.0.2", "", { "dependencies": { "mime-db": "^1.54.0" } }, "sha512-Lbgzdk0h4juoQ9fCKXW4by0UJqj+nOOrI9MJ1sSj4nI8aI2eo1qmvQEie4VD1glsS250n15LsWsYtCugiStS5A=="],
    "ms": ["ms@2.1.3", "", {}, "sha512-6FlzubTLZG3J2a/NVCAleEhjzq5oxgHyaCU9yYXvcLsvoVaHJq/s5xXI6/XXP6tz7R9xAOtHnSO/tXtF3WRTlA=="],
    "negotiator": ["negotiator@1.0.0", "", {}, "sha512-8Ofs/AUQh8MaEcrlq5xOX0CQ9ypTF5dl78mjlMNfOK08fzpgTHQRQPBxcPlEtIw0yRpws+Zo/3r+5WRby7u3Gg=="],
    "node-domexception": ["node-domexception@1.0.0", "", {}, "sha512-/jKZoMpw0F8GRwl4/eLROPA3cfcXtLApP0QzLmUT/HuPCZWyB7IY9ZrMeKw2O/nFIqPQB3PVM9aYm0F312AXDQ=="],
    "node-fetch": ["node-fetch@2.7.0", "", { "dependencies": { "whatwg-url": "^5.0.0" }, "peerDependencies": { "encoding": "^0.1.0" }, "optionalPeers": ["encoding"] }, "sha512-c4FRfUm/dbcWZ7U+1Wq0AwCyFL+3nt2bEw05wfxSz+DWpWsitgmSgYmy2dQdWyKC1694ELPqMs/YzUSNozLt8A=="],
    "object-assign": ["object-assign@4.1.1", "", {}, "sha512-rJgTQnkUnH1sFw8yT6VSU3zD3sWmu6sZhIseY8VX+GRu3P6F7Fu+JNDoXfklElbLJSnc3FUQHVe4cU5hj+BcUg=="],
    "object-inspect": ["object-inspect@1.13.4", "", {}, "sha512-W67iLl4J2EXEGTbfeHCffrjDfitvLANg0UlX3wFUUSTx92KXRFegMHUVgSqE+wvhAbi4WqjGg9czysTV2Epbew=="],
    "on-finished": ["on-finished@2.4.1", "", { "dependencies": { "ee-first": "1.1.1" } }, "sha512-oVlzkg3ENAhCk2zdv7IJwd/QUD4z2RxRwpkcGY8psCVcCYZNq4wYnVWALHM+brtuJjePWiYF/ClmuDr8Ch5+kg=="],
    "once": ["once@1.4.0", "", { "dependencies": { "wrappy": "1" } }, "sha512-lNaJgI+2Q5URQBkccEKHTQOPaXdUxnZZElQTZY0MFUAuaEqe1E+Nyvgdz/aIyNi6Z9MzO5dv1H8n58/GELp3+w=="],
    "openai": ["openai@6.34.0", "", { "peerDependencies": { "ws": "^8.18.0", "zod": "^3.25 || ^4.0" }, "optionalPeers": ["ws", "zod"], "bin": { "openai": "bin/cli" } }, "sha512-yEr2jdGf4tVFYG6ohmr3pF6VJuveP0EA/sS8TBx+4Eq5NT10alu5zg2dmxMXMgqpihRDQlFGpRt2XwsGj+Fyxw=="],
    "parseurl": ["parseurl@1.3.3", "", {}, "sha512-CiyeOxFT/JZyN5m0z9PfXw4SCBJ6Sygz1Dpl0wqjlhDEGGBP1GnsUVEL0p63hoG1fcj3fHynXi9NYO4nWOL+qQ=="],
    "path-key": ["path-key@3.1.1", "", {}, "sha512-ojmeN0qd+y0jszEtoY48r0Peq5dwMEkIlCOu6Q5f41lfkswXuKtYrhgoTpLnyIcHm24Uhqx+5Tqm2InSwLhE6Q=="],
    "path-to-regexp": ["path-to-regexp@8.4.1", "", {}, "sha512-fvU78fIjZ+SBM9YwCknCvKOUKkLVqtWDVctl0s7xIqfmfb38t2TT4ZU2gHm+Z8xGwgW+QWEU3oQSAzIbo89Ggw=="],
    "pkce-challenge": ["pkce-challenge@5.0.1", "", {}, "sha512-wQ0b/W4Fr01qtpHlqSqspcj3EhBvimsdh0KlHhH8HRZnMsEa0ea2fTULOXOS9ccQr3om+GcGRk4e+isrZWV8qQ=="],
    "proxy-addr": ["proxy-addr@2.0.7", "", { "dependencies": { "forwarded": "0.2.0", "ipaddr.js": "1.9.1" } }, "sha512-llQsMLSUDUPT44jdrU/O37qlnifitDP+ZwrmmZcoSKyLKvtZxpyV0n2/bD/N4tBAAZ/gJEdZU7KMraoK1+XYAg=="],
    "qs": ["qs@6.15.0", "", { "dependencies": { "side-channel": "^1.1.0" } }, "sha512-mAZTtNCeetKMH+pSjrb76NAM8V9a05I9aBZOHztWy/UqcJdQYNsf59vrRKWnojAT9Y+GbIvoTBC++CPHqpDBhQ=="],
    "range-parser": ["range-parser@1.2.1", "", {}, "sha512-Hrgsx+orqoygnmhFbKaHE6c296J+HTAQXoxEF6gNupROmmGJRoyzfG3ccAveqCBrwr/2yxQ5BVd/GTl5agOwSg=="],
    "raw-body": ["raw-body@3.0.2", "", { "dependencies": { "bytes": "~3.1.2", "http-errors": "~2.0.1", "iconv-lite": "~0.7.0", "unpipe": "~1.0.0" } }, "sha512-K5zQjDllxWkf7Z5xJdV0/B0WTNqx6vxG70zJE4N0kBs4LovmEYWJzQGxC9bS9RAKu3bgM40lrd5zoLJ12MQ5BA=="],
    "require-from-string": ["require-from-string@2.0.2", "", {}, "sha512-Xf0nWe6RseziFMu+Ap9biiUbmplq6S9/p+7w7YXP/JBHhrUDDUhwa+vANyubuqfZWTveU//DYVGsDG7RKL/vEw=="],
    "router": ["router@2.2.0", "", { "dependencies": { "debug": "^4.4.0", "depd": "^2.0.0", "is-promise": "^4.0.0", "parseurl": "^1.3.3", "path-to-regexp": "^8.0.0" } }, "sha512-nLTrUKm2UyiL7rlhapu/Zl45FwNgkZGaCpZbIHajDYgwlJCOzLSk+cIPAnsEqV955GjILJnKbdQC1nVPz+gAYQ=="],
    "safer-buffer": ["safer-buffer@2.1.2", "", {}, "sha512-YZo3K82SD7Riyi0E1EQPojLz7kpepnSQI9IyPbHHg1XXXevb5dJI7tpyN2ADxGcQbHG7vcyRHk0cbwqcQriUtg=="],
    "send": ["send@1.2.1", "", { "dependencies": { "debug": "^4.4.3", "encodeurl": "^2.0.0", "escape-html": "^1.0.3", "etag": "^1.8.1", "fresh": "^2.0.0", "http-errors": "^2.0.1", "mime-types": "^3.0.2", "ms": "^2.1.3", "on-finished": "^2.4.1", "range-parser": "^1.2.1", "statuses": "^2.0.2" } }, "sha512-1gnZf7DFcoIcajTjTwjwuDjzuz4PPcY2StKPlsGAQ1+YH20IRVrBaXSWmdjowTJ6u8Rc01PoYOGHXfP1mYcZNQ=="],
    "serve-static": ["serve-static@2.2.1", "", { "dependencies": { "encodeurl": "^2.0.0", "escape-html": "^1.0.3", "parseurl": "^1.3.3", "send": "^1.2.0" } }, "sha512-xRXBn0pPqQTVQiC8wyQrKs2MOlX24zQ0POGaj0kultvoOCstBQM5yvOhAVSUwOMjQtTvsPWoNCHfPGwaaQJhTw=="],
    "setprototypeof": ["setprototypeof@1.2.0", "", {}, "sha512-E5LDX7Wrp85Kil5bhZv46j8jOeboKq5JMmYM3gVGdGH8xFpPWXUMsNrlODCrkoxMEeNi/XZIwuRvY4XNwYMJpw=="],
    "shebang-command": ["shebang-command@2.0.0", "", { "dependencies": { "shebang-regex": "^3.0.0" } }, "sha512-kHxr2zZpYtdmrN1qDjrrX/Z1rR1kG8Dx+gkpK1G4eXmvXswmcE1hTWBWYUzlraYw1/yZp6YuDY77YtvbN0dmDA=="],
    "shebang-regex": ["shebang-regex@3.0.0", "", {}, "sha512-7++dFhtcx3353uBaq8DDR4NuxBetBzC7ZQOhmTQInHEd6bSrXdiEyzCvG07Z44UYdLShWUyXt5M/yhz8ekcb1A=="],
    "side-channel": ["side-channel@1.1.0", "", { "dependencies": { "es-errors": "^1.3.0", "object-inspect": "^1.13.3", "side-channel-list": "^1.0.0", "side-channel-map": "^1.0.1", "side-channel-weakmap": "^1.0.2" } }, "sha512-ZX99e6tRweoUXqR+VBrslhda51Nh5MTQwou5tnUDgbtyM0dBgmhEDtWGP/xbKn6hqfPRHujUNwz5fy/wbbhnpw=="],
    "side-channel-list": ["side-channel-list@1.0.0", "", { "dependencies": { "es-errors": "^1.3.0", "object-inspect": "^1.13.3" } }, "sha512-FCLHtRD/gnpCiCHEiJLOwdmFP+wzCmDEkc9y7NsYxeF4u7Btsn1ZuwgwJGxImImHicJArLP4R0yX4c2KCrMrTA=="],
    "side-channel-map": ["side-channel-map@1.0.1", "", { "dependencies": { "call-bound": "^1.0.2", "es-errors": "^1.3.0", "get-intrinsic": "^1.2.5", "object-inspect": "^1.13.3" } }, "sha512-VCjCNfgMsby3tTdo02nbjtM/ewra6jPHmpThenkTYh8pG9ucZ/1P8So4u4FGBek/BjpOVsDCMoLA/iuBKIFXRA=="],
    "side-channel-weakmap": ["side-channel-weakmap@1.0.2", "", { "dependencies": { "call-bound": "^1.0.2", "es-errors": "^1.3.0", "get-intrinsic": "^1.2.5", "object-inspect": "^1.13.3", "side-channel-map": "^1.0.1" } }, "sha512-WPS/HvHQTYnHisLo9McqBHOJk2FkHO/tlpvldyrnem4aeQp4hai3gythswg6p01oSoTl58rcpiFAjF2br2Ak2A=="],
    "statuses": ["statuses@2.0.2", "", {}, "sha512-DvEy55V3DB7uknRo+4iOGT5fP1slR8wQohVdknigZPMpMstaKJQWhwiYBACJE3Ul2pTnATihhBYnRhZQHGBiRw=="],
    "toidentifier": ["toidentifier@1.0.1", "", {}, "sha512-o5sSPKEkg/DIQNmH43V0/uerLrpzVedkUh8tGNvaeXpfpuwjKenlSox/2O/BTlZUtEe+JG7s5YhEz608PlAHRA=="],
    "tr46": ["tr46@0.0.3", "", {}, "sha512-N3WMsuqV66lT30CrXNbEjx4GEwlow3v6rr4mCcv6prnfwhS01rkgyFdjPNBYd9br7LpXV1+Emh01fHnq2Gdgrw=="],
    "ts-algebra": ["ts-algebra@2.0.0", "", {}, "sha512-FPAhNPFMrkwz76P7cdjdmiShwMynZYN6SgOujD1urY4oNm80Ou9oMdmbR45LotcKOXoy7wSmHkRFE6Mxbrhefw=="],
    "type-is": ["type-is@2.0.1", "", { "dependencies": { "content-type": "^1.0.5", "media-typer": "^1.1.0", "mime-types": "^3.0.0" } }, "sha512-OZs6gsjF4vMp32qrCbiVSkrFmXtG/AZhY3t0iAMrMBiAZyV9oALtXO8hsrHbMXF9x6L3grlFuwW2oAz7cav+Gw=="],
    "typescript": ["typescript@5.9.3", "", { "bin": { "tsc": "bin/tsc", "tsserver": "bin/tsserver" } }, "sha512-jl1vZzPDinLr9eUt3J/t7V6FgNEw9QjvBPdysz9KfQDD41fQrC2Y4vKQdiaUpFT4bXlb1RHhLpp8wtm6M5TgSw=="],
    "undici-types": ["undici-types@5.26.5", "", {}, "sha512-JlCMO+ehdEIKqlFxk6IfVoAUVmgz7cU7zD/h9XZ0qzeosSHmUJVOzSQvvYSYWXkFXC+IfLKSIffhv0sVZup6pA=="],
    "unpipe": ["unpipe@1.0.0", "", {}, "sha512-pjy2bYhSsufwWlKwPc+l3cN7+wuJlK6uz0YdJEOlQDbl6jo/YlPi4mb8agUkVC8BF7V8NuzeyPNqRksA3hztKQ=="],
    "vary": ["vary@1.1.2", "", {}, "sha512-BNGbWLfd0eUPabhkXUVm0j8uuvREyTh5ovRa/dyow/BqAbZJyC+5fU+IzQOzmAKzYqYRAISoRhdQr3eIZ/PXqg=="],
    "web-streams-polyfill": ["web-streams-polyfill@4.0.0-beta.3", "", {}, "sha512-QW95TCTaHmsYfHDybGMwO5IJIM93I/6vTRk+daHTWFPhwh+C8Cg7j7XyKrwrj8Ib6vYXe0ocYNrmzY4xAAN6ug=="],
    "webidl-conversions": ["webidl-conversions@3.0.1", "", {}, "sha512-2JAn3z8AR6rjK8Sm8orRC0h/bcl/DqL7tRPdGZ4I1CjdF+EaMLmYxBHyXuKL849eucPFhvBoxMsflfOb8kxaeQ=="],
    "whatwg-url": ["whatwg-url@5.0.0", "", { "dependencies": { "tr46": "~0.0.3", "webidl-conversions": "^3.0.0" } }, "sha512-saE57nupxk6v3HY35+jzBwYa0rKSy0XR8JSxZPwgLr7ys0IBzhGviA1/TUGJLmSVqs8pb9AnvICXEuOHLprYTw=="],
    "which": ["which@2.0.2", "", { "dependencies": { "isexe": "^2.0.0" }, "bin": { "node-which": "./bin/node-which" } }, "sha512-BLI3Tl1TW3Pvl70l3yq3Y64i+awpwXqsGBYWkkqMtnbXgrMD+yj7rhW0kuEDxzJaYXGjEW5ogapKNMEKNMjibA=="],
    "wrappy": ["wrappy@1.0.2", "", {}, "sha512-l4Sp/DRseor9wL6EvV2+TuQn63dMkPjZ/sp9XkghTEbV9KlPS1xUsZ3u7/IQO4wxtcFB4bgpQPRcR3QCvezPcQ=="],
    "yaml": ["yaml@2.8.3", "", { "bin": { "yaml": "bin.mjs" } }, "sha512-AvbaCLOO2Otw/lW5bmh9d/WEdcDFdQp2Z2ZUH3pX9U2ihyUY0nvLv7J6TrWowklRGPYbB/IuIMfYgxaCPg5Bpg=="],
    "zod": ["zod@4.3.6", "", {}, "sha512-rftlrkhHZOcjDwkGlnUtZZkvaPHCsDATp4pGpuOOMDaTdDDXF91wuVDJoWoPsKX/3YPQ5fHuF3STjcYyKr+Qhg=="],
    "zod-to-json-schema": ["zod-to-json-schema@3.25.2", "", { "peerDependencies": { "zod": "^3.25.28 || ^4" } }, "sha512-O/PgfnpT1xKSDeQYSCfRI5Gy3hPf91mKVDuYLUHZJMiDFptvP41MSnWofm8dnCm0256ZNfZIM7DSzuSMAFnjHA=="],
    "@anthropic-ai/claude-agent-sdk/@anthropic-ai/sdk": ["@anthropic-ai/sdk@0.74.0", "", { "dependencies": { "json-schema-to-ts": "^3.1.1" }, "peerDependencies": { "zod": "^3.25.0 || ^4.0.0" }, "optionalPeers": ["zod"], "bin": { "anthropic-ai-sdk": "bin/cli" } }, "sha512-srbJV7JKsc5cQ6eVuFzjZO7UR3xEPJqPamHFIe29bs38Ij2IripoAhC0S5NslNbaFUYqBKypmmpzMTpqfHEUDw=="],
    "@types/node-fetch/@types/node": ["@types/node@25.5.0", "", { "dependencies": { "undici-types": "~7.18.0" } }, "sha512-jp2P3tQMSxWugkCUKLRPVUpGaL5MVFwF8RDuSRztfwgN1wmqJeMSbKlnEtQqU8UrhTmzEmZdu2I6v2dpp7XIxw=="],
    "bun-types/@types/node": ["@types/node@25.5.0", "", { "dependencies": { "undici-types": "~7.18.0" } }, "sha512-jp2P3tQMSxWugkCUKLRPVUpGaL5MVFwF8RDuSRztfwgN1wmqJeMSbKlnEtQqU8UrhTmzEmZdu2I6v2dpp7XIxw=="],
    "form-data/mime-types": ["mime-types@2.1.35", "", { "dependencies": { "mime-db": "1.52.0" } }, "sha512-ZDY+bPm5zTTF+YpCrAU9nK0UgICYPT0QtT1NZWFv4s++TNkcgVaT0g6+4R2uI4MjQjzysHB1zxuWL50hzaeXiw=="],
    "@types/node-fetch/@types/node/undici-types": ["undici-types@7.18.2", "", {}, "sha512-AsuCzffGHJybSaRrmr5eHr81mwJU3kjw6M+uprWvCXiNeN9SOGwQ3Jn8jb8m3Z6izVgknn1R0FTCEAP2QrLY/w=="],
    "bun-types/@types/node/undici-types": ["undici-types@7.18.2", "", {}, "sha512-AsuCzffGHJybSaRrmr5eHr81mwJU3kjw6M+uprWvCXiNeN9SOGwQ3Jn8jb8m3Z6izVgknn1R0FTCEAP2QrLY/w=="],
    "form-data/mime-types/mime-db": ["mime-db@1.52.0", "", {}, "sha512-sPU4uV7dYlvtWJxwwxHD0PuihVNiE7TyAbQ5SWxDCB9mUYvOgroQOwYQQOKPJ8CIbE+1ETVlOoK1UC2nU3gYvg=="],
  }
 }
--- a/ai_evals/cases/app.yaml
+++ b/ai_evals/cases/app.yaml
@@ -0,0 +1,93 @@
 - id: app-test1-counter-create
  prompt: |-
    Create a simple counter app with increment and decrement buttons.
  judgeChecklist:
    - shows the current count in the UI
    - includes an increment button
    - includes a decrement button
    - clicking the buttons updates the count correctly
 - id: app-test2-counter-reset
  prompt: |-
    Add a reset button that sets the counter back to 0
  initial: ai_evals/fixtures/frontend/app/initial/test1_counter_app
  judgeChecklist:
    - adds a reset control to the existing counter app
    - clicking reset sets the count back to 0
    - keeps the existing increment and decrement behavior working
 - id: app-test3-shopping-cart-quantity
  prompt: |-
    Add a quantity selector (+ and - buttons) to each cart item so users can adjust quantities without removing and re-adding items
  initial: ai_evals/fixtures/frontend/app/initial/shopping_cart
  judgeChecklist:
    - each cart item has visible plus and minus quantity controls
    - users can increase quantity without re-adding the product
    - users can decrease quantity from the cart UI
    - cart totals stay in sync with quantity changes
 - id: app-test4-shopping-cart-discount
  prompt: |-
    Add a discount code input field in the cart.
    When the code "SAVE10" is entered, apply a 10% discount to the total
  initial: ai_evals/fixtures/frontend/app/initial/shopping_cart
  judgeChecklist:
    - adds a discount code input to the cart
    - recognizes the code SAVE10
    - applies a 10 percent discount to the displayed total
    - keeps the rest of the cart behavior intact
 - id: app-test5-file-manager-search
  prompt: |-
    Add a search bar in the toolbar that filters files and folders by name as the user types
  initial: ai_evals/fixtures/frontend/app/initial/file_manager
  judgeChecklist:
    - adds a search input in the toolbar
    - filters files and folders by name as the user types
    - updates the visible file list from the search query
    - keeps the rest of the file manager usable
 - id: app-test6-file-manager-inline-rename
  prompt: |-
    Let users rename files and folders directly from the file list without leaving the page.
  initial: ai_evals/fixtures/frontend/app/initial/file_manager
  judgeChecklist:
    - adds a visible rename action or inline edit mode in the file list
    - lets users edit an item's name directly from the list
    - saves the renamed item through the app's existing rename behavior
    - refreshes the displayed name after a successful rename
 - id: app-test7-file-manager-select-all
  prompt: |-
    Add a "Select All" checkbox in the file list header and individual checkboxes for each file.
    Add a "Delete Selected" button that appears when items are selected
  initial: ai_evals/fixtures/frontend/app/initial/file_manager
  judgeChecklist:
    - adds a select-all control in the file list header
    - adds per-item selection controls
    - shows a delete-selected action only when there is a selection
    - deleting selected items updates the visible list
 - id: app-test8-inventory-tracker-create
  prompt: |-
    Create an inventory tracker app for a small store.
    Users should be able to add items with a name, sku, quantity, and price, search items by name or sku, and delete items.
    The inventory should persist between sessions.
  judgeChecklist:
    - includes a form to add inventory items with name, sku, quantity, and price
    - shows a list or table of saved inventory items
    - supports searching or filtering by name or sku
    - lets users delete existing inventory items
    - persists the inventory data appropriately for a raw Windmill app
 - id: app-test9-recipe-book-create
  prompt: |-
    Create a recipe book app where users can add recipes with a name, ingredients list, and instructions.
    Include a search bar to filter recipes by name and the ability to delete recipes.
    Recipes should persist between sessions.
  judgeChecklist:
    - includes a form to add recipes with name, ingredients, and instructions
    - shows saved recipes in the app
    - supports searching recipes by name
    - lets users delete recipes
    - persists recipes appropriately for a raw Windmill app
--- a/ai_evals/cases/cli.yaml
+++ b/ai_evals/cases/cli.yaml
@@ -0,0 +1,66 @@
 - id: bun-hello-script
  prompt: |-
    Create a Windmill Bun script at `f/evals/hello.ts`.
    It should take a `name` input and return a greeting object like `{ greeting: "Hello, Alice!" }`.
  expected: ai_evals/fixtures/cli/expected/bun-hello-script
  judgeChecklist:
    - creates the requested Bun script at f/evals/hello.ts
    - takes a name input
    - returns an object containing the greeting
 - id: bun-hello-flow
  prompt: |-
    Create a Windmill flow at `f/evals/hello__flow`.
    It should take a `name` input and return a greeting object like `{ greeting: "Hello, Alice!" }`.
    Put the step code in `hello.ts`.
  expected: ai_evals/fixtures/cli/expected/bun-hello-flow
  judgeChecklist:
    - creates the requested flow folder with flow.yaml and hello.ts
    - wires the name input into the flow step
    - returns the greeting object
 - id: python-add-numbers-script
  prompt: |-
    Add a Windmill Python script at `f/evals/add_numbers.py`.
    It should take `a` and `b` as inputs and return `{ "total": a + b }`.
  expected: ai_evals/fixtures/cli/expected/python-add-numbers-script
  judgeChecklist:
    - creates the requested Python script at f/evals/add_numbers.py
    - takes `a` and `b` as inputs
    - returns an object with total equal to a plus b
 - id: bun-hello-script-uppercase
  prompt: |-
    Update `f/evals/hello.ts` so it accepts an optional `uppercase` boolean.
    Keep returning `{ greeting: ... }`, but when `uppercase` is true the greeting should be uppercased before returning it.
  initial: ai_evals/fixtures/cli/initial/bun-hello-script-uppercase
  expected: ai_evals/fixtures/cli/expected/bun-hello-script-uppercase
  judgeChecklist:
    - updates the existing hello.ts file rather than creating a new script
    - accepts an optional uppercase boolean input
    - keeps returning an object with greeting
    - uppercases the greeting when uppercase is true
 - id: bun-hello-flow-punctuation
  prompt: |-
    Update the existing flow in `f/evals/hello__flow` so it also accepts an optional `punctuation` input.
    The greeting should use that punctuation and default to `!` when it is missing.
  initial: ai_evals/fixtures/cli/initial/bun-hello-flow-punctuation
  expected: ai_evals/fixtures/cli/expected/bun-hello-flow-punctuation
  judgeChecklist:
    - updates the existing hello flow instead of creating a new one
    - adds an optional punctuation input to the flow
    - updates the step code so the returned greeting uses punctuation
    - defaults punctuation to an exclamation mark when omitted
 - id: flow-reuse-existing-script
  prompt: |-
    There is already a reusable greeting script at `f/lib/format_greeting.ts`.
    Create a flow at `f/evals/reuse_greeting__flow` that takes a `name` input and reuses that existing script instead of duplicating the logic inline.
  initial: ai_evals/fixtures/cli/initial/flow-reuse-existing-script
  expected: ai_evals/fixtures/cli/expected/flow-reuse-existing-script
  judgeChecklist:
    - creates the requested flow at f/evals/reuse_greeting__flow
    - reuses the existing script from f/lib by path
    - does not duplicate the greeting logic in a new inline script
    - wires the name input into the reused script
--- a/ai_evals/cases/flow.yaml
+++ b/ai_evals/cases/flow.yaml
@@ -0,0 +1,246 @@
 - id: flow-test0-sum-two-numbers
  prompt: |-
    Create a flow that takes two numbers, `a` and `b`, and returns their sum.
    Keep it simple and use a single step named `sum_numbers`.
  expected: ai_evals/fixtures/frontend/flow/expected/test0_sum_two_numbers.json
  judgeChecklist:
    - "the flow takes `a` and `b` as inputs"
    - "the main step is named `sum_numbers`"
    - the flow returns the sum of the two numbers
 - id: flow-test1-reuse-existing-script
  prompt: |-
    I need a flow that adds two numbers.
    If there is already a script in the workspace that does that, reuse it instead of rewriting the logic.
    The flow should take `a` and `b` as inputs and use a single step named `sum_numbers`.
  initial: ai_evals/fixtures/frontend/flow/initial/test1_reuse_existing_script_initial.json
  expected: ai_evals/fixtures/frontend/flow/expected/test1_reuse_existing_script.json
  judgeChecklist:
    - "the flow takes `a` and `b` as inputs"
    - "the main step is named `sum_numbers`"
    - the flow reuses the existing workspace script instead of rewriting the addition logic
 - id: flow-test2-call-existing-subflow
  prompt: |-
    Create a parent flow that adds two numbers by reusing an existing flow in the workspace if one already exists.
    The parent flow should take `a` and `b` as inputs and delegate the calculation instead of inlining it.
    Use a single step named `call_add_numbers`.
  initial: ai_evals/fixtures/frontend/flow/initial/test2_call_existing_subflow_initial.json
  expected: ai_evals/fixtures/frontend/flow/expected/test2_call_existing_subflow.json
  judgeChecklist:
    - "the parent flow takes `a` and `b` as inputs"
    - "the main step is named `call_add_numbers`"
    - the parent flow delegates to an existing workspace subflow instead of inlining the addition logic
 - id: flow-test3-branchone-routing
  prompt: |-
    Create a flow that routes incoming support requests based on the customer's tier.
    The input should contain a string field named `tier`.
    Free, pro, and enterprise requests should go to different queues, and unknown tiers should fall back to a default queue.
    Name the main routing step `route_by_tier`.
  expected: ai_evals/fixtures/frontend/flow/expected/test3_branchone_routing.json
  judgeChecklist:
    - "the input schema includes a string field named `tier`"
    - "the main routing step is named `route_by_tier`"
    - free requests go to a free queue
    - pro requests go to a pro queue
    - enterprise requests go to an enterprise queue
    - unknown tiers fall back to a default queue
 - id: flow-test4-order-processing-loop
  prompt: |-
    Build an order-processing flow.
    The input should include an order with:
    - an `items` array containing `name`, `price`, and `quantity`
    - `customer_email`
    - `shipping_address`
    The flow should:
    - validate that every item has a positive price and quantity
    - calculate the order total with 8% tax
    - check inventory for each item using placeholder availability data
    - create a shipment if everything is in stock, otherwise create a backorder
    - send a confirmation using placeholder email logic
    - return a final order summary with the status
  validate:
    schemaAnyOf:
      - requiredPaths:
          - order
          - order.items
          - order.customer_email
          - order.shipping_address
      - requiredPaths:
          - items
          - customer_email
          - shipping_address
    resolveResultsRefs: true
  judgeChecklist:
    - the flow validates that every item has a positive price and quantity
    - the flow calculates the order total with 8% tax
    - the flow checks inventory for each item using placeholder availability data
    - the flow creates a shipment if everything is in stock, otherwise a backorder
    - the flow sends a confirmation using placeholder email logic
    - the flow returns a final order summary with the resulting status
 - id: flow-test5-parallel-data-pipeline
  prompt: |-
    Create a data-processing flow for three external data sources.
    It should:
    - load a small placeholder configuration listing the three sources
    - fetch placeholder records from each source
    - clean and validate each source's records
    - combine everything into one dataset
    - compute an overall quality score
    - store the result differently depending on the score:
      - 90 or above goes to the primary database
      - 70 to 89 goes to a secondary database with a warning
      - below 70 goes to quarantine and triggers an alert
    - return a processing report with total records, quality score, and destination
  judgeChecklist:
    - the flow loads a placeholder configuration listing three external sources
    - the flow fetches placeholder records from each source
    - the flow cleans and validates each source's records
    - the flow combines everything into one dataset
    - the flow computes an overall quality score
    - scores of 90 or above go to the primary database
    - scores from 70 to 89 go to a secondary database with a warning
    - scores below 70 go to quarantine and trigger an alert
    - the final report includes total records, quality score, and destination
 - id: flow-test6-ai-agent-tools
  prompt: |-
    Create a customer support flow.
    The input should include `customer_id` and `query_text`.
    The flow should load the customer's profile and order history, then use an AI assistant to help with the request.
    The assistant should be able to:
    - look up orders
    - check refund eligibility
    - search FAQs
    - open a support ticket when needed
    After that, log the interaction and return the assistant's response along with any actions it took.
  judgeChecklist:
    - "the input schema includes `customer_id` and `query_text`"
    - the flow loads the customer's profile and order history
    - the flow uses an AI assistant step
    - the assistant can look up orders
    - the assistant can check refund eligibility
    - the assistant can search FAQs
    - the assistant can open a support ticket
    - the flow logs the interaction
    - the final output returns the assistant response along with any actions taken or resulting support action details
 - id: flow-test7-simple-modification
  prompt: |-
    Update this flow so it validates processed data before saving it.
    After `process_data`, add a `validate_data` step that checks the data array is not empty.
    If the array is empty, it should return an error object with the message `No data to save`.
    If validation passes, let the save continue normally.
    Update `save_results` so it handles the validation result correctly.
  initial: ai_evals/fixtures/frontend/flow/initial/test5_initial.json
  expected: ai_evals/fixtures/frontend/flow/expected/test5_modify_simple.json
  judgeChecklist:
    - the updated flow keeps the original fetch and process steps intact
    - "a `validate_data` step is added after `process_data`"
    - "`validate_data` checks that the processed data array is not empty"
    - "empty data returns an error object with the message `No data to save`"
    - "`save_results` handles the validation result correctly"
 - id: flow-test8-branching-in-loop
  prompt: |-
    Update the order-processing logic inside `loop_orders` so different order types are handled differently.
    For `express`, mark the order as priority and use a shipping cost of $15.99.
    For `standard`, use a shipping cost of $5.99.
    For `pickup`, mark it as no shipping required with a cost of $0.
    Keep the existing processing as a fallback for unknown order types.
    Each path should return the orderId, shipping cost, and shipping type.
  initial: ai_evals/fixtures/frontend/flow/initial/test6_initial.json
  judgeChecklist:
    - "the existing `loop_orders` flow still handles per-order processing"
    - exact branching topology is not required as long as `loop_orders` handles the order types correctly
    - express orders are marked as priority and use a shipping cost of 15.99
    - standard orders use a shipping cost of 5.99
    - pickup orders use a shipping cost of 0 and are treated as no shipping required
    - unknown order types still follow a fallback path
    - "each processed order returns `orderId`, `shippingCost`, and `shippingType`"
 - id: flow-test9-parallel-refactor
  prompt: |-
    Refactor this flow so the enrichment work no longer runs one step at a time.
    `enrich_price`, `enrich_inventory`, and `enrich_reviews` should run independently.
    Each one should return a fallback value if it fails.
    Update `combine_data` so it merges the enrichment results and sets a `hasFallbacks` flag when any fallback was used.
    Keep `get_item` as the first step and `return_result` as the last step.
  initial: ai_evals/fixtures/frontend/flow/initial/test7_initial.json
  expected: ai_evals/fixtures/frontend/flow/expected/test7_modify_complex.json
  judgeChecklist:
    - "the updated flow keeps `get_item` as the first step"
    - "the updated flow keeps `return_result` as the last step"
    - "`enrich_price`, `enrich_inventory`, and `enrich_reviews` run independently rather than sequentially"
    - each enrichment path returns a fallback value if it fails
    - "`combine_data` merges the enrichment results"
    - "`combine_data` sets `hasFallbacks` when any fallback was used"
 - id: flow-test10-while-loop-counter
  prompt: |-
    Create a flow that keeps incrementing a counter until it reaches a target value.
    The input should include a number field named `target`.
    Name the looping step `count_until_target`.
    Once the target is reached, return the final counter value.
  expected: ai_evals/fixtures/frontend/flow/expected/test10_while_loop_counter.json
  judgeChecklist:
    - "the input schema includes a number field named `target`"
    - "the looping step is named `count_until_target`"
    - the flow keeps incrementing a counter until the target is reached
    - the final output returns the final counter value
 - id: flow-test11-preprocessor-and-failure-handler
  prompt: |-
    Create an event-processing flow for a string payload.
    Before the main processing runs, trim the payload and reject empty strings.
    The main step should be named `process_event` and return a simple success object.
    If anything fails, return a compact error object with the error message and the failing step id.
  expected: ai_evals/fixtures/frontend/flow/expected/test11_preprocessor_failure.json
  validate:
    requireSpecialModules:
      - preprocessor_module
      - failure_module
  judgeChecklist:
    - the flow trims the payload before the main processing runs
    - the flow rejects empty payload strings
    - "the main step is named `process_event`"
    - "`process_event` returns a simple success object"
    - failures return a compact error object with the error message and failing step id
 - id: flow-test12-approval-step
  prompt: |-
    Create a purchase approval flow.
    The input should include `requester_email` and `amount`.
    Add an approval step named `request_approval` that pauses the flow and asks the approver for a comment.
    One approval should be enough to continue.
    After approval, add a final step named `finalize_purchase` that returns an approved status object.
  expected: ai_evals/fixtures/frontend/flow/expected/test12_approval_step.json
  validate:
    schemaRequiredPaths:
      - requester_email
      - amount
    requireSuspendSteps:
      - id: request_approval
        requiredEvents: 1
        resumeRequiredStringFieldAnyOf:
          - comment
          - approver_comment
  judgeChecklist:
    - "the flow includes an approval step named `request_approval`"
    - "`request_approval` pauses the flow and asks the approver for a comment"
    - one approval is enough to continue
    - "the flow includes a final step named `finalize_purchase`"
    - "`finalize_purchase` returns an approved status object after approval"
--- a/ai_evals/cases/script.yaml
+++ b/ai_evals/cases/script.yaml
@@ -0,0 +1,11 @@
 - id: script-test1-greet-user
  prompt: |-
    Update the current Bun script so it takes the existing `name` input and returns a plain greeting string like `Hello, Alice!`.
    Do not wrap the result in an object or array.
    Keep it simple and do not add external dependencies.
  initial: ai_evals/fixtures/frontend/script/initial/test1_empty_bun.json
  expected: ai_evals/fixtures/frontend/script/expected/test1_greet_user.json
  judgeChecklist:
    - uses the existing `name` input
    - returns a plain greeting string
    - does not wrap the result in an object or array
--- a/ai_evals/cli/index.ts
+++ b/ai_evals/cli/index.ts
@@ -0,0 +1,295 @@
 #!/usr/bin/env bun
 import { Command, InvalidArgumentError } from "commander";
 import { loadCases, loadSelectedCases } from "../core/cases";
 import {
  EVAL_MODELS,
  type EvalModelSpec,
  formatRunModelLabel,
  getCliEvalModel,
  getEvalModelHelpText,
  resolveEvalModel,
 } from "../core/models";
 import {
  appendHistoryRecord,
  buildRunResult,
  formatRunSummary,
  resolveRunOutputPath,
  writeRunArtifacts,
  writeRunResult,
 } from "../core/results";
 import { runSuite } from "../core/runSuite";
 import { EVAL_MODES, type EvalMode } from "../core/types";
 import { DEFAULT_JUDGE_MODEL } from "../core/judge";
 import { createCliModeRunner } from "../modes/cli";
 import { runFrontendBenchmarkAdapter } from "../adapters/frontend/runtime";
 async function main() {
  const program = new Command()
    .name("bun run cli --")
    .description("Run AI eval cases against the current production prompts and guidance")
    .showHelpAfterError()
    .showSuggestionAfterError()
    .addHelpText(
      "after",
      [
        "",
        "Examples:",
        "  bun run cli -- models",
        "  bun run cli -- cases",
        "  bun run cli -- cases flow",
        "  bun run cli -- run flow",
        "  bun run cli -- run flow --model 4o",
        "  bun run cli -- run flow --models haiku,opus,4o",
        "  bun run cli -- run flow flow-test0-sum-two-numbers --verbose",
        "  bun run cli -- run flow --record",
        "  bun run cli -- run flow flow-test5-simple-modification --runs 3",
        "  bun run cli -- run cli bun-hello-script",
        "",
        "Models:",
        getEvalModelHelpText(),
      ].join("\n")
    );
  program
    .command("models")
    .description("List available model aliases")
    .action(() => {
      handleModels();
    });
  program
    .command("cases")
    .description("List available cases")
    .argument("[mode]", "cli, flow, script, or app", parseOptionalMode)
    .action(async (mode?: EvalMode) => {
      await handleCases(mode);
    });
  program
    .command("run")
    .description("Run one benchmark mode")
    .argument("<mode>", "cli, flow, script, or app", parseMode)
    .argument("[caseIds...]", "specific case ids to run")
    .option("--runs <n>", "number of attempts per case", parsePositiveInteger, 1)
    .option("--output <path>", "write the result JSON to this path")
    .option("--model <name>", `model alias (${EVAL_MODELS.map((entry) => entry.id).join(", ")})`)
    .option("--models <names>", "comma-separated model aliases to run sequentially")
    .option("--verbose", "stream assistant output during frontend runs")
    .option("--record", "append a compact summary line to ai_evals/history/<mode>.jsonl")
    .action(
      async (
        mode: EvalMode,
        caseIds: string[],
        options: {
          runs: number;
          output?: string;
          model?: string;
          models?: string;
          verbose?: boolean;
          record?: boolean;
        }
      ) => {
        await handleRun({
          mode,
          caseIds,
          runs: options.runs,
          outputPath: options.output,
          model: options.model,
          models: options.models,
          verbose: options.verbose ?? false,
          record: options.record ?? false,
        });
      }
    );
  await program.parseAsync(process.argv);
 }
 async function handleCases(mode?: EvalMode) {
  const modes = mode ? [mode] : [...EVAL_MODES];
  for (const entry of modes) {
    const cases = await loadCases(entry);
    process.stdout.write(`${entry} (${cases.length})\n`);
    for (const evalCase of cases) {
      process.stdout.write(`- ${evalCase.id}\n`);
    }
    process.stdout.write("\n");
  }
 }
 function handleModels() {
  process.stdout.write("Available models\n");
  for (const model of EVAL_MODELS) {
    const supports = [
      ...(model.frontend ? ["flow", "script", "app"] : []),
      ...(model.cli ? ["cli"] : []),
    ];
    const aliases = [model.id, ...model.aliases.filter((alias) => alias !== model.id)];
    process.stdout.write(`- ${model.id}: ${model.label}\n`);
    process.stdout.write(`  aliases: ${aliases.join(", ")}\n`);
    process.stdout.write(`  modes: ${supports.join(", ")}\n`);
  }
  process.stdout.write(`\nJudge model: ${DEFAULT_JUDGE_MODEL}\n`);
 }
 async function handleRun(input: {
  mode: EvalMode;
  caseIds: string[];
  runs: number;
  outputPath?: string;
  model?: string;
  models?: string;
  verbose: boolean;
  record: boolean;
 }) {
  if (input.record && input.caseIds.length > 0) {
    throw new Error("--record only supports full-suite runs; omit case ids to record history");
  }
  if (input.model && input.models) {
    throw new Error("Use either --model or --models, not both");
  }
  const selectedCases = await loadSelectedCases(input.mode, input.caseIds);
  const models = resolveRequestedModels(input.mode, input.model, input.models);
  if (input.outputPath && models.length > 1) {
    throw new Error("--output only supports a single model run");
  }
  const summaries: Array<{ label: string; passRate: number; averageDurationMs: number }> = [];
  for (const [index, model] of models.entries()) {
    const runModel = formatRunModelLabel(input.mode, model);
    if (models.length > 1) {
      process.stdout.write(
        `${index > 0 ? "\n" : ""}=== ${input.mode} ${model.id} (${runModel}) ===\n`
      );
    }
    process.stderr.write(`Starting ${input.mode} benchmark...\n`);
    const result =
      input.mode === "cli"
        ? await runCliBenchmark(selectedCases, input.runs, getCliEvalModel(model), runModel)
        : await runFrontendBenchmarkAdapter({
            mode: input.mode,
            caseIds: input.caseIds,
            runs: input.runs,
            model: model.id,
            verbose: input.verbose,
          });
    const resolvedOutputPath =
      models.length === 1
        ? resolveRunOutputPath(input.mode, input.outputPath)
        : resolveRunOutputPath(input.mode);
    const artifactsPath = await writeRunArtifacts(result, resolvedOutputPath);
    const resultPath = await writeRunResult(result, resolvedOutputPath);
    const historyPath = input.record ? await appendHistoryRecord(result) : null;
    process.stdout.write(`${formatRunSummary(result)}\n`);
    process.stdout.write(`Saved: ${resultPath}\n`);
    if (artifactsPath) {
      process.stdout.write(`Artifacts: ${artifactsPath}\n`);
    }
    if (historyPath) {
      process.stdout.write(`Recorded: ${historyPath}\n`);
    }
    summaries.push({
      label: `${model.id} (${runModel})`,
      passRate: result.passRate,
      averageDurationMs: result.averageDurationMs,
    });
  }
  if (summaries.length > 1) {
    process.stdout.write("\nModel summary\n");
    for (const summary of summaries) {
      process.stdout.write(
        `- ${summary.label}: ${formatPercent(summary.passRate)} | ${Math.round(summary.averageDurationMs)}ms\n`
      );
    }
  }
 }
 async function runCliBenchmark(
  cases: Awaited<ReturnType<typeof loadSelectedCases>>,
  runs: number,
  model: ReturnType<typeof getCliEvalModel>,
  runModel: string
 ) {
  const caseResults = await runSuite({
    modeRunner: createCliModeRunner(model),
    cases,
    runs,
    runModel,
    judgeModel: DEFAULT_JUDGE_MODEL,
  });
  return buildRunResult({
    mode: "cli",
    runs,
    runModel,
    judgeModel: DEFAULT_JUDGE_MODEL,
    caseResults,
  });
 }
 function parseMode(value: string): EvalMode {
  if (EVAL_MODES.includes(value as EvalMode)) {
    return value as EvalMode;
  }
  throw new InvalidArgumentError(`mode must be one of: ${EVAL_MODES.join(", ")}`);
 }
 function parseOptionalMode(value: string | undefined): EvalMode | undefined {
  return value ? parseMode(value) : undefined;
 }
 function parsePositiveInteger(value: string): number {
  const parsed = Number(value);
  if (!Number.isInteger(parsed) || parsed <= 0) {
    throw new InvalidArgumentError("must be a positive integer");
  }
  return parsed;
 }
 function resolveRequestedModels(
  mode: EvalMode,
  singleModel?: string,
  multipleModels?: string
 ): EvalModelSpec[] {
  if (!multipleModels) {
    return [resolveEvalModel(mode, singleModel)];
  }
  const aliases = multipleModels
    .split(",")
    .map((value) => value.trim())
    .filter(Boolean);
  if (aliases.length === 0) {
    throw new Error("--models requires at least one model alias");
  }
  const seen = new Set<string>();
  const models: EvalModelSpec[] = [];
  for (const alias of aliases) {
    const model = resolveEvalModel(mode, alias);
    if (seen.has(model.id)) {
      continue;
    }
    seen.add(model.id);
    models.push(model);
  }
  return models;
 }
 function formatPercent(value: number): string {
  return `${(value * 100).toFixed(1)}%`;
 }
 void main().catch((error) => {
  const message = error instanceof Error ? error.message : String(error);
  process.stderr.write(`${message}\n`);
  process.exit(1);
 });
--- a/ai_evals/core/cases.ts
+++ b/ai_evals/core/cases.ts
@@ -0,0 +1,71 @@
 import { readFile } from "node:fs/promises";
 import path from "node:path";
 import { fileURLToPath } from "node:url";
 import { parse } from "yaml";
 import type { EvalCase, EvalMode, FlowValidationSpec } from "./types";
 const REPO_ROOT = fileURLToPath(new URL("../../", import.meta.url));
 const CASES_DIR = path.join(REPO_ROOT, "ai_evals", "cases");
 interface RawEvalCase {
  id: string;
  prompt: string;
  initial?: string;
  expected?: string;
  validate?: FlowValidationSpec;
  judgeChecklist?: string[];
 }
 export function getRepoRoot(): string {
  return REPO_ROOT;
 }
 export function getAiEvalsRoot(): string {
  return path.join(REPO_ROOT, "ai_evals");
 }
 export async function loadCases(mode: EvalMode): Promise<EvalCase[]> {
  const filePath = path.join(CASES_DIR, `${mode}.yaml`);
  const raw = await readFile(filePath, "utf8");
  const parsed = parse(raw);
  if (!Array.isArray(parsed)) {
    throw new Error(`Expected ${filePath} to contain a YAML list of cases`);
  }
  return parsed.map((entry) => ({
    id: entry.id,
    prompt: entry.prompt,
    initialPath: resolveFixturePath(entry.initial),
    expectedPath: resolveFixturePath(entry.expected),
    validate: entry.validate,
    judgeChecklist: entry.judgeChecklist,
  }));
 }
 export async function loadSelectedCases(
  mode: EvalMode,
  selectedIds: string[]
 ): Promise<EvalCase[]> {
  const allCases = await loadCases(mode);
  if (selectedIds.length === 0) {
    return allCases;
  }
  const caseMap = new Map(allCases.map((entry) => [entry.id, entry]));
  const missing = selectedIds.filter((id) => !caseMap.has(id));
  if (missing.length > 0) {
    throw new Error(
      `Unknown ${mode} case${missing.length === 1 ? "" : "s"}: ${missing.join(", ")}`
    );
  }
  return selectedIds.map((id) => caseMap.get(id)!);
 }
 function resolveFixturePath(value: string | undefined): string | undefined {
  if (!value) {
    return undefined;
  }
  return path.isAbsolute(value) ? value : path.join(REPO_ROOT, value);
 }
--- a/ai_evals/core/files.ts
+++ b/ai_evals/core/files.ts
@@ -0,0 +1,67 @@
 import { access, copyFile, mkdir, readdir, readFile } from "node:fs/promises";
 import path from "node:path";
 export async function exists(filePath: string): Promise<boolean> {
  try {
    await access(filePath);
    return true;
  } catch {
    return false;
  }
 }
 export async function readJsonFile<T>(filePath: string): Promise<T> {
  const raw = await readFile(filePath, "utf8");
  return JSON.parse(raw) as T;
 }
 export async function readDirectoryFiles(
  rootDir: string,
  options: {
    ignore?: Set<string>;
  } = {}
 ): Promise<Record<string, string>> {
  const files: Record<string, string> = {};
  await walkDirectory(rootDir, "", files, options.ignore ?? new Set());
  return files;
 }
 export async function copyDirectory(sourceDir: string, targetDir: string): Promise<void> {
  const entries = await readdir(sourceDir, { withFileTypes: true });
  await mkdir(targetDir, { recursive: true });
  for (const entry of entries) {
    const sourcePath = path.join(sourceDir, entry.name);
    const targetPath = path.join(targetDir, entry.name);
    if (entry.isDirectory()) {
      await copyDirectory(sourcePath, targetPath);
      continue;
    }
    await mkdir(path.dirname(targetPath), { recursive: true });
    await copyFile(sourcePath, targetPath);
  }
 }
 async function walkDirectory(
  absoluteDir: string,
  relativeDir: string,
  output: Record<string, string>,
  ignore: Set<string>
 ): Promise<void> {
  const entries = await readdir(absoluteDir, { withFileTypes: true });
  for (const entry of entries) {
    const relativePath = relativeDir ? `${relativeDir}/${entry.name}` : entry.name;
    if (ignore.has(relativePath) || ignore.has(entry.name)) {
      continue;
    }
    const absolutePath = path.join(absoluteDir, entry.name);
    if (entry.isDirectory()) {
      await walkDirectory(absolutePath, relativePath, output, ignore);
      continue;
    }
    output[relativePath] = await readFile(absolutePath, "utf8");
  }
 }
--- a/ai_evals/core/judge.ts
+++ b/ai_evals/core/judge.ts
@@ -0,0 +1,149 @@
 import Anthropic from "@anthropic-ai/sdk";
 import type { EvalMode, JudgeResult } from "./types";
 export const DEFAULT_JUDGE_MODEL = "claude-sonnet-4-6";
 const JUDGE_TOOL_NAME = "submit_judgement";
 export async function judgeOutput(input: {
  mode: EvalMode;
  prompt: string;
  checklist?: string[];
  initial?: unknown;
  expected?: unknown;
  actual: unknown;
  model?: string;
 }): Promise<JudgeResult> {
  const apiKey = process.env.ANTHROPIC_API_KEY;
  if (!apiKey) {
    return {
      success: false,
      score: 0,
      summary: "Judge unavailable",
      error: "ANTHROPIC_API_KEY is not set",
    };
  }
  const client = new Anthropic({ apiKey });
  const model = input.model ?? DEFAULT_JUDGE_MODEL;
  const system = [
    "You evaluate benchmark outputs for Windmill AI generation.",
    "Deterministic checks already run separately. Focus on whether the final output satisfies the user request.",
    "If expected state is provided, treat it as a valid example and reward semantically equivalent outputs.",
    "If a checklist is provided, treat it as the explicit acceptance criteria for this case.",
    "Be strict about missing requested functionality.",
    "When the prompt wording is ambiguous, prefer the checklist over inferred structural requirements.",
    "Do not require exact ids, exact topology, or exact field names unless the prompt, checklist, or expected state clearly requires them.",
    `Always respond by calling the ${JUDGE_TOOL_NAME} tool exactly once.`,
  ].join("\n\n");
  const user = [
    `Mode: ${input.mode}`,
    "",
    "User prompt:",
    input.prompt,
    "",
    "Checklist:",
    formatChecklist(input.checklist),
    "",
    "Initial state:",
    formatJsonBlock(input.initial),
    "",
    "Expected state:",
    formatJsonBlock(input.expected),
    "",
    "Actual result:",
    formatJsonBlock(input.actual),
  ].join("\n");
  try {
    const response = await client.messages.create({
      model,
      max_tokens: 1024,
      temperature: 0,
      system,
      messages: [{ role: "user", content: user }],
      tools: [
        {
          name: JUDGE_TOOL_NAME,
          description: "Submit the benchmark judgement as structured data.",
          input_schema: {
            type: "object",
            properties: {
              score: {
                type: "integer",
                minimum: 0,
                maximum: 100,
              },
              summary: {
                type: "string",
              },
            },
            required: ["score", "summary"],
          },
        },
      ],
      tool_choice: {
        type: "tool",
        name: JUDGE_TOOL_NAME,
        disable_parallel_tool_use: true,
      },
    });
    const toolUseBlock = response.content.find(
      (block): block is Anthropic.ToolUseBlock =>
        block.type === "tool_use" && block.name === JUDGE_TOOL_NAME
    );
    if (!toolUseBlock) {
      return {
        success: false,
        score: 0,
        summary: "Judge returned no tool output",
        error: "Expected structured tool output from judge",
      };
    }
    const parsed = toolUseBlock.input as {
      score: number;
      summary: string;
    };
    return {
      success: true,
      score: normalizeScore(parsed.score),
      summary: parsed.summary,
    };
  } catch (error) {
    const message = error instanceof Error ? error.message : String(error);
    return {
      success: false,
      score: 0,
      summary: "Judge failed",
      error: message,
    };
  }
 }
 function formatJsonBlock(value: unknown): string {
  if (value === undefined) {
    return "(none)";
  }
  return JSON.stringify(value, null, 2);
 }
 function formatChecklist(checklist: string[] | undefined): string {
  if (!checklist || checklist.length === 0) {
    return "(none)";
  }
  return checklist.map((item) => `- ${item}`).join("\n");
 }
 function normalizeScore(value: number): number {
  if (!Number.isFinite(value)) {
    return 0;
  }
  return Math.max(0, Math.min(100, Math.round(value)));
 }
--- a/ai_evals/core/models.test.ts
+++ b/ai_evals/core/models.test.ts
@@ -0,0 +1,29 @@
 import { describe, expect, it } from "bun:test";
 import { resolveEvalModel } from "./models";
 describe("resolveEvalModel", () => {
  it("supports Gemini aliases for frontend evals", () => {
    expect(resolveEvalModel("flow", "gemini").frontend).toEqual({
      provider: "googleai",
      model: "gemini-2.5-flash",
    });
    expect(resolveEvalModel("app", "gemini-pro").frontend).toEqual({
      provider: "googleai",
      model: "gemini-2.5-pro",
    });
    expect(resolveEvalModel("script", "gemini-3-flash-preview").frontend).toEqual({
      provider: "googleai",
      model: "gemini-3-flash-preview",
    });
    expect(resolveEvalModel("flow", "gemini-3.1-pro-preview").frontend).toEqual({
      provider: "googleai",
      model: "gemini-3.1-pro-preview",
    });
  });
  it("rejects Gemini aliases for cli evals", () => {
    expect(() => resolveEvalModel("cli", "gemini")).toThrow(
      "Model gemini-flash is not supported for cli mode"
    );
  });
 });
--- a/ai_evals/core/models.ts
+++ b/ai_evals/core/models.ts
@@ -0,0 +1,185 @@
 import type { EvalMode } from "./types";
 export interface FrontendEvalModelConfig {
  provider: "anthropic" | "openai" | "googleai";
  model: string;
 }
 export interface CliEvalModelConfig {
  provider: "anthropic";
  model: string;
 }
 export interface EvalModelSpec {
  id: string;
  label: string;
  aliases: string[];
  frontend?: FrontendEvalModelConfig;
  cli?: CliEvalModelConfig;
 }
 export const EVAL_MODELS: EvalModelSpec[] = [
  {
    id: "haiku",
    label: "Claude Haiku 4.5",
    aliases: [
      "haiku",
      "haiku-4.5",
      "claude-haiku",
      "claude-haiku-4.5",
      "claude-haiku-4-5",
      "claude-haiku-4-5-20251001",
    ],
    frontend: {
      provider: "anthropic",
      model: "claude-haiku-4-5-20251001",
    },
    cli: {
      provider: "anthropic",
      model: "haiku",
    },
  },
  {
    id: "sonnet",
    label: "Claude Sonnet 4.5",
    aliases: [
      "sonnet",
      "sonnet-4.5",
      "claude-sonnet",
      "claude-sonnet-4.5",
      "claude-sonnet-4-5",
      "claude-sonnet-4-5-20250929",
    ],
    frontend: {
      provider: "anthropic",
      model: "claude-sonnet-4-5-20250929",
    },
    cli: {
      provider: "anthropic",
      model: "sonnet",
    },
  },
  {
    id: "opus",
    label: "Claude Opus 4.6",
    aliases: [
      "opus",
      "opus-4.6",
      "claude-opus",
      "claude-opus-4.6",
      "claude-opus-4-6",
    ],
    frontend: {
      provider: "anthropic",
      model: "claude-opus-4-6",
    },
    cli: {
      provider: "anthropic",
      model: "opus",
    },
  },
  {
    id: "4o",
    label: "GPT-4o",
    aliases: ["4o", "gpt-4o"],
    frontend: {
      provider: "openai",
      model: "gpt-4o",
    },
  },
  {
    id: "gemini-flash",
    label: "Gemini 2.5 Flash",
    aliases: ["gemini", "gemini-flash", "gemini-2.5-flash"],
    frontend: {
      provider: "googleai",
      model: "gemini-2.5-flash",
    },
  },
  {
    id: "gemini-pro",
    label: "Gemini 2.5 Pro",
    aliases: ["gemini-pro", "gemini-2.5-pro"],
    frontend: {
      provider: "googleai",
      model: "gemini-2.5-pro",
    },
  },
  {
    id: "gemini-3-flash-preview",
    label: "Gemini 3 Flash Preview",
    aliases: ["gemini-3-flash-preview", "gemini-3-flash"],
    frontend: {
      provider: "googleai",
      model: "gemini-3-flash-preview",
    },
  },
  {
    id: "gemini-3.1-pro-preview",
    label: "Gemini 3.1 Pro Preview",
    aliases: ["gemini-3.1-pro-preview", "gemini-3.1-pro", "gemini-3-pro-preview"],
    frontend: {
      provider: "googleai",
      model: "gemini-3.1-pro-preview",
    },
  },
 ];
 export function resolveEvalModel(mode: EvalMode, alias?: string): EvalModelSpec {
  const spec = alias ? findEvalModel(alias) : getDefaultEvalModel(mode);
  if (!spec) {
    throw new Error(`Unknown model: ${alias}`);
  }
  if (mode === "cli" && !spec.cli) {
    throw new Error(`Model ${spec.id} is not supported for cli mode`);
  }
  if (mode !== "cli" && !spec.frontend) {
    throw new Error(`Model ${spec.id} is not supported for ${mode} mode`);
  }
  return spec;
 }
 export function getEvalModelHelpText(): string {
  return EVAL_MODELS.map((model) => {
    const modes = [
      ...(model.frontend ? ["flow", "script", "app"] : []),
      ...(model.cli ? ["cli"] : []),
    ];
    return `  ${model.id.padEnd(8)} ${model.label} (${modes.join(", ")})`;
  }).join("\n");
 }
 export function formatRunModelLabel(mode: EvalMode, model: EvalModelSpec): string {
  if (mode === "cli") {
    return `${model.cli!.provider}:${model.cli!.model}`;
  }
  return `${model.frontend!.provider}:${model.frontend!.model}`;
 }
 export function getFrontendEvalModel(model: EvalModelSpec): FrontendEvalModelConfig {
  if (!model.frontend) {
    throw new Error(`Model ${model.id} does not support frontend evals`);
  }
  return model.frontend;
 }
 export function getCliEvalModel(model: EvalModelSpec): CliEvalModelConfig {
  if (!model.cli) {
    throw new Error(`Model ${model.id} does not support cli evals`);
  }
  return model.cli;
 }
 function getDefaultEvalModel(mode: EvalMode): EvalModelSpec {
  return mode === "cli" ? EVAL_MODELS[0]! : EVAL_MODELS[0]!;
 }
 function findEvalModel(alias: string): EvalModelSpec | undefined {
  const normalized = alias.trim().toLowerCase();
  return EVAL_MODELS.find((model) =>
    [model.id, ...model.aliases].some((candidate) => candidate.toLowerCase() === normalized)
  );
 }
--- a/ai_evals/core/results.ts
+++ b/ai_evals/core/results.ts
@@ -0,0 +1,296 @@
 import { appendFile, mkdir, rm, writeFile } from "node:fs/promises";
 import path from "node:path";
 import { execFileSync } from "node:child_process";
 import { getAiEvalsRoot, getRepoRoot } from "./cases";
 import type {
  BenchmarkArtifactFile,
  BenchmarkCaseResult,
  BenchmarkRunResult,
  BenchmarkTokenUsage,
  EvalMode,
 } from "./types";
 export async function writeRunResult(
  result: BenchmarkRunResult,
  outputPath?: string
 ): Promise<string> {
  const targetPath = resolveRunOutputPath(result.mode, outputPath);
  await mkdir(path.dirname(targetPath), { recursive: true });
  await writeFile(targetPath, JSON.stringify(toSerializableRunResult(result), null, 2) + "\n", "utf8");
  return targetPath;
 }
 export async function appendHistoryRecord(
  result: BenchmarkRunResult,
  historyPath = resolveHistoryPath(result.mode)
 ): Promise<string> {
  await mkdir(path.dirname(historyPath), { recursive: true });
  await appendFile(historyPath, JSON.stringify(toHistoryRecord(result)) + "\n", "utf8");
  return historyPath;
 }
 export async function writeRunArtifacts(
  result: BenchmarkRunResult,
  outputPath?: string
 ): Promise<string | null> {
  const targetPath = resolveRunOutputPath(result.mode, outputPath);
  const artifactRoot = defaultArtifactsRoot(targetPath);
  await rm(artifactRoot, { recursive: true, force: true });
  let wroteArtifacts = false;
  for (const caseResult of result.cases) {
    for (const attempt of caseResult.attempts) {
      const artifactFiles = attempt.artifactFiles ?? [];
      if (artifactFiles.length === 0) {
        attempt.artifactsPath = null;
        continue;
      }
      const attemptDir = path.join(artifactRoot, caseResult.id, `attempt-${attempt.attempt}`);
      await writeArtifactFiles(attemptDir, artifactFiles);
      attempt.artifactsPath = attemptDir;
      wroteArtifacts = true;
    }
  }
  result.artifactsPath = wroteArtifacts ? artifactRoot : null;
  return result.artifactsPath ?? null;
 }
 export function buildRunResult(input: {
  mode: EvalMode;
  runs: number;
  runModel: string | null;
  judgeModel: string | null;
  caseResults: BenchmarkCaseResult[];
 }): BenchmarkRunResult {
  const attemptCount = input.caseResults.reduce((sum, entry) => sum + entry.attempts.length, 0);
  const passedAttempts = input.caseResults.reduce(
    (sum, entry) => sum + entry.attempts.filter((attempt) => attempt.passed).length,
    0
  );
  const durationTotal = input.caseResults.reduce(
    (sum, entry) => sum + entry.attempts.reduce((inner, attempt) => inner + attempt.durationMs, 0),
    0
  );
  const tokenUsageTotal = input.caseResults.reduce<BenchmarkTokenUsage | null>(
    (sum, entry) => {
      for (const attempt of entry.attempts) {
        if (!attempt.tokenUsage) {
          continue;
        }
        sum ??= { prompt: 0, completion: 0, total: 0 };
        sum.prompt += attempt.tokenUsage.prompt;
        sum.completion += attempt.tokenUsage.completion;
        sum.total += attempt.tokenUsage.total;
      }
      return sum;
    },
    null
  );
  return {
    version: 1,
    mode: input.mode,
    createdAt: new Date().toISOString(),
    gitSha: getGitSha(),
    runs: input.runs,
    runModel: input.runModel,
    judgeModel: input.judgeModel,
    caseCount: input.caseResults.length,
    attemptCount,
    passedAttempts,
    passRate: attemptCount === 0 ? 0 : passedAttempts / attemptCount,
    averageDurationMs: attemptCount === 0 ? 0 : durationTotal / attemptCount,
    totalTokenUsage: tokenUsageTotal,
    averageTokenUsagePerAttempt:
      attemptCount === 0 || !tokenUsageTotal
        ? null
        : {
            prompt: tokenUsageTotal.prompt / attemptCount,
            completion: tokenUsageTotal.completion / attemptCount,
            total: tokenUsageTotal.total / attemptCount,
          },
    cases: input.caseResults,
  };
 }
 export function formatRunSummary(result: BenchmarkRunResult): string {
  const lines = [
    `${result.mode} benchmark complete`,
    `Pass rate: ${formatPercent(result.passRate)} (${result.passedAttempts}/${result.attemptCount})`,
    `Average duration: ${Math.round(result.averageDurationMs)}ms`,
  ];
  const failures = collectFailures(result);
  if (failures.length > 0) {
    lines.push("Failures:");
    for (const entry of failures.slice(0, 10)) {
      lines.push(`- ${entry}`);
    }
  }
  return lines.join("\n");
 }
 function collectFailures(result: BenchmarkRunResult): string[] {
  const failures: string[] = [];
  for (const caseResult of result.cases) {
    for (const attempt of caseResult.attempts) {
      if (attempt.passed) {
        continue;
      }
      const failedChecks = attempt.checks.filter((check) => !check.passed).map((check) => check.name);
      failures.push(
        `${caseResult.id} attempt ${attempt.attempt}: ${failedChecks.join(", ") || attempt.error || "failed"}`
      );
    }
  }
  return failures;
 }
 function defaultFileName(mode: EvalMode): string {
  return `${new Date().toISOString().replaceAll(":", "-")}__${mode}.json`;
 }
 export function resolveRunOutputPath(mode: EvalMode, outputPath?: string): string {
  return outputPath ?? path.join(getAiEvalsRoot(), "results", defaultFileName(mode));
 }
 export function resolveHistoryPath(mode: EvalMode): string {
  return path.join(getAiEvalsRoot(), "history", `${mode}.jsonl`);
 }
 function defaultArtifactsRoot(resultPath: string): string {
  return resultPath.endsWith(".json")
    ? resultPath.slice(0, -".json".length)
    : `${resultPath}.artifacts`;
 }
 async function writeArtifactFiles(
  rootDir: string,
  files: BenchmarkArtifactFile[]
 ): Promise<void> {
  for (const file of files) {
    const relativePath = normalizeArtifactPath(file.path);
    const targetPath = path.join(rootDir, relativePath);
    await mkdir(path.dirname(targetPath), { recursive: true });
    await writeFile(targetPath, file.content, "utf8");
  }
 }
 function normalizeArtifactPath(filePath: string): string {
  const normalized = filePath.replaceAll("\\", "/").replace(/^\/+/, "");
  const parts = normalized.split("/").filter(Boolean);
  if (parts.length === 0 || parts.some((part) => part === "." || part === "..")) {
    throw new Error(`Invalid artifact path: ${filePath}`);
  }
  return parts.join("/");
 }
 function toSerializableRunResult(result: BenchmarkRunResult): BenchmarkRunResult {
  return {
    ...result,
    cases: result.cases.map((caseResult) => ({
      ...caseResult,
      attempts: caseResult.attempts.map(({ artifactFiles, ...attempt }) => attempt),
    })),
  };
 }
 function toHistoryRecord(result: BenchmarkRunResult) {
  const judgeScores = result.cases.flatMap((caseResult) =>
    caseResult.attempts.flatMap((attempt) =>
      typeof attempt.judgeScore === "number" ? [attempt.judgeScore] : []
    )
  );
  return {
    createdAt: result.createdAt,
    gitSha: result.gitSha,
    mode: result.mode,
    runs: result.runs,
    runModel: result.runModel,
    judgeModel: result.judgeModel,
    caseCount: result.caseCount,
    attemptCount: result.attemptCount,
    passedAttempts: result.passedAttempts,
    passRate: result.passRate,
    averageDurationMs: result.averageDurationMs,
    averageJudgeScore:
      judgeScores.length === 0
        ? null
        : judgeScores.reduce((sum, score) => sum + score, 0) / judgeScores.length,
    averageTokenUsagePerAttempt: result.averageTokenUsagePerAttempt ?? null,
    failedCaseIds: Array.from(
      new Set(
        result.cases
          .filter((caseResult) => caseResult.attempts.some((attempt) => !attempt.passed))
          .map((caseResult) => caseResult.id)
      )
    ),
    cases: result.cases.map((caseResult) => {
      const attemptCount = caseResult.attempts.length;
      const passedAttempts = caseResult.attempts.filter((attempt) => attempt.passed).length;
      const totalDurationMs = caseResult.attempts.reduce(
        (sum, attempt) => sum + attempt.durationMs,
        0
      );
      const judgeScores = caseResult.attempts.flatMap((attempt) =>
        typeof attempt.judgeScore === "number" ? [attempt.judgeScore] : []
      );
      const totalTokenUsage = caseResult.attempts.reduce<BenchmarkTokenUsage | null>(
        (sum, attempt) => {
          if (!attempt.tokenUsage) {
            return sum;
          }
          sum ??= { prompt: 0, completion: 0, total: 0 };
          sum.prompt += attempt.tokenUsage.prompt;
          sum.completion += attempt.tokenUsage.completion;
          sum.total += attempt.tokenUsage.total;
          return sum;
        },
        null
      );
      return {
        id: caseResult.id,
        attemptCount,
        passedAttempts,
        passRate: attemptCount === 0 ? 0 : passedAttempts / attemptCount,
        averageDurationMs: attemptCount === 0 ? 0 : totalDurationMs / attemptCount,
        averageJudgeScore:
          judgeScores.length === 0
            ? null
            : judgeScores.reduce((sum, score) => sum + score, 0) / judgeScores.length,
        averageTokenUsagePerAttempt:
          attemptCount === 0 || !totalTokenUsage
            ? null
            : {
                prompt: totalTokenUsage.prompt / attemptCount,
                completion: totalTokenUsage.completion / attemptCount,
                total: totalTokenUsage.total / attemptCount,
              },
      };
    }),
  };
 }
 function getGitSha(): string | null {
  try {
    return execFileSync("git", ["rev-parse", "HEAD"], {
      cwd: getRepoRoot(),
      encoding: "utf8",
      stdio: ["ignore", "pipe", "ignore"],
    }).trim();
  } catch {
    return null;
  }
 }
 function formatPercent(value: number): string {
  return `${(value * 100).toFixed(1)}%`;
 }
--- a/ai_evals/core/runSuite.ts
+++ b/ai_evals/core/runSuite.ts
@@ -0,0 +1,264 @@
 import { judgeOutput, DEFAULT_JUDGE_MODEL } from "./judge";
 import type {
  BenchmarkAttemptResult,
  BenchmarkCaseResult,
  BenchmarkCheck,
  EvalCase,
  FrontendBenchmarkProgressEvent,
  ModeRunner,
 } from "./types";
 export async function runSuite<TInitial, TExpected, TActual>(input: {
  modeRunner: ModeRunner<TInitial, TExpected, TActual>;
  cases: EvalCase[];
  runs: number;
  runModel: string | null;
  judgeModel?: string | null;
  concurrency?: number;
  verbose?: boolean;
  onProgress?: (event: FrontendBenchmarkProgressEvent) => void;
 }): Promise<BenchmarkCaseResult[]> {
  const judgeModel = input.judgeModel ?? DEFAULT_JUDGE_MODEL;
  const concurrency = Math.max(1, input.concurrency ?? input.modeRunner.concurrency);
  const results = new Array<BenchmarkCaseResult>(input.cases.length);
  let cursor = 0;
  if (input.modeRunner.mode !== "cli") {
    input.onProgress?.({
      type: "run-start",
      surface: input.modeRunner.mode,
      totalCases: input.cases.length,
      runs: input.runs,
      concurrency,
    });
  }
  async function worker(): Promise<void> {
    while (true) {
      const caseIndex = cursor++;
      if (caseIndex >= input.cases.length) {
        return;
      }
      const evalCase = input.cases[caseIndex];
      results[caseIndex] = {
        id: evalCase.id,
        prompt: evalCase.prompt,
        initialPath: evalCase.initialPath,
        expectedPath: evalCase.expectedPath,
        attempts: await runCaseAttempts({
          caseIndex,
          evalCase,
          runs: input.runs,
          judgeModel,
          judgeThreshold: input.modeRunner.judgeThreshold ?? 80,
          modeRunner: input.modeRunner,
          totalCases: input.cases.length,
          verbose: input.verbose ?? false,
          onProgress: input.onProgress,
        }),
      };
    }
  }
  await Promise.all(
    Array.from({ length: Math.min(concurrency, input.cases.length) }, () => worker())
  );
  return results;
 }
 async function runCaseAttempts<TInitial, TExpected, TActual>(input: {
  caseIndex: number;
  evalCase: EvalCase;
  runs: number;
  judgeModel: string;
  judgeThreshold: number;
  modeRunner: ModeRunner<TInitial, TExpected, TActual>;
  totalCases: number;
  verbose: boolean;
  onProgress?: (event: FrontendBenchmarkProgressEvent) => void;
 }): Promise<BenchmarkAttemptResult[]> {
  const attempts: BenchmarkAttemptResult[] = [];
  const surface = input.modeRunner.mode === "cli" ? null : input.modeRunner.mode;
  for (let attempt = 1; attempt <= input.runs; attempt += 1) {
    if (surface) {
      input.onProgress?.({
        type: "attempt-start",
        surface,
        caseId: input.evalCase.id,
        caseNumber: input.caseIndex + 1,
        totalCases: input.totalCases,
        attempt,
        runs: input.runs,
      });
    }
    const startedAt = Date.now();
    try {
      const initial = await input.modeRunner.loadInitial(input.evalCase.initialPath);
      const expected = await input.modeRunner.loadExpected(input.evalCase.expectedPath);
      const run = await input.modeRunner.run(input.evalCase.prompt, initial, {
        caseId: input.evalCase.id,
        caseNumber: input.caseIndex + 1,
        totalCases: input.totalCases,
        attempt,
        runs: input.runs,
        verbose: input.verbose,
        onAssistantMessageStart: input.verbose && surface
          ? () =>
              input.onProgress?.({
                type: "assistant-message-start",
                surface,
                caseId: input.evalCase.id,
                caseNumber: input.caseIndex + 1,
                totalCases: input.totalCases,
                attempt,
                runs: input.runs,
              })
          : undefined,
        onAssistantChunk: input.verbose && surface
          ? (chunk: string) =>
              input.onProgress?.({
                type: "assistant-chunk",
                surface,
                caseId: input.evalCase.id,
                caseNumber: input.caseIndex + 1,
                totalCases: input.totalCases,
                attempt,
                runs: input.runs,
                chunk,
              })
          : undefined,
        onAssistantMessageEnd: input.verbose && surface
          ? () =>
              input.onProgress?.({
                type: "assistant-message-end",
                surface,
                caseId: input.evalCase.id,
                caseNumber: input.caseIndex + 1,
                totalCases: input.totalCases,
                attempt,
                runs: input.runs,
              })
          : undefined,
      });
      const checks: BenchmarkCheck[] = [
        buildCheck("run succeeded", run.success, run.error),
        ...input.modeRunner.validate({
          evalCase: input.evalCase,
          prompt: input.evalCase.prompt,
          initial,
          expected,
          actual: run.actual,
          run,
        }),
      ];
      let judgeScore: number | null = null;
      let judgeSummary: string | null = null;
      if (run.success) {
        const judge = await judgeOutput({
          mode: input.modeRunner.mode,
          prompt: input.evalCase.prompt,
          checklist: input.evalCase.judgeChecklist,
          initial,
          expected: input.modeRunner.mode === "cli" ? undefined : expected,
          actual: run.actual,
          model: input.judgeModel,
        });
        judgeScore = judge.success ? judge.score : null;
        judgeSummary = judge.summary;
        checks.push(buildCheck("judge succeeded", judge.success, judge.error));
        checks.push(
          buildCheck(
            `judge score >= ${input.judgeThreshold}`,
            (judgeScore ?? 0) >= input.judgeThreshold,
            judge.success ? `score=${judgeScore}` : judge.error
          )
        );
      }
      const artifactFiles = input.modeRunner.buildArtifacts?.(run.actual) ?? [];
      const attemptResult: BenchmarkAttemptResult = {
        attempt,
        passed: checks.every((check) => check.passed),
        durationMs: Date.now() - startedAt,
        assistantMessageCount: run.assistantMessageCount,
        toolCallCount: run.toolCallCount,
        toolsUsed: uniqueStrings(run.toolsUsed),
        skillsInvoked: uniqueStrings(run.skillsInvoked),
        checks,
        judgeScore,
        judgeSummary,
        error: run.error ?? null,
        tokenUsage: run.tokenUsage ?? null,
        artifactsPath: null,
        artifactFiles,
      };
      if (surface) {
        input.onProgress?.({
          type: "attempt-finish",
          surface,
          caseId: input.evalCase.id,
          caseNumber: input.caseIndex + 1,
          totalCases: input.totalCases,
          attempt,
          runs: input.runs,
          passed: attemptResult.passed,
          durationMs: attemptResult.durationMs,
          judgeScore: attemptResult.judgeScore,
          error: attemptResult.error,
        });
      }
      attempts.push(attemptResult);
    } catch (error) {
      const message = error instanceof Error ? error.message : String(error);
      const failedAttempt: BenchmarkAttemptResult = {
        attempt,
        passed: false,
        durationMs: Date.now() - startedAt,
        assistantMessageCount: 0,
        toolCallCount: 0,
        toolsUsed: [],
        skillsInvoked: [],
        checks: [buildCheck("run crashed", false, message)],
        judgeScore: null,
        judgeSummary: null,
        error: message,
        tokenUsage: null,
      };
      if (surface) {
        input.onProgress?.({
          type: "attempt-finish",
          surface,
          caseId: input.evalCase.id,
          caseNumber: input.caseIndex + 1,
          totalCases: input.totalCases,
          attempt,
          runs: input.runs,
          passed: false,
          durationMs: failedAttempt.durationMs,
          judgeScore: null,
          error: message,
        });
      }
      attempts.push(failedAttempt);
    }
  }
  return attempts;
 }
 function buildCheck(name: string, passed: boolean, details?: string): BenchmarkCheck {
  return details ? { name, passed, details } : { name, passed };
 }
 function uniqueStrings(values: string[]): string[] {
  return [...new Set(values)];
 }
--- a/ai_evals/core/types.ts
+++ b/ai_evals/core/types.ts
@@ -0,0 +1,198 @@
 export const EVAL_MODES = ["cli", "flow", "script", "app"] as const;
 export type EvalMode = (typeof EVAL_MODES)[number];
 export interface FlowValidationSpec {
  schemaRequiredPaths?: string[];
  schemaAnyOf?: Array<{
    requiredPaths: string[];
  }>;
  resolveResultsRefs?: boolean;
  requireSpecialModules?: Array<"preprocessor_module" | "failure_module">;
  requireSuspendSteps?: Array<{
    id: string;
    requiredEvents?: number;
    resumeRequiredStringFieldAnyOf?: string[];
  }>;
 }
 export interface EvalCase {
  id: string;
  prompt: string;
  initialPath?: string;
  expectedPath?: string;
  validate?: FlowValidationSpec;
  judgeChecklist?: string[];
 }
 export interface BenchmarkCheck {
  name: string;
  passed: boolean;
  details?: string;
 }
 export interface JudgeResult {
  success: boolean;
  score: number;
  summary: string;
  error?: string;
 }
 export interface BenchmarkArtifactFile {
  path: string;
  content: string;
 }
 export interface BenchmarkTokenUsage {
  prompt: number;
  completion: number;
  total: number;
 }
 export interface ModeRunOutput<TActual> {
  success: boolean;
  actual: TActual;
  error?: string;
  assistantMessageCount: number;
  toolCallCount: number;
  toolsUsed: string[];
  skillsInvoked: string[];
  tokenUsage?: BenchmarkTokenUsage | null;
 }
 export interface ModeRunContext {
  caseId: string;
  caseNumber: number;
  totalCases: number;
  attempt: number;
  runs: number;
  verbose: boolean;
  onAssistantMessageStart?: () => void;
  onAssistantChunk?: (chunk: string) => void;
  onAssistantMessageEnd?: () => void;
 }
 export interface ModeRunner<TInitial, TExpected, TActual> {
  mode: EvalMode;
  concurrency: number;
  judgeThreshold?: number;
  loadInitial(path?: string): Promise<TInitial | undefined>;
  loadExpected(path?: string): Promise<TExpected | undefined>;
  run(
    prompt: string,
    initial: TInitial | undefined,
    context: ModeRunContext
  ): Promise<ModeRunOutput<TActual>>;
  validate(input: {
    evalCase: EvalCase;
    prompt: string;
    initial: TInitial | undefined;
    expected: TExpected | undefined;
    actual: TActual;
    run: ModeRunOutput<TActual>;
  }): BenchmarkCheck[];
  buildArtifacts?(actual: TActual): BenchmarkArtifactFile[];
 }
 export interface BenchmarkAttemptResult {
  attempt: number;
  passed: boolean;
  durationMs: number;
  assistantMessageCount: number;
  toolCallCount: number;
  toolsUsed: string[];
  skillsInvoked: string[];
  checks: BenchmarkCheck[];
  judgeScore: number | null;
  judgeSummary: string | null;
  error: string | null;
  tokenUsage?: BenchmarkTokenUsage | null;
  artifactsPath?: string | null;
  artifactFiles?: BenchmarkArtifactFile[];
 }
 export interface BenchmarkCaseResult {
  id: string;
  prompt: string;
  initialPath?: string;
  expectedPath?: string;
  attempts: BenchmarkAttemptResult[];
 }
 export interface BenchmarkRunResult {
  version: 1;
  mode: EvalMode;
  createdAt: string;
  gitSha: string | null;
  runs: number;
  runModel: string | null;
  judgeModel: string | null;
  caseCount: number;
  attemptCount: number;
  passedAttempts: number;
  passRate: number;
  averageDurationMs: number;
  totalTokenUsage?: BenchmarkTokenUsage | null;
  averageTokenUsagePerAttempt?: BenchmarkTokenUsage | null;
  artifactsPath?: string | null;
  cases: BenchmarkCaseResult[];
 }
 export type FrontendBenchmarkProgressEvent =
  | {
      type: "run-start";
      surface: Exclude<EvalMode, "cli">;
      totalCases: number;
      runs: number;
      concurrency: number;
    }
  | {
      type: "attempt-start";
      surface: Exclude<EvalMode, "cli">;
      caseId: string;
      caseNumber: number;
      totalCases: number;
      attempt: number;
      runs: number;
    }
  | {
      type: "attempt-finish";
      surface: Exclude<EvalMode, "cli">;
      caseId: string;
      caseNumber: number;
      totalCases: number;
      attempt: number;
      runs: number;
      passed: boolean;
      durationMs: number;
      judgeScore: number | null;
      error: string | null;
    }
  | {
      type: "assistant-message-start";
      surface: Exclude<EvalMode, "cli">;
      caseId: string;
      caseNumber: number;
      totalCases: number;
      attempt: number;
      runs: number;
    }
  | {
      type: "assistant-chunk";
      surface: Exclude<EvalMode, "cli">;
      caseId: string;
      caseNumber: number;
      totalCases: number;
      attempt: number;
      runs: number;
      chunk: string;
    }
  | {
      type: "assistant-message-end";
      surface: Exclude<EvalMode, "cli">;
      caseId: string;
      caseNumber: number;
      totalCases: number;
      attempt: number;
      runs: number;
    };
--- a/ai_evals/core/validators.test.ts
+++ b/ai_evals/core/validators.test.ts
@@ -0,0 +1,36 @@
 import { describe, expect, it } from "bun:test";
 import { validateScriptState } from "./validators";
 describe("validateScriptState", () => {
  it("accepts semantically equivalent script implementations", () => {
    const checks = validateScriptState({
      actual: {
        path: "f/evals/greet_user.ts",
        lang: "bun",
        code: "export async function main(name: string): Promise<string> {\n  return `Hello, ${name}!`;\n}\n",
      },
      expected: {
        path: "f/evals/greet_user.ts",
        lang: "bun",
        code: "export async function main(name: string) {\n\treturn `Hello, ${name}!`\n}\n",
      },
    });
    expect(checks.every((check) => check.passed)).toBe(true);
  });
  it("still requires an exported main entrypoint", () => {
    const checks = validateScriptState({
      actual: {
        path: "f/evals/greet_user.ts",
        lang: "bun",
        code: "async function main(name: string) {\n  return `Hello, ${name}!`;\n}\n",
      },
    });
    expect(checks).toContainEqual({
      name: "script exports entrypoint",
      passed: false,
    });
  });
 });
--- a/ai_evals/core/validators.ts
+++ b/ai_evals/core/validators.ts
@@ -0,0 +1,997 @@
 import path from "node:path";
 import ts from "typescript";
 import type { BenchmarkCheck, FlowValidationSpec } from "./types";
 export interface ScriptState {
  path: string;
  lang: string;
  args?: Record<string, unknown>;
  code: string;
 }
 export interface FlowState {
  summary?: string;
  value?: {
    preprocessor_module?: Record<string, unknown>;
    failure_module?: Record<string, unknown>;
    modules?: Array<Record<string, unknown>>;
    [key: string]: unknown;
  };
  schema?: Record<string, unknown>;
 }
 export interface AppFilesState {
  frontend: Record<string, string>;
  backend: Record<string, AppRunnableState>;
 }
 export interface AppRunnableState {
  type?: string;
  name?: string;
  path?: string;
  inlineScript?: {
    language?: string;
    content?: string;
  };
 }
 const TS_LIKE_LANGUAGES = new Set(["bun", "deno", "nativets", "bunnative", "ts", "typescript"]);
 const CONTROL_FLOW_MODULE_TYPES = new Set(["branchone", "branchall", "forloopflow", "whileloopflow"]);
 export function validateScriptState(input: {
  actual: ScriptState;
  initial?: ScriptState;
  expected?: ScriptState;
 }): BenchmarkCheck[] {
  const checks: BenchmarkCheck[] = [
    check("script exports entrypoint", hasSupportedEntrypoint(input.actual.code)),
    check("script has no syntax errors", getScriptSyntaxErrors(input.actual.code, input.actual.lang).length === 0),
  ];
  if (input.expected) {
    checks.push(
      check(
        "script path matches expected",
        input.actual.path === input.expected.path,
        `expected ${input.expected.path}, got ${input.actual.path}`
      )
    );
    checks.push(
      check(
        "script language matches expected",
        input.actual.lang === input.expected.lang,
        `expected ${input.expected.lang}, got ${input.actual.lang}`
      )
    );
  }
  if (input.initial) {
    checks.push(
      check(
        "script differs from initial",
        normalizeText(input.actual.code) !== normalizeText(input.initial.code)
      )
    );
  }
  return checks;
 }
 export function validateFlowState(input: {
  actual: FlowState;
  initial?: FlowState;
  expected?: FlowState;
  validate?: FlowValidationSpec;
 }): BenchmarkCheck[] {
  const actualModules = getFlowModules(input.actual);
  const placeholderModuleIds = getInlineScriptPlaceholderModuleIds(input.actual);
  const checks: BenchmarkCheck[] = [
    check("flow has modules", actualModules.length > 0),
    check(
      "flow has no inline placeholder code",
      placeholderModuleIds.length === 0,
      placeholderModuleIds.length > 0
        ? `placeholder content in: ${placeholderModuleIds.join(", ")}`
        : undefined
    ),
  ];
  if (input.initial) {
    checks.push(
      check(
        "flow differs from initial",
        normalizeJson(input.actual) !== normalizeJson(input.initial)
      )
    );
  }
  if (input.expected) {
    checks.push(...validateFlowExpectedStructure(input.actual, input.expected));
  }
  if (input.validate) {
    checks.push(...validateFlowRequirements(input.actual, input.validate));
  }
  return checks;
 }
 export function validateAppState(input: {
  actual: AppFilesState;
  initial?: AppFilesState;
  expected?: AppFilesState;
 }): BenchmarkCheck[] {
  const checks: BenchmarkCheck[] = [];
  const frontendEntries = Object.entries(input.actual.frontend ?? {});
  const backendEntries = Object.entries(input.actual.backend ?? {});
  const frontendSyntaxProblems = getAppFrontendSyntaxProblems(input.actual.frontend);
  const backendSyntaxProblems = getAppBackendSyntaxProblems(input.actual.backend);
  const unresolvedBackendRefs = getUnresolvedBackendReferences(
    input.actual.frontend,
    input.actual.backend
  );
  checks.push(check("app has frontend entrypoint", Boolean(input.actual.frontend["/index.tsx"])));
  checks.push(
    check(
      "app has non-empty frontend files",
      frontendEntries.some(([, content]) => content.trim().length > 0)
    )
  );
  checks.push(
    check(
      "frontend files have no syntax errors",
      frontendSyntaxProblems.length === 0,
      summarizeProblems(frontendSyntaxProblems)
    )
  );
  checks.push(
    check(
      "backend inline scripts have entrypoints",
      backendEntries.every(([, runnable]) => {
        if (runnable.type !== "inline") {
          return true;
        }
        return hasSupportedEntrypoint(runnable.inlineScript?.content ?? "");
      })
    )
  );
  checks.push(
    check(
      "backend inline scripts have no syntax errors",
      backendSyntaxProblems.length === 0,
      summarizeProblems(backendSyntaxProblems)
    )
  );
  checks.push(
    check(
      "frontend backend references resolve",
      unresolvedBackendRefs.length === 0,
      summarizeProblems(unresolvedBackendRefs)
    )
  );
  if (input.initial) {
    checks.push(check("app differs from initial", !appStatesEqual(input.actual, input.initial)));
  }
  if (input.expected) {
    for (const [filePath, content] of Object.entries(input.expected.frontend)) {
      checks.push(
        check(
          `frontend includes ${filePath}`,
          normalizeText(input.actual.frontend[filePath] ?? "") === normalizeText(content)
        )
      );
    }
    for (const [runnableName, runnable] of Object.entries(input.expected.backend)) {
      const actualRunnable = input.actual.backend[runnableName];
      checks.push(check(`backend includes ${runnableName}`, Boolean(actualRunnable)));
      if (actualRunnable && runnable.inlineScript?.content) {
        checks.push(
          check(
            `${runnableName} code matches expected`,
            normalizeText(actualRunnable.inlineScript?.content ?? "") ===
              normalizeText(runnable.inlineScript.content)
          )
        );
      }
    }
  }
  return checks;
 }
 export function validateCliWorkspace(input: {
  actualFiles: Record<string, string>;
  expectedFiles?: Record<string, string>;
  initialFiles?: Record<string, string>;
 }): BenchmarkCheck[] {
  const checks: BenchmarkCheck[] = [];
  if (input.expectedFiles) {
    for (const [filePath, expectedContent] of Object.entries(input.expectedFiles)) {
      const actualContent = input.actualFiles[filePath];
      checks.push(check(`creates ${filePath}`, actualContent !== undefined));
      if (actualContent !== undefined) {
        checks.push(
          check(
            `${filePath} contains expected content`,
            cliFileContainsExpectedContent(actualContent, expectedContent)
          )
        );
      }
    }
    const expectedPaths = new Set(Object.keys(input.expectedFiles));
    const unexpectedPaths = Object.keys(input.actualFiles).filter((filePath) => !expectedPaths.has(filePath));
    checks.push(
      check(
        "workspace contains no unexpected files",
        unexpectedPaths.length === 0,
        summarizeProblems(unexpectedPaths)
      )
    );
  }
  if (input.initialFiles) {
    checks.push(check("workspace differs from initial", !fileMapsEqual(input.actualFiles, input.initialFiles)));
  }
  return checks;
 }
 function cliFileContainsExpectedContent(actualContent: string, expectedContent: string): boolean {
  const expectedSnippets = expectedContent
    .replace(/\r\n/g, "\n")
    .split("\n")
    .map((line) => line.trim())
    .filter((line) => line.length > 0);
  if (expectedSnippets.length === 0) {
    return true;
  }
  const normalizedActual = actualContent.replace(/\r\n/g, "\n");
  return expectedSnippets.every((snippet) => normalizedActual.includes(snippet));
 }
 function check(name: string, passed: boolean, details?: string): BenchmarkCheck {
  return !passed && details ? { name, passed, details } : { name, passed };
 }
 function normalizeText(value: string): string {
  return value.replace(/\r\n/g, "\n").trim();
 }
 function normalizeJson(value: unknown): string {
  return JSON.stringify(value);
 }
 function summarizeProblems(problems: string[], limit = 5): string | undefined {
  if (problems.length === 0) {
    return undefined;
  }
  if (problems.length <= limit) {
    return problems.join("; ");
  }
  return `${problems.slice(0, limit).join("; ")}; ...and ${problems.length - limit} more`;
 }
 function hasSupportedEntrypoint(code: string): boolean {
  return (
    /export\s+(async\s+)?function\s+main\s*\(/.test(code) ||
    /export\s+default\s+(async\s+)?function\s*\(/.test(code)
  );
 }
 function getScriptSyntaxErrors(code: string, lang: string): string[] {
  if (!TS_LIKE_LANGUAGES.has(lang)) {
    return [];
  }
  return getTypeScriptSyntaxErrors(code, "eval.ts");
 }
 function getTypeScriptSyntaxErrors(code: string, fileName: string): string[] {
  const result = ts.transpileModule(code, {
    compilerOptions: {
      target: ts.ScriptTarget.ES2022,
      module: ts.ModuleKind.ESNext,
      jsx: ts.JsxEmit.ReactJSX,
    },
    reportDiagnostics: true,
    fileName,
  });
  return (result.diagnostics ?? []).map((diagnostic) =>
    ts.flattenDiagnosticMessageText(diagnostic.messageText, "\n")
  );
 }
 function getAppFrontendSyntaxProblems(frontend: Record<string, string>): string[] {
  const problems: string[] = [];
  for (const [filePath, content] of Object.entries(frontend)) {
    if (!isFrontendCodeFile(filePath)) {
      continue;
    }
    const errors = getTypeScriptSyntaxErrors(content, filePath);
    for (const error of errors) {
      problems.push(`${filePath}: ${error}`);
    }
  }
  return problems;
 }
 function getAppBackendSyntaxProblems(backend: Record<string, AppRunnableState>): string[] {
  const problems: string[] = [];
  for (const [key, runnable] of Object.entries(backend)) {
    if (runnable.type !== "inline") {
      continue;
    }
    const language = runnable.inlineScript?.language ?? "";
    const content = runnable.inlineScript?.content ?? "";
    for (const error of getScriptSyntaxErrors(content, language)) {
      problems.push(`${key}: ${error}`);
    }
  }
  return problems;
 }
 function isFrontendCodeFile(filePath: string): boolean {
  const extension = path.extname(filePath).toLowerCase();
  return extension === ".ts" || extension === ".tsx" || extension === ".js" || extension === ".jsx";
 }
 function getUnresolvedBackendReferences(
  frontend: Record<string, string>,
  backend: Record<string, AppRunnableState>
 ): string[] {
  const backendKeys = new Set(Object.keys(backend));
  const unresolved = new Set<string>();
  for (const [filePath, content] of Object.entries(frontend)) {
    for (const key of extractBackendCallKeys(content)) {
      if (!backendKeys.has(key)) {
        unresolved.add(`${filePath} references missing backend.${key}()`);
      }
    }
  }
  return [...unresolved];
 }
 function extractBackendCallKeys(content: string): string[] {
  const matches = content.matchAll(/\bbackend\.([A-Za-z_][A-Za-z0-9_]*)\s*\(/g);
  return [...new Set([...matches].map((match) => match[1]))];
 }
 function getFlowModules(flow: FlowState): Array<Record<string, unknown>> {
  return Array.isArray(flow.value?.modules) ? flow.value.modules : [];
 }
 function validateFlowExpectedStructure(
  actual: FlowState,
  expected: FlowState
 ): BenchmarkCheck[] {
  const checks: BenchmarkCheck[] = [];
  const expectedTopLevelModules = getFlowModules(expected);
  const actualTopLevelModules = getFlowModules(actual);
  const expectedSchemaFields = getTopLevelSchemaFields(expected.schema);
  if (expectedSchemaFields.length > 0) {
    checks.push(
      check(
        "flow schema includes expected top-level fields",
        expectedSchemaFields.every((field) => hasSchemaPath(actual.schema, field)),
        `missing one of: ${expectedSchemaFields.join(", ")}`
      )
    );
  }
  if (expectedTopLevelModules.length > 0) {
    const actualIds = actualTopLevelModules
      .map((module) => (typeof module.id === "string" ? module.id : null))
      .filter((id): id is string => Boolean(id));
    const expectedIds = expectedTopLevelModules
      .map((module) => (typeof module.id === "string" ? module.id : null))
      .filter((id): id is string => Boolean(id));
    checks.push(
      check(
        "flow includes expected top-level step ids",
        expectedIds.every((id) => actualIds.includes(id)),
        `expected ids: ${expectedIds.join(", ")}; actual ids: ${actualIds.join(", ")}`
      )
    );
    checks.push(
      check(
        "flow preserves expected top-level step order",
        preservesRelativeOrder(actualIds, expectedIds),
        `expected order: ${expectedIds.join(" -> ")}; actual ids: ${actualIds.join(" -> ")}`
      )
    );
    for (const expectedModule of expectedTopLevelModules) {
      const moduleId = typeof expectedModule.id === "string" ? expectedModule.id : null;
      if (!moduleId) {
        continue;
      }
      const actualModule = actualTopLevelModules.find((module) => module.id === moduleId);
      if (!actualModule) {
        continue;
      }
      const expectedType = getModuleType(expectedModule);
      if (expectedType && !(hasSuspendConfig(expectedModule) || hasSuspendConfig(actualModule))) {
        checks.push(
          check(
            `${moduleId} type matches expected`,
            getModuleType(actualModule) === expectedType,
            `expected ${expectedType}, got ${getModuleType(actualModule) ?? "(missing)"}`
          )
        );
      }
      const expectedPath = getModulePath(expectedModule);
      if (expectedPath) {
        checks.push(
          check(
            `${moduleId} path matches expected`,
            getModulePath(actualModule) === expectedPath,
            `expected ${expectedPath}, got ${getModulePath(actualModule) ?? "(missing)"}`
          )
        );
      }
    }
  }
  for (const specialModuleKey of ["preprocessor_module", "failure_module"] as const) {
    const expectedSpecialModule = getSpecialFlowModule(expected, specialModuleKey);
    if (!expectedSpecialModule) {
      continue;
    }
    const actualSpecialModule = getSpecialFlowModule(actual, specialModuleKey);
    checks.push(check(`${specialModuleKey} matches expected presence`, Boolean(actualSpecialModule)));
    if (!actualSpecialModule) {
      continue;
    }
    const expectedType = getModuleType(expectedSpecialModule);
    if (expectedType) {
      checks.push(
        check(
          `${specialModuleKey} type matches expected`,
          getModuleType(actualSpecialModule) === expectedType,
          `expected ${expectedType}, got ${getModuleType(actualSpecialModule) ?? "(missing)"}`
        )
      );
    }
  }
  return checks;
 }
 function validateFlowRequirements(
  flow: FlowState,
  validate: FlowValidationSpec
 ): BenchmarkCheck[] {
  const checks: BenchmarkCheck[] = [];
  for (const requiredPath of validate.schemaRequiredPaths ?? []) {
    checks.push(
      check(
        `schema includes ${requiredPath}`,
        hasSchemaPath(flow.schema, requiredPath),
        `missing schema path ${requiredPath}`
      )
    );
  }
  if (validate.schemaAnyOf && validate.schemaAnyOf.length > 0) {
    const matchingVariant = validate.schemaAnyOf.find((variant) =>
      variant.requiredPaths.every((requiredPath) => hasSchemaPath(flow.schema, requiredPath))
    );
    checks.push(
      check(
        "schema matches one accepted input shape",
        Boolean(matchingVariant),
        matchingVariant
          ? undefined
          : `expected one of: ${validate.schemaAnyOf
              .map((variant) => `[${variant.requiredPaths.join(", ")}]`)
              .join(" or ")}`
      )
    );
  }
  if (validate.resolveResultsRefs) {
    const unresolved = collectUnresolvedResultsRefs(flow);
    checks.push(
      check(
        "results references resolve",
        unresolved.length === 0,
        unresolved.length > 0 ? unresolved.join("; ") : undefined
      )
    );
  }
  for (const specialModule of validate.requireSpecialModules ?? []) {
    checks.push(
      check(
        `${specialModule} exists`,
        Boolean(getSpecialFlowModule(flow, specialModule))
      )
    );
  }
  for (const suspendStep of validate.requireSuspendSteps ?? []) {
    const module = findFlowModuleById(flow, suspendStep.id);
    checks.push(check(`${suspendStep.id} step exists`, Boolean(module)));
    if (!module) {
      continue;
    }
    checks.push(check(`${suspendStep.id} includes suspend config`, hasSuspendConfig(module)));
    if (!hasSuspendConfig(module)) {
      continue;
    }
    if (suspendStep.requiredEvents !== undefined) {
      checks.push(
        check(
          `${suspendStep.id} requires ${suspendStep.requiredEvents} approval event${suspendStep.requiredEvents === 1 ? "" : "s"}`,
          getSuspendRequiredEvents(module) === suspendStep.requiredEvents,
          `expected ${suspendStep.requiredEvents}, got ${getSuspendRequiredEvents(module) ?? "(missing)"}`
        )
      );
    }
    if (
      suspendStep.resumeRequiredStringFieldAnyOf &&
      suspendStep.resumeRequiredStringFieldAnyOf.length > 0
    ) {
      const stringFields = getSuspendResumeStringFields(module);
      checks.push(
        check(
          `${suspendStep.id} resume form includes one accepted comment field`,
          suspendStep.resumeRequiredStringFieldAnyOf.some((field) =>
            stringFields.includes(field)
          ),
          `expected one of [${suspendStep.resumeRequiredStringFieldAnyOf.join(", ")}], got [${stringFields.join(", ")}]`
        )
      );
    }
  }
  return checks;
 }
 function hasSchemaPath(schema: Record<string, unknown> | undefined, dottedPath: string): boolean {
  if (!schema || typeof schema !== "object") {
    return false;
  }
  const segments = dottedPath.split(".").filter(Boolean);
  if (segments.length === 0) {
    return false;
  }
  let current: Record<string, unknown> | undefined = schema;
  for (const segment of segments) {
    const properties = current?.properties;
    if (!properties || typeof properties !== "object") {
      return false;
    }
    const next = (properties as Record<string, unknown>)[segment];
    if (!next || typeof next !== "object") {
      return false;
    }
    current = next as Record<string, unknown>;
  }
  return true;
 }
 function getTopLevelSchemaFields(schema: Record<string, unknown> | undefined): string[] {
  if (!schema || typeof schema !== "object") {
    return [];
  }
  const properties = schema.properties;
  if (!properties || typeof properties !== "object") {
    return [];
  }
  return Object.keys(properties as Record<string, unknown>).filter((key) => key.length > 0);
 }
 function preservesRelativeOrder(actualIds: string[], expectedIds: string[]): boolean {
  if (expectedIds.length === 0) {
    return true;
  }
  let cursor = 0;
  for (const actualId of actualIds) {
    if (actualId === expectedIds[cursor]) {
      cursor += 1;
      if (cursor === expectedIds.length) {
        return true;
      }
    }
  }
  return false;
 }
 function collectUnresolvedResultsRefs(flow: FlowState): string[] {
  const unresolved = new Set<string>();
  validateModuleSequence(getFlowModules(flow), new Map<string, Record<string, unknown>>(), unresolved);
  return [...unresolved];
 }
 function validateModuleSequence(
  modules: Array<Record<string, unknown>>,
  parentVisibleModules: Map<string, Record<string, unknown>>,
  unresolved: Set<string>
 ): void {
  const visibleModules = new Map(parentVisibleModules);
  for (const module of modules) {
    validateResultsRefsInRecord(module, visibleModules, unresolved);
    validateNestedModuleResultsRefs(module, visibleModules, unresolved);
    if (typeof module.id === "string" && module.id.length > 0) {
      visibleModules.set(module.id, module);
    }
  }
 }
 function validateNestedModuleResultsRefs(
  module: Record<string, unknown>,
  visibleModules: Map<string, Record<string, unknown>>,
  unresolved: Set<string>
 ): void {
  const value = isObjectRecord(module.value) ? module.value : null;
  if (!value) {
    return;
  }
  const nestedSequences: Array<Array<Record<string, unknown>>> = [];
  if (Array.isArray(value.modules)) {
    nestedSequences.push(asModuleArray(value.modules));
  }
  if (Array.isArray(value.default)) {
    nestedSequences.push(asModuleArray(value.default));
  }
  if (Array.isArray(value.branches)) {
    for (const branch of value.branches) {
      if (!isObjectRecord(branch)) {
        continue;
      }
      if (typeof branch.expr === "string") {
        validateResultsRefsInExpression(
          branch.expr,
          `branch ${module.id ?? "(unnamed)"}`,
          visibleModules,
          unresolved
        );
      }
      if (Array.isArray(branch.modules)) {
        nestedSequences.push(asModuleArray(branch.modules));
      }
    }
  }
  for (const sequence of nestedSequences) {
    validateModuleSequence(sequence, visibleModules, unresolved);
  }
 }
 function validateResultsRefsInRecord(
  value: unknown,
  visibleModules: Map<string, Record<string, unknown>>,
  unresolved: Set<string>,
  context = "expression"
 ): void {
  if (typeof value === "string") {
    validateResultsRefsInExpression(value, context, visibleModules, unresolved);
    return;
  }
  if (Array.isArray(value)) {
    for (const entry of value) {
      validateResultsRefsInRecord(entry, visibleModules, unresolved, context);
    }
    return;
  }
  if (!isObjectRecord(value)) {
    return;
  }
  for (const [key, entry] of Object.entries(value)) {
    if (key === "content" || key === "modules" || key === "branches" || key === "default") {
      continue;
    }
    validateResultsRefsInRecord(entry, visibleModules, unresolved, key);
  }
 }
 function validateResultsRefsInExpression(
  expression: string,
  context: string,
  visibleModules: Map<string, Record<string, unknown>>,
  unresolved: Set<string>
 ): void {
  for (const ref of extractResultsRefs(expression)) {
    const module = visibleModules.get(ref.root);
    if (!module) {
      unresolved.add(`${context} references missing results.${ref.root}`);
      continue;
    }
    validateNestedResultsRefPath(ref.root, ref.path, module, context, unresolved);
  }
 }
 function extractResultsRefs(
  expression: string
 ): Array<{ root: string; path: string[] }> {
  const matches = expression.matchAll(/\bresults\.([A-Za-z0-9_-]+)((?:\.[A-Za-z0-9_-]+)*)/g);
  const refs = new Map<string, { root: string; path: string[] }>();
  for (const match of matches) {
    const root = match[1];
    const path = match[2]
      .split(".")
      .filter(Boolean);
    const key = `${root}:${path.join(".")}`;
    refs.set(key, { root, path });
  }
  return [...refs.values()];
 }
 function validateNestedResultsRefPath(
  rootId: string,
  path: string[],
  module: Record<string, unknown>,
  context: string,
  unresolved: Set<string>
 ): void {
  if (path.length === 0) {
    return;
  }
  const moduleType = getModuleType(module);
  if (!moduleType || !CONTROL_FLOW_MODULE_TYPES.has(moduleType)) {
    return;
  }
  const nestedIds = new Set(getImmediateNestedModuleIds(module));
  const [firstSegment] = path;
  if (nestedIds.has(firstSegment)) {
    unresolved.add(
      `${context} references nested results.${rootId}.${firstSegment} inside ${moduleType} ${rootId}`
    );
  }
 }
 function getAllFlowModules(flow: FlowState): Array<Record<string, unknown>> {
  const modules: Array<Record<string, unknown>> = [];
  const specialModules = ["preprocessor_module", "failure_module"] as const;
  for (const key of specialModules) {
    const specialModule = getSpecialFlowModule(flow, key);
    if (specialModule) {
      modules.push(specialModule);
      modules.push(...collectNestedModules(specialModule));
    }
  }
  for (const module of getFlowModules(flow)) {
    modules.push(module);
    modules.push(...collectNestedModules(module));
  }
  return modules;
 }
 function collectNestedModules(module: Record<string, unknown>): Array<Record<string, unknown>> {
  const nested: Array<Record<string, unknown>> = [];
  const value = isObjectRecord(module.value) ? module.value : null;
  if (!value) {
    return nested;
  }
  if (Array.isArray(value.modules)) {
    for (const child of asModuleArray(value.modules)) {
      nested.push(child, ...collectNestedModules(child));
    }
  }
  if (Array.isArray(value.default)) {
    for (const child of asModuleArray(value.default)) {
      nested.push(child, ...collectNestedModules(child));
    }
  }
  if (Array.isArray(value.branches)) {
    for (const branch of value.branches) {
      if (!isObjectRecord(branch) || !Array.isArray(branch.modules)) {
        continue;
      }
      for (const child of asModuleArray(branch.modules)) {
        nested.push(child, ...collectNestedModules(child));
      }
    }
  }
  return nested;
 }
 function findFlowModuleById(flow: FlowState, id: string): Record<string, unknown> | null {
  for (const module of getAllFlowModules(flow)) {
    if (module.id === id) {
      return module;
    }
  }
  return null;
 }
 function getInlineScriptPlaceholderModuleIds(flow: FlowState): string[] {
  return getAllFlowModules(flow).flatMap((module) => {
    const code = getModuleCode(module)?.trim();
    if (!code || !/^inline_script\.[A-Za-z0-9_-]+$/.test(code)) {
      return [];
    }
    if (typeof module.id === "string" && module.id.length > 0) {
      return [module.id];
    }
    return ["(unnamed)"];
  });
 }
 function getImmediateNestedModuleIds(module: Record<string, unknown>): string[] {
  const ids: string[] = [];
  const value = isObjectRecord(module.value) ? module.value : null;
  if (!value) {
    return ids;
  }
  if (Array.isArray(value.modules)) {
    ids.push(...asModuleArray(value.modules).flatMap((child) => (typeof child.id === "string" ? [child.id] : [])));
  }
  if (Array.isArray(value.default)) {
    ids.push(...asModuleArray(value.default).flatMap((child) => (typeof child.id === "string" ? [child.id] : [])));
  }
  if (Array.isArray(value.branches)) {
    for (const branch of value.branches) {
      if (!isObjectRecord(branch) || !Array.isArray(branch.modules)) {
        continue;
      }
      ids.push(
        ...asModuleArray(branch.modules).flatMap((child) => (typeof child.id === "string" ? [child.id] : []))
      );
    }
  }
  return ids;
 }
 function getModuleCode(module: Record<string, unknown>): string | null {
  const value = isObjectRecord(module.value) ? module.value : null;
  return typeof value?.content === "string" ? value.content : null;
 }
 function asModuleArray(value: unknown[]): Array<Record<string, unknown>> {
  return value.filter(isObjectRecord);
 }
 function isObjectRecord(value: unknown): value is Record<string, any> {
  return typeof value === "object" && value !== null && !Array.isArray(value);
 }
 function getSpecialFlowModule(
  flow: FlowState,
  key: "preprocessor_module" | "failure_module"
 ): Record<string, unknown> | null {
  if (!flow.value || typeof flow.value !== "object") {
    return null;
  }
  const module = (flow.value as Record<string, unknown>)[key];
  return module && typeof module === "object" ? (module as Record<string, unknown>) : null;
 }
 function getModuleType(module: Record<string, unknown>): string | null {
  const value = module.value;
  if (!value || typeof value !== "object") {
    return null;
  }
  return typeof (value as Record<string, unknown>).type === "string"
    ? ((value as Record<string, string>).type)
    : null;
 }
 function getModulePath(module: Record<string, unknown>): string | null {
  const value = module.value;
  if (!value || typeof value !== "object") {
    return null;
  }
  return typeof (value as Record<string, unknown>).path === "string"
    ? ((value as Record<string, string>).path)
    : null;
 }
 function hasSuspendConfig(module: Record<string, unknown>): boolean {
  return typeof module.suspend === "object" && module.suspend !== null;
 }
 function getSuspendRequiredEvents(module: Record<string, unknown>): number | null {
  const suspend = isObjectRecord(module.suspend) ? module.suspend : null;
  return typeof suspend?.required_events === "number" ? suspend.required_events : null;
 }
 function getSuspendResumeStringFields(module: Record<string, unknown>): string[] {
  const suspend = isObjectRecord(module.suspend) ? module.suspend : null;
  const resumeForm = isObjectRecord(suspend?.resume_form) ? suspend.resume_form : null;
  const schema = isObjectRecord(resumeForm?.schema) ? resumeForm.schema : null;
  const properties = isObjectRecord(schema?.properties) ? schema.properties : null;
  if (!properties) {
    return [];
  }
  return Object.entries(properties).flatMap(([field, property]) => {
    if (!isObjectRecord(property) || property.type !== "string") {
      return [];
    }
    return [field];
  });
 }
 function appStatesEqual(left: AppFilesState, right: AppFilesState): boolean {
  return fileMapsEqual(left.frontend, right.frontend) && fileMapsEqual(stringifyBackend(left.backend), stringifyBackend(right.backend));
 }
 function stringifyBackend(backend: Record<string, AppRunnableState>): Record<string, string> {
  const result: Record<string, string> = {};
  for (const [key, value] of Object.entries(backend)) {
    result[key] = JSON.stringify(value);
  }
  return result;
 }
 function fileMapsEqual(left: Record<string, string>, right: Record<string, string>): boolean {
  const leftEntries = Object.entries(left).sort(([a], [b]) => a.localeCompare(b));
  const rightEntries = Object.entries(right).sort(([a], [b]) => a.localeCompare(b));
  if (leftEntries.length !== rightEntries.length) {
    return false;
  }
  return leftEntries.every(([key, value], index) => {
    const [otherKey, otherValue] = rightEntries[index];
    return key === otherKey && normalizeText(value) === normalizeText(otherValue);
  });
 }
--- a/ai_evals/fixtures/cli/expected/bun-hello-flow-punctuation/f/evals/hello__flow/flow.yaml
+++ b/ai_evals/fixtures/cli/expected/bun-hello-flow-punctuation/f/evals/hello__flow/flow.yaml
--- a/ai_evals/fixtures/cli/expected/bun-hello-flow-punctuation/f/evals/hello__flow/hello.ts
+++ b/ai_evals/fixtures/cli/expected/bun-hello-flow-punctuation/f/evals/hello__flow/hello.ts
--- a/ai_evals/fixtures/cli/expected/bun-hello-flow/f/evals/hello__flow/flow.yaml
+++ b/ai_evals/fixtures/cli/expected/bun-hello-flow/f/evals/hello__flow/flow.yaml
--- a/ai_evals/fixtures/cli/expected/bun-hello-flow/f/evals/hello__flow/hello.ts
+++ b/ai_evals/fixtures/cli/expected/bun-hello-flow/f/evals/hello__flow/hello.ts
@@ -0,0 +1,2 @@
 main(name: string)
 greeting: `Hello, ${name}!`
--- a/ai_evals/fixtures/cli/expected/bun-hello-script-uppercase/f/evals/hello.ts
+++ b/ai_evals/fixtures/cli/expected/bun-hello-script-uppercase/f/evals/hello.ts
--- a/ai_evals/fixtures/cli/expected/bun-hello-script/f/evals/hello.ts
+++ b/ai_evals/fixtures/cli/expected/bun-hello-script/f/evals/hello.ts
@@ -0,0 +1,3 @@
 export async function main(name: string) {
  return { greeting: `Hello, ${name}!` };
 }
--- a/ai_evals/fixtures/cli/expected/flow-reuse-existing-script/f/evals/reuse_greeting__flow/flow.yaml
+++ b/ai_evals/fixtures/cli/expected/flow-reuse-existing-script/f/evals/reuse_greeting__flow/flow.yaml
@@ -0,0 +1,2 @@
 type: script
 path: f/lib/format_greeting
--- a/ai_evals/fixtures/cli/expected/flow-reuse-existing-script/f/lib/format_greeting.ts
+++ b/ai_evals/fixtures/cli/expected/flow-reuse-existing-script/f/lib/format_greeting.ts
@@ -0,0 +1,3 @@
 export async function main(name: string) {
  return { greeting: `Hello, ${name}!` };
 }
--- a/ai_evals/fixtures/cli/expected/python-add-numbers-script/f/evals/add_numbers.py
+++ b/ai_evals/fixtures/cli/expected/python-add-numbers-script/f/evals/add_numbers.py
@@ -0,0 +1,2 @@
 def main(
 return {"total": a + b}
--- a/ai_evals/fixtures/cli/initial/bun-hello-flow-punctuation/f/evals/hello__flow/flow.yaml
+++ b/ai_evals/fixtures/cli/initial/bun-hello-flow-punctuation/f/evals/hello__flow/flow.yaml
@@ -0,0 +1,20 @@
 summary: Simple greeting flow
 schema:
  type: object
  properties:
    name:
      type: string
      description: Name to greet
  required:
    - name
 value:
  modules:
    - id: hello_step
      value:
        type: rawscript
        language: bun
        content: !inline hello.ts
        input_transforms:
          name:
            type: javascript
            expr: flow_input.name
--- a/ai_evals/fixtures/cli/initial/bun-hello-flow-punctuation/f/evals/hello__flow/hello.ts
+++ b/ai_evals/fixtures/cli/initial/bun-hello-flow-punctuation/f/evals/hello__flow/hello.ts
@@ -0,0 +1,3 @@
 export async function main(name: string) {
  return { greeting: `Hello, ${name}!` };
 }
--- a/ai_evals/fixtures/cli/initial/bun-hello-script-uppercase/f/evals/hello.ts
+++ b/ai_evals/fixtures/cli/initial/bun-hello-script-uppercase/f/evals/hello.ts
@@ -0,0 +1,3 @@
 export async function main(name: string) {
  return { greeting: `Hello, ${name}!` };
 }
--- a/ai_evals/fixtures/cli/initial/flow-reuse-existing-script/f/lib/format_greeting.ts
+++ b/ai_evals/fixtures/cli/initial/flow-reuse-existing-script/f/lib/format_greeting.ts
@@ -0,0 +1,3 @@
 export async function main(name: string) {
  return { greeting: `Hello, ${name}!` };
 }
--- a/frontend/src/lib/components/copilot/chat/tests/app/initial/file_manager/backend/createFolder/main.ts
+++ b/frontend/src/lib/components/copilot/chat/tests/app/initial/file_manager/backend/createFolder/main.ts
--- a/frontend/src/lib/components/copilot/chat/tests/app/initial/file_manager/backend/createFolder/meta.json
+++ b/frontend/src/lib/components/copilot/chat/tests/app/initial/file_manager/backend/createFolder/meta.json
--- a/frontend/src/lib/components/copilot/chat/tests/app/initial/file_manager/backend/deleteItem/main.ts
+++ b/frontend/src/lib/components/copilot/chat/tests/app/initial/file_manager/backend/deleteItem/main.ts
--- a/frontend/src/lib/components/copilot/chat/tests/app/initial/file_manager/backend/deleteItem/meta.json
+++ b/frontend/src/lib/components/copilot/chat/tests/app/initial/file_manager/backend/deleteItem/meta.json
--- a/frontend/src/lib/components/copilot/chat/tests/app/initial/file_manager/backend/listFiles/main.ts
+++ b/frontend/src/lib/components/copilot/chat/tests/app/initial/file_manager/backend/listFiles/main.ts
--- a/frontend/src/lib/components/copilot/chat/tests/app/initial/file_manager/backend/listFiles/meta.json
+++ b/frontend/src/lib/components/copilot/chat/tests/app/initial/file_manager/backend/listFiles/meta.json
--- a/frontend/src/lib/components/copilot/chat/tests/app/initial/file_manager/backend/listFolders/main.ts
+++ b/frontend/src/lib/components/copilot/chat/tests/app/initial/file_manager/backend/listFolders/main.ts
--- a/frontend/src/lib/components/copilot/chat/tests/app/initial/file_manager/backend/listFolders/meta.json
+++ b/frontend/src/lib/components/copilot/chat/tests/app/initial/file_manager/backend/listFolders/meta.json
--- a/frontend/src/lib/components/copilot/chat/tests/app/initial/file_manager/backend/moveItem/main.ts
+++ b/frontend/src/lib/components/copilot/chat/tests/app/initial/file_manager/backend/moveItem/main.ts
--- a/frontend/src/lib/components/copilot/chat/tests/app/initial/file_manager/backend/moveItem/meta.json
+++ b/frontend/src/lib/components/copilot/chat/tests/app/initial/file_manager/backend/moveItem/meta.json
--- a/frontend/src/lib/components/copilot/chat/tests/app/initial/file_manager/backend/renameItem/main.ts
+++ b/frontend/src/lib/components/copilot/chat/tests/app/initial/file_manager/backend/renameItem/main.ts
--- a/frontend/src/lib/components/copilot/chat/tests/app/initial/file_manager/backend/renameItem/meta.json
+++ b/frontend/src/lib/components/copilot/chat/tests/app/initial/file_manager/backend/renameItem/meta.json
--- a/frontend/src/lib/components/copilot/chat/tests/app/initial/file_manager/frontend/components/Breadcrumb.tsx
+++ b/frontend/src/lib/components/copilot/chat/tests/app/initial/file_manager/frontend/components/Breadcrumb.tsx
--- a/frontend/src/lib/components/copilot/chat/tests/app/initial/file_manager/frontend/components/FileItem.tsx
+++ b/frontend/src/lib/components/copilot/chat/tests/app/initial/file_manager/frontend/components/FileItem.tsx
--- a/frontend/src/lib/components/copilot/chat/tests/app/initial/file_manager/frontend/components/FileList.tsx
+++ b/frontend/src/lib/components/copilot/chat/tests/app/initial/file_manager/frontend/components/FileList.tsx
--- a/frontend/src/lib/components/copilot/chat/tests/app/initial/file_manager/frontend/components/FolderTree.tsx
+++ b/frontend/src/lib/components/copilot/chat/tests/app/initial/file_manager/frontend/components/FolderTree.tsx
--- a/frontend/src/lib/components/copilot/chat/tests/app/initial/file_manager/frontend/components/Toolbar.tsx
+++ b/frontend/src/lib/components/copilot/chat/tests/app/initial/file_manager/frontend/components/Toolbar.tsx
--- a/frontend/src/lib/components/copilot/chat/tests/app/initial/file_manager/frontend/index.tsx
+++ b/frontend/src/lib/components/copilot/chat/tests/app/initial/file_manager/frontend/index.tsx
--- a/frontend/src/lib/components/copilot/chat/tests/app/initial/shopping_cart/backend/addToCart/main.ts
+++ b/frontend/src/lib/components/copilot/chat/tests/app/initial/shopping_cart/backend/addToCart/main.ts
--- a/frontend/src/lib/components/copilot/chat/tests/app/initial/shopping_cart/backend/addToCart/meta.json
+++ b/frontend/src/lib/components/copilot/chat/tests/app/initial/shopping_cart/backend/addToCart/meta.json
--- a/frontend/src/lib/components/copilot/chat/tests/app/initial/shopping_cart/backend/calculateTotal/main.ts
+++ b/frontend/src/lib/components/copilot/chat/tests/app/initial/shopping_cart/backend/calculateTotal/main.ts
--- a/frontend/src/lib/components/copilot/chat/tests/app/initial/shopping_cart/backend/calculateTotal/meta.json
+++ b/frontend/src/lib/components/copilot/chat/tests/app/initial/shopping_cart/backend/calculateTotal/meta.json
--- a/frontend/src/lib/components/copilot/chat/tests/app/initial/shopping_cart/backend/getProducts/main.ts
+++ b/frontend/src/lib/components/copilot/chat/tests/app/initial/shopping_cart/backend/getProducts/main.ts
--- a/frontend/src/lib/components/copilot/chat/tests/app/initial/shopping_cart/backend/getProducts/meta.json
+++ b/frontend/src/lib/components/copilot/chat/tests/app/initial/shopping_cart/backend/getProducts/meta.json
--- a/frontend/src/lib/components/copilot/chat/tests/app/initial/shopping_cart/backend/removeFromCart/main.ts
+++ b/frontend/src/lib/components/copilot/chat/tests/app/initial/shopping_cart/backend/removeFromCart/main.ts
--- a/frontend/src/lib/components/copilot/chat/tests/app/initial/shopping_cart/backend/removeFromCart/meta.json
+++ b/frontend/src/lib/components/copilot/chat/tests/app/initial/shopping_cart/backend/removeFromCart/meta.json
--- a/frontend/src/lib/components/copilot/chat/tests/app/initial/shopping_cart/frontend/components/Cart.tsx
+++ b/frontend/src/lib/components/copilot/chat/tests/app/initial/shopping_cart/frontend/components/Cart.tsx
--- a/frontend/src/lib/components/copilot/chat/tests/app/initial/shopping_cart/frontend/components/ProductCard.tsx
+++ b/frontend/src/lib/components/copilot/chat/tests/app/initial/shopping_cart/frontend/components/ProductCard.tsx
--- a/frontend/src/lib/components/copilot/chat/tests/app/initial/shopping_cart/frontend/components/ProductList.tsx
+++ b/frontend/src/lib/components/copilot/chat/tests/app/initial/shopping_cart/frontend/components/ProductList.tsx
--- a/frontend/src/lib/components/copilot/chat/tests/app/initial/shopping_cart/frontend/index.tsx
+++ b/frontend/src/lib/components/copilot/chat/tests/app/initial/shopping_cart/frontend/index.tsx
--- a/frontend/src/lib/components/copilot/chat/tests/app/initial/test1_counter_app/backend/decrementCounter/main.ts
+++ b/frontend/src/lib/components/copilot/chat/tests/app/initial/test1_counter_app/backend/decrementCounter/main.ts
--- a/frontend/src/lib/components/copilot/chat/tests/app/initial/test1_counter_app/backend/decrementCounter/meta.json
+++ b/frontend/src/lib/components/copilot/chat/tests/app/initial/test1_counter_app/backend/decrementCounter/meta.json
--- a/frontend/src/lib/components/copilot/chat/tests/app/initial/test1_counter_app/backend/incrementCounter/main.ts
+++ b/frontend/src/lib/components/copilot/chat/tests/app/initial/test1_counter_app/backend/incrementCounter/main.ts
--- a/frontend/src/lib/components/copilot/chat/tests/app/initial/test1_counter_app/backend/incrementCounter/meta.json
+++ b/frontend/src/lib/components/copilot/chat/tests/app/initial/test1_counter_app/backend/incrementCounter/meta.json
--- a/frontend/src/lib/components/copilot/chat/tests/app/initial/test1_counter_app/frontend/index.tsx
+++ b/frontend/src/lib/components/copilot/chat/tests/app/initial/test1_counter_app/frontend/index.tsx
--- a/ai_evals/fixtures/frontend/flow/expected/test0_sum_two_numbers.json
+++ b/ai_evals/fixtures/frontend/flow/expected/test0_sum_two_numbers.json
@@ -0,0 +1,31 @@
 {
  "summary": "",
  "value": {
    "modules": [
      {
        "id": "sum_numbers",
        "value": {
          "type": "rawscript",
          "language": "bun",
          "content": "export async function main(a: number, b: number) {\n  return a + b;\n}",
          "input_transforms": {
            "a": {
              "type": "javascript",
              "expr": "flow_input.a"
            },
            "b": {
              "type": "javascript",
              "expr": "flow_input.b"
            }
          }
        }
      }
    ]
  },
  "schema": {
    "$schema": "https://json-schema.org/draft/2020-12/schema",
    "properties": {},
    "required": [],
    "type": "object"
  }
 }
--- a/frontend/src/lib/components/copilot/chat/tests/flow/expected/test1.json
+++ b/frontend/src/lib/components/copilot/chat/tests/flow/expected/test1.json
--- a/ai_evals/fixtures/frontend/flow/expected/test10_while_loop_counter.json
+++ b/ai_evals/fixtures/frontend/flow/expected/test10_while_loop_counter.json
@@ -0,0 +1,30 @@
 {
  "value": {
    "modules": [
      {
        "id": "count_until_target",
        "value": {
          "type": "whileloopflow"
        }
      },
      {
        "id": "return_final_count",
        "value": {
          "type": "rawscript"
        }
      }
    ]
  },
  "schema": {
    "$schema": "https://json-schema.org/draft/2020-12/schema",
    "type": "object",
    "properties": {
      "target": {
        "type": "number"
      }
    },
    "required": [
      "target"
    ]
  }
 }
--- a/ai_evals/fixtures/frontend/flow/expected/test11_preprocessor_failure.json
+++ b/ai_evals/fixtures/frontend/flow/expected/test11_preprocessor_failure.json
@@ -0,0 +1,36 @@
 {
  "value": {
    "preprocessor_module": {
      "id": "preprocessor",
      "value": {
        "type": "rawscript"
      }
    },
    "failure_module": {
      "id": "failure",
      "value": {
        "type": "rawscript"
      }
    },
    "modules": [
      {
        "id": "process_event",
        "value": {
          "type": "rawscript"
        }
      }
    ]
  },
  "schema": {
    "$schema": "https://json-schema.org/draft/2020-12/schema",
    "type": "object",
    "properties": {
      "payload": {
        "type": "string"
      }
    },
    "required": [
      "payload"
    ]
  }
 }
--- a/ai_evals/fixtures/frontend/flow/expected/test12_approval_step.json
+++ b/ai_evals/fixtures/frontend/flow/expected/test12_approval_step.json
@@ -0,0 +1,44 @@
 {
  "value": {
    "modules": [
      {
        "id": "request_approval",
        "suspend": {
          "required_events": 1,
          "resume_form": {
            "schema": {
              "approver_comment": {
                "type": "string"
              }
            }
          }
        },
        "value": {
          "type": "rawscript"
        }
      },
      {
        "id": "finalize_purchase",
        "value": {
          "type": "rawscript"
        }
      }
    ]
  },
  "schema": {
    "$schema": "https://json-schema.org/draft/2020-12/schema",
    "type": "object",
    "properties": {
      "requester_email": {
        "type": "string"
      },
      "amount": {
        "type": "number"
      }
    },
    "required": [
      "requester_email",
      "amount"
    ]
  }
 }
--- a/ai_evals/fixtures/frontend/flow/expected/test1_reuse_existing_script.json
+++ b/ai_evals/fixtures/frontend/flow/expected/test1_reuse_existing_script.json
@@ -0,0 +1,39 @@
 {
  "value": {
    "modules": [
      {
        "id": "sum_numbers",
        "value": {
          "type": "script",
          "path": "f/evals/add_two_numbers.ts",
          "input_transforms": {
            "a": {
              "type": "javascript",
              "expr": "flow_input.a"
            },
            "b": {
              "type": "javascript",
              "expr": "flow_input.b"
            }
          }
        }
      }
    ]
  },
  "schema": {
    "$schema": "https://json-schema.org/draft/2020-12/schema",
    "type": "object",
    "properties": {
      "a": {
        "type": "number"
      },
      "b": {
        "type": "number"
      }
    },
    "required": [
      "a",
      "b"
    ]
  }
 }
--- a/frontend/src/lib/components/copilot/chat/tests/flow/expected/test2.json
+++ b/frontend/src/lib/components/copilot/chat/tests/flow/expected/test2.json
--- a/ai_evals/fixtures/frontend/flow/expected/test2_call_existing_subflow.json
+++ b/ai_evals/fixtures/frontend/flow/expected/test2_call_existing_subflow.json
@@ -0,0 +1,39 @@
 {
  "value": {
    "modules": [
      {
        "id": "call_add_numbers",
        "value": {
          "type": "flow",
          "path": "f/evals/add_numbers_flow",
          "input_transforms": {
            "a": {
              "type": "javascript",
              "expr": "flow_input.a"
            },
            "b": {
              "type": "javascript",
              "expr": "flow_input.b"
            }
          }
        }
      }
    ]
  },
  "schema": {
    "$schema": "https://json-schema.org/draft/2020-12/schema",
    "type": "object",
    "properties": {
      "a": {
        "type": "number"
      },
      "b": {
        "type": "number"
      }
    },
    "required": [
      "a",
      "b"
    ]
  }
 }
--- a/frontend/src/lib/components/copilot/chat/tests/flow/expected/test3.json
+++ b/frontend/src/lib/components/copilot/chat/tests/flow/expected/test3.json
--- a/ai_evals/fixtures/frontend/flow/expected/test3_branchone_routing.json
+++ b/ai_evals/fixtures/frontend/flow/expected/test3_branchone_routing.json
@@ -0,0 +1,24 @@
 {
  "value": {
    "modules": [
      {
        "id": "route_by_tier",
        "value": {
          "type": "branchone"
        }
      }
    ]
  },
  "schema": {
    "$schema": "https://json-schema.org/draft/2020-12/schema",
    "type": "object",
    "properties": {
      "tier": {
        "type": "string"
      }
    },
    "required": [
      "tier"
    ]
  }
 }
--- a/Show More
+++ b/Show More
		`@@ -0,0 +1,2 @@`
							`main(name: string)`
							greeting: `Hello, ${name}!`
		`@@ -0,0 +1,2 @@`
							`type: script`
							`path: f/lib/format_greeting`