kalugin66/windmill

Fork 0

Files

centdix cdcc56461b feat: add black-box ai eval benchmarks (#8618 )

2026-04-13 14:05:46 +02:00

5.5 KiB

Raw Blame History

AI Evals Authoring Guide

This folder contains black-box benchmark cases for:

flow
app
script
cli

The goal is to test the current production prompts and guidance with realistic user requests, not to test one exact implementation shape.

Core rules

Write prompts like a real user request.
Prefer behavior, inputs, constraints, and outcomes over internal implementation details.
Keep deterministic validation narrow and hard.
Put semantic expectations in judgeChecklist.
Use expected fixtures only when exact structure really matters.

Prompt writing

Prompts should sound like something a user would naturally ask.

Good:

"Create a flow that routes support requests based on customer tier."
"Add a reset button that sets the counter back to 0."
"Create a flow that reuses the existing greeting script instead of duplicating the logic."

Bad:

"Use branchone with 3 branches and a default branch."
"Create a rawscript step with this exact topology."
"This is a benchmark harness."

Do not write prompts as if the user knows Windmill internals unless the case is explicitly testing a power-user workflow.

Flow-specific rules

This is the main principle you asked for:

flow prompts should read like requests from a user who does not know the product internals
the user should ask for behavior, not for branchone, branchall, rawscript, preprocessor_module, failure_module, exact graph topology, or other internal constructs

That means:

creation cases should describe the business behavior and expected result
modification cases may mention existing step names, because the user can see the current flow
only mention special Windmill constructs when the case is explicitly about those constructs

Examples:

acceptable creation prompt: "Create a purchase approval flow that pauses for approval and asks the approver for a comment."
avoid: "Create a suspend step with one required event and a resume form."

For flow cases, do not fail a case just because the model chose a different valid topology.

App-specific rules

App prompts should focus on user-visible behavior:

what the UI should let the user do
what should persist
what backend behavior is needed

Avoid prompting in terms of React structure, component names, or implementation unless the case is specifically about editing an existing app.

CLI-specific rules

CLI prompts can be more explicit about paths and file names because real CLI users often do specify them.

Still, avoid benchmark phrasing. The prompt should read like a repo task, not a harness instruction.

When relevant, ask the assistant to tell the user which wmill commands to run next. That is part of the benchmarked behavior.

Deterministic validation

Use deterministic validation only for hard failures such as:

missing required files
unexpected extra files when the prompt says not to create them
syntax errors
unresolved flow refs
missing required special modules or suspend config
obvious artifact corruption

Do not use deterministic validation to enforce one preferred implementation for broad creation tasks.

Examples of bad hard checks:

exact step topology for a creation flow
exact branch structure when the prompt only asked for routing behavior
exact input shape when multiple reasonable shapes are acceptable

Judge checklist

Every non-trivial case should have a judgeChecklist.

The checklist should capture:

the user-visible behavior that must be present
important constraints
key completion criteria

The checklist should not duplicate low-level implementation details unless they are truly required by the task.

Good checklist items:

"the flow calculates the order total with 8% tax"
"the app persists recipes appropriately for a raw Windmill app"
"the flow reuses the existing workspace script instead of rewriting the logic"

Bad checklist items:

"uses branchone"
"contains a rawscript node"

When to use `expected`

Use expected fixtures when the case is structure-sensitive, for example:

exact file creation
exact script content
modification cases where a specific file must change in a specific way
cases where preserving an existing structure is part of the requirement

Do not use a full expected artifact as the semantic oracle for broad creation tasks when multiple valid outputs should pass.

When to use `initial`

Use initial when the benchmark is about:

editing an existing artifact
reusing existing workspace assets
preserving existing behavior while adding a change

If the case is greenfield, prefer no initial.

Case design ladder

Prefer suites that get gradually harder:

trivial create case
realistic create case
reuse-existing-assets case
modification case
refactor case
edge-case or niche product behavior

The last cases in a suite should cover unusual or product-specific behavior.

Anti-patterns

Avoid these:

benchmark framing in prompts
over-specified internal topology for creation tasks
judge checklists that just restate implementation details
deterministic validation that encodes one preferred solution
fixtures that are so minimal or brittle that they create false negatives

Before adding a case

Ask:

Would a real user plausibly write this prompt?
If the model solves it in a different valid way, would the case still pass?
Are the hard deterministic checks only catching objectively broken output?
Does the judgeChecklist describe the real success criteria?
If this case fails, will the reason be understandable from the saved artifacts?

5.5 KiB Raw Blame History