Compare commits

...

58 Commits

Author SHA1 Message Date
Ruben Fiszel
7f639d8f32 Merge branch 'main' into reload-setting-load-split 2026-04-13 18:31:13 -04:00
Ruben Fiszel
e0066b266f chore(main): release 1.683.0 (#8802)
* chore(main): release 1.683.0

* Apply automatic changes

---------

Co-authored-by: rubenfiszel <275584+rubenfiszel@users.noreply.github.com>
2026-04-13 22:21:47 +00:00
Ruben Fiszel
7f619c2ee1 refactor: extract load helpers from reload_setting family
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 22:05:09 +00:00
Ruben Fiszel
42d3e8c789 fix: enrich OTEL log records with per-request LogContext (#8812)
* fix: enrich OTEL log records with per-request LogContext

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: add otlp_smoke example for manual OTEL log bridge verification

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: update ee-repo-ref to 5d6b713b74fc46735807f5c32883002e8d976fbc

This commit updates the EE repository reference after PR #529 was merged in windmill-ee-private.

Previous ee-repo-ref: 45959d063bc941c567488d330b5819601cdd2d3d

New ee-repo-ref: 5d6b713b74fc46735807f5c32883002e8d976fbc

Automated by sync-ee-ref workflow.

* refactor: store LogContext in ArcSwap instead of Mutex

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: pin ee-repo-ref to ArcSwap branch commit

* chore: update ee-repo-ref to be2f3d4d11bb7110200524d7157caab3aac53996

This commit updates the EE repository reference after PR #530 was merged in windmill-ee-private.

Previous ee-repo-ref: 45b4d7963a9ebcd583d1a87abe7d07d3d521584a

New ee-repo-ref: be2f3d4d11bb7110200524d7157caab3aac53996

Automated by sync-ee-ref workflow.

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: windmill-internal-app[bot] <windmill-internal-app[bot]@users.noreply.github.com>
2026-04-13 21:50:50 +00:00
centdix
c889a185d5 refactor: extract flow delete helpers (#8746)
* refactor: extract flow delete helpers

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* refactor: unify flow delete planning

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* refactor: stabilize flow delete execution

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* refactor: simplify flow delete plan execution

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
2026-04-13 20:35:01 +00:00
Ruben Fiszel
baeb202037 nit npm check 2026-04-13 20:31:28 +00:00
hugocasa
9fb78164b4 feat: allow non-admins to create and edit HTTP triggers (#8810)
* feat: allow non-admin users to create HTTP triggers with forced workspaced routes

Non-admin users can now create and fully edit HTTP triggers, but are forced
to use workspaced routes (workspace-prefixed URLs). Instance-wide routes
remain admin-only to prevent cross-workspace URL conflicts.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: add missing RLS INSERT/DELETE policies for http_trigger table

Non-admin users were blocked by row-level security when creating HTTP triggers.
Added INSERT, DELETE, see_own, and see_member policies matching other trigger tables.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: allow user paths for HTTP triggers

Remove the hideUser restriction on the Path component so HTTP triggers
can be created under user paths (u/username/...) in addition to folder paths.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: remove added note from instance settings description

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: allow non-admins to edit non-workspaced routes without changing route config

Non-admins can now open and edit existing non-workspaced HTTP triggers
(created by admins) as long as they don't modify route_path, http_method,
or workspaced_route. The workspaced prefix is only forced on new triggers.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: allow non-admins to change route_path on workspaced routes

The prevent_route_path_change DB trigger blocked all route_path changes
for windmill_user, even on workspaced routes. Now only instance-wide
(non-workspaced) routes are protected.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: add explicit GRANT and force workspaced routes in OpenAPI generator

- Add explicit GRANT INSERT, DELETE on http_trigger to windmill_user
  for safety on customer instances
- Force workspaced_route: true for non-admins in OpenAPI route generator

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 18:43:49 +00:00
Ruben Fiszel
64c58c824f feat: add deploy restriction rule and fork review requests (#8804)
* feat: add deploy restriction rule and fork review requests

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: update ee-repo-ref.txt for fork review requests

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: address PR review comments on fork review requests

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* refactor: rename fork review requests to deployment requests

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: update ee-repo-ref.txt for deployment request rename

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* refactor: inline deployment request panel into deploy layout

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: place Request deployment button to the left of Deploy

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* refactor: inline fork triggers into main deploy list

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* refactor: open real trigger detail drawer for inline fork triggers

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: email notifications for merge completion and reply pings

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: update deployment_request + protection_rule tables on workspace id rename

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: update ee-repo-ref to 972893c3870e4c4a70a35748abed282d88904805

This commit updates the EE repository reference after PR #528 was merged in windmill-ee-private.

Previous ee-repo-ref: 5684d1c17d930b17849c1e5d7577891e64682d45

New ee-repo-ref: 972893c3870e4c4a70a35748abed282d88904805

Automated by sync-ee-ref workflow.

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: windmill-internal-app[bot] <windmill-internal-app[bot]@users.noreply.github.com>
2026-04-13 17:10:37 +00:00
Ruben Fiszel
b3ef4bc26c perf: add inline-persist fast path for WAC v2 step() (#8807) 2026-04-13 16:49:53 +00:00
Ruben Fiszel
3f5841f84d feat: instance-level ruff config auto-pulled by LSP container (#8803)
* feat: add instance-level ruff config auto-pulled by LSP container

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* refactor: move ruff config to new LSP tab in instance settings

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 16:04:49 +00:00
Ruben Fiszel
78a877eb96 avoid lock file race in repro_diffname CLI test on windows (#8811)
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 14:27:36 +00:00
hugocasa
378ba78284 fix: silence user-facing toast for non-critical hub script tracking error (#8808)
* fix: silence user-facing toast for non-critical hub script tracking error

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* n

---------

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
2026-04-13 14:21:54 +00:00
hugocasa
95411b2563 feat: display agent message in flow graph (#8806)
* feat: display message and web search content in agent graph node status

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: use markdown renderer for agent message output

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: revert web search output display — content not useful

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: improve web search alert text

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: align message title styling with other node status sections

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 13:39:14 +00:00
Ruben Fiszel
b6f1cc70cd fix(cli): make cli help resilient to npm registry fetch failures (#8809)
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
2026-04-13 13:38:38 +00:00
centdix
cdcc56461b feat: add black-box ai eval benchmarks (#8618) 2026-04-13 14:05:46 +02:00
Ruben Fiszel
60211c1d19 feat: folder default_permissioned_as rules for ownership defaults on deploy (#8801)
* feat: add folder default_permissioned_as rules for ownership defaults on deploy

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: remove unnecessary auth guard on default_permissioned_as — rules are advisory only

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: regenerate system prompts with new CLI commands

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: address CI review findings — TOCTOU, race condition, email validation, type coercion

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: add sqlx offline cache for test queries (fixes cargo_test CI)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: address remaining review findings — incomplete request bodies, dead code, redundant import

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: address remaining review findings — full script fields, reactive stores, catch-all validation

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: app/schedule/trigger set-permissioned-as fetch remote first to avoid data loss

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: app set-permissioned-as avoid creating redundant app version

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: compact user/group toggle + select for folder default_permissioned_as rules

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: collapse default_permissioned_as section by default in folder editor

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: include default_permissioned_as in FolderFile CLI type for YAML round-trip

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: process folder.meta changes before items in push to apply new rules immediately

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: clone default_permissioned_as on fork/rename + add full lifecycle tests

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* test: add no-op guarantee test — folder without rules behaves like before

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* refactor: rename cliBehavior to syncBehavior — more accurate scope

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-12 20:14:07 +00:00
Ruben Fiszel
6cebc6f61b chore(main): release 1.682.0 (#8798)
* chore(main): release 1.682.0

* Apply automatic changes

---------

Co-authored-by: rubenfiszel <275584+rubenfiszel@users.noreply.github.com>
2026-04-10 17:41:42 +00:00
Ruben Fiszel
59c457a138 feat: enrich hanging flow error with worker and service log info (#8800)
* feat: enrich hanging flow error with worker and service log info

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: address PR review on hanging flow diagnostics

- Widen log_file lookup window to [-90s, +30s] around worker last ping
  so the batch containing the crash is captured (log files are
  minute-aligned; looking forward only was missing the relevant bucket).
- Log a warning on log_file query errors instead of silently swallowing,
  so a misconfigured table is not reported as "no log files found".
- Note that service log download URLs require S3/parquet collection.
- Fix memory display when only worker_memory_total is known.
- Regenerate sqlx offline cache for the new/modified queries.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-10 17:19:20 +00:00
Ruben Fiszel
b783bf2d83 fix: show full path on hover in deploy drawer and widen drawer (#8799)
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-10 16:40:23 +00:00
Ruben Fiszel
9c85565221 fix: bypass OTEL MITM tracing proxy for git sync jobs (#8796)
Git sync runs as a DeploymentCallback job. When the OTEL MITM tracing
proxy is enabled, all HTTP/HTTPS traffic from the script is rerouted
through a local intercepting proxy that chains to the corporate upstream
proxy. Git's HTTPS to GitHub fails in this setup (TLS interception with
chained CONNECT tunneling is fragile, and git's CA env handling diverges
from what the proxy injects), so customers see "GitHub.com URL couldn't
be reached" until they disable OTEL.

Detect DeploymentCallback jobs in get_proxy_envs_for_lang and fall back
to the stock PROXY_ENVS so git talks to the corporate proxy directly,
unmodified. The git sync script is system code; we don't need HTTP spans
for it.

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-10 16:32:19 +00:00
Ruben Fiszel
e48c7cf448 move CiTestResult schema outside python-client inline markers (#8795)
CiTestResult was defined between the INLINE START/END markers, which
python-client/build.sh strips and replaces with a wildcard $ref to
openflow.openapi.yaml, breaking the PyPI publish job.

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-10 15:28:09 +00:00
Ruben Fiszel
8b2a8882bc chore(main): release 1.681.0 (#8769)
* chore(main): release 1.681.0

* Apply automatic changes

---------

Co-authored-by: rubenfiszel <275584+rubenfiszel@users.noreply.github.com>
2026-04-10 14:53:04 +00:00
Alexander Petric
5eb9a2e965 add instance onboarding telemetry (#8792)
* [ee] feat: add instance onboarding telemetry

Update ee-repo-ref to include instance_onboarding telemetry field
in the daily stats payload.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: update ee-repo-ref to 5f912375340225876a8c1740c3301f39cd6cbd6d

This commit updates the EE repository reference after PR #527 was merged in windmill-ee-private.

Previous ee-repo-ref: b0b10d81060ab6dabee81a5a067ffadc6b48e074

New ee-repo-ref: 5f912375340225876a8c1740c3301f39cd6cbd6d

Automated by sync-ee-ref workflow.

* sqlx

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: windmill-internal-app[bot] <windmill-internal-app[bot]@users.noreply.github.com>
Co-authored-by: Ruben Fiszel <ruben@windmill.dev>
2026-04-10 14:48:38 +00:00
hugocasa
946848feef fix: limit multi-runnable dedicated workers to one job at a time (#8782)
* feat: thread concurrency semaphore through dedicated worker executors

Pass the concurrency_semaphore parameter through bun, deno, and python
start_worker functions to handle_dedicated_process. Also fix the
DedicatedWorkersSelector to use listWorkspacesAsSuperAdmin (so all
workspaces including admins are visible) and skip loading when disabled.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: update ee-repo-ref to limit-workers-one-job branch

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: update ee-repo-ref to 54037e77cdd37777560755fef7075d35906c96d8

This commit updates the EE repository reference after PR #523 was merged in windmill-ee-private.

Previous ee-repo-ref: 56890ea8fca2c1c44a1338a27011b4dd1137d9c9

New ee-repo-ref: 54037e77cdd37777560755fef7075d35906c96d8

Automated by sync-ee-ref workflow.

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: windmill-internal-app[bot] <windmill-internal-app[bot]@users.noreply.github.com>
Co-authored-by: Ruben Fiszel <ruben@windmill.dev>
2026-04-10 14:35:13 +00:00
Diego Imbert
3d43d31aba fix: refresh custom instance user password if auth failed (#8787)
* Refresh custom instance user pwd if connection failed

* No longer need to check on startup

* nit: unneeded inner function

* fix
2026-04-10 14:26:53 +00:00
Diego Imbert
8957d8f19b fix: bypass sql type injection during formatting to prevent offset corruption (#8786)
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-10 14:26:33 +00:00
Diego Imbert
3c64a4282d Prompt to analyse assets for whole flow on undetected assets (#8784) 2026-04-10 14:26:20 +00:00
Ruben Fiszel
ec9cec1d02 fix: treat empty global setting strings as unset (#8793)
* fix: treat empty global setting strings as unset

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: close protected-setting whitespace gap in diff and preserve empty ws override

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-10 14:23:37 +00:00
Ruben Fiszel
09666af157 refactor(git-sync): remove force_branch UI option (#8794)
The new workspaces: section in wmill.yaml lets the CLI auto-select the
right entry by matching baseUrl + workspaceId against the existing
--base-url and --workspace flags the backend already passes, making the
force_branch override redundant.

Backend field and serializer are intentionally left intact for backward
compat with any repository that already has force_branch saved.

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-10 14:22:25 +00:00
Ruben Fiszel
6cf7ffc26b feat(vault): add skip_ssl_verify option for HashiCorp Vault (#8791)
* [ee] feat(vault): add skip_ssl_verify option for HashiCorp Vault

Adds an optional skip_ssl_verify boolean to VaultSettings so
self-signed Vault deployments can be used in development without
needing a custom CA bundle. The flag is surfaced as a Toggle in the
HashiCorp Vault section of the secret backend instance settings and
plumbed through to the EE Vault HTTP client builder.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: update ee-repo-ref to bcfb663f9e902539abbbf69c517715eb8d4ce8f9

This commit updates the EE repository reference after PR #526 was merged in windmill-ee-private.

Previous ee-repo-ref: 7e1372b8f59fe81aaf61212970ebdf2286be864d

New ee-repo-ref: bcfb663f9e902539abbbf69c517715eb8d4ce8f9

Automated by sync-ee-ref workflow.

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: windmill-internal-app[bot] <windmill-internal-app[bot]@users.noreply.github.com>
2026-04-10 13:55:22 +00:00
hugocasa
ce3e676f4a feat: list external JWT tokens in instance settings (#8783)
* [ee] feat: add external JWT tokens listing in instance settings

Add the ability for superadmins to view all external JWT tokens that have
been used for authentication, along with their claim metadata.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: update ee-repo-ref.txt

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* refactor: move external JWT tokens listing to users tab

- Move list endpoint from /oidc/ext_jwt_tokens to /users/ext_jwt_tokens
- Display as a sub-tab below the instance Users tab, only shown when tokens exist
- Use DataTable's built-in load-more pattern for pagination
- Add "Recently active only" toggle (tokens used in the last 30 days)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add dev_override cargo feature to windmill-common

* feat: show placeholder for legacy external JWT entries

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: update ee-repo-ref to 62a462461271b900351c18b0ab1ca78651154b2a

This commit updates the EE repository reference after PR #524 was merged in windmill-ee-private.

Previous ee-repo-ref: 7b493a337abe00a47cf9d94847babe3cb3a6799f

New ee-repo-ref: 62a462461271b900351c18b0ab1ca78651154b2a

Automated by sync-ee-ref workflow.

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: windmill-internal-app[bot] <windmill-internal-app[bot]@users.noreply.github.com>
2026-04-10 13:11:00 +00:00
Ruben Fiszel
4fff89f98c fix: hide legacy global_settings.worker_configs ghost row (#8790)
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-10 06:00:03 +00:00
Ruben Fiszel
d243eb31b0 fix: CLI falls back to workspace whoami for workspace-scoped tokens (#8789)
* fix: CLI falls back to workspace whoami when global whoami is 401

Workspace-scoped tokens (token.workspace_id set) cannot call
/api/users/whoami — the backend's token lookup filters by workspace_id
which is NULL on global paths, so auth returns 401 before the handler
runs. This breaks the CLI entirely: requireLogin calls globalWhoami at
the start of every command, so no command works with a
workspace-scoped token, not even `wmill workspace whoami`.

Fix it CLI-side: if the global whoami returns 401, fall back to the
workspace-scoped /api/w/{w}/users/whoami using the workspace already
known from the CLI profile, and adapt the response shape to
GlobalUserInfo. Also drop the redundant second globalWhoami call in
`wmill workspace whoami` — use requireLogin's return value instead.

No backend changes: the workspace_id binding on the token stays
strictly enforced for every global endpoint.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: use name-based ApiError check in whoami fallback

Review feedback from PR #8789: `instanceof ApiError` can silently
return false when bundling produces multiple module instances of
`gen/core/ApiError.ts` (bun build for npm, JSR dev path), which would
skip the workspace-whoami fallback and reintroduce the exact bug this
PR fixes. Match the name-based check already used at
`cli/src/main.ts:232` and drop the `ApiError` import.

Also add a comment on `workspaceUserToGlobalUserInfo` listing the
fields that aren't derivable from the workspace-scoped User response
and are filled with placeholder values, so future callers don't trust
them downstream.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-10 05:55:53 +00:00
Ruben Fiszel
a7512f9034 chore: update git sync script version to 28191 (#8788)
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 20:18:01 -04:00
Ruben Fiszel
5b97092997 feat: unify CLI config to workspaces, deprecate gitBranches/environments (#8767)
* refactor: unify CLI config to workspaces, deprecate gitBranches/environments

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* chore: update frontend examples and regenerate system prompts for workspaces config

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* test: update test files to use workspaces config instead of gitBranches

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* fix: handle --branch with --base-url correctly in sync pull/push

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* feat: warn when --workspace overrides auto-detected branch or misses config entry

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* feat: show reason why workspace was selected in log message

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* docs: clarify specificItems file naming uses gitBranch as suffix

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* refactor: rename branch-specific to workspace-specific, use workspace name as file suffix

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* refactor: rename branch-specific to workspace-specific, add comprehensive integration tests

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* refactor: simplify bind and init to be workspace-centric

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* refactor: make bind/unbind interactive with --workspace and --branch flags

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* refactor: make bind interactive with profile selection, workspace name, and optional branch

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* refactor: init offers to bind workspace using same flow as wmill workspace bind

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* fix: skip backend git-sync check in init when no workspace was bound

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* fix: skip all API calls in init when no workspace was bound

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* feat: log when RT namespace is skipped, offer to generate it after bind

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* fix: warn when no workspace bound during init

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* fix: init git-sync check uses bound workspace, not active profile

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* fix: init uses selected profile directly, avoids re-resolving and duplicate prompt

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* fix: init skips requireLogin, uses bound profile token directly

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* feat: auto-pick or prompt workspace from config when no branch matches

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* fix: show configured workspaces list and bind hint in resolution messages

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* fix: cache bound profile to avoid duplicate profile selection prompts in init

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* fix: hoist boundProfile scope, add 2 comprehensive integration tests covering all flows

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* fix: rt.d.ts prompt defaults to no when file exists, better description

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* fix: remove empty overrides from generated config, add specificItems hint

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* fix: add inline comments for non-trivial fields, add overrides/promotionOverrides hints to bound workspaces

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* chore: regenerate system prompts

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
2026-04-09 19:46:34 +00:00
Diego Imbert
29e7701972 nit text too long (#8785) 2026-04-09 19:33:16 +00:00
hugocasa
435b25e6a4 feat: add user offboarding flow with object reassignment (#8647)
* feat: add user offboarding flow with object reassignment

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: require new_operator for permissioned_as when reassigning to folder

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: update on_behalf_of_email on scripts/flows during offboarding

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* refactor: extract offboarding to separate module and add integration tests

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: delete tokens, add operator preview counts, remove token reassignment UI

Tokens are now always deleted during offboarding. Preview now shows
scripts/flows/apps with on_behalf_of and schedules/triggers with
permissioned_as referencing the departing user (even outside their path).
Token reassignment UI removed since webhooks break on path changes anyway.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: rich preview with path lists, warnings, and downloadable report

Preview now returns full path lists (not just counts) for owned objects
and objects executing on behalf of the user. Adds warnings for:
- HTTP triggers (webhook URLs will change)
- Email triggers (addresses will change)
- Broken $var:/$res: references in resources/variables
Frontend provides "Export list" button to download affected content.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* test: add coverage for dynamic queries (triggers, extra_perms, operator schedules)

Adds HTTP trigger, extra_perms reference, and shared schedule to test
fixture. Tests verify that non-macro sqlx queries (trigger reassignment,
extra_perms cleanup, operator schedule update) work correctly.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: remove broken_references, add full dynamic query test coverage

Remove broken_references field from preview (user's resources/variables
are already in the owned paths list). Add shared HTTP trigger fixture
to test all dynamic query paths: trigger operator preview (line 232),
trigger permissioned_as update for non-user-path (line 951), and
extra_perms cleanup on trigger tables (line 983).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add referencing field to preview for content/value path references

Preview now includes a 'referencing' section listing scripts (by content),
flows (by value JSON), apps (by policy/extra_perms), and resources (by value)
that contain references to u/{username}/ paths. These references may break
after reassignment. Shown in export list and as a warning in the UI.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: rename objects to items in UI, detect on_behalf_of items in hasItems

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: replace remaining objects with items in UI text

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* refactor: rename operator to on_behalf_of, separate owned vs on-behalf UI sections

- Rename new_operator to new_on_behalf_of_user in API and frontend
- Rename op_ prefixed variables to obo_ in backend
- UI now shows separate sections for owned items and items running
  on behalf, with the operator selector shown only when needed
- canSubmit logic updated: operator needed for folder targets OR
  when on-behalf items exist

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: CSV export, side-by-side layout, always accept new_on_behalf_of_user

- Export affected items as CSV instead of text
- Owned items and on-behalf items shown side by side in summary boxes
- new_on_behalf_of_user always accepted (defaults to target user for
  user targets, required for folder targets)
- On_behalf_of selector always visible, auto-defaults when user target
  is selected

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: proper pluralization and bottom-aligned counts in summary boxes

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: stack info boxes vertically, referencing box as warning style at top

Info boxes (owned, on-behalf, referencing) now one per row instead of
side-by-side. Referencing box uses warning colors. Webhook/email trigger
alerts shown below boxes. Proper pluralization in global modal too.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: CSV exports only referencing items, export button inside warning box

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: use ToggleButtonGroup for user/folder, add reassign toggle on remove

- User/Folder selection now uses ToggleButtonGroup component
- When removing a user, a "Reassign items before removing" toggle lets
  the admin skip reassignment and just delete directly
- In reassign-only mode, the toggle is not shown (always reassigns)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: show token details with labels and scopes in preview

Preview now returns token label, scopes, and expiration instead of just
a count. Frontend shows a dedicated token box listing each token with
its scopes. Test updated to verify token label in preview response.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: extract shared offboarding components, per-type trigger links, hash deep linking

- Extract OffboardItemsBox, OffboardReassignControls, OffboardWorkspaceSection,
  and offboarding-utils.ts as shared components used by both workspace and global modals
- Change triggers in OffboardAffectedPaths from Vec<String> to HashMap<String, Vec<String>>
  so frontend knows which trigger page to link to
- Add hash-based deep linking to all 9 trigger pages and schedules page
- Preserve URL hash in updateQueryFilters across all trigger pages
- Only open editor drawer if the item is found in the list
- Reassign toggle at top with warning alert when disabled (both modals)
- Referencing items box uses yellow warning variant with expandable path links
- Cleaner labels: "Move u/{username}/* items to", "Update triggers/runnables permissions to"

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: rename on_behalf_of section label to match flow advanced settings

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: regenerate sqlx query cache

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: address PR review issues for offboarding

- Add 9 trigger tables to check_path_conflicts for user-friendly conflict messages
- Fix submit button no-op when user has only on-behalf items (show target selector, fix canSubmit)
- Only delete workspace user when reassignment entry exists (prevent orphaned objects)
- Add $azure_kv: prefix to vault secret query (match rename_user pattern)
- Use Svelte 5 onSelected callback instead of deprecated on:selected
- Make ScriptBuilder section label conditional on canPreserve
- Fix CSV export to include trigger paths via flattenPaths utility
- Fix test_offboard_reassign_only to remove conflicts and assert on response
- Parallelize workspace config fetches in global modal with Promise.all
- Delete tokens when deleting workspace user
- Return structured JSON from global offboard endpoint

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* sqlx

* fix: address second round of PR review issues

- Accumulate per-workspace OffboardSummary in global offboard instead of returning zeros
- Delete workspace user unconditionally when delete_user=true (prevent orphaned usr rows)
- Filter archived/deleted scripts in check_path_conflicts to match preview
- Reset form state when workspace offboard modal reopens
- Move hashHandled=true inside trigger-found guard on all 10 deep-link pages

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* test: improve offboarding integration tests

- Add second workspace to fixture for multi-workspace global offboard testing
- Add test_global_offboard_execution: verifies items reassigned across 2
  workspaces, user deleted from both, and password row deleted from instance
- Add test_offboard_invalid_target: verifies 400 for nonexistent user,
  nonexistent folder, and invalid target format
- Fix test_offboard_to_user: use single DELETE, add explicit new_on_behalf_of_user
- Fix test_global_offboard_preview: assert 2 workspaces instead of 1

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: address third round of PR review issues

- Fix ScriptBuilder tooltip to match conditional section label wording
- Clear stale conflicts in global modal on reopen
- Fix test_offboard_to_folder to assert on specific moved path, not pre-existing data
- Allow deleting user with zero items (show Offboard button, skip reassignment)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: add global token deletion warning in instance-level offboard modal

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* update sqlx

* fix: add raw_app path and dependency_map path reassignment to offboarding

Audit found these tables with user-scoped paths were not being updated:
- raw_app: mirrors app paths, needs path reassignment
- dependency_map: importer_path and imported_path reference user paths

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: move user cleanup to delete_workspace_user_internal, fix review issues

- Move extra_perms, folder owners, drafts, favorites, inputs, captures
  cleanup into delete_workspace_user_internal so any user deletion gets
  proper cleanup (not just offboard path)
- Fix flow INSERT missing labels and lock_error_logs columns (data loss)
- Fix validate_target returning 404 instead of 400 for nonexistent targets
- Fix canSubmit blocking delete when user has no items to reassign
- Fix token preview query filtering out tokens without scopes
- Fix token warning messages: workspace-level mentions webhooks/HTTP triggers,
  instance-level mentions API calls using credentials
- Fix "Schedules and triggers" -> "Triggers and runnables" wording
- Show token section at instance level only when tokens exist
- Show Offboard button at instance level when user has no items but deleteUser=true

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 17:31:36 +00:00
Ruben Fiszel
1deb31f1e0 fix: error on flow/app folder suffix format mismatch during sync push/pull (#8775)
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 17:30:59 +00:00
Ruben Fiszel
c57c769dea feat: add CI test scripts with auto-trigger on deploy (#8736)
* feat: add CI test scripts with auto-trigger on deploy

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: fix annotation parser early return and handle renames correctly

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: move CI test results to top of script/flow detail pages

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: improve CI test results spacing, icon, and remove pass label

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: support one-line annotation and use script/path format

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: move CI test trigger logic to EE

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: move CI badge next to New badge and add deduplicated CI summary

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add CI test e2e tests and fix nullable column annotations

Add integration tests for CI test annotation parsing (creates/removes
ci_test_reference rows) and the CI test results API (single + batch
endpoints). Add backend test for auto-trigger on deploy (private+python).

Fix sqlx LEFT JOIN LATERAL nullable column annotations in
get_ci_test_results and get_ci_test_results_batch queries — sqlx
cannot infer nullability from LATERAL subqueries, causing runtime
decode errors when no matching job exists.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix build/sqlx

* fix

* feat: CI test improvements and templates

- Fix windmill-dep-map/private feature propagation in worker, api-scripts,
  and api-flows Cargo.toml so CI test triggers actually fire in EE mode
- Clone ci_test_reference rows during workspace fork
- Add polling to CiTestResults component (refetch every 3s while running)
- Add running state and auto-refresh to ForkWorkspaceBanner CI summary
- Add yellow "CI test" badge on script list rows and detail page
- Fix Library badge border color (remove indigo border override)
- Add CI Test TypeScript and CI Test Python templates in ScriptBuilder
- Update sqlx offline cache
- Add debug tracing for CI test trigger in worker_lockfiles

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: add missing children prop to WorkspaceDeployLayout

Fixes svelte-fast-check type error when passing named snippets as
children content inside the component tag.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: address PR review feedback

- Remove empty wrapper divs around CiTestResults, move mb-4 into component
- Add batch endpoint size cap (max 200 items)
- Add ON DELETE CASCADE to ci_test_reference workspace FK (new migration)
- Downgrade CI test trigger logs from info to debug
- Fix false-positive polling: only treat status='running' as running,
  not null status (CiTestResults, CompareWorkspaces, ForkWorkspaceBanner)
- Fix test numbering in integration tests

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: update ee-repo-ref to latest EE commit

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: update ee-repo-ref to d9d68c2406df0b59f413ea0b2cb24780a9817d04

This commit updates the EE repository reference after PR #516 was merged in windmill-ee-private.

Previous ee-repo-ref: d7ccd9b86da99ec056a0e8708e3637d64290387a

New ee-repo-ref: d9d68c2406df0b59f413ea0b2cb24780a9817d04

Automated by sync-ee-ref workflow.

* fix: treat queued jobs (job_id set, null status) as running

Jobs that have been pushed but not yet picked up by a worker have a
job_id but null status. Treat these as 'running' to avoid showing
misleading 'pass' badges or '0 passing'. Tests that were never
triggered (no job_id, null status) remain neutral/hidden.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: hugocasa <hugo@casademont.ch>
Co-authored-by: windmill-internal-app[bot] <windmill-internal-app[bot]@users.noreply.github.com>
2026-04-09 17:21:36 +00:00
centdix
b73be37916 feat: add edit yaml button to raw app settings (#8771)
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 17:21:02 +00:00
Diego Imbert
4b876392a0 feat: oauth manual connect option (#8770) 2026-04-09 17:19:25 +00:00
centdix
5f57727a4d feat: allow selecting hub flows as raw app backend runnables (#8772)
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 17:18:40 +00:00
Diego Imbert
6d36eca216 fix: Flow status viewer layout nits (avoid excess y space and scroll) (#8780) 2026-04-09 17:15:14 +00:00
Ruben Fiszel
3fb557a7f5 fix: flow step testing UX improvements (#8781)
* fix: flow dev page layout and compact toolbar improvements

- Fix JSON.parse error on /flows/dev page when editor not yet initialized
- Increase compact topbar threshold from 700px to 800px
- Reposition "Test flow" button below settings bar when pane is narrow on dev pages

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: flow step testing UX improvements

- Store and display logs for step test results (previewLogs in flowState)
- Add logs toggle button in output picker popover
- Fix AI proxy 401 in VS Code extension by passing OpenAPI.TOKEN
- Prevent output picker from closing when clicking Run on same node
- Make toggleOpen idempotent to avoid flicker
- Show loading placeholder in badge area during test execution
- Keep pin button visible during test runs
- Auto-refresh step history when new test completes

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: track previous previewJobId to avoid redundant history refreshes

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: dev page insert popover z-index, summary editing, output picker UX

- Add #flow-editor portal div to /flows/dev page for correct popover stacking
- Add summary text field at bottom of dev pages when a step node is selected
- Keep pin button visible during test runs
- Show loading placeholder badge to prevent content shift
- Exclude same-node run button from output picker outside-click detection
- Make toggleOpen idempotent when popover already open

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* refactor: reuse findModuleInFlow instead of duplicated findModule

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 16:40:34 +00:00
Diego Imbert
e63924e377 fix: disable scroll-to-change-number on number inputs (#8777)
* fix: disable scroll-to-change-number on number inputs

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* style: add comment explaining wheel handler

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 15:39:49 +00:00
Diego Imbert
3d02be98f7 fix: normalize multi-word pg types in build_parameters to fix float8 serialization (#8778)
Multi-word Postgres type names like "double precision" caused the SQL
parser regex to fail (no spaces allowed in type group), falling back to
otyp="text". When Postgres inferred float8 for the column, the
text-typed null couldn't serialize, breaking DB Manager inserts/updates.

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 15:39:32 +00:00
Ruben Fiszel
89920e77f3 fix: flow dev page layout and compact toolbar improvements (#8776)
- Fix JSON.parse error on /flows/dev page when editor not yet initialized
- Increase compact topbar threshold from 700px to 800px
- Reposition "Test flow" button below settings bar when pane is narrow on dev pages

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 15:08:35 +00:00
Henri Courdent
11ecb5a774 Volumes link (#8773) 2026-04-09 08:00:35 -04:00
Ruben Fiszel
506b7f55e1 fix: zero-downtime coordinated restarts for OTEL and other setting changes (#8768)
* fix: zero-downtime coordinated restarts for OTEL and other setting changes

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: use background_task_state for server heartbeats and fix stale heartbeat detection

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: show restart propagation toast when saving settings that trigger server restarts

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08 20:35:27 +00:00
Ruben Fiszel
25f4242a87 chore(main): release 1.680.0 (#8757)
* chore(main): release 1.680.0

* Apply automatic changes

---------

Co-authored-by: rubenfiszel <275584+rubenfiszel@users.noreply.github.com>
2026-04-08 16:39:17 +00:00
Ruben Fiszel
609d94aa31 fix bun lock 2026-04-08 16:22:56 +00:00
Ruben Fiszel
80c8e076fc cli nit 2026-04-08 16:16:34 +00:00
Diego Imbert
d2992af8be refactor: move ws_specific from resource column to separate table (#8766)
* Move ws_specific to separate table

* on delete cascade

* feat: handle ws_specific on resource rename and delete

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* is_false never used

---------

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
2026-04-08 16:10:41 +00:00
Ruben Fiszel
e36d440a25 fix: resolve esbuild host/binary version mismatch in app sync push (#8765)
* fix: resolve esbuild host/binary version mismatch in app sync push

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Revert "fix: resolve esbuild host/binary version mismatch in app sync push"

This reverts commit 8822614f8e.

* fix: update esbuild to 0.28.0 and pin version exactly

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08 16:09:58 +00:00
Ruben Fiszel
fa668707c0 fix: move alert config from config table to global_settings (#8762)
* feat: move alert config from config table to global_settings

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: update ee-repo-ref.txt

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* refactor: rename alert setting to alert_job_queue_waiting

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: update ee-repo-ref.txt

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* test: add CLI unit tests for pullInstanceConfigs/pushInstanceConfigs

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: update ee-repo-ref.txt

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: update ee-repo-ref.txt to merged main

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08 15:54:44 +00:00
Diego Imbert
c69f10d20d fix: skip serializing ws_specific on resources when false (#8764)
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
2026-04-08 15:53:01 +00:00
Ruben Fiszel
84778ca3e9 chore: update ee-repo-ref.txt
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08 15:52:10 +00:00
Ruben Fiszel
c4c003dab8 chore: update ee-repo-ref.txt
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08 15:51:14 +00:00
582 changed files with 30588 additions and 7337 deletions

2
.gitignore vendored
View File

@@ -25,6 +25,8 @@ rust-client/Cargo.toml
backend/target
frontend/node_modules
typescript-client/node_modules
ai_evals/node_modules
ai_evals/results/
frontend/.svelte-kit
backend/chrome_profiler.json
.fast-check/

View File

@@ -1,5 +1,89 @@
# Changelog
## [1.683.0](https://github.com/windmill-labs/windmill/compare/v1.682.0...v1.683.0) (2026-04-13)
### Features
* add black-box ai eval benchmarks ([#8618](https://github.com/windmill-labs/windmill/issues/8618)) ([cdcc564](https://github.com/windmill-labs/windmill/commit/cdcc56461b77554964622f490ae901f170886595))
* add deploy restriction rule and fork review requests ([#8804](https://github.com/windmill-labs/windmill/issues/8804)) ([64c58c8](https://github.com/windmill-labs/windmill/commit/64c58c824fcefe00f15405b7e3877eb566a3ffa2))
* allow non-admins to create and edit HTTP triggers ([#8810](https://github.com/windmill-labs/windmill/issues/8810)) ([9fb7816](https://github.com/windmill-labs/windmill/commit/9fb78164b4baa14c10d10f91ae969d48590c29f3))
* display agent message in flow graph ([#8806](https://github.com/windmill-labs/windmill/issues/8806)) ([95411b2](https://github.com/windmill-labs/windmill/commit/95411b256332fa41816a93b19906f1534da9b300))
* folder default_permissioned_as rules for ownership defaults on deploy ([#8801](https://github.com/windmill-labs/windmill/issues/8801)) ([60211c1](https://github.com/windmill-labs/windmill/commit/60211c1d1910b5f7ac6fed112f790201d2047a4c))
* instance-level ruff config auto-pulled by LSP container ([#8803](https://github.com/windmill-labs/windmill/issues/8803)) ([3f5841f](https://github.com/windmill-labs/windmill/commit/3f5841f84d878cd3f43c435fa237d3f0c2265fb9))
### Bug Fixes
* **cli:** make cli help resilient to npm registry fetch failures ([#8809](https://github.com/windmill-labs/windmill/issues/8809)) ([b6f1cc7](https://github.com/windmill-labs/windmill/commit/b6f1cc70cd87c61df7112d3838fbb5fe9bcdc145))
* enrich OTEL log records with per-request LogContext ([#8812](https://github.com/windmill-labs/windmill/issues/8812)) ([42d3e8c](https://github.com/windmill-labs/windmill/commit/42d3e8c7893cd959c7faffd19cd210c869c604f8))
* silence user-facing toast for non-critical hub script tracking error ([#8808](https://github.com/windmill-labs/windmill/issues/8808)) ([378ba78](https://github.com/windmill-labs/windmill/commit/378ba7828456c871b5778f1144c4bb559bd5a733))
### Performance Improvements
* add inline-persist fast path for WAC v2 step() ([#8807](https://github.com/windmill-labs/windmill/issues/8807)) ([b3ef4bc](https://github.com/windmill-labs/windmill/commit/b3ef4bc26c5696624efee89b5e4e33e77e10cf15))
## [1.682.0](https://github.com/windmill-labs/windmill/compare/v1.681.0...v1.682.0) (2026-04-10)
### Features
* enrich hanging flow error with worker and service log info ([#8800](https://github.com/windmill-labs/windmill/issues/8800)) ([59c457a](https://github.com/windmill-labs/windmill/commit/59c457a13881e35c229baed3edd87e618f89b9a0))
### Bug Fixes
* bypass OTEL MITM tracing proxy for git sync jobs ([#8796](https://github.com/windmill-labs/windmill/issues/8796)) ([9c85565](https://github.com/windmill-labs/windmill/commit/9c855652212dbac0e49f87dedd447d3d7d7b500a))
* show full path on hover in deploy drawer and widen drawer ([#8799](https://github.com/windmill-labs/windmill/issues/8799)) ([b783bf2](https://github.com/windmill-labs/windmill/commit/b783bf2d835cde0843739f7d1099193bb0af042e))
## [1.681.0](https://github.com/windmill-labs/windmill/compare/v1.680.0...v1.681.0) (2026-04-10)
### Features
* add CI test scripts with auto-trigger on deploy ([#8736](https://github.com/windmill-labs/windmill/issues/8736)) ([c57c769](https://github.com/windmill-labs/windmill/commit/c57c769deaa207e7ba7995f75649d3630774e898))
* add edit yaml button to raw app settings ([#8771](https://github.com/windmill-labs/windmill/issues/8771)) ([b73be37](https://github.com/windmill-labs/windmill/commit/b73be37916de808dc64bec1337edf6e7d3993c5e))
* add user offboarding flow with object reassignment ([#8647](https://github.com/windmill-labs/windmill/issues/8647)) ([435b25e](https://github.com/windmill-labs/windmill/commit/435b25e6a4c7272c0189cbcfb83526379f41ebf0))
* allow selecting hub flows as raw app backend runnables ([#8772](https://github.com/windmill-labs/windmill/issues/8772)) ([5f57727](https://github.com/windmill-labs/windmill/commit/5f57727a4d956a9066b005b3c55f08dd6780475a))
* list external JWT tokens in instance settings ([#8783](https://github.com/windmill-labs/windmill/issues/8783)) ([ce3e676](https://github.com/windmill-labs/windmill/commit/ce3e676f4ab0c442058c64db4ebf35545a805ef5))
* oauth manual connect option ([#8770](https://github.com/windmill-labs/windmill/issues/8770)) ([4b87639](https://github.com/windmill-labs/windmill/commit/4b876392a0ce41ae42bd882ced10fe0187e532bc))
* unify CLI config to workspaces, deprecate gitBranches/environments ([#8767](https://github.com/windmill-labs/windmill/issues/8767)) ([5b97092](https://github.com/windmill-labs/windmill/commit/5b9709299761b83a88df17a4259c431dfcd244f9))
* **vault:** add skip_ssl_verify option for HashiCorp Vault ([#8791](https://github.com/windmill-labs/windmill/issues/8791)) ([6cf7ffc](https://github.com/windmill-labs/windmill/commit/6cf7ffc26bcbc8f4ef0e4ad2879fcd114332c4e2))
### Bug Fixes
* bypass sql type injection during formatting to prevent offset corruption ([#8786](https://github.com/windmill-labs/windmill/issues/8786)) ([8957d8f](https://github.com/windmill-labs/windmill/commit/8957d8f19bce3430871c2858b3accd53e0be178f))
* CLI falls back to workspace whoami for workspace-scoped tokens ([#8789](https://github.com/windmill-labs/windmill/issues/8789)) ([d243eb3](https://github.com/windmill-labs/windmill/commit/d243eb31b014781a249f903b2a467aa58909ddd6))
* disable scroll-to-change-number on number inputs ([#8777](https://github.com/windmill-labs/windmill/issues/8777)) ([e63924e](https://github.com/windmill-labs/windmill/commit/e63924e3778b40486813192dc2913e565e0a765e))
* error on flow/app folder suffix format mismatch during sync push/pull ([#8775](https://github.com/windmill-labs/windmill/issues/8775)) ([1deb31f](https://github.com/windmill-labs/windmill/commit/1deb31f1e01d6168eee3c2cc242cb483272d1965))
* flow dev page layout and compact toolbar improvements ([#8776](https://github.com/windmill-labs/windmill/issues/8776)) ([89920e7](https://github.com/windmill-labs/windmill/commit/89920e77f3f5dc45db939ec938d92c881dccc8a0))
* Flow status viewer layout nits (avoid excess y space and scroll) ([#8780](https://github.com/windmill-labs/windmill/issues/8780)) ([6d36eca](https://github.com/windmill-labs/windmill/commit/6d36eca21684f9d3ab36658c2b66f85b9be8d331))
* flow step testing UX improvements ([#8781](https://github.com/windmill-labs/windmill/issues/8781)) ([3fb557a](https://github.com/windmill-labs/windmill/commit/3fb557a7f51dbbd3fac445734196f1b9a1d2e287))
* hide legacy global_settings.worker_configs ghost row ([#8790](https://github.com/windmill-labs/windmill/issues/8790)) ([4fff89f](https://github.com/windmill-labs/windmill/commit/4fff89f98ce72997a055cc313c8fe217d2f1fe78))
* limit multi-runnable dedicated workers to one job at a time ([#8782](https://github.com/windmill-labs/windmill/issues/8782)) ([946848f](https://github.com/windmill-labs/windmill/commit/946848feef60aba2a54bc2f5b686b33cc96ec9ef))
* normalize multi-word pg types in build_parameters to fix float8 serialization ([#8778](https://github.com/windmill-labs/windmill/issues/8778)) ([3d02be9](https://github.com/windmill-labs/windmill/commit/3d02be98f748d985f688243f3215d15ca4227f8f))
* refresh custom instance user password if auth failed ([#8787](https://github.com/windmill-labs/windmill/issues/8787)) ([3d43d31](https://github.com/windmill-labs/windmill/commit/3d43d31aba276f23903f16f06035a4c4955b52e2))
* treat empty global setting strings as unset ([#8793](https://github.com/windmill-labs/windmill/issues/8793)) ([ec9cec1](https://github.com/windmill-labs/windmill/commit/ec9cec1d02d87328db92a71a1b3a945e9e0c6bd2))
* zero-downtime coordinated restarts for OTEL and other setting changes ([#8768](https://github.com/windmill-labs/windmill/issues/8768)) ([506b7f5](https://github.com/windmill-labs/windmill/commit/506b7f55e17472d1384e9676c1b6df7a9d7a118b))
## [1.680.0](https://github.com/windmill-labs/windmill/compare/v1.679.0...v1.680.0) (2026-04-08)
### Features
* add CLI workspace merge command and enhance fork with datatable/color support ([#8756](https://github.com/windmill-labs/windmill/issues/8756)) ([4342c18](https://github.com/windmill-labs/windmill/commit/4342c1854134500d3b2bc46280f9885ee84e2c9e))
* add scheduled job deletion with configurable retention period ([#8753](https://github.com/windmill-labs/windmill/issues/8753)) ([2d18a68](https://github.com/windmill-labs/windmill/commit/2d18a680991babe317ca315bbce40e6ce733afda))
* add status indicator dots to parallel loop iteration picker ([#8761](https://github.com/windmill-labs/windmill/issues/8761)) ([470b8aa](https://github.com/windmill-labs/windmill/commit/470b8aa5f1870e26fea022c1e2a9f48471d8a205))
### Bug Fixes
* move alert config from config table to global_settings ([#8762](https://github.com/windmill-labs/windmill/issues/8762)) ([fa66870](https://github.com/windmill-labs/windmill/commit/fa668707c0ee7f261d78e145666b1073471259fd))
* resolve esbuild host/binary version mismatch in app sync push ([#8765](https://github.com/windmill-labs/windmill/issues/8765)) ([e36d440](https://github.com/windmill-labs/windmill/commit/e36d440a251a43ea888e3ce378d0bb8ed8f42e11))
* skip serializing ws_specific on resources when false ([#8764](https://github.com/windmill-labs/windmill/issues/8764)) ([c69f10d](https://github.com/windmill-labs/windmill/commit/c69f10d20dd064f0c329934096c2945424ff81f2))
## [1.679.0](https://github.com/windmill-labs/windmill/compare/v1.678.0...v1.679.0) (2026-04-07)

2
ai_evals/.gitignore vendored Normal file
View File

@@ -0,0 +1,2 @@
.env
results/

172
ai_evals/AGENTS.md Normal file
View File

@@ -0,0 +1,172 @@
# AI Evals Authoring Guide
This folder contains black-box benchmark cases for:
- `flow`
- `app`
- `script`
- `cli`
The goal is to test the current production prompts and guidance with realistic user requests, not to test one exact implementation shape.
## Core rules
1. Write prompts like a real user request.
2. Prefer behavior, inputs, constraints, and outcomes over internal implementation details.
3. Keep deterministic validation narrow and hard.
4. Put semantic expectations in `judgeChecklist`.
5. Use `expected` fixtures only when exact structure really matters.
## Prompt writing
Prompts should sound like something a user would naturally ask.
Good:
- "Create a flow that routes support requests based on customer tier."
- "Add a reset button that sets the counter back to 0."
- "Create a flow that reuses the existing greeting script instead of duplicating the logic."
Bad:
- "Use `branchone` with 3 branches and a default branch."
- "Create a `rawscript` step with this exact topology."
- "This is a benchmark harness."
Do not write prompts as if the user knows Windmill internals unless the case is explicitly testing a power-user workflow.
## Flow-specific rules
This is the main principle you asked for:
- flow prompts should read like requests from a user who does not know the product internals
- the user should ask for behavior, not for `branchone`, `branchall`, `rawscript`, `preprocessor_module`, `failure_module`, exact graph topology, or other internal constructs
That means:
- creation cases should describe the business behavior and expected result
- modification cases may mention existing step names, because the user can see the current flow
- only mention special Windmill constructs when the case is explicitly about those constructs
Examples:
- acceptable creation prompt:
"Create a purchase approval flow that pauses for approval and asks the approver for a comment."
- avoid:
"Create a suspend step with one required event and a resume form."
For flow cases, do not fail a case just because the model chose a different valid topology.
## App-specific rules
App prompts should focus on user-visible behavior:
- what the UI should let the user do
- what should persist
- what backend behavior is needed
Avoid prompting in terms of React structure, component names, or implementation unless the case is specifically about editing an existing app.
## CLI-specific rules
CLI prompts can be more explicit about paths and file names because real CLI users often do specify them.
Still, avoid benchmark phrasing. The prompt should read like a repo task, not a harness instruction.
When relevant, ask the assistant to tell the user which `wmill` commands to run next. That is part of the benchmarked behavior.
## Deterministic validation
Use deterministic validation only for hard failures such as:
- missing required files
- unexpected extra files when the prompt says not to create them
- syntax errors
- unresolved flow refs
- missing required special modules or suspend config
- obvious artifact corruption
Do not use deterministic validation to enforce one preferred implementation for broad creation tasks.
Examples of bad hard checks:
- exact step topology for a creation flow
- exact branch structure when the prompt only asked for routing behavior
- exact input shape when multiple reasonable shapes are acceptable
## Judge checklist
Every non-trivial case should have a `judgeChecklist`.
The checklist should capture:
- the user-visible behavior that must be present
- important constraints
- key completion criteria
The checklist should not duplicate low-level implementation details unless they are truly required by the task.
Good checklist items:
- "the flow calculates the order total with 8% tax"
- "the app persists recipes appropriately for a raw Windmill app"
- "the flow reuses the existing workspace script instead of rewriting the logic"
Bad checklist items:
- "uses `branchone`"
- "contains a `rawscript` node"
## When to use `expected`
Use `expected` fixtures when the case is structure-sensitive, for example:
- exact file creation
- exact script content
- modification cases where a specific file must change in a specific way
- cases where preserving an existing structure is part of the requirement
Do not use a full `expected` artifact as the semantic oracle for broad creation tasks when multiple valid outputs should pass.
## When to use `initial`
Use `initial` when the benchmark is about:
- editing an existing artifact
- reusing existing workspace assets
- preserving existing behavior while adding a change
If the case is greenfield, prefer no `initial`.
## Case design ladder
Prefer suites that get gradually harder:
1. trivial create case
2. realistic create case
3. reuse-existing-assets case
4. modification case
5. refactor case
6. edge-case or niche product behavior
The last cases in a suite should cover unusual or product-specific behavior.
## Anti-patterns
Avoid these:
- benchmark framing in prompts
- over-specified internal topology for creation tasks
- judge checklists that just restate implementation details
- deterministic validation that encodes one preferred solution
- fixtures that are so minimal or brittle that they create false negatives
## Before adding a case
Ask:
1. Would a real user plausibly write this prompt?
2. If the model solves it in a different valid way, would the case still pass?
3. Are the hard deterministic checks only catching objectively broken output?
4. Does the `judgeChecklist` describe the real success criteria?
5. If this case fails, will the reason be understandable from the saved artifacts?

1
ai_evals/CLAUDE.md Normal file
View File

@@ -0,0 +1 @@
@AGENTS.md

176
ai_evals/README.md Normal file
View File

@@ -0,0 +1,176 @@
# AI Evals
Small benchmark runner for the four Windmill AI generation modes:
- `cli`
- `flow`
- `script`
- `app`
The benchmark always tests the current production prompts, tools, and guidance in this checkout.
Each attempt runs:
1. the real production path
2. deterministic validation
3. LLM judging
## Install
```bash
cd ai_evals
bun install
```
Frontend modes also require frontend dependencies:
```bash
cd frontend
bun install
```
## Commands
List model aliases:
```bash
cd ai_evals
bun run cli -- models
```
List cases:
```bash
cd ai_evals
bun run cli -- cases
bun run cli -- cases flow
```
Run benchmarks:
```bash
cd ai_evals
bun run cli -- run flow
bun run cli -- run flow flow-test4-order-processing-loop --model opus
bun run cli -- run flow flow-test0-sum-two-numbers --models haiku,opus,4o
bun run cli -- run flow flow-test0-sum-two-numbers --runs 3 --verbose
bun run cli -- run flow --record
bun run cli -- run cli bun-hello-script
```
Public CLI surface:
- `models`
- `cases [mode]`
- `run <mode> [caseIds...]`
`run` options:
- `--runs <n>`: repeat each case `n` times
- `--output <path>`: custom result JSON path
- `--model <alias>`: choose the model under test
- `--models <a,b,c>`: run the same cases sequentially against several model aliases
- `--verbose`: stream assistant output for frontend runs
- `--record`: append a compact tracked summary line to `ai_evals/history/<mode>.jsonl` for full-suite runs only
## Models
Use `bun run cli -- models` to see the current aliases.
Today:
- `haiku`
- `sonnet`
- `opus`
- `4o`
- `gemini-flash`
- `gemini-pro`
- `gemini-3-flash-preview`
- `gemini-3.1-pro-preview`
Notes:
- the command also prints accepted alias spellings such as `gpt-4o`, `claude-opus-4.6`, and `claude-haiku-4.5`
- frontend modes (`flow`, `script`, `app`) can use Anthropic, OpenAI, and Gemini-backed aliases
- `cli` mode always uses the Anthropic agent SDK, so only Anthropic aliases are valid there
- the judge model is separate and currently defaults to `claude-sonnet-4-6`
## Case Format
Cases live in one YAML file per mode under `ai_evals/cases/`.
Minimal shape:
```yaml
- id: flow-test0-sum-two-numbers
prompt: |-
Create a flow that takes two numbers, `a` and `b`, and returns their sum.
initial: ai_evals/fixtures/...
expected: ai_evals/fixtures/...
```
Optional fields:
- `initial`: starting state fixture
- `expected`: expected artifact fixture
- `validate`: extra deterministic validation rules
For `flow` mode, `validate` can express requirements such as:
- accepted input schema shapes
- required `results.*` reference validity
- required module/code/input characteristics
For `flow` mode, an `initial` fixture can also include a benchmark workspace catalog of
existing scripts and flows. That lets the real `search_workspace` and
`get_runnable_details` tools discover reusable workspace runnables during evals.
## Results And Artifacts
Every run writes:
- a summary JSON under `ai_evals/results/`
- generated artifacts in a sibling directory
If `--record` is used, the CLI also appends one compact JSON line to:
- `ai_evals/history/flow.jsonl`
- `ai_evals/history/script.jsonl`
- `ai_evals/history/app.jsonl`
- `ai_evals/history/cli.jsonl`
Each recorded line contains:
- run metadata (`createdAt`, `gitSha`, `mode`, `runModel`, `judgeModel`)
- suite totals (`caseCount`, `attemptCount`, `passedAttempts`, `passRate`, `averageDurationMs`, `averageJudgeScore`)
- average token usage (`averageTokenUsagePerAttempt`)
- per-case metrics under `cases[]` (`averageDurationMs`, `averageJudgeScore`, `averageTokenUsagePerAttempt`, pass rate)
- `failedCaseIds`
Example:
- summary: `ai_evals/results/2026-04-09T09-40-33.051Z__flow.json`
- artifacts: `ai_evals/results/2026-04-09T09-40-33.051Z__flow/`
Typical artifacts by mode:
- `flow`: `flow.json`
- `script`: `script.json` plus the generated script file
- `app`: `app.json` plus frontend/backend files
- `cli`: `assistant-output.txt` plus generated workspace files
## Layout
- `cases/`: one YAML file per mode
- `fixtures/`: initial and expected fixtures
- `core/`: shared loading, model resolution, validation, judging, and result writing
- `modes/`: one runner per mode
- `history/`: optional tracked pass-rate history written by `run --record`, one JSONL file per mode
- `results/`: local benchmark output and artifacts
## Notes
- Frontend modes reuse the production frontend chat code through the Vitest bridge.
- CLI mode creates an isolated workspace, writes the current checkout guidance into it, and benchmarks the real skills / `AGENTS.md` flow.
- Frontend progress streams live while the benchmark is running.
- Deterministic validators should stay focused on real correctness constraints, not one exact implementation shape.

View File

@@ -0,0 +1,72 @@
import { describe, expect, it } from "bun:test";
import {
anthropicUsageToBenchmarkTokenUsage,
extractCliResultTokenUsage,
} from "./runtime";
describe("anthropicUsageToBenchmarkTokenUsage", () => {
it("includes cache tokens in prompt usage", () => {
expect(
anthropicUsageToBenchmarkTokenUsage({
input_tokens: 120,
output_tokens: 45,
cache_creation_input_tokens: 30,
cache_read_input_tokens: 5,
})
).toEqual({
prompt: 155,
completion: 45,
total: 200,
});
});
it("returns null when usage is absent", () => {
expect(anthropicUsageToBenchmarkTokenUsage(null)).toBeNull();
});
});
describe("extractCliResultTokenUsage", () => {
it("reads aggregate usage from the SDK result event", () => {
expect(
extractCliResultTokenUsage({
type: "result",
usage: {
input_tokens: 400,
output_tokens: 120,
cache_creation_input_tokens: 50,
cache_read_input_tokens: 25,
},
})
).toEqual({
prompt: 475,
completion: 120,
total: 595,
});
});
it("falls back to modelUsage when aggregate usage is unavailable", () => {
expect(
extractCliResultTokenUsage({
type: "result",
modelUsage: {
opus: {
inputTokens: 200,
outputTokens: 60,
cacheCreationInputTokens: 10,
cacheReadInputTokens: 5,
},
haiku: {
inputTokens: 80,
outputTokens: 20,
cacheCreationInputTokens: 0,
cacheReadInputTokens: 15,
},
},
})
).toEqual({
prompt: 310,
completion: 80,
total: 390,
});
});
});

View File

@@ -0,0 +1,199 @@
import { query, type Options } from "@anthropic-ai/claude-agent-sdk";
import { join } from "path";
import { fileURLToPath } from "url";
import { getCliEvalModel, resolveEvalModel, type CliEvalModelConfig } from "../../core/models";
import type { BenchmarkTokenUsage } from "../../core/types";
export interface ToolInvocation {
tool: string;
input: Record<string, unknown>;
timestamp: number;
}
export interface PromptRunResult {
toolsUsed: ToolInvocation[];
skillsInvoked: string[];
output: string;
durationMs: number;
assistantMessageCount: number;
tokenUsage: BenchmarkTokenUsage | null;
}
interface AnthropicUsageLike {
input_tokens?: number | null;
output_tokens?: number | null;
cache_creation_input_tokens?: number | null;
cache_read_input_tokens?: number | null;
}
interface AnthropicModelUsageLike {
inputTokens?: number | null;
outputTokens?: number | null;
cacheCreationInputTokens?: number | null;
cacheReadInputTokens?: number | null;
}
interface CliResultMessageLike {
type?: string;
usage?: AnthropicUsageLike | null;
modelUsage?: Record<string, AnthropicModelUsageLike> | null;
}
const REPO_ROOT = fileURLToPath(new URL("../../../", import.meta.url));
export const DEFAULT_CLI_EVAL_MODEL: CliEvalModelConfig = getCliEvalModel(resolveEvalModel("cli"));
export function getGeneratedSkillsSource(): string {
return join(REPO_ROOT, "system_prompts", "auto-generated", "skills");
}
export function anthropicUsageToBenchmarkTokenUsage(
usage: AnthropicUsageLike | null | undefined
): BenchmarkTokenUsage | null {
if (!usage) {
return null;
}
const prompt =
(usage.input_tokens ?? 0) +
(usage.cache_creation_input_tokens ?? 0) +
(usage.cache_read_input_tokens ?? 0);
const completion = usage.output_tokens ?? 0;
return {
prompt,
completion,
total: prompt + completion,
};
}
export function extractCliResultTokenUsage(message: unknown): BenchmarkTokenUsage | null {
if (!message || typeof message !== "object") {
return null;
}
const resultMessage = message as CliResultMessageLike;
if (resultMessage.type !== "result") {
return null;
}
const usage = anthropicUsageToBenchmarkTokenUsage(resultMessage.usage);
if (usage) {
return usage;
}
if (!resultMessage.modelUsage || typeof resultMessage.modelUsage !== "object") {
return null;
}
let prompt = 0;
let completion = 0;
let sawModelUsage = false;
for (const modelUsage of Object.values(resultMessage.modelUsage)) {
if (!modelUsage || typeof modelUsage !== "object") {
continue;
}
prompt +=
(modelUsage.inputTokens ?? 0) +
(modelUsage.cacheCreationInputTokens ?? 0) +
(modelUsage.cacheReadInputTokens ?? 0);
completion += modelUsage.outputTokens ?? 0;
sawModelUsage = true;
}
if (!sawModelUsage) {
return null;
}
return {
prompt,
completion,
total: prompt + completion,
};
}
export async function runPromptAndCapture(
prompt: string,
cwd: string,
maxTurns: number = 3,
modelConfig: CliEvalModelConfig = DEFAULT_CLI_EVAL_MODEL
): Promise<PromptRunResult> {
const toolsUsed: ToolInvocation[] = [];
const skillsInvoked: string[] = [];
let output = "";
let assistantMessageCount = 0;
let tokenUsage: BenchmarkTokenUsage | null = null;
const startedAt = Date.now();
const options: Options = {
cwd,
model: modelConfig.model,
maxTurns,
settingSources: ["project"],
allowedTools: ["Skill", "Read", "Glob", "Grep", "Bash", "Write", "Edit"]
};
for await (const message of query({ prompt, options })) {
if (message.type === "assistant") {
assistantMessageCount += 1;
const content = message.message?.content;
if (Array.isArray(content)) {
for (const block of content) {
if (block.type === "tool_use") {
toolsUsed.push({
tool: block.name,
input: block.input as Record<string, unknown>,
timestamp: Date.now()
});
if (block.name === "Skill" && typeof block.input === "object" && block.input !== null) {
const skillInput = block.input as { skill?: string };
if (skillInput.skill) {
skillsInvoked.push(skillInput.skill);
}
}
} else if (block.type === "text") {
output += block.text;
}
}
}
} else if (message.type === "result") {
const resultMessage = message as { result?: string };
tokenUsage = extractCliResultTokenUsage(message) ?? tokenUsage;
if (typeof resultMessage.result === "string") {
output += resultMessage.result;
}
}
}
return {
toolsUsed,
skillsInvoked,
output,
durationMs: Date.now() - startedAt,
assistantMessageCount,
tokenUsage,
};
}
export function wasSkillInvoked(result: PromptRunResult, skillName: string): boolean {
return result.skillsInvoked.some((skill) => skill === skillName || skill.includes(skillName));
}
export function wasToolUsed(result: PromptRunResult, toolName: string): boolean {
return result.toolsUsed.some((tool) => tool.tool === toolName);
}
export function formatCliRunModelLabel(modelConfig: CliEvalModelConfig): string {
return `${modelConfig.provider}:${modelConfig.model}`;
}
export function getToolInputs(
result: PromptRunResult,
toolName: string
): Record<string, unknown>[] {
return result.toolsUsed
.filter((tool) => tool.tool === toolName)
.map((tool) => tool.input);
}

View File

@@ -0,0 +1,87 @@
import { loadSelectedCases } from "../../core/cases";
import {
formatRunModelLabel,
getFrontendEvalModel,
resolveEvalModel,
} from "../../core/models";
import { buildRunResult } from "../../core/results";
import { runSuite } from "../../core/runSuite";
import type { BenchmarkRunResult, ModeRunner } from "../../core/types";
import { emitFrontendBenchmarkProgress } from "./progress";
import { createAppModeRunner } from "../../modes/app";
import { createFlowModeRunner } from "../../modes/flow";
import { createScriptModeRunner } from "../../modes/script";
import { DEFAULT_JUDGE_MODEL } from "../../core/judge";
export type FrontendBenchmarkMode = "flow" | "app" | "script";
export async function runFrontendBenchmarkFromEnv(): Promise<BenchmarkRunResult> {
const mode = parseMode(process.env.WMILL_FRONTEND_AI_EVAL_MODE);
const caseIds = parseOptionalJsonStringArray(process.env.WMILL_FRONTEND_AI_EVAL_CASE_IDS);
const runs = parsePositiveInteger(process.env.WMILL_FRONTEND_AI_EVAL_RUNS, "WMILL_FRONTEND_AI_EVAL_RUNS");
const emitProgress = process.env.WMILL_FRONTEND_AI_EVAL_PROGRESS === "1";
const verbose = process.env.WMILL_FRONTEND_AI_EVAL_VERBOSE === "1";
const model = resolveEvalModel(mode, process.env.WMILL_FRONTEND_AI_EVAL_MODEL);
const selectedCases = await loadSelectedCases(mode, caseIds);
const modeRunner = getModeRunner(mode, getFrontendEvalModel(model));
const runModel = formatRunModelLabel(mode, model);
const caseResults = await runSuite({
modeRunner,
cases: selectedCases,
runs,
runModel,
judgeModel: DEFAULT_JUDGE_MODEL,
concurrency: verbose ? 1 : undefined,
verbose,
onProgress: emitProgress ? (event) => emitFrontendBenchmarkProgress(event) : undefined,
});
return buildRunResult({
mode,
runs,
runModel,
judgeModel: DEFAULT_JUDGE_MODEL,
caseResults,
});
}
function getModeRunner(
mode: FrontendBenchmarkMode,
model: ReturnType<typeof getFrontendEvalModel>
): ModeRunner<any, any, any> {
switch (mode) {
case "flow":
return createFlowModeRunner(model);
case "app":
return createAppModeRunner(model);
case "script":
return createScriptModeRunner(model);
}
}
function parseMode(value: string | undefined): FrontendBenchmarkMode {
if (value === "flow" || value === "app" || value === "script") {
return value;
}
throw new Error(`Unsupported frontend benchmark mode: ${String(value)}`);
}
function parseOptionalJsonStringArray(value: string | undefined): string[] {
if (!value) {
return [];
}
const parsed = JSON.parse(value) as unknown;
if (!Array.isArray(parsed) || parsed.some((entry) => typeof entry !== "string")) {
throw new Error("WMILL_FRONTEND_AI_EVAL_CASE_IDS must be a JSON string array");
}
return parsed;
}
function parsePositiveInteger(value: string | undefined, envName: string): number {
const parsed = Number(value);
if (!Number.isInteger(parsed) || parsed <= 0) {
throw new Error(`${envName} must be a positive integer`);
}
return parsed;
}

View File

@@ -0,0 +1,92 @@
import { mkdtemp } from 'fs/promises'
import { tmpdir } from 'os'
import { join } from 'path'
import type {
AppFiles,
BackendRunnable,
AppAIChatHelpers
} from '../../../../../frontend/src/lib/components/copilot/chat/app/core'
import {
getAppTools,
prepareAppSystemMessage,
prepareAppUserMessage
} from '../../../../../frontend/src/lib/components/copilot/chat/app/core'
import type { Tool as ProductionTool } from '../../../../../frontend/src/lib/components/copilot/chat/shared'
import { createAppFileHelpers } from './fileHelpers'
import { runEval } from '../shared'
import type { AIProvider } from '$lib/gen/types.gen'
import type { ModeRunContext } from '../../../../core/types'
import type { TokenUsage } from '../shared/types'
export interface AppEvalResult {
success: boolean
files: AppFiles
error?: string
assistantMessageCount: number
toolCallCount: number
toolsUsed: string[]
tokenUsage: TokenUsage
}
export interface AppEvalOptions {
initialFrontend?: Record<string, string>
initialBackend?: Record<string, BackendRunnable>
model?: string
maxIterations?: number
provider?: AIProvider
workspaceRoot?: string
runContext?: ModeRunContext
}
export async function runAppEval(
userPrompt: string,
apiKey: string,
options?: AppEvalOptions
): Promise<AppEvalResult> {
const workspaceRoot =
options?.workspaceRoot ??
(await mkdtemp(join(tmpdir(), 'wmill-frontend-app-benchmark-')))
const { helpers, getFiles, cleanup } = await createAppFileHelpers(
options?.initialFrontend ?? {},
options?.initialBackend ?? {},
workspaceRoot
)
try {
const systemMessage = prepareAppSystemMessage()
const tools = getAppTools() as ProductionTool<AppAIChatHelpers>[]
const model = options?.model ?? 'claude-haiku-4-5-20251001'
const userMessage = prepareAppUserMessage(userPrompt, helpers.getSelectedContext())
const rawResult = await runEval({
userPrompt,
systemMessage,
userMessage,
tools,
helpers,
apiKey,
getOutput: getFiles,
onAssistantMessageStart: options?.runContext?.onAssistantMessageStart,
onAssistantToken: options?.runContext?.onAssistantChunk,
onAssistantMessageEnd: options?.runContext?.onAssistantMessageEnd,
options: {
maxIterations: options?.maxIterations,
model,
workspace: workspaceRoot,
provider: options?.provider
}
})
return {
files: rawResult.output,
success: rawResult.success,
error: rawResult.error,
assistantMessageCount: rawResult.iterations,
toolCallCount: rawResult.toolCallsCount,
toolsUsed: rawResult.toolsCalled,
tokenUsage: rawResult.tokenUsage
}
} finally {
await cleanup()
}
}

View File

@@ -1,4 +1,8 @@
import type { AppFiles, BackendRunnable, InlineScript } from '../../app/core'
import type {
AppFiles,
BackendRunnable,
InlineScript
} from '../../../../../frontend/src/lib/components/copilot/chat/app/core'
/**
* Backend runnable metadata stored in meta.json files.

View File

@@ -0,0 +1,255 @@
import { mkdir, rm, writeFile } from 'fs/promises'
import { dirname, join } from 'path'
import type {
AppAIChatHelpers,
AppFiles,
BackendRunnable,
DataTableSchema,
LintResult,
SelectedContext
} from '../../../../../frontend/src/lib/components/copilot/chat/app/core'
function createEmptyLintResult(): LintResult {
return {
errorCount: 0,
warningCount: 0,
errors: { frontend: {}, backend: {} },
warnings: { frontend: {}, backend: {} }
}
}
async function writeFrontendFile(
workspaceRoot: string | undefined,
path: string,
content: string
): Promise<void> {
if (!workspaceRoot) {
return
}
const relativePath = path.startsWith('/') ? path.slice(1) : path
const fullPath = join(workspaceRoot, 'frontend', relativePath)
await mkdir(dirname(fullPath), { recursive: true })
await writeFile(fullPath, content, 'utf8')
}
async function removeFrontendFile(workspaceRoot: string | undefined, path: string): Promise<void> {
if (!workspaceRoot) {
return
}
const relativePath = path.startsWith('/') ? path.slice(1) : path
await rm(join(workspaceRoot, 'frontend', relativePath), { force: true })
}
async function writeBackendRunnable(
workspaceRoot: string | undefined,
key: string,
runnable: BackendRunnable
): Promise<void> {
if (!workspaceRoot) {
return
}
const runnableDir = join(workspaceRoot, 'backend', key)
await mkdir(runnableDir, { recursive: true })
const meta: { name: string; language?: string; type?: string; path?: string } = {
name: runnable.name
}
if (runnable.type === 'inline' && runnable.inlineScript) {
meta.language = runnable.inlineScript.language
const extension = runnable.inlineScript.language === 'python3' ? 'py' : 'ts'
await writeFile(
join(runnableDir, `main.${extension}`),
runnable.inlineScript.content,
'utf8'
)
} else {
meta.type = runnable.type
if (runnable.path) {
meta.path = runnable.path
}
}
await writeFile(join(runnableDir, 'meta.json'), JSON.stringify(meta, null, 2) + '\n', 'utf8')
}
async function removeBackendRunnable(workspaceRoot: string | undefined, key: string): Promise<void> {
if (!workspaceRoot) {
return
}
await rm(join(workspaceRoot, 'backend', key), { recursive: true, force: true })
}
async function persistDatatables(
workspaceRoot: string | undefined,
datatables: DataTableSchema[]
): Promise<void> {
if (!workspaceRoot) {
return
}
await writeFile(
join(workspaceRoot, 'datatables.json'),
JSON.stringify(datatables, null, 2) + '\n',
'utf8'
)
}
export async function createAppFileHelpers(
initialFrontend: Record<string, string> = {},
initialBackend: Record<string, BackendRunnable> = {},
workspaceRoot?: string
): Promise<{
helpers: AppAIChatHelpers
getFiles: () => AppFiles
getFrontend: () => Record<string, string>
getBackend: () => Record<string, BackendRunnable>
cleanup: () => Promise<void>
workspaceDir: string | null
}> {
let frontend = { ...initialFrontend }
let backend = { ...initialBackend }
let snapshotId = 0
const snapshots = new Map<
number,
{ frontend: Record<string, string>; backend: Record<string, BackendRunnable> }
>()
const datatables: DataTableSchema[] = []
for (const [path, content] of Object.entries(frontend)) {
await writeFrontendFile(workspaceRoot, path, content)
}
for (const [key, runnable] of Object.entries(backend)) {
await writeBackendRunnable(workspaceRoot, key, runnable)
}
await persistDatatables(workspaceRoot, datatables)
const helpers: AppAIChatHelpers = {
listFrontendFiles: () => Object.keys(frontend),
getFrontendFile: (path: string) => frontend[path],
getFrontendFiles: () => ({ ...frontend }),
setFrontendFile: (path: string, content: string) => {
frontend[path] = content
void writeFrontendFile(workspaceRoot, path, content)
return createEmptyLintResult()
},
deleteFrontendFile: (path: string) => {
delete frontend[path]
void removeFrontendFile(workspaceRoot, path)
},
listBackendRunnables: () =>
Object.entries(backend).map(([key, runnable]) => ({
key,
name: runnable.name
})),
getBackendRunnable: (key: string) => backend[key],
getBackendRunnables: () => ({ ...backend }),
setBackendRunnable: async (key: string, runnable: BackendRunnable) => {
backend[key] = runnable
await writeBackendRunnable(workspaceRoot, key, runnable)
return createEmptyLintResult()
},
deleteBackendRunnable: (key: string) => {
delete backend[key]
void removeBackendRunnable(workspaceRoot, key)
},
getFiles: (): AppFiles => ({
frontend: { ...frontend },
backend: { ...backend }
}),
getSelectedContext: (): SelectedContext => ({ type: 'none' }),
snapshot: () => {
const id = ++snapshotId
snapshots.set(id, {
frontend: { ...frontend },
backend: { ...backend }
})
return id
},
revertToSnapshot: (id: number) => {
const snapshot = snapshots.get(id)
if (!snapshot) {
return
}
frontend = { ...snapshot.frontend }
backend = { ...snapshot.backend }
void syncWorkspace()
},
lint: () => createEmptyLintResult(),
getDatatables: async () => structuredClone(datatables),
getAvailableDatatableNames: () => datatables.map((datatable) => datatable.datatable_name),
execDatatableSql: async (
datatableName: string,
sql: string,
newTable?: { schema: string; name: string }
) => {
if (newTable) {
datatables.push({
datatable_name: datatableName,
schemas: {
[newTable.schema]: {
[newTable.name]: {}
}
}
})
await persistDatatables(workspaceRoot, datatables)
}
return {
success: true,
result: [
{
datatableName,
sql
}
]
}
},
addTableToWhitelist: (datatableName: string, schemaName: string, tableName: string) => {
const existing = datatables.find((entry) => entry.datatable_name === datatableName)
if (existing) {
existing.schemas[schemaName] ??= {}
existing.schemas[schemaName][tableName] ??= {}
} else {
datatables.push({
datatable_name: datatableName,
schemas: {
[schemaName]: {
[tableName]: {}
}
}
})
}
void persistDatatables(workspaceRoot, datatables)
}
}
async function syncWorkspace(): Promise<void> {
if (!workspaceRoot) {
return
}
await rm(join(workspaceRoot, 'frontend'), { recursive: true, force: true })
await rm(join(workspaceRoot, 'backend'), { recursive: true, force: true })
for (const [path, content] of Object.entries(frontend)) {
await writeFrontendFile(workspaceRoot, path, content)
}
for (const [key, runnable] of Object.entries(backend)) {
await writeBackendRunnable(workspaceRoot, key, runnable)
}
await persistDatatables(workspaceRoot, datatables)
}
return {
helpers,
getFiles: () => ({
frontend: { ...frontend },
backend: { ...backend }
}),
getFrontend: () => ({ ...frontend }),
getBackend: () => ({ ...backend }),
cleanup: async () => {
if (workspaceRoot) {
await rm(workspaceRoot, { recursive: true, force: true })
}
},
workspaceDir: workspaceRoot ?? null
}
}

View File

@@ -0,0 +1,161 @@
import { mkdir, rm, writeFile } from 'fs/promises'
import { dirname, join } from 'path'
import type { FlowModule, InputTransform } from '../../../../../frontend/src/lib/gen'
import type { ExtendedOpenFlow } from '../../../../../frontend/src/lib/components/flows/types'
import type { FlowAIChatHelpers } from '../../../../../frontend/src/lib/components/copilot/chat/flow/core'
import type { ScriptLintResult } from '../../../../../frontend/src/lib/components/copilot/chat/shared'
import { findModuleById } from '../../../../../frontend/src/lib/components/copilot/chat/shared'
import {
createInlineScriptSession
} from '../../../../../frontend/src/lib/components/copilot/chat/flow/inlineScriptsUtils'
import {
registerBenchmarkWorkspace,
registerBenchmarkWorkspaceRunnables,
unregisterBenchmarkWorkspaceRunnables,
createBenchmarkCompletedJob,
type BenchmarkWorkspaceFlow,
type BenchmarkWorkspaceScript
} from '../../mockBackend'
const EMPTY_SCRIPT_LINT_RESULT: ScriptLintResult = {
errorCount: 0,
warningCount: 0,
errors: [],
warnings: []
}
export interface FlowWorkspaceFixtures {
scripts?: BenchmarkWorkspaceScript[]
flows?: BenchmarkWorkspaceFlow[]
}
export async function createFlowFileHelpers(
initialModules: FlowModule[] = [],
initialSchema?: Record<string, any>,
workspaceRoot?: string,
workspaceFixtures?: FlowWorkspaceFixtures
): Promise<{
helpers: FlowAIChatHelpers
getFlow: () => ExtendedOpenFlow
getModules: () => FlowModule[]
cleanup: () => Promise<void>
workspaceDir: string | null
}> {
let flow: ExtendedOpenFlow = {
value: { modules: structuredClone(initialModules) },
summary: '',
schema: initialSchema ?? {
$schema: 'https://json-schema.org/draft/2020-12/schema',
properties: {},
required: [],
type: 'object'
}
}
const inlineScriptSession = createInlineScriptSession()
const flowFilePath = workspaceRoot ? join(workspaceRoot, 'flow.json') : null
async function persistFlow(): Promise<void> {
if (!flowFilePath) {
return
}
await mkdir(dirname(flowFilePath), { recursive: true })
await writeFile(flowFilePath, JSON.stringify(flow, null, 2) + '\n', 'utf8')
}
await persistFlow()
if (workspaceRoot) {
registerBenchmarkWorkspace(workspaceRoot)
if (workspaceFixtures) {
registerBenchmarkWorkspaceRunnables(workspaceRoot, workspaceFixtures)
}
}
const helpers: FlowAIChatHelpers = {
getFlowAndSelectedId: () => ({ flow, selectedId: '' }),
getModules: (id?: string) => {
if (!id) return flow.value.modules
const module = findModuleById(flow.value.modules, id)
return module ? [module] : []
},
inlineScriptSession,
setSnapshot: () => {},
revertToSnapshot: () => {},
setCode: async (id: string, code: string) => {
const module = findModuleById(flow.value.modules, id)
if (module && module.value.type === 'rawscript') {
module.value.content = code
}
inlineScriptSession.set(id, code)
await persistFlow()
},
setFlowJson: async (
modules: FlowModule[] | undefined,
schema: Record<string, any> | undefined
) => {
if (modules) {
flow.value.modules = inlineScriptSession.restoreInlineScriptReferences(modules)
const unresolvedRefs = inlineScriptSession.findUnresolvedInlineScriptRefs(flow.value.modules)
if (unresolvedRefs.length > 0) {
throw new Error(
`Unresolved inline script references: ${unresolvedRefs.join(', ')}`
)
}
}
if (schema !== undefined) {
flow.schema = schema
}
await persistFlow()
},
getFlowInputsSchema: async () => flow.schema ?? {},
updateExprsToSet: (_id: string, _inputTransforms: Record<string, InputTransform>) => {},
acceptAllModuleActions: () => {},
rejectAllModuleActions: () => {},
hasPendingChanges: () => false,
selectStep: (_id: string) => {},
testFlow: async (args?: Record<string, any>) => {
if (workspaceRoot) {
const runPath = join(workspaceRoot, 'test-run.json')
await writeFile(
runPath,
JSON.stringify(
{
requestedArgs: args ?? {},
modules: flow.value.modules.map((module) => module.id)
},
null,
2
) + '\n',
'utf8'
)
}
return createBenchmarkCompletedJob({
workspace: workspaceRoot ?? 'benchmark',
jobKind: 'flowpreview',
result: {
requestedArgs: args ?? {},
modules: flow.value.modules.map((module) => module.id),
mocked: true
},
logs: 'Mock benchmark flow test run completed successfully.'
})
},
getLintErrors: async () => EMPTY_SCRIPT_LINT_RESULT
}
return {
helpers,
getFlow: () => flow,
getModules: () => flow.value.modules,
cleanup: async () => {
if (workspaceRoot) {
unregisterBenchmarkWorkspaceRunnables(workspaceRoot)
}
if (workspaceRoot) {
await rm(workspaceRoot, { recursive: true, force: true })
}
},
workspaceDir: workspaceRoot ?? null
}
}

View File

@@ -0,0 +1,103 @@
import { mkdtemp } from 'fs/promises'
import { tmpdir } from 'os'
import { join } from 'path'
import type { FlowModule } from '$lib/gen'
import type { AIProvider } from '$lib/gen/types.gen'
import type { ExtendedOpenFlow } from '$lib/components/flows/types'
import {
flowTools,
prepareFlowSystemMessage,
prepareFlowUserMessage,
type FlowAIChatHelpers
} from '../../../../../frontend/src/lib/components/copilot/chat/flow/core'
import type { Tool as ProductionTool } from '../../../../../frontend/src/lib/components/copilot/chat/shared'
import { createFlowFileHelpers, type FlowWorkspaceFixtures } from './fileHelpers'
import { runEval } from '../shared'
import type { ModeRunContext } from '../../../../core/types'
import type { TokenUsage } from '../shared/types'
export interface FlowFixture {
value?: {
modules?: FlowModule[]
}
schema?: Record<string, unknown>
}
export interface FlowEvalResult {
success: boolean
flow: ExtendedOpenFlow
error?: string
assistantMessageCount: number
toolCallCount: number
toolsUsed: string[]
tokenUsage: TokenUsage
}
export interface FlowEvalOptions {
initialFlow?: FlowFixture
workspaceFixtures?: FlowWorkspaceFixtures
model?: string
maxIterations?: number
provider?: AIProvider
workspaceRoot?: string
runContext?: ModeRunContext
}
export async function runFlowEval(
userPrompt: string,
apiKey: string,
options?: FlowEvalOptions
): Promise<FlowEvalResult> {
const workspaceRoot =
options?.workspaceRoot ??
(await mkdtemp(join(tmpdir(), 'wmill-frontend-flow-benchmark-')))
const { helpers, getFlow, cleanup } = await createFlowFileHelpers(
options?.initialFlow?.value?.modules ?? [],
options?.initialFlow?.schema,
workspaceRoot,
options?.workspaceFixtures
)
try {
const systemMessage = prepareFlowSystemMessage()
const tools = flowTools as ProductionTool<FlowAIChatHelpers>[]
const model = options?.model ?? 'claude-haiku-4-5-20251001'
const userMessage = prepareFlowUserMessage(
userPrompt,
helpers.getFlowAndSelectedId(),
[],
helpers.inlineScriptSession
)
const rawResult = await runEval({
userPrompt,
systemMessage,
userMessage,
tools,
helpers,
apiKey,
getOutput: getFlow,
onAssistantMessageStart: options?.runContext?.onAssistantMessageStart,
onAssistantToken: options?.runContext?.onAssistantChunk,
onAssistantMessageEnd: options?.runContext?.onAssistantMessageEnd,
options: {
maxIterations: options?.maxIterations,
model,
workspace: workspaceRoot,
provider: options?.provider
}
})
return {
flow: rawResult.output,
success: rawResult.success,
error: rawResult.error,
assistantMessageCount: rawResult.iterations,
toolCallCount: rawResult.toolCallsCount,
toolsUsed: rawResult.toolsCalled,
tokenUsage: rawResult.tokenUsage
}
} finally {
await cleanup()
}
}

View File

@@ -0,0 +1,73 @@
import { mkdir, rm, writeFile } from 'fs/promises'
import { dirname, join } from 'path'
import type { ScriptLang } from '../../../../../frontend/src/lib/gen/types.gen'
import type { ReviewChangesOpts } from '../../../../../frontend/src/lib/components/copilot/chat/monaco-adapter'
import type { ScriptChatHelpers } from '../../../../../frontend/src/lib/components/copilot/chat/script/core'
import { buildScriptLintResult } from './preview'
import { registerBenchmarkWorkspace, unregisterBenchmarkWorkspace } from '../../mockBackend'
export interface ScriptEvalState {
code: string
lang: ScriptLang | 'bunnative'
path: string
args: Record<string, any>
}
export async function createScriptFileHelpers(
initialScript: ScriptEvalState,
workspaceRoot?: string
): Promise<{
helpers: ScriptChatHelpers
getScript: () => ScriptEvalState
cleanup: () => Promise<void>
workspaceDir: string | null
}> {
let script = structuredClone(initialScript)
const scriptFilePath = workspaceRoot ? join(workspaceRoot, script.path) : null
async function persistScript(): Promise<void> {
if (!scriptFilePath) {
return
}
await mkdir(dirname(scriptFilePath), { recursive: true })
await writeFile(scriptFilePath, script.code, 'utf8')
}
await persistScript()
if (workspaceRoot) {
registerBenchmarkWorkspace(workspaceRoot)
}
const helpers: ScriptChatHelpers = {
getScriptOptions: () => ({
code: script.code,
lang: script.lang,
path: script.path,
args: structuredClone(script.args)
}),
applyCode: async (code: string, opts?: ReviewChangesOpts) => {
if (opts?.mode === 'revert') {
return
}
script = {
...script,
code
}
await persistScript()
},
getLintErrors: () => buildScriptLintResult(script.code, script.lang)
}
return {
helpers,
getScript: () => structuredClone(script),
cleanup: async () => {
if (workspaceRoot) {
unregisterBenchmarkWorkspace(workspaceRoot)
await rm(workspaceRoot, { recursive: true, force: true })
}
},
workspaceDir: workspaceRoot ?? null
}
}

View File

@@ -0,0 +1,96 @@
import ts from 'typescript'
import type { ScriptLang } from '../../../../../frontend/src/lib/gen/types.gen'
import type { ScriptLintResult } from '../../../../../frontend/src/lib/components/copilot/chat/shared'
export type ScriptPreviewLanguage = ScriptLang | 'bunnative'
const TS_LIKE_LANGUAGES = new Set<ScriptPreviewLanguage>(['bun', 'deno', 'nativets', 'bunnative'])
const JS_LIKE_LANGUAGES = new Set<ScriptPreviewLanguage>(['bun', 'deno', 'nativets', 'bunnative'])
function hasSupportedEntrypoint(code: string): boolean {
return (
/export\s+(async\s+)?function\s+main\s*\(/.test(code) ||
/export\s+(async\s+)?function\s+preprocessor\s*\(/.test(code)
)
}
function compilerOptionsForLanguage(lang: ScriptPreviewLanguage): ts.CompilerOptions | null {
if (!TS_LIKE_LANGUAGES.has(lang)) {
return null
}
return {
target: ts.ScriptTarget.ES2022,
module: ts.ModuleKind.ESNext,
moduleResolution: ts.ModuleResolutionKind.Bundler,
noEmit: true,
allowJs: true,
checkJs: false,
strict: false,
skipLibCheck: true
}
}
function getLineAndColumn(sourceText: string, start: number): { line: number; column: number } {
const prefix = sourceText.slice(0, Math.max(0, start))
const line = prefix.split('\n').length
const lastNewline = prefix.lastIndexOf('\n')
const column = lastNewline === -1 ? prefix.length + 1 : prefix.length - lastNewline
return { line, column }
}
export function buildScriptLintResult(
code: string,
lang: ScriptPreviewLanguage
): ScriptLintResult {
const diagnostics: ScriptLintResult['errors'] = []
const compilerOptions = compilerOptionsForLanguage(lang)
if (compilerOptions) {
const sourceFile = ts.createSourceFile(
'script.ts',
code,
ts.ScriptTarget.ES2022,
true,
JS_LIKE_LANGUAGES.has(lang) ? ts.ScriptKind.TS : ts.ScriptKind.JS
)
const output = ts.transpileModule(code, {
compilerOptions,
fileName: sourceFile.fileName,
reportDiagnostics: true
})
for (const diagnostic of output.diagnostics ?? []) {
const start = diagnostic.start ?? 0
const length = diagnostic.length ?? 1
const { line, column } = getLineAndColumn(code, start)
const message = ts.flattenDiagnosticMessageText(diagnostic.messageText, '\n')
diagnostics.push({
startLineNumber: line,
startColumn: column,
endLineNumber: line,
endColumn: column + Math.max(1, length),
message,
severity: 8
} as ScriptLintResult['errors'][number])
}
}
if (!hasSupportedEntrypoint(code)) {
diagnostics.push({
startLineNumber: 1,
startColumn: 1,
endLineNumber: 1,
endColumn: 1,
message: 'Script must export a main or preprocessor function.',
severity: 8
} as ScriptLintResult['errors'][number])
}
return {
errorCount: diagnostics.length,
warningCount: 0,
errors: diagnostics,
warnings: []
}
}

View File

@@ -0,0 +1,109 @@
import { mkdtemp } from 'fs/promises'
import { tmpdir } from 'os'
import { join } from 'path'
import type { AIProvider, AIProviderModel, ScriptLang } from '$lib/gen/types.gen'
import type { ContextElement } from '../../../../../frontend/src/lib/components/copilot/chat/context'
import {
prepareScriptSystemMessage,
prepareScriptTools,
prepareScriptUserMessage,
type ScriptChatHelpers
} from '../../../../../frontend/src/lib/components/copilot/chat/script/core'
import type { Tool as ProductionTool } from '../../../../../frontend/src/lib/components/copilot/chat/shared'
import { createScriptFileHelpers, type ScriptEvalState } from './fileHelpers'
import { runEval } from '../shared'
import type { ModeRunContext } from '../../../../core/types'
import type { TokenUsage } from '../shared/types'
export interface ScriptEvalResult {
success: boolean
script: ScriptEvalState
error?: string
assistantMessageCount: number
toolCallCount: number
toolsUsed: string[]
tokenUsage: TokenUsage
}
export interface ScriptEvalOptions {
initialScript: ScriptEvalState
model?: string
maxIterations?: number
provider?: AIProvider
workspaceRoot?: string
runContext?: ModeRunContext
}
function resolveModelProvider(
model: string,
provider?: AIProvider
): AIProviderModel {
if (provider) {
return { provider, model }
}
if (model.startsWith('claude')) {
return { provider: 'anthropic', model }
}
return { provider: 'openai', model }
}
export async function runScriptEval(
userPrompt: string,
apiKey: string,
options: ScriptEvalOptions
): Promise<ScriptEvalResult> {
const workspaceRoot =
options.workspaceRoot ?? (await mkdtemp(join(tmpdir(), 'wmill-frontend-script-benchmark-')))
const { helpers, getScript, cleanup } = await createScriptFileHelpers(
options.initialScript,
workspaceRoot
)
try {
const model = options.model ?? 'claude-haiku-4-5-20251001'
const modelProvider = resolveModelProvider(model, options.provider)
const selectedContext: ContextElement[] = []
const systemMessage = prepareScriptSystemMessage(
modelProvider,
options.initialScript.lang,
{}
)
const tools = prepareScriptTools(
modelProvider,
options.initialScript.lang,
selectedContext
) as ProductionTool<ScriptChatHelpers>[]
const userMessage = prepareScriptUserMessage(userPrompt, selectedContext)
const rawResult = await runEval({
userPrompt,
systemMessage,
userMessage,
tools,
helpers,
apiKey,
getOutput: getScript,
onAssistantMessageStart: options.runContext?.onAssistantMessageStart,
onAssistantToken: options.runContext?.onAssistantChunk,
onAssistantMessageEnd: options.runContext?.onAssistantMessageEnd,
options: {
maxIterations: options.maxIterations,
model,
workspace: workspaceRoot,
provider: modelProvider.provider
}
})
return {
script: rawResult.output,
success: rawResult.success,
error: rawResult.error,
assistantMessageCount: rawResult.iterations,
toolCallCount: rawResult.toolCallsCount,
toolsUsed: rawResult.toolsCalled,
tokenUsage: rawResult.tokenUsage
}
} finally {
await cleanup()
}
}

View File

@@ -1,29 +1,19 @@
import OpenAI from 'openai'
import Anthropic from '@anthropic-ai/sdk'
import type {
ChatCompletionMessageParam,
ChatCompletionSystemMessageParam
} from 'openai/resources/chat/completions.mjs'
import type { AIProvider, AIProviderModel } from '$lib/gen/types.gen'
import type { TokenUsage, ToolCallDetail, EvalRunnerOptions } from './types'
import type { Tool } from './baseVariants'
import { runChatLoop, type ChatClients } from '../../chatLoop'
import type { Tool as ProductionTool, ToolCallbacks } from '../../shared'
/**
* Result from a single eval run (before domain-specific evaluation).
*/
export interface RawEvalResult<TOutput> {
success: boolean
output: TOutput
error?: string
tokenUsage: TokenUsage
toolCallsCount: number
toolsCalled: string[]
toolCallDetails: ToolCallDetail[]
iterations: number
messages: ChatCompletionMessageParam[]
}
import type { AIProviderModel } from '$lib/gen/types.gen'
import type { TokenUsage, ToolCallDetail, EvalRunnerOptions, RawEvalResult } from './types'
import { runChatLoop, type ChatClients } from '../../../../../frontend/src/lib/components/copilot/chat/chatLoop'
import type {
Tool as ProductionTool,
ToolCallbacks
} from '../../../../../frontend/src/lib/components/copilot/chat/shared'
import {
createEvalClients,
type FrontendEvalProvider,
resolveEvalModelProvider
} from './providerConfig'
/**
* Parameters for running a base evaluation.
@@ -38,7 +28,7 @@ export interface RunEvalParams<THelpers, TOutput> {
/** Tool definitions for the LLM API (unused — derived from tools) */
toolDefs?: unknown
/** Full tool implementations for execution */
tools: Tool<THelpers>[]
tools: ProductionTool<THelpers>[]
/** Domain-specific helpers for tool execution */
helpers: THelpers
/** API key for the provider */
@@ -47,35 +37,9 @@ export interface RunEvalParams<THelpers, TOutput> {
getOutput: () => TOutput
/** Optional configuration */
options?: EvalRunnerOptions
}
/**
* Creates SDK clients for the given provider.
*/
function createEvalClients(provider: AIProvider, apiKey: string): ChatClients {
if (provider === 'anthropic') {
return {
openai: new OpenAI({ apiKey: 'unused' }),
anthropic: new Anthropic({ apiKey })
}
}
return {
openai: new OpenAI({ apiKey }),
anthropic: new Anthropic({ apiKey: 'unused' })
}
}
/**
* Resolves model string to AIProviderModel.
*/
function resolveModelProvider(
model: string,
provider?: AIProvider
): AIProviderModel {
if (provider) return { provider, model }
if (model.startsWith('claude')) return { provider: 'anthropic', model }
if (model.startsWith('gpt') || model.startsWith('o')) return { provider: 'openai', model }
return { provider: 'openai', model }
onAssistantMessageStart?: () => void
onAssistantToken?: (token: string) => void
onAssistantMessageEnd?: () => void
}
/**
@@ -92,16 +56,23 @@ export async function runEval<THelpers, TOutput>(
helpers,
apiKey,
getOutput,
options
options,
onAssistantMessageStart,
onAssistantToken,
onAssistantMessageEnd
} = params
let shouldEmitMessageStart = true
const model = options?.model ?? 'gpt-4o'
const maxIterations = options?.maxIterations ?? 20
const workspace = options?.workspace ?? 'test-workspace'
const provider = options?.provider
const modelProvider = resolveModelProvider(model, provider)
const clients = createEvalClients(modelProvider.provider, apiKey)
const modelProvider = resolveEvalModelProvider(
model,
provider as FrontendEvalProvider | undefined
) as AIProviderModel
const clients = createEvalClients(modelProvider.provider, apiKey) as ChatClients
const messages: ChatCompletionMessageParam[] = [userMessage]
let toolCallsCount = 0
@@ -128,7 +99,7 @@ export async function runEval<THelpers, TOutput>(
}
return tool.fn(p)
}
})) as ProductionTool<THelpers>[]
}))
// No-op callbacks for eval
const callbacks: ToolCallbacks & {
@@ -137,8 +108,19 @@ export async function runEval<THelpers, TOutput>(
} = {
setToolStatus: () => {},
removeToolStatus: () => {},
onNewToken: () => {},
onMessageEnd: () => {}
onNewToken: (token: string) => {
if (shouldEmitMessageStart) {
onAssistantMessageStart?.()
shouldEmitMessageStart = false
}
onAssistantToken?.(token)
},
onMessageEnd: () => {
if (!shouldEmitMessageStart) {
onAssistantMessageEnd?.()
}
shouldEmitMessageStart = true
}
}
const abortController = new AbortController()
@@ -161,7 +143,7 @@ export async function runEval<THelpers, TOutput>(
return {
success: true,
output: getOutput(),
tokenUsage: { prompt: 0, completion: 0, total: 0 },
tokenUsage: result.tokenUsage,
toolCallsCount,
toolsCalled,
toolCallDetails,

View File

@@ -0,0 +1,3 @@
export type { TokenUsage, ToolCallDetail, EvalRunnerOptions, RawEvalResult } from './types'
export type { RunEvalParams } from './baseEvalRunner'
export { runEval } from './baseEvalRunner'

View File

@@ -0,0 +1,41 @@
import { describe, expect, it } from "bun:test";
import {
buildOpenAICompatibleClientOptions,
resolveEvalModelProvider,
} from "./providerConfig";
describe("buildOpenAICompatibleClientOptions", () => {
it("adds Gemini's OpenAI-compatible base URL and client header", () => {
const options = buildOpenAICompatibleClientOptions("googleai", "gemini-test-key");
expect(options).toMatchObject({
apiKey: "gemini-test-key",
baseURL: "https://generativelanguage.googleapis.com/v1beta/openai/",
defaultHeaders: {
"x-goog-api-client": "windmill-ai-evals/1.0",
},
});
});
it("keeps the default OpenAI-compatible config for OpenAI", () => {
expect(buildOpenAICompatibleClientOptions("openai", "openai-test-key")).toEqual({
apiKey: "openai-test-key",
});
});
});
describe("resolveEvalModelProvider", () => {
it("infers googleai from Gemini model ids", () => {
expect(resolveEvalModelProvider("gemini-2.5-flash")).toEqual({
provider: "googleai",
model: "gemini-2.5-flash",
});
});
it("preserves an explicit provider", () => {
expect(resolveEvalModelProvider("gemini-2.5-pro", "googleai")).toEqual({
provider: "googleai",
model: "gemini-2.5-pro",
});
});
});

View File

@@ -0,0 +1,71 @@
import Anthropic from "@anthropic-ai/sdk";
import OpenAI from "openai";
import type { FrontendEvalModelConfig } from "../../../../core/models";
export type FrontendEvalProvider = FrontendEvalModelConfig["provider"];
export interface EvalClients {
openai: OpenAI;
anthropic: Anthropic;
}
export interface ResolvedEvalModelProvider {
provider: FrontendEvalProvider;
model: string;
}
const GEMINI_OPENAI_BASE_URL = "https://generativelanguage.googleapis.com/v1beta/openai/";
const GEMINI_GOOG_API_CLIENT = "windmill-ai-evals/1.0";
export function buildOpenAICompatibleClientOptions(
provider: Exclude<FrontendEvalProvider, "anthropic">,
apiKey: string
): ConstructorParameters<typeof OpenAI>[0] {
if (provider === "googleai") {
return {
apiKey,
baseURL: GEMINI_OPENAI_BASE_URL,
defaultHeaders: {
"x-goog-api-client": GEMINI_GOOG_API_CLIENT,
},
};
}
return { apiKey };
}
export function createEvalClients(
provider: FrontendEvalProvider,
apiKey: string
): EvalClients {
if (provider === "anthropic") {
return {
openai: new OpenAI({ apiKey: "unused" }),
anthropic: new Anthropic({ apiKey }),
};
}
return {
openai: new OpenAI(buildOpenAICompatibleClientOptions(provider, apiKey)),
anthropic: new Anthropic({ apiKey: "unused" }),
};
}
export function resolveEvalModelProvider(
model: string,
provider?: FrontendEvalProvider
): ResolvedEvalModelProvider {
if (provider) {
return { provider, model };
}
if (model.startsWith("claude")) {
return { provider: "anthropic", model };
}
if (model.startsWith("gemini")) {
return { provider: "googleai", model };
}
if (model.startsWith("gpt") || model.startsWith("o")) {
return { provider: "openai", model };
}
return { provider: "openai", model };
}

View File

@@ -0,0 +1,32 @@
import type { ChatCompletionMessageParam } from 'openai/resources/chat/completions.mjs'
import type { AIProvider } from '$lib/gen/types.gen'
export interface TokenUsage {
prompt: number
completion: number
total: number
}
export interface ToolCallDetail {
name: string
arguments: Record<string, unknown>
}
export interface EvalRunnerOptions {
maxIterations?: number
model?: string
workspace?: string
provider?: AIProvider
}
export interface RawEvalResult<TOutput> {
success: boolean
output: TOutput
error?: string
tokenUsage: TokenUsage
toolCallsCount: number
toolsCalled: string[]
toolCallDetails: ToolCallDetail[]
iterations: number
messages: ChatCompletionMessageParam[]
}

View File

@@ -0,0 +1,270 @@
import { randomUUID } from 'node:crypto'
import type { CompletedJob, Flow, Script } from '../../../frontend/src/lib/gen'
import type { ScriptLang } from '../../../frontend/src/lib/gen/types.gen'
import { buildScriptLintResult } from './core/script/preview'
const BENCHMARK_TIMESTAMP = '1970-01-01T00:00:00.000Z'
export interface BenchmarkWorkspaceScript {
path: string
summary: string
description?: string
language: Script['language']
schema?: Record<string, unknown>
content: string
}
export interface BenchmarkWorkspaceFlow {
path: string
summary: string
description?: string
schema?: Record<string, unknown>
value: Flow['value']
}
export interface BenchmarkWorkspaceRunnables {
scripts?: BenchmarkWorkspaceScript[]
flows?: BenchmarkWorkspaceFlow[]
}
type BenchmarkCompletedJob = CompletedJob & { type: 'CompletedJob' }
const benchmarkWorkspaces = new Set<string>()
const benchmarkWorkspaceRunnables = new Map<string, BenchmarkWorkspaceRunnables>()
const benchmarkJobs = new Map<string, { workspace: string; job: BenchmarkCompletedJob }>()
export function resetBenchmarkMockBackend(): void {
benchmarkWorkspaces.clear()
benchmarkWorkspaceRunnables.clear()
benchmarkJobs.clear()
}
export function registerBenchmarkWorkspace(workspace: string): void {
benchmarkWorkspaces.add(workspace)
}
export function registerBenchmarkWorkspaceRunnables(
workspace: string,
runnables: BenchmarkWorkspaceRunnables
): void {
benchmarkWorkspaces.add(workspace)
benchmarkWorkspaceRunnables.set(workspace, runnables)
}
export function unregisterBenchmarkWorkspace(workspace: string): void {
benchmarkWorkspaces.delete(workspace)
benchmarkWorkspaceRunnables.delete(workspace)
for (const [jobId, entry] of benchmarkJobs.entries()) {
if (entry.workspace === workspace) {
benchmarkJobs.delete(jobId)
}
}
}
export function unregisterBenchmarkWorkspaceRunnables(workspace: string): void {
unregisterBenchmarkWorkspace(workspace)
}
export function hasBenchmarkWorkspace(workspace: string): boolean {
return benchmarkWorkspaces.has(workspace)
}
export function listBenchmarkScripts(workspace: string): Script[] | null {
const runnables = benchmarkWorkspaceRunnables.get(workspace)
if (!runnables) {
return null
}
return (runnables.scripts ?? []).map(buildBenchmarkScript)
}
export function listBenchmarkFlows(workspace: string): Flow[] | null {
const runnables = benchmarkWorkspaceRunnables.get(workspace)
if (!runnables) {
return null
}
return (runnables.flows ?? []).map(buildBenchmarkFlow)
}
export function getBenchmarkScriptByPath(workspace: string, path: string): Script | null {
const script = benchmarkWorkspaceRunnables
.get(workspace)
?.scripts?.find((entry) => entry.path === path)
return script ? buildBenchmarkScript(script) : null
}
export function getBenchmarkScriptByHash(workspace: string, hash: string): Script | null {
const script = benchmarkWorkspaceRunnables
.get(workspace)
?.scripts?.find((entry) => buildBenchmarkScriptHash(entry.path) === hash)
return script ? buildBenchmarkScript(script) : null
}
export function getBenchmarkFlowByPath(workspace: string, path: string): Flow | null {
const flow = benchmarkWorkspaceRunnables
.get(workspace)
?.flows?.find((entry) => entry.path === path)
return flow ? buildBenchmarkFlow(flow) : null
}
export function createBenchmarkCompletedJob(input: {
workspace: string
jobKind: CompletedJob['job_kind']
success?: boolean
result?: unknown
logs?: string
scriptPath?: string
scriptHash?: string
args?: Record<string, unknown>
}): string {
const jobId = `benchmark-job-${randomUUID()}`
const now = new Date().toISOString()
const job: BenchmarkCompletedJob = {
type: 'CompletedJob',
id: jobId,
workspace_id: input.workspace,
created_by: 'ai-evals',
created_at: now,
started_at: now,
completed_at: now,
duration_ms: 0,
success: input.success ?? true,
script_path: input.scriptPath,
script_hash: input.scriptHash,
args: input.args,
result: input.result,
logs: input.logs,
canceled: false,
job_kind: input.jobKind,
permissioned_as: 'u/ai-evals',
is_flow_step: false,
is_skipped: false,
email: 'ai-evals@local',
visible_to_owner: true,
tag: 'benchmark'
}
benchmarkJobs.set(jobId, { workspace: input.workspace, job })
return jobId
}
export function getBenchmarkCompletedJob(
workspace: string,
jobId: string
): BenchmarkCompletedJob | null {
const entry = benchmarkJobs.get(jobId)
if (!entry || entry.workspace !== workspace) {
return null
}
return structuredClone(entry.job)
}
export function runBenchmarkScriptPreview(input: {
workspace: string
requestBody: {
content?: string
language?: ScriptLang | 'bunnative'
args?: Record<string, unknown>
path?: string
}
}): string {
const content = input.requestBody.content ?? ''
const language = input.requestBody.language ?? 'bun'
const lintResult = buildScriptLintResult(content, language)
const success = lintResult.errorCount === 0
return createBenchmarkCompletedJob({
workspace: input.workspace,
jobKind: 'preview',
success,
scriptPath: input.requestBody.path,
args: input.requestBody.args,
result: success
? {
path: input.requestBody.path,
args: input.requestBody.args ?? {},
validated: true
}
: {
path: input.requestBody.path,
args: input.requestBody.args ?? {},
errorCount: lintResult.errorCount,
errors: lintResult.errors.map((entry) => ({
line: entry.startLineNumber,
message: entry.message
}))
}
})
}
export function runBenchmarkFlowByPath(input: {
workspace: string
path: string
args?: Record<string, unknown>
}): string {
const flow = getBenchmarkFlowByPath(input.workspace, input.path)
return createBenchmarkCompletedJob({
workspace: input.workspace,
jobKind: 'flowpreview',
success: flow !== null,
args: input.args,
result:
flow !== null
? {
path: input.path,
args: input.args ?? {},
mocked: true
}
: {
error: `Flow "${input.path}" not found in benchmark workspace`
},
logs:
flow !== null
? 'Mock benchmark flow run completed successfully.'
: `Flow "${input.path}" not found in benchmark workspace.`
})
}
function buildBenchmarkScriptHash(path: string): string {
return `benchmark:${path}`
}
function buildBenchmarkScript(script: BenchmarkWorkspaceScript): Script {
return {
workspace_id: 'benchmark',
hash: buildBenchmarkScriptHash(script.path),
path: script.path,
parent_hashes: [],
summary: script.summary,
description: script.description ?? '',
content: script.content,
created_by: 'benchmark',
created_at: BENCHMARK_TIMESTAMP,
archived: false,
schema: script.schema ?? {},
deleted: false,
is_template: false,
extra_perms: {},
language: script.language,
kind: 'script',
starred: false,
has_preprocessor: false,
modules: null
}
}
function buildBenchmarkFlow(flow: BenchmarkWorkspaceFlow): Flow {
return {
path: flow.path,
summary: flow.summary,
description: flow.description ?? '',
value: flow.value,
schema: flow.schema ?? {},
edited_by: 'benchmark',
edited_at: BENCHMARK_TIMESTAMP,
archived: false,
extra_perms: {}
} as Flow
}

View File

@@ -0,0 +1,133 @@
export type FrontendBenchmarkProgressSurface = 'flow' | 'app' | 'script'
export type FrontendBenchmarkProgressEvent =
| {
type: 'run-start'
surface: FrontendBenchmarkProgressSurface
totalCases: number
runs: number
concurrency: number
}
| {
type: 'attempt-start'
surface: FrontendBenchmarkProgressSurface
caseId: string
caseNumber: number
totalCases: number
attempt: number
runs: number
}
| {
type: 'attempt-finish'
surface: FrontendBenchmarkProgressSurface
caseId: string
caseNumber: number
totalCases: number
attempt: number
runs: number
passed: boolean
durationMs: number
judgeScore: number | null
error: string | null
}
| {
type: 'assistant-message-start'
surface: FrontendBenchmarkProgressSurface
caseId: string
caseNumber: number
totalCases: number
attempt: number
runs: number
}
| {
type: 'assistant-chunk'
surface: FrontendBenchmarkProgressSurface
caseId: string
caseNumber: number
totalCases: number
attempt: number
runs: number
chunk: string
}
| {
type: 'assistant-message-end'
surface: FrontendBenchmarkProgressSurface
caseId: string
caseNumber: number
totalCases: number
attempt: number
runs: number
}
export const FRONTEND_BENCHMARK_PROGRESS_PREFIX = 'WMILL_FRONTEND_AI_EVAL_PROGRESS '
export function emitFrontendBenchmarkProgress(event: FrontendBenchmarkProgressEvent): void {
process.stderr.write(
`${FRONTEND_BENCHMARK_PROGRESS_PREFIX}${JSON.stringify(event)}\n`
)
}
export function parseFrontendBenchmarkProgressLine(
line: string
): FrontendBenchmarkProgressEvent | null {
if (!line.startsWith(FRONTEND_BENCHMARK_PROGRESS_PREFIX)) {
return null
}
try {
const parsed = JSON.parse(
line.slice(FRONTEND_BENCHMARK_PROGRESS_PREFIX.length)
) as FrontendBenchmarkProgressEvent
return parsed?.type ? parsed : null
} catch {
return null
}
}
export function formatFrontendBenchmarkProgressEvent(
event: FrontendBenchmarkProgressEvent
): string {
switch (event.type) {
case 'run-start':
return `Running ${event.surface}: ${event.totalCases} cases x ${event.runs} run${event.runs === 1 ? '' : 's'}, concurrency ${event.concurrency}`
case 'attempt-start':
return `${formatCasePrefix(event.caseNumber, event.totalCases)} ${event.caseId} attempt ${event.attempt}/${event.runs}...`
case 'attempt-finish': {
const parts = [
`${formatCasePrefix(event.caseNumber, event.totalCases)} ${event.caseId} attempt ${event.attempt}/${event.runs} ${event.passed ? 'pass' : 'fail'}`,
formatDuration(event.durationMs)
]
if (event.judgeScore !== null) {
parts.push(`judge ${formatNumber(event.judgeScore)}`)
}
if (event.error) {
parts.push(truncateSingleLine(event.error, 120))
}
return parts.join(' | ')
}
case 'assistant-message-start':
case 'assistant-chunk':
case 'assistant-message-end':
return ''
}
}
function formatCasePrefix(caseNumber: number, totalCases: number): string {
return `[${caseNumber}/${totalCases}]`
}
function formatDuration(durationMs: number): string {
return `${formatNumber(durationMs / 1000)}s`
}
function formatNumber(value: number): string {
return Number.isInteger(value) ? String(value) : value.toFixed(1)
}
function truncateSingleLine(value: string, maxLength: number): string {
const normalized = value.replace(/\s+/g, ' ').trim()
if (normalized.length <= maxLength) {
return normalized
}
return `${normalized.slice(0, Math.max(0, maxLength - 3))}...`
}

View File

@@ -0,0 +1,216 @@
import { spawn } from 'node:child_process'
import { mkdtemp, readFile, rm } from 'node:fs/promises'
import { tmpdir } from 'node:os'
import path from 'node:path'
import { fileURLToPath } from 'node:url'
import {
formatFrontendBenchmarkProgressEvent,
parseFrontendBenchmarkProgressLine
} from './progress'
import type { BenchmarkRunResult } from '../../core/types'
const REPO_ROOT = fileURLToPath(new URL('../../../', import.meta.url))
const FRONTEND_DIR = path.join(REPO_ROOT, 'frontend')
const FRONTEND_BENCHMARK_TEST = '../ai_evals/adapters/frontend/vitestAdapter.test.ts'
const FRONTEND_BENCHMARK_CONFIG = '../ai_evals/adapters/frontend/vitest.config.ts'
export type FrontendMode = 'flow' | 'app' | 'script'
export async function runFrontendBenchmarkAdapter(input: {
mode: FrontendMode
caseIds: string[]
runs: number
model?: string
verbose?: boolean
}): Promise<BenchmarkRunResult> {
const tempDir = await mkdtemp(path.join(tmpdir(), 'wmill-frontend-benchmark-'))
const outputPath = path.join(tempDir, 'result.json')
try {
await runVitestBenchmark(
path.join(FRONTEND_DIR, 'node_modules', '.bin', 'vitest'),
[
'run',
FRONTEND_BENCHMARK_TEST,
'--project',
'server',
'--config',
FRONTEND_BENCHMARK_CONFIG
],
{
cwd: FRONTEND_DIR,
env: {
...process.env,
BROWSERSLIST_IGNORE_OLD_DATA: '1',
WMILL_FRONTEND_AI_EVAL_OUTPUT_PATH: outputPath,
WMILL_FRONTEND_AI_EVAL_MODE: input.mode,
WMILL_FRONTEND_AI_EVAL_CASE_IDS: JSON.stringify(input.caseIds),
WMILL_FRONTEND_AI_EVAL_RUNS: String(input.runs),
WMILL_FRONTEND_AI_EVAL_MODEL: input.model ?? "",
WMILL_FRONTEND_AI_EVAL_PROGRESS: '1',
WMILL_FRONTEND_AI_EVAL_VERBOSE: input.verbose ? '1' : '0'
}
}
)
const raw = await readFile(outputPath, 'utf8')
return JSON.parse(raw) as BenchmarkRunResult
} catch (error) {
throw new Error(`Frontend benchmark adapter failed:\n${toErrorMessage(error)}`)
} finally {
await rm(tempDir, { recursive: true, force: true })
}
}
async function runVitestBenchmark(
command: string,
args: string[],
options: {
cwd: string
env: NodeJS.ProcessEnv
}
): Promise<void> {
const child = spawn(command, args, {
cwd: options.cwd,
env: options.env,
stdio: ['ignore', 'pipe', 'pipe']
})
let stdout = ''
let stderr = ''
let stderrLineBuffer = ''
let assistantStreamOpen = false
child.stdout?.setEncoding('utf8')
child.stdout?.on('data', (chunk: string) => {
stdout += chunk
})
child.stderr?.setEncoding('utf8')
child.stderr?.on('data', (chunk: string) => {
stderrLineBuffer += chunk
const { remainder, passthrough, nextAssistantStreamOpen } = drainProgressLines(
stderrLineBuffer,
assistantStreamOpen
)
stderrLineBuffer = remainder
stderr += passthrough
assistantStreamOpen = nextAssistantStreamOpen
})
await new Promise<void>((resolve, reject) => {
child.once('error', reject)
child.once('close', (code) => {
if (stderrLineBuffer.length > 0) {
const {
remainder,
passthrough,
nextAssistantStreamOpen
} = drainProgressLines(`${stderrLineBuffer}\n`, assistantStreamOpen)
stderrLineBuffer = remainder
stderr += passthrough
assistantStreamOpen = nextAssistantStreamOpen
}
if (code === 0) {
if (assistantStreamOpen) {
process.stderr.write('\n')
}
resolve()
return
}
const details = [`vitest exited with code ${code}`, stdout, stderr].filter(Boolean).join('\n')
reject(new Error(details))
})
})
}
function drainProgressLines(buffer: string): {
remainder: string
passthrough: string
nextAssistantStreamOpen: boolean
}
function drainProgressLines(
buffer: string,
initialAssistantStreamOpen: boolean
): {
remainder: string
passthrough: string
nextAssistantStreamOpen: boolean
} {
let remainder = buffer
let passthrough = ''
let assistantStreamOpen = initialAssistantStreamOpen
while (true) {
const newlineIndex = remainder.indexOf('\n')
if (newlineIndex === -1) {
return { remainder, passthrough, nextAssistantStreamOpen: assistantStreamOpen }
}
const line = remainder.slice(0, newlineIndex).replace(/\r$/, '')
remainder = remainder.slice(newlineIndex + 1)
const progressEvent = parseFrontendBenchmarkProgressLine(line)
if (progressEvent) {
if (progressEvent.type === 'assistant-message-start') {
if (assistantStreamOpen) {
process.stderr.write('\n')
}
process.stderr.write(
`${formatCasePrefix(progressEvent.caseNumber, progressEvent.totalCases)} ${progressEvent.caseId} attempt ${progressEvent.attempt}/${progressEvent.runs} assistant:\n`
)
assistantStreamOpen = true
continue
}
if (progressEvent.type === 'assistant-chunk') {
process.stderr.write(progressEvent.chunk)
continue
}
if (progressEvent.type === 'assistant-message-end') {
if (assistantStreamOpen) {
process.stderr.write('\n')
}
assistantStreamOpen = false
continue
}
if (assistantStreamOpen) {
process.stderr.write('\n')
assistantStreamOpen = false
}
process.stderr.write(`${formatFrontendBenchmarkProgressEvent(progressEvent)}\n`)
continue
}
if (shouldSuppressFrontendStderrLine(line)) {
continue
}
passthrough += `${line}\n`
process.stderr.write(`${line}\n`)
}
}
function formatCasePrefix(caseNumber: number, totalCases: number): string {
return `[${caseNumber}/${totalCases}]`
}
function shouldSuppressFrontendStderrLine(line: string): boolean {
return (
line.startsWith('[baseline-browser-mapping] ') ||
line.startsWith('Browserslist: browsers data (caniuse-lite) is ') ||
line.includes('update-browserslist-db@latest') ||
line.includes('update-db#readme')
)
}
function toErrorMessage(error: unknown): string {
if (error instanceof Error) {
return error.message
}
return String(error)
}

View File

@@ -0,0 +1,28 @@
import { fileURLToPath } from 'node:url'
import frontendConfig from '../../../frontend/vite.config.js'
const FRONTEND_VITE_CONFIG_PATH = fileURLToPath(new URL('../../../frontend/vite.config.js', import.meta.url))
const FRONTEND_TEST_SETUP_PATH = fileURLToPath(
new URL('../../../frontend/src/lib/test-setup.ts', import.meta.url)
)
const ADAPTER_TEST_PATH = fileURLToPath(new URL('./vitestAdapter.test.ts', import.meta.url))
const config = {
...frontendConfig,
test: {
...frontendConfig.test,
projects: [
{
extends: FRONTEND_VITE_CONFIG_PATH,
test: {
name: 'server',
environment: 'node',
include: [ADAPTER_TEST_PATH],
setupFiles: [FRONTEND_TEST_SETUP_PATH]
}
}
]
}
}
export default config

View File

@@ -0,0 +1,165 @@
import { expect, it, vi } from 'vitest'
// @ts-ignore - Node.js fs/promises
import { mkdir, writeFile } from 'fs/promises'
// @ts-ignore - Node.js path
import { dirname, resolve } from 'path'
vi.mock('monaco-editor', () => ({
editor: {},
languages: {},
KeyCode: {},
Uri: {
parse: (value: string) => ({ toString: () => value })
},
MarkerSeverity: {
Error: 8,
Warning: 4,
Info: 2,
Hint: 1
}
}))
vi.mock('@codingame/monaco-vscode-standalone-typescript-language-features', () => ({
getTypeScriptWorker: async () => async () => ({}),
typescriptVersion: 'test'
}))
vi.mock('@codingame/monaco-vscode-languages-service-override', () => ({
default: () => ({})
}))
vi.mock('$lib/components/vscode', () => ({}))
vi.mock('$lib/gen', async () => {
const actual = await vi.importActual<any>('$lib/gen')
const {
getBenchmarkCompletedJob,
getBenchmarkFlowByPath,
getBenchmarkScriptByHash,
getBenchmarkScriptByPath,
hasBenchmarkWorkspace,
listBenchmarkFlows,
listBenchmarkScripts,
runBenchmarkFlowByPath,
runBenchmarkScriptPreview
} = await import('./mockBackend')
function wrapService<T extends object>(target: T, overrides: Record<string, unknown>): T {
return new Proxy(target, {
get(source, property, receiver) {
if (typeof property === 'string' && property in overrides) {
return overrides[property]
}
return Reflect.get(source, property, receiver)
}
})
}
return {
...actual,
ScriptService: wrapService(actual.ScriptService, {
listScripts: async (data: { workspace: string }) =>
hasBenchmarkWorkspace(data.workspace)
? (listBenchmarkScripts(data.workspace) ?? [])
: actual.ScriptService.listScripts(data),
getScriptByPath: async (data: { workspace: string; path: string }) => {
if (hasBenchmarkWorkspace(data.workspace)) {
const script = getBenchmarkScriptByPath(data.workspace, data.path)
if (!script) {
throw new Error(`Script "${data.path}" not found in benchmark workspace`)
}
return script
}
return actual.ScriptService.getScriptByPath(data)
},
getScriptByHash: async (data: { workspace: string; hash: string }) => {
if (hasBenchmarkWorkspace(data.workspace)) {
const script = getBenchmarkScriptByHash(data.workspace, data.hash)
if (!script) {
throw new Error(`Script hash "${data.hash}" not found in benchmark workspace`)
}
return script
}
return actual.ScriptService.getScriptByHash(data)
}
}),
FlowService: wrapService(actual.FlowService, {
listFlows: async (data: { workspace: string }) =>
hasBenchmarkWorkspace(data.workspace)
? (listBenchmarkFlows(data.workspace) ?? [])
: actual.FlowService.listFlows(data),
getFlowByPath: async (data: { workspace: string; path: string }) => {
if (hasBenchmarkWorkspace(data.workspace)) {
const flow = getBenchmarkFlowByPath(data.workspace, data.path)
if (!flow) {
throw new Error(`Flow "${data.path}" not found in benchmark workspace`)
}
return flow
}
return actual.FlowService.getFlowByPath(data)
}
}),
JobService: wrapService(actual.JobService, {
runScriptPreview: async (data: {
workspace: string
requestBody?: {
content?: string
language?: string
args?: Record<string, unknown>
path?: string
}
}) =>
hasBenchmarkWorkspace(data.workspace)
? runBenchmarkScriptPreview({
workspace: data.workspace,
requestBody: data.requestBody ?? {}
})
: actual.JobService.runScriptPreview(data),
runFlowByPath: async (data: {
workspace: string
path: string
requestBody?: Record<string, unknown>
}) =>
hasBenchmarkWorkspace(data.workspace)
? runBenchmarkFlowByPath({
workspace: data.workspace,
path: data.path,
args: data.requestBody
})
: actual.JobService.runFlowByPath(data),
getJob: async (data: { workspace: string; id: string }) => {
if (hasBenchmarkWorkspace(data.workspace)) {
const job = getBenchmarkCompletedJob(data.workspace, data.id)
if (!job) {
throw new Error(`Job "${data.id}" not found in benchmark workspace`)
}
return job
}
return actual.JobService.getJob(data)
}
})
}
})
const benchmarkOutputPath = process.env.WMILL_FRONTEND_AI_EVAL_OUTPUT_PATH
const benchmarkIt = benchmarkOutputPath ? it : it.skip
benchmarkIt(
'runs the frontend benchmark adapter from environment input',
async () => {
const { resetBenchmarkMockBackend } = await import('./mockBackend')
resetBenchmarkMockBackend()
const { runFrontendBenchmarkFromEnv } = await import('./benchmarkRunner')
try {
const payload = await runFrontendBenchmarkFromEnv()
const absoluteOutputPath = resolve(benchmarkOutputPath!)
await mkdir(dirname(absoluteOutputPath), { recursive: true })
await writeFile(absoluteOutputPath, JSON.stringify(payload, null, 2) + '\n', 'utf8')
expect(payload.cases.length).toBeGreaterThan(0)
} finally {
resetBenchmarkMockBackend()
}
},
600_000
)

313
ai_evals/bun.lock Normal file
View File

@@ -0,0 +1,313 @@
{
"lockfileVersion": 1,
"configVersion": 1,
"workspaces": {
"": {
"name": "windmill-ai-evals",
"dependencies": {
"@anthropic-ai/claude-agent-sdk": "^0.2.25",
"@anthropic-ai/sdk": "^0.39.0",
"commander": "^14.0.3",
"openai": "^6.9.1",
"yaml": "^2.8.3",
},
"devDependencies": {
"@types/bun": "latest",
"typescript": "^5.0.0",
},
},
},
"packages": {
"@anthropic-ai/claude-agent-sdk": ["@anthropic-ai/claude-agent-sdk@0.2.87", "", { "dependencies": { "@anthropic-ai/sdk": "^0.74.0", "@modelcontextprotocol/sdk": "^1.27.1" }, "optionalDependencies": { "@img/sharp-darwin-arm64": "^0.34.2", "@img/sharp-darwin-x64": "^0.34.2", "@img/sharp-linux-arm": "^0.34.2", "@img/sharp-linux-arm64": "^0.34.2", "@img/sharp-linux-x64": "^0.34.2", "@img/sharp-linuxmusl-arm64": "^0.34.2", "@img/sharp-linuxmusl-x64": "^0.34.2", "@img/sharp-win32-arm64": "^0.34.2", "@img/sharp-win32-x64": "^0.34.2" }, "peerDependencies": { "zod": "^4.0.0" } }, "sha512-WWmgBPxPhBOvNT0ujI8vPTI2lK+w5YEkEZ/y1mH0EDkK/0kBnxVJNhCtG5vnueiAViwLoUOFn66pbkDiivijdA=="],
"@anthropic-ai/sdk": ["@anthropic-ai/sdk@0.39.0", "", { "dependencies": { "@types/node": "^18.11.18", "@types/node-fetch": "^2.6.4", "abort-controller": "^3.0.0", "agentkeepalive": "^4.2.1", "form-data-encoder": "1.7.2", "formdata-node": "^4.3.2", "node-fetch": "^2.6.7" } }, "sha512-eMyDIPRZbt1CCLErRCi3exlAvNkBtRe+kW5vvJyef93PmNr/clstYgHhtvmkxN82nlKgzyGPCyGxrm0JQ1ZIdg=="],
"@babel/runtime": ["@babel/runtime@7.29.2", "", {}, "sha512-JiDShH45zKHWyGe4ZNVRrCjBz8Nh9TMmZG1kh4QTK8hCBTWBi8Da+i7s1fJw7/lYpM4ccepSNfqzZ/QvABBi5g=="],
"@hono/node-server": ["@hono/node-server@1.19.12", "", { "peerDependencies": { "hono": "^4" } }, "sha512-txsUW4SQ1iilgE0l9/e9VQWmELXifEFvmdA1j6WFh/aFPj99hIntrSsq/if0UWyGVkmrRPKA1wCeP+UCr1B9Uw=="],
"@img/sharp-darwin-arm64": ["@img/sharp-darwin-arm64@0.34.5", "", { "optionalDependencies": { "@img/sharp-libvips-darwin-arm64": "1.2.4" }, "os": "darwin", "cpu": "arm64" }, "sha512-imtQ3WMJXbMY4fxb/Ndp6HBTNVtWCUI0WdobyheGf5+ad6xX8VIDO8u2xE4qc/fr08CKG/7dDseFtn6M6g/r3w=="],
"@img/sharp-darwin-x64": ["@img/sharp-darwin-x64@0.34.5", "", { "optionalDependencies": { "@img/sharp-libvips-darwin-x64": "1.2.4" }, "os": "darwin", "cpu": "x64" }, "sha512-YNEFAF/4KQ/PeW0N+r+aVVsoIY0/qxxikF2SWdp+NRkmMB7y9LBZAVqQ4yhGCm/H3H270OSykqmQMKLBhBJDEw=="],
"@img/sharp-libvips-darwin-arm64": ["@img/sharp-libvips-darwin-arm64@1.2.4", "", { "os": "darwin", "cpu": "arm64" }, "sha512-zqjjo7RatFfFoP0MkQ51jfuFZBnVE2pRiaydKJ1G/rHZvnsrHAOcQALIi9sA5co5xenQdTugCvtb1cuf78Vf4g=="],
"@img/sharp-libvips-darwin-x64": ["@img/sharp-libvips-darwin-x64@1.2.4", "", { "os": "darwin", "cpu": "x64" }, "sha512-1IOd5xfVhlGwX+zXv2N93k0yMONvUlANylbJw1eTah8K/Jtpi15KC+WSiaX/nBmbm2HxRM1gZ0nSdjSsrZbGKg=="],
"@img/sharp-libvips-linux-arm": ["@img/sharp-libvips-linux-arm@1.2.4", "", { "os": "linux", "cpu": "arm" }, "sha512-bFI7xcKFELdiNCVov8e44Ia4u2byA+l3XtsAj+Q8tfCwO6BQ8iDojYdvoPMqsKDkuoOo+X6HZA0s0q11ANMQ8A=="],
"@img/sharp-libvips-linux-arm64": ["@img/sharp-libvips-linux-arm64@1.2.4", "", { "os": "linux", "cpu": "arm64" }, "sha512-excjX8DfsIcJ10x1Kzr4RcWe1edC9PquDRRPx3YVCvQv+U5p7Yin2s32ftzikXojb1PIFc/9Mt28/y+iRklkrw=="],
"@img/sharp-libvips-linux-x64": ["@img/sharp-libvips-linux-x64@1.2.4", "", { "os": "linux", "cpu": "x64" }, "sha512-tJxiiLsmHc9Ax1bz3oaOYBURTXGIRDODBqhveVHonrHJ9/+k89qbLl0bcJns+e4t4rvaNBxaEZsFtSfAdquPrw=="],
"@img/sharp-libvips-linuxmusl-arm64": ["@img/sharp-libvips-linuxmusl-arm64@1.2.4", "", { "os": "linux", "cpu": "arm64" }, "sha512-FVQHuwx1IIuNow9QAbYUzJ+En8KcVm9Lk5+uGUQJHaZmMECZmOlix9HnH7n1TRkXMS0pGxIJokIVB9SuqZGGXw=="],
"@img/sharp-libvips-linuxmusl-x64": ["@img/sharp-libvips-linuxmusl-x64@1.2.4", "", { "os": "linux", "cpu": "x64" }, "sha512-+LpyBk7L44ZIXwz/VYfglaX/okxezESc6UxDSoyo2Ks6Jxc4Y7sGjpgU9s4PMgqgjj1gZCylTieNamqA1MF7Dg=="],
"@img/sharp-linux-arm": ["@img/sharp-linux-arm@0.34.5", "", { "optionalDependencies": { "@img/sharp-libvips-linux-arm": "1.2.4" }, "os": "linux", "cpu": "arm" }, "sha512-9dLqsvwtg1uuXBGZKsxem9595+ujv0sJ6Vi8wcTANSFpwV/GONat5eCkzQo/1O6zRIkh0m/8+5BjrRr7jDUSZw=="],
"@img/sharp-linux-arm64": ["@img/sharp-linux-arm64@0.34.5", "", { "optionalDependencies": { "@img/sharp-libvips-linux-arm64": "1.2.4" }, "os": "linux", "cpu": "arm64" }, "sha512-bKQzaJRY/bkPOXyKx5EVup7qkaojECG6NLYswgktOZjaXecSAeCWiZwwiFf3/Y+O1HrauiE3FVsGxFg8c24rZg=="],
"@img/sharp-linux-x64": ["@img/sharp-linux-x64@0.34.5", "", { "optionalDependencies": { "@img/sharp-libvips-linux-x64": "1.2.4" }, "os": "linux", "cpu": "x64" }, "sha512-MEzd8HPKxVxVenwAa+JRPwEC7QFjoPWuS5NZnBt6B3pu7EG2Ge0id1oLHZpPJdn3OQK+BQDiw9zStiHBTJQQQQ=="],
"@img/sharp-linuxmusl-arm64": ["@img/sharp-linuxmusl-arm64@0.34.5", "", { "optionalDependencies": { "@img/sharp-libvips-linuxmusl-arm64": "1.2.4" }, "os": "linux", "cpu": "arm64" }, "sha512-fprJR6GtRsMt6Kyfq44IsChVZeGN97gTD331weR1ex1c1rypDEABN6Tm2xa1wE6lYb5DdEnk03NZPqA7Id21yg=="],
"@img/sharp-linuxmusl-x64": ["@img/sharp-linuxmusl-x64@0.34.5", "", { "optionalDependencies": { "@img/sharp-libvips-linuxmusl-x64": "1.2.4" }, "os": "linux", "cpu": "x64" }, "sha512-Jg8wNT1MUzIvhBFxViqrEhWDGzqymo3sV7z7ZsaWbZNDLXRJZoRGrjulp60YYtV4wfY8VIKcWidjojlLcWrd8Q=="],
"@img/sharp-win32-arm64": ["@img/sharp-win32-arm64@0.34.5", "", { "os": "win32", "cpu": "arm64" }, "sha512-WQ3AgWCWYSb2yt+IG8mnC6Jdk9Whs7O0gxphblsLvdhSpSTtmu69ZG1Gkb6NuvxsNACwiPV6cNSZNzt0KPsw7g=="],
"@img/sharp-win32-x64": ["@img/sharp-win32-x64@0.34.5", "", { "os": "win32", "cpu": "x64" }, "sha512-+29YMsqY2/9eFEiW93eqWnuLcWcufowXewwSNIT6UwZdUUCrM3oFjMWH/Z6/TMmb4hlFenmfAVbpWeup2jryCw=="],
"@modelcontextprotocol/sdk": ["@modelcontextprotocol/sdk@1.29.0", "", { "dependencies": { "@hono/node-server": "^1.19.9", "ajv": "^8.17.1", "ajv-formats": "^3.0.1", "content-type": "^1.0.5", "cors": "^2.8.5", "cross-spawn": "^7.0.5", "eventsource": "^3.0.2", "eventsource-parser": "^3.0.0", "express": "^5.2.1", "express-rate-limit": "^8.2.1", "hono": "^4.11.4", "jose": "^6.1.3", "json-schema-typed": "^8.0.2", "pkce-challenge": "^5.0.0", "raw-body": "^3.0.0", "zod": "^3.25 || ^4.0", "zod-to-json-schema": "^3.25.1" }, "peerDependencies": { "@cfworker/json-schema": "^4.1.1" }, "optionalPeers": ["@cfworker/json-schema"] }, "sha512-zo37mZA9hJWpULgkRpowewez1y6ML5GsXJPY8FI0tBBCd77HEvza4jDqRKOXgHNn867PVGCyTdzqpz0izu5ZjQ=="],
"@types/bun": ["@types/bun@1.3.11", "", { "dependencies": { "bun-types": "1.3.11" } }, "sha512-5vPne5QvtpjGpsGYXiFyycfpDF2ECyPcTSsFBMa0fraoxiQyMJ3SmuQIGhzPg2WJuWxVBoxWJ2kClYTcw/4fAg=="],
"@types/node": ["@types/node@18.19.130", "", { "dependencies": { "undici-types": "~5.26.4" } }, "sha512-GRaXQx6jGfL8sKfaIDD6OupbIHBr9jv7Jnaml9tB7l4v068PAOXqfcujMMo5PhbIs6ggR1XODELqahT2R8v0fg=="],
"@types/node-fetch": ["@types/node-fetch@2.6.13", "", { "dependencies": { "@types/node": "*", "form-data": "^4.0.4" } }, "sha512-QGpRVpzSaUs30JBSGPjOg4Uveu384erbHBoT1zeONvyCfwQxIkUshLAOqN/k9EjGviPRmWTTe6aH2qySWKTVSw=="],
"abort-controller": ["abort-controller@3.0.0", "", { "dependencies": { "event-target-shim": "^5.0.0" } }, "sha512-h8lQ8tacZYnR3vNQTgibj+tODHI5/+l06Au2Pcriv/Gmet0eaj4TwWH41sO9wnHDiQsEj19q0drzdWdeAHtweg=="],
"accepts": ["accepts@2.0.0", "", { "dependencies": { "mime-types": "^3.0.0", "negotiator": "^1.0.0" } }, "sha512-5cvg6CtKwfgdmVqY1WIiXKc3Q1bkRqGLi+2W/6ao+6Y7gu/RCwRuAhGEzh5B4KlszSuTLgZYuqFqo5bImjNKng=="],
"agentkeepalive": ["agentkeepalive@4.6.0", "", { "dependencies": { "humanize-ms": "^1.2.1" } }, "sha512-kja8j7PjmncONqaTsB8fQ+wE2mSU2DJ9D4XKoJ5PFWIdRMa6SLSN1ff4mOr4jCbfRSsxR4keIiySJU0N9T5hIQ=="],
"ajv": ["ajv@8.18.0", "", { "dependencies": { "fast-deep-equal": "^3.1.3", "fast-uri": "^3.0.1", "json-schema-traverse": "^1.0.0", "require-from-string": "^2.0.2" } }, "sha512-PlXPeEWMXMZ7sPYOHqmDyCJzcfNrUr3fGNKtezX14ykXOEIvyK81d+qydx89KY5O71FKMPaQ2vBfBFI5NHR63A=="],
"ajv-formats": ["ajv-formats@3.0.1", "", { "dependencies": { "ajv": "^8.0.0" } }, "sha512-8iUql50EUR+uUcdRQ3HDqa6EVyo3docL8g5WJ3FNcWmu62IbkGUue/pEyLBW8VGKKucTPgqeks4fIU1DA4yowQ=="],
"asynckit": ["asynckit@0.4.0", "", {}, "sha512-Oei9OH4tRh0YqU3GxhX79dM/mwVgvbZJaSNaRk+bshkj0S5cfHcgYakreBjrHwatXKbz+IoIdYLxrKim2MjW0Q=="],
"body-parser": ["body-parser@2.2.2", "", { "dependencies": { "bytes": "^3.1.2", "content-type": "^1.0.5", "debug": "^4.4.3", "http-errors": "^2.0.0", "iconv-lite": "^0.7.0", "on-finished": "^2.4.1", "qs": "^6.14.1", "raw-body": "^3.0.1", "type-is": "^2.0.1" } }, "sha512-oP5VkATKlNwcgvxi0vM0p/D3n2C3EReYVX+DNYs5TjZFn/oQt2j+4sVJtSMr18pdRr8wjTcBl6LoV+FUwzPmNA=="],
"bun-types": ["bun-types@1.3.11", "", { "dependencies": { "@types/node": "*" } }, "sha512-1KGPpoxQWl9f6wcZh57LvrPIInQMn2TQ7jsgxqpRzg+l0QPOFvJVH7HmvHo/AiPgwXy+/Thf6Ov3EdVn1vOabg=="],
"bytes": ["bytes@3.1.2", "", {}, "sha512-/Nf7TyzTx6S3yRJObOAV7956r8cr2+Oj8AC5dt8wSP3BQAoeX58NoHyCU8P8zGkNXStjTSi6fzO6F0pBdcYbEg=="],
"call-bind-apply-helpers": ["call-bind-apply-helpers@1.0.2", "", { "dependencies": { "es-errors": "^1.3.0", "function-bind": "^1.1.2" } }, "sha512-Sp1ablJ0ivDkSzjcaJdxEunN5/XvksFJ2sMBFfq6x0ryhQV/2b/KwFe21cMpmHtPOSij8K99/wSfoEuTObmuMQ=="],
"call-bound": ["call-bound@1.0.4", "", { "dependencies": { "call-bind-apply-helpers": "^1.0.2", "get-intrinsic": "^1.3.0" } }, "sha512-+ys997U96po4Kx/ABpBCqhA9EuxJaQWDQg7295H4hBphv3IZg0boBKuwYpt4YXp6MZ5AmZQnU/tyMTlRpaSejg=="],
"combined-stream": ["combined-stream@1.0.8", "", { "dependencies": { "delayed-stream": "~1.0.0" } }, "sha512-FQN4MRfuJeHf7cBbBMJFXhKSDq+2kAArBlmRBvcvFE5BB1HZKXtSFASDhdlz9zOYwxh8lDdnvmMOe/+5cdoEdg=="],
"commander": ["commander@14.0.3", "", {}, "sha512-H+y0Jo/T1RZ9qPP4Eh1pkcQcLRglraJaSLoyOtHxu6AapkjWVCy2Sit1QQ4x3Dng8qDlSsZEet7g5Pq06MvTgw=="],
"content-disposition": ["content-disposition@1.0.1", "", {}, "sha512-oIXISMynqSqm241k6kcQ5UwttDILMK4BiurCfGEREw6+X9jkkpEe5T9FZaApyLGGOnFuyMWZpdolTXMtvEJ08Q=="],
"content-type": ["content-type@1.0.5", "", {}, "sha512-nTjqfcBFEipKdXCv4YDQWCfmcLZKm81ldF0pAopTvyrFGVbcR6P/VAAd5G7N+0tTr8QqiU0tFadD6FK4NtJwOA=="],
"cookie": ["cookie@0.7.2", "", {}, "sha512-yki5XnKuf750l50uGTllt6kKILY4nQ1eNIQatoXEByZ5dWgnKqbnqmTrBE5B4N7lrMJKQ2ytWMiTO2o0v6Ew/w=="],
"cookie-signature": ["cookie-signature@1.2.2", "", {}, "sha512-D76uU73ulSXrD1UXF4KE2TMxVVwhsnCgfAyTg9k8P6KGZjlXKrOLe4dJQKI3Bxi5wjesZoFXJWElNWBjPZMbhg=="],
"cors": ["cors@2.8.6", "", { "dependencies": { "object-assign": "^4", "vary": "^1" } }, "sha512-tJtZBBHA6vjIAaF6EnIaq6laBBP9aq/Y3ouVJjEfoHbRBcHBAHYcMh/w8LDrk2PvIMMq8gmopa5D4V8RmbrxGw=="],
"cross-spawn": ["cross-spawn@7.0.6", "", { "dependencies": { "path-key": "^3.1.0", "shebang-command": "^2.0.0", "which": "^2.0.1" } }, "sha512-uV2QOWP2nWzsy2aMp8aRibhi9dlzF5Hgh5SHaB9OiTGEyDTiJJyx0uy51QXdyWbtAHNua4XJzUKca3OzKUd3vA=="],
"debug": ["debug@4.4.3", "", { "dependencies": { "ms": "^2.1.3" } }, "sha512-RGwwWnwQvkVfavKVt22FGLw+xYSdzARwm0ru6DhTVA3umU5hZc28V3kO4stgYryrTlLpuvgI9GiijltAjNbcqA=="],
"delayed-stream": ["delayed-stream@1.0.0", "", {}, "sha512-ZySD7Nf91aLB0RxL4KGrKHBXl7Eds1DAmEdcoVawXnLD7SDhpNgtuII2aAkg7a7QS41jxPSZ17p4VdGnMHk3MQ=="],
"depd": ["depd@2.0.0", "", {}, "sha512-g7nH6P6dyDioJogAAGprGpCtVImJhpPk/roCzdb3fIh61/s/nPsfR6onyMwkCAR/OlC3yBC0lESvUoQEAssIrw=="],
"dunder-proto": ["dunder-proto@1.0.1", "", { "dependencies": { "call-bind-apply-helpers": "^1.0.1", "es-errors": "^1.3.0", "gopd": "^1.2.0" } }, "sha512-KIN/nDJBQRcXw0MLVhZE9iQHmG68qAVIBg9CqmUYjmQIhgij9U5MFvrqkUL5FbtyyzZuOeOt0zdeRe4UY7ct+A=="],
"ee-first": ["ee-first@1.1.1", "", {}, "sha512-WMwm9LhRUo+WUaRN+vRuETqG89IgZphVSNkdFgeb6sS/E4OrDIN7t48CAewSHXc6C8lefD8KKfr5vY61brQlow=="],
"encodeurl": ["encodeurl@2.0.0", "", {}, "sha512-Q0n9HRi4m6JuGIV1eFlmvJB7ZEVxu93IrMyiMsGC0lrMJMWzRgx6WGquyfQgZVb31vhGgXnfmPNNXmxnOkRBrg=="],
"es-define-property": ["es-define-property@1.0.1", "", {}, "sha512-e3nRfgfUZ4rNGL232gUgX06QNyyez04KdjFrF+LTRoOXmrOgFKDg4BCdsjW8EnT69eqdYGmRpJwiPVYNrCaW3g=="],
"es-errors": ["es-errors@1.3.0", "", {}, "sha512-Zf5H2Kxt2xjTvbJvP2ZWLEICxA6j+hAmMzIlypy4xcBg1vKVnx89Wy0GbS+kf5cwCVFFzdCFh2XSCFNULS6csw=="],
"es-object-atoms": ["es-object-atoms@1.1.1", "", { "dependencies": { "es-errors": "^1.3.0" } }, "sha512-FGgH2h8zKNim9ljj7dankFPcICIK9Cp5bm+c2gQSYePhpaG5+esrLODihIorn+Pe6FGJzWhXQotPv73jTaldXA=="],
"es-set-tostringtag": ["es-set-tostringtag@2.1.0", "", { "dependencies": { "es-errors": "^1.3.0", "get-intrinsic": "^1.2.6", "has-tostringtag": "^1.0.2", "hasown": "^2.0.2" } }, "sha512-j6vWzfrGVfyXxge+O0x5sh6cvxAog0a/4Rdd2K36zCMV5eJ+/+tOAngRO8cODMNWbVRdVlmGZQL2YS3yR8bIUA=="],
"escape-html": ["escape-html@1.0.3", "", {}, "sha512-NiSupZ4OeuGwr68lGIeym/ksIZMJodUGOSCZ/FSnTxcrekbvqrgdUxlJOMpijaKZVjAJrWrGs/6Jy8OMuyj9ow=="],
"etag": ["etag@1.8.1", "", {}, "sha512-aIL5Fx7mawVa300al2BnEE4iNvo1qETxLrPI/o05L7z6go7fCw1J6EQmbK4FmJ2AS7kgVF/KEZWufBfdClMcPg=="],
"event-target-shim": ["event-target-shim@5.0.1", "", {}, "sha512-i/2XbnSz/uxRCU6+NdVJgKWDTM427+MqYbkQzD321DuCQJUqOuJKIA0IM2+W2xtYHdKOmZ4dR6fExsd4SXL+WQ=="],
"eventsource": ["eventsource@3.0.7", "", { "dependencies": { "eventsource-parser": "^3.0.1" } }, "sha512-CRT1WTyuQoD771GW56XEZFQ/ZoSfWid1alKGDYMmkt2yl8UXrVR4pspqWNEcqKvVIzg6PAltWjxcSSPrboA4iA=="],
"eventsource-parser": ["eventsource-parser@3.0.6", "", {}, "sha512-Vo1ab+QXPzZ4tCa8SwIHJFaSzy4R6SHf7BY79rFBDf0idraZWAkYrDjDj8uWaSm3S2TK+hJ7/t1CEmZ7jXw+pg=="],
"express": ["express@5.2.1", "", { "dependencies": { "accepts": "^2.0.0", "body-parser": "^2.2.1", "content-disposition": "^1.0.0", "content-type": "^1.0.5", "cookie": "^0.7.1", "cookie-signature": "^1.2.1", "debug": "^4.4.0", "depd": "^2.0.0", "encodeurl": "^2.0.0", "escape-html": "^1.0.3", "etag": "^1.8.1", "finalhandler": "^2.1.0", "fresh": "^2.0.0", "http-errors": "^2.0.0", "merge-descriptors": "^2.0.0", "mime-types": "^3.0.0", "on-finished": "^2.4.1", "once": "^1.4.0", "parseurl": "^1.3.3", "proxy-addr": "^2.0.7", "qs": "^6.14.0", "range-parser": "^1.2.1", "router": "^2.2.0", "send": "^1.1.0", "serve-static": "^2.2.0", "statuses": "^2.0.1", "type-is": "^2.0.1", "vary": "^1.1.2" } }, "sha512-hIS4idWWai69NezIdRt2xFVofaF4j+6INOpJlVOLDO8zXGpUVEVzIYk12UUi2JzjEzWL3IOAxcTubgz9Po0yXw=="],
"express-rate-limit": ["express-rate-limit@8.3.2", "", { "dependencies": { "ip-address": "10.1.0" }, "peerDependencies": { "express": ">= 4.11" } }, "sha512-77VmFeJkO0/rvimEDuUC5H30oqUC4EyOhyGccfqoLebB0oiEYfM7nwPrsDsBL1gsTpwfzX8SFy2MT3TDyRq+bg=="],
"fast-deep-equal": ["fast-deep-equal@3.1.3", "", {}, "sha512-f3qQ9oQy9j2AhBe/H9VC91wLmKBCCU/gDOnKNAYG5hswO7BLKj09Hc5HYNz9cGI++xlpDCIgDaitVs03ATR84Q=="],
"fast-uri": ["fast-uri@3.1.0", "", {}, "sha512-iPeeDKJSWf4IEOasVVrknXpaBV0IApz/gp7S2bb7Z4Lljbl2MGJRqInZiUrQwV16cpzw/D3S5j5Julj/gT52AA=="],
"finalhandler": ["finalhandler@2.1.1", "", { "dependencies": { "debug": "^4.4.0", "encodeurl": "^2.0.0", "escape-html": "^1.0.3", "on-finished": "^2.4.1", "parseurl": "^1.3.3", "statuses": "^2.0.1" } }, "sha512-S8KoZgRZN+a5rNwqTxlZZePjT/4cnm0ROV70LedRHZ0p8u9fRID0hJUZQpkKLzro8LfmC8sx23bY6tVNxv8pQA=="],
"form-data": ["form-data@4.0.5", "", { "dependencies": { "asynckit": "^0.4.0", "combined-stream": "^1.0.8", "es-set-tostringtag": "^2.1.0", "hasown": "^2.0.2", "mime-types": "^2.1.12" } }, "sha512-8RipRLol37bNs2bhoV67fiTEvdTrbMUYcFTiy3+wuuOnUog2QBHCZWXDRijWQfAkhBj2Uf5UnVaiWwA5vdd82w=="],
"form-data-encoder": ["form-data-encoder@1.7.2", "", {}, "sha512-qfqtYan3rxrnCk1VYaA4H+Ms9xdpPqvLZa6xmMgFvhO32x7/3J/ExcTd6qpxM0vH2GdMI+poehyBZvqfMTto8A=="],
"formdata-node": ["formdata-node@4.4.1", "", { "dependencies": { "node-domexception": "1.0.0", "web-streams-polyfill": "4.0.0-beta.3" } }, "sha512-0iirZp3uVDjVGt9p49aTaqjk84TrglENEDuqfdlZQ1roC9CWlPk6Avf8EEnZNcAqPonwkG35x4n3ww/1THYAeQ=="],
"forwarded": ["forwarded@0.2.0", "", {}, "sha512-buRG0fpBtRHSTCOASe6hD258tEubFoRLb4ZNA6NxMVHNw2gOcwHo9wyablzMzOA5z9xA9L1KNjk/Nt6MT9aYow=="],
"fresh": ["fresh@2.0.0", "", {}, "sha512-Rx/WycZ60HOaqLKAi6cHRKKI7zxWbJ31MhntmtwMoaTeF7XFH9hhBp8vITaMidfljRQ6eYWCKkaTK+ykVJHP2A=="],
"function-bind": ["function-bind@1.1.2", "", {}, "sha512-7XHNxH7qX9xG5mIwxkhumTox/MIRNcOgDrxWsMt2pAr23WHp6MrRlN7FBSFpCpr+oVO0F744iUgR82nJMfG2SA=="],
"get-intrinsic": ["get-intrinsic@1.3.0", "", { "dependencies": { "call-bind-apply-helpers": "^1.0.2", "es-define-property": "^1.0.1", "es-errors": "^1.3.0", "es-object-atoms": "^1.1.1", "function-bind": "^1.1.2", "get-proto": "^1.0.1", "gopd": "^1.2.0", "has-symbols": "^1.1.0", "hasown": "^2.0.2", "math-intrinsics": "^1.1.0" } }, "sha512-9fSjSaos/fRIVIp+xSJlE6lfwhES7LNtKaCBIamHsjr2na1BiABJPo0mOjjz8GJDURarmCPGqaiVg5mfjb98CQ=="],
"get-proto": ["get-proto@1.0.1", "", { "dependencies": { "dunder-proto": "^1.0.1", "es-object-atoms": "^1.0.0" } }, "sha512-sTSfBjoXBp89JvIKIefqw7U2CCebsc74kiY6awiGogKtoSGbgjYE/G/+l9sF3MWFPNc9IcoOC4ODfKHfxFmp0g=="],
"gopd": ["gopd@1.2.0", "", {}, "sha512-ZUKRh6/kUFoAiTAtTYPZJ3hw9wNxx+BIBOijnlG9PnrJsCcSjs1wyyD6vJpaYtgnzDrKYRSqf3OO6Rfa93xsRg=="],
"has-symbols": ["has-symbols@1.1.0", "", {}, "sha512-1cDNdwJ2Jaohmb3sg4OmKaMBwuC48sYni5HUw2DvsC8LjGTLK9h+eb1X6RyuOHe4hT0ULCW68iomhjUoKUqlPQ=="],
"has-tostringtag": ["has-tostringtag@1.0.2", "", { "dependencies": { "has-symbols": "^1.0.3" } }, "sha512-NqADB8VjPFLM2V0VvHUewwwsw0ZWBaIdgo+ieHtK3hasLz4qeCRjYcqfB6AQrBggRKppKF8L52/VqdVsO47Dlw=="],
"hasown": ["hasown@2.0.2", "", { "dependencies": { "function-bind": "^1.1.2" } }, "sha512-0hJU9SCPvmMzIBdZFqNPXWa6dqh7WdH0cII9y+CyS8rG3nL48Bclra9HmKhVVUHyPWNH5Y7xDwAB7bfgSjkUMQ=="],
"hono": ["hono@4.12.9", "", {}, "sha512-wy3T8Zm2bsEvxKZM5w21VdHDDcwVS1yUFFY6i8UobSsKfFceT7TOwhbhfKsDyx7tYQlmRM5FLpIuYvNFyjctiA=="],
"http-errors": ["http-errors@2.0.1", "", { "dependencies": { "depd": "~2.0.0", "inherits": "~2.0.4", "setprototypeof": "~1.2.0", "statuses": "~2.0.2", "toidentifier": "~1.0.1" } }, "sha512-4FbRdAX+bSdmo4AUFuS0WNiPz8NgFt+r8ThgNWmlrjQjt1Q7ZR9+zTlce2859x4KSXrwIsaeTqDoKQmtP8pLmQ=="],
"humanize-ms": ["humanize-ms@1.2.1", "", { "dependencies": { "ms": "^2.0.0" } }, "sha512-Fl70vYtsAFb/C06PTS9dZBo7ihau+Tu/DNCk/OyHhea07S+aeMWpFFkUaXRa8fI+ScZbEI8dfSxwY7gxZ9SAVQ=="],
"iconv-lite": ["iconv-lite@0.7.2", "", { "dependencies": { "safer-buffer": ">= 2.1.2 < 3.0.0" } }, "sha512-im9DjEDQ55s9fL4EYzOAv0yMqmMBSZp6G0VvFyTMPKWxiSBHUj9NW/qqLmXUwXrrM7AvqSlTCfvqRb0cM8yYqw=="],
"inherits": ["inherits@2.0.4", "", {}, "sha512-k/vGaX4/Yla3WzyMCvTQOXYeIHvqOKtnqBduzTHpzpQZzAskKMhZ2K+EnBiSM9zGSoIFeMpXKxa4dYeZIQqewQ=="],
"ip-address": ["ip-address@10.1.0", "", {}, "sha512-XXADHxXmvT9+CRxhXg56LJovE+bmWnEWB78LB83VZTprKTmaC5QfruXocxzTZ2Kl0DNwKuBdlIhjL8LeY8Sf8Q=="],
"ipaddr.js": ["ipaddr.js@1.9.1", "", {}, "sha512-0KI/607xoxSToH7GjN1FfSbLoU0+btTicjsQSWQlh/hZykN8KpmMf7uYwPW3R+akZ6R/w18ZlXSHBYXiYUPO3g=="],
"is-promise": ["is-promise@4.0.0", "", {}, "sha512-hvpoI6korhJMnej285dSg6nu1+e6uxs7zG3BYAm5byqDsgJNWwxzM6z6iZiAgQR4TJ30JmBTOwqZUw3WlyH3AQ=="],
"isexe": ["isexe@2.0.0", "", {}, "sha512-RHxMLp9lnKHGHRng9QFhRCMbYAcVpn69smSGcq3f36xjgVVWThj4qqLbTLlq7Ssj8B+fIQ1EuCEGI2lKsyQeIw=="],
"jose": ["jose@6.2.2", "", {}, "sha512-d7kPDd34KO/YnzaDOlikGpOurfF0ByC2sEV4cANCtdqLlTfBlw2p14O/5d/zv40gJPbIQxfES3nSx1/oYNyuZQ=="],
"json-schema-to-ts": ["json-schema-to-ts@3.1.1", "", { "dependencies": { "@babel/runtime": "^7.18.3", "ts-algebra": "^2.0.0" } }, "sha512-+DWg8jCJG2TEnpy7kOm/7/AxaYoaRbjVB4LFZLySZlWn8exGs3A4OLJR966cVvU26N7X9TWxl+Jsw7dzAqKT6g=="],
"json-schema-traverse": ["json-schema-traverse@1.0.0", "", {}, "sha512-NM8/P9n3XjXhIZn1lLhkFaACTOURQXjWhV4BA/RnOv8xvgqtqpAX9IO4mRQxSx1Rlo4tqzeqb0sOlruaOy3dug=="],
"json-schema-typed": ["json-schema-typed@8.0.2", "", {}, "sha512-fQhoXdcvc3V28x7C7BMs4P5+kNlgUURe2jmUT1T//oBRMDrqy1QPelJimwZGo7Hg9VPV3EQV5Bnq4hbFy2vetA=="],
"math-intrinsics": ["math-intrinsics@1.1.0", "", {}, "sha512-/IXtbwEk5HTPyEwyKX6hGkYXxM9nbj64B+ilVJnC/R6B0pH5G4V3b0pVbL7DBj4tkhBAppbQUlf6F6Xl9LHu1g=="],
"media-typer": ["media-typer@1.1.0", "", {}, "sha512-aisnrDP4GNe06UcKFnV5bfMNPBUw4jsLGaWwWfnH3v02GnBuXX2MCVn5RbrWo0j3pczUilYblq7fQ7Nw2t5XKw=="],
"merge-descriptors": ["merge-descriptors@2.0.0", "", {}, "sha512-Snk314V5ayFLhp3fkUREub6WtjBfPdCPY1Ln8/8munuLuiYhsABgBVWsozAG+MWMbVEvcdcpbi9R7ww22l9Q3g=="],
"mime-db": ["mime-db@1.54.0", "", {}, "sha512-aU5EJuIN2WDemCcAp2vFBfp/m4EAhWJnUNSSw0ixs7/kXbd6Pg64EmwJkNdFhB8aWt1sH2CTXrLxo/iAGV3oPQ=="],
"mime-types": ["mime-types@3.0.2", "", { "dependencies": { "mime-db": "^1.54.0" } }, "sha512-Lbgzdk0h4juoQ9fCKXW4by0UJqj+nOOrI9MJ1sSj4nI8aI2eo1qmvQEie4VD1glsS250n15LsWsYtCugiStS5A=="],
"ms": ["ms@2.1.3", "", {}, "sha512-6FlzubTLZG3J2a/NVCAleEhjzq5oxgHyaCU9yYXvcLsvoVaHJq/s5xXI6/XXP6tz7R9xAOtHnSO/tXtF3WRTlA=="],
"negotiator": ["negotiator@1.0.0", "", {}, "sha512-8Ofs/AUQh8MaEcrlq5xOX0CQ9ypTF5dl78mjlMNfOK08fzpgTHQRQPBxcPlEtIw0yRpws+Zo/3r+5WRby7u3Gg=="],
"node-domexception": ["node-domexception@1.0.0", "", {}, "sha512-/jKZoMpw0F8GRwl4/eLROPA3cfcXtLApP0QzLmUT/HuPCZWyB7IY9ZrMeKw2O/nFIqPQB3PVM9aYm0F312AXDQ=="],
"node-fetch": ["node-fetch@2.7.0", "", { "dependencies": { "whatwg-url": "^5.0.0" }, "peerDependencies": { "encoding": "^0.1.0" }, "optionalPeers": ["encoding"] }, "sha512-c4FRfUm/dbcWZ7U+1Wq0AwCyFL+3nt2bEw05wfxSz+DWpWsitgmSgYmy2dQdWyKC1694ELPqMs/YzUSNozLt8A=="],
"object-assign": ["object-assign@4.1.1", "", {}, "sha512-rJgTQnkUnH1sFw8yT6VSU3zD3sWmu6sZhIseY8VX+GRu3P6F7Fu+JNDoXfklElbLJSnc3FUQHVe4cU5hj+BcUg=="],
"object-inspect": ["object-inspect@1.13.4", "", {}, "sha512-W67iLl4J2EXEGTbfeHCffrjDfitvLANg0UlX3wFUUSTx92KXRFegMHUVgSqE+wvhAbi4WqjGg9czysTV2Epbew=="],
"on-finished": ["on-finished@2.4.1", "", { "dependencies": { "ee-first": "1.1.1" } }, "sha512-oVlzkg3ENAhCk2zdv7IJwd/QUD4z2RxRwpkcGY8psCVcCYZNq4wYnVWALHM+brtuJjePWiYF/ClmuDr8Ch5+kg=="],
"once": ["once@1.4.0", "", { "dependencies": { "wrappy": "1" } }, "sha512-lNaJgI+2Q5URQBkccEKHTQOPaXdUxnZZElQTZY0MFUAuaEqe1E+Nyvgdz/aIyNi6Z9MzO5dv1H8n58/GELp3+w=="],
"openai": ["openai@6.34.0", "", { "peerDependencies": { "ws": "^8.18.0", "zod": "^3.25 || ^4.0" }, "optionalPeers": ["ws", "zod"], "bin": { "openai": "bin/cli" } }, "sha512-yEr2jdGf4tVFYG6ohmr3pF6VJuveP0EA/sS8TBx+4Eq5NT10alu5zg2dmxMXMgqpihRDQlFGpRt2XwsGj+Fyxw=="],
"parseurl": ["parseurl@1.3.3", "", {}, "sha512-CiyeOxFT/JZyN5m0z9PfXw4SCBJ6Sygz1Dpl0wqjlhDEGGBP1GnsUVEL0p63hoG1fcj3fHynXi9NYO4nWOL+qQ=="],
"path-key": ["path-key@3.1.1", "", {}, "sha512-ojmeN0qd+y0jszEtoY48r0Peq5dwMEkIlCOu6Q5f41lfkswXuKtYrhgoTpLnyIcHm24Uhqx+5Tqm2InSwLhE6Q=="],
"path-to-regexp": ["path-to-regexp@8.4.1", "", {}, "sha512-fvU78fIjZ+SBM9YwCknCvKOUKkLVqtWDVctl0s7xIqfmfb38t2TT4ZU2gHm+Z8xGwgW+QWEU3oQSAzIbo89Ggw=="],
"pkce-challenge": ["pkce-challenge@5.0.1", "", {}, "sha512-wQ0b/W4Fr01qtpHlqSqspcj3EhBvimsdh0KlHhH8HRZnMsEa0ea2fTULOXOS9ccQr3om+GcGRk4e+isrZWV8qQ=="],
"proxy-addr": ["proxy-addr@2.0.7", "", { "dependencies": { "forwarded": "0.2.0", "ipaddr.js": "1.9.1" } }, "sha512-llQsMLSUDUPT44jdrU/O37qlnifitDP+ZwrmmZcoSKyLKvtZxpyV0n2/bD/N4tBAAZ/gJEdZU7KMraoK1+XYAg=="],
"qs": ["qs@6.15.0", "", { "dependencies": { "side-channel": "^1.1.0" } }, "sha512-mAZTtNCeetKMH+pSjrb76NAM8V9a05I9aBZOHztWy/UqcJdQYNsf59vrRKWnojAT9Y+GbIvoTBC++CPHqpDBhQ=="],
"range-parser": ["range-parser@1.2.1", "", {}, "sha512-Hrgsx+orqoygnmhFbKaHE6c296J+HTAQXoxEF6gNupROmmGJRoyzfG3ccAveqCBrwr/2yxQ5BVd/GTl5agOwSg=="],
"raw-body": ["raw-body@3.0.2", "", { "dependencies": { "bytes": "~3.1.2", "http-errors": "~2.0.1", "iconv-lite": "~0.7.0", "unpipe": "~1.0.0" } }, "sha512-K5zQjDllxWkf7Z5xJdV0/B0WTNqx6vxG70zJE4N0kBs4LovmEYWJzQGxC9bS9RAKu3bgM40lrd5zoLJ12MQ5BA=="],
"require-from-string": ["require-from-string@2.0.2", "", {}, "sha512-Xf0nWe6RseziFMu+Ap9biiUbmplq6S9/p+7w7YXP/JBHhrUDDUhwa+vANyubuqfZWTveU//DYVGsDG7RKL/vEw=="],
"router": ["router@2.2.0", "", { "dependencies": { "debug": "^4.4.0", "depd": "^2.0.0", "is-promise": "^4.0.0", "parseurl": "^1.3.3", "path-to-regexp": "^8.0.0" } }, "sha512-nLTrUKm2UyiL7rlhapu/Zl45FwNgkZGaCpZbIHajDYgwlJCOzLSk+cIPAnsEqV955GjILJnKbdQC1nVPz+gAYQ=="],
"safer-buffer": ["safer-buffer@2.1.2", "", {}, "sha512-YZo3K82SD7Riyi0E1EQPojLz7kpepnSQI9IyPbHHg1XXXevb5dJI7tpyN2ADxGcQbHG7vcyRHk0cbwqcQriUtg=="],
"send": ["send@1.2.1", "", { "dependencies": { "debug": "^4.4.3", "encodeurl": "^2.0.0", "escape-html": "^1.0.3", "etag": "^1.8.1", "fresh": "^2.0.0", "http-errors": "^2.0.1", "mime-types": "^3.0.2", "ms": "^2.1.3", "on-finished": "^2.4.1", "range-parser": "^1.2.1", "statuses": "^2.0.2" } }, "sha512-1gnZf7DFcoIcajTjTwjwuDjzuz4PPcY2StKPlsGAQ1+YH20IRVrBaXSWmdjowTJ6u8Rc01PoYOGHXfP1mYcZNQ=="],
"serve-static": ["serve-static@2.2.1", "", { "dependencies": { "encodeurl": "^2.0.0", "escape-html": "^1.0.3", "parseurl": "^1.3.3", "send": "^1.2.0" } }, "sha512-xRXBn0pPqQTVQiC8wyQrKs2MOlX24zQ0POGaj0kultvoOCstBQM5yvOhAVSUwOMjQtTvsPWoNCHfPGwaaQJhTw=="],
"setprototypeof": ["setprototypeof@1.2.0", "", {}, "sha512-E5LDX7Wrp85Kil5bhZv46j8jOeboKq5JMmYM3gVGdGH8xFpPWXUMsNrlODCrkoxMEeNi/XZIwuRvY4XNwYMJpw=="],
"shebang-command": ["shebang-command@2.0.0", "", { "dependencies": { "shebang-regex": "^3.0.0" } }, "sha512-kHxr2zZpYtdmrN1qDjrrX/Z1rR1kG8Dx+gkpK1G4eXmvXswmcE1hTWBWYUzlraYw1/yZp6YuDY77YtvbN0dmDA=="],
"shebang-regex": ["shebang-regex@3.0.0", "", {}, "sha512-7++dFhtcx3353uBaq8DDR4NuxBetBzC7ZQOhmTQInHEd6bSrXdiEyzCvG07Z44UYdLShWUyXt5M/yhz8ekcb1A=="],
"side-channel": ["side-channel@1.1.0", "", { "dependencies": { "es-errors": "^1.3.0", "object-inspect": "^1.13.3", "side-channel-list": "^1.0.0", "side-channel-map": "^1.0.1", "side-channel-weakmap": "^1.0.2" } }, "sha512-ZX99e6tRweoUXqR+VBrslhda51Nh5MTQwou5tnUDgbtyM0dBgmhEDtWGP/xbKn6hqfPRHujUNwz5fy/wbbhnpw=="],
"side-channel-list": ["side-channel-list@1.0.0", "", { "dependencies": { "es-errors": "^1.3.0", "object-inspect": "^1.13.3" } }, "sha512-FCLHtRD/gnpCiCHEiJLOwdmFP+wzCmDEkc9y7NsYxeF4u7Btsn1ZuwgwJGxImImHicJArLP4R0yX4c2KCrMrTA=="],
"side-channel-map": ["side-channel-map@1.0.1", "", { "dependencies": { "call-bound": "^1.0.2", "es-errors": "^1.3.0", "get-intrinsic": "^1.2.5", "object-inspect": "^1.13.3" } }, "sha512-VCjCNfgMsby3tTdo02nbjtM/ewra6jPHmpThenkTYh8pG9ucZ/1P8So4u4FGBek/BjpOVsDCMoLA/iuBKIFXRA=="],
"side-channel-weakmap": ["side-channel-weakmap@1.0.2", "", { "dependencies": { "call-bound": "^1.0.2", "es-errors": "^1.3.0", "get-intrinsic": "^1.2.5", "object-inspect": "^1.13.3", "side-channel-map": "^1.0.1" } }, "sha512-WPS/HvHQTYnHisLo9McqBHOJk2FkHO/tlpvldyrnem4aeQp4hai3gythswg6p01oSoTl58rcpiFAjF2br2Ak2A=="],
"statuses": ["statuses@2.0.2", "", {}, "sha512-DvEy55V3DB7uknRo+4iOGT5fP1slR8wQohVdknigZPMpMstaKJQWhwiYBACJE3Ul2pTnATihhBYnRhZQHGBiRw=="],
"toidentifier": ["toidentifier@1.0.1", "", {}, "sha512-o5sSPKEkg/DIQNmH43V0/uerLrpzVedkUh8tGNvaeXpfpuwjKenlSox/2O/BTlZUtEe+JG7s5YhEz608PlAHRA=="],
"tr46": ["tr46@0.0.3", "", {}, "sha512-N3WMsuqV66lT30CrXNbEjx4GEwlow3v6rr4mCcv6prnfwhS01rkgyFdjPNBYd9br7LpXV1+Emh01fHnq2Gdgrw=="],
"ts-algebra": ["ts-algebra@2.0.0", "", {}, "sha512-FPAhNPFMrkwz76P7cdjdmiShwMynZYN6SgOujD1urY4oNm80Ou9oMdmbR45LotcKOXoy7wSmHkRFE6Mxbrhefw=="],
"type-is": ["type-is@2.0.1", "", { "dependencies": { "content-type": "^1.0.5", "media-typer": "^1.1.0", "mime-types": "^3.0.0" } }, "sha512-OZs6gsjF4vMp32qrCbiVSkrFmXtG/AZhY3t0iAMrMBiAZyV9oALtXO8hsrHbMXF9x6L3grlFuwW2oAz7cav+Gw=="],
"typescript": ["typescript@5.9.3", "", { "bin": { "tsc": "bin/tsc", "tsserver": "bin/tsserver" } }, "sha512-jl1vZzPDinLr9eUt3J/t7V6FgNEw9QjvBPdysz9KfQDD41fQrC2Y4vKQdiaUpFT4bXlb1RHhLpp8wtm6M5TgSw=="],
"undici-types": ["undici-types@5.26.5", "", {}, "sha512-JlCMO+ehdEIKqlFxk6IfVoAUVmgz7cU7zD/h9XZ0qzeosSHmUJVOzSQvvYSYWXkFXC+IfLKSIffhv0sVZup6pA=="],
"unpipe": ["unpipe@1.0.0", "", {}, "sha512-pjy2bYhSsufwWlKwPc+l3cN7+wuJlK6uz0YdJEOlQDbl6jo/YlPi4mb8agUkVC8BF7V8NuzeyPNqRksA3hztKQ=="],
"vary": ["vary@1.1.2", "", {}, "sha512-BNGbWLfd0eUPabhkXUVm0j8uuvREyTh5ovRa/dyow/BqAbZJyC+5fU+IzQOzmAKzYqYRAISoRhdQr3eIZ/PXqg=="],
"web-streams-polyfill": ["web-streams-polyfill@4.0.0-beta.3", "", {}, "sha512-QW95TCTaHmsYfHDybGMwO5IJIM93I/6vTRk+daHTWFPhwh+C8Cg7j7XyKrwrj8Ib6vYXe0ocYNrmzY4xAAN6ug=="],
"webidl-conversions": ["webidl-conversions@3.0.1", "", {}, "sha512-2JAn3z8AR6rjK8Sm8orRC0h/bcl/DqL7tRPdGZ4I1CjdF+EaMLmYxBHyXuKL849eucPFhvBoxMsflfOb8kxaeQ=="],
"whatwg-url": ["whatwg-url@5.0.0", "", { "dependencies": { "tr46": "~0.0.3", "webidl-conversions": "^3.0.0" } }, "sha512-saE57nupxk6v3HY35+jzBwYa0rKSy0XR8JSxZPwgLr7ys0IBzhGviA1/TUGJLmSVqs8pb9AnvICXEuOHLprYTw=="],
"which": ["which@2.0.2", "", { "dependencies": { "isexe": "^2.0.0" }, "bin": { "node-which": "./bin/node-which" } }, "sha512-BLI3Tl1TW3Pvl70l3yq3Y64i+awpwXqsGBYWkkqMtnbXgrMD+yj7rhW0kuEDxzJaYXGjEW5ogapKNMEKNMjibA=="],
"wrappy": ["wrappy@1.0.2", "", {}, "sha512-l4Sp/DRseor9wL6EvV2+TuQn63dMkPjZ/sp9XkghTEbV9KlPS1xUsZ3u7/IQO4wxtcFB4bgpQPRcR3QCvezPcQ=="],
"yaml": ["yaml@2.8.3", "", { "bin": { "yaml": "bin.mjs" } }, "sha512-AvbaCLOO2Otw/lW5bmh9d/WEdcDFdQp2Z2ZUH3pX9U2ihyUY0nvLv7J6TrWowklRGPYbB/IuIMfYgxaCPg5Bpg=="],
"zod": ["zod@4.3.6", "", {}, "sha512-rftlrkhHZOcjDwkGlnUtZZkvaPHCsDATp4pGpuOOMDaTdDDXF91wuVDJoWoPsKX/3YPQ5fHuF3STjcYyKr+Qhg=="],
"zod-to-json-schema": ["zod-to-json-schema@3.25.2", "", { "peerDependencies": { "zod": "^3.25.28 || ^4" } }, "sha512-O/PgfnpT1xKSDeQYSCfRI5Gy3hPf91mKVDuYLUHZJMiDFptvP41MSnWofm8dnCm0256ZNfZIM7DSzuSMAFnjHA=="],
"@anthropic-ai/claude-agent-sdk/@anthropic-ai/sdk": ["@anthropic-ai/sdk@0.74.0", "", { "dependencies": { "json-schema-to-ts": "^3.1.1" }, "peerDependencies": { "zod": "^3.25.0 || ^4.0.0" }, "optionalPeers": ["zod"], "bin": { "anthropic-ai-sdk": "bin/cli" } }, "sha512-srbJV7JKsc5cQ6eVuFzjZO7UR3xEPJqPamHFIe29bs38Ij2IripoAhC0S5NslNbaFUYqBKypmmpzMTpqfHEUDw=="],
"@types/node-fetch/@types/node": ["@types/node@25.5.0", "", { "dependencies": { "undici-types": "~7.18.0" } }, "sha512-jp2P3tQMSxWugkCUKLRPVUpGaL5MVFwF8RDuSRztfwgN1wmqJeMSbKlnEtQqU8UrhTmzEmZdu2I6v2dpp7XIxw=="],
"bun-types/@types/node": ["@types/node@25.5.0", "", { "dependencies": { "undici-types": "~7.18.0" } }, "sha512-jp2P3tQMSxWugkCUKLRPVUpGaL5MVFwF8RDuSRztfwgN1wmqJeMSbKlnEtQqU8UrhTmzEmZdu2I6v2dpp7XIxw=="],
"form-data/mime-types": ["mime-types@2.1.35", "", { "dependencies": { "mime-db": "1.52.0" } }, "sha512-ZDY+bPm5zTTF+YpCrAU9nK0UgICYPT0QtT1NZWFv4s++TNkcgVaT0g6+4R2uI4MjQjzysHB1zxuWL50hzaeXiw=="],
"@types/node-fetch/@types/node/undici-types": ["undici-types@7.18.2", "", {}, "sha512-AsuCzffGHJybSaRrmr5eHr81mwJU3kjw6M+uprWvCXiNeN9SOGwQ3Jn8jb8m3Z6izVgknn1R0FTCEAP2QrLY/w=="],
"bun-types/@types/node/undici-types": ["undici-types@7.18.2", "", {}, "sha512-AsuCzffGHJybSaRrmr5eHr81mwJU3kjw6M+uprWvCXiNeN9SOGwQ3Jn8jb8m3Z6izVgknn1R0FTCEAP2QrLY/w=="],
"form-data/mime-types/mime-db": ["mime-db@1.52.0", "", {}, "sha512-sPU4uV7dYlvtWJxwwxHD0PuihVNiE7TyAbQ5SWxDCB9mUYvOgroQOwYQQOKPJ8CIbE+1ETVlOoK1UC2nU3gYvg=="],
}
}

93
ai_evals/cases/app.yaml Normal file
View File

@@ -0,0 +1,93 @@
- id: app-test1-counter-create
prompt: |-
Create a simple counter app with increment and decrement buttons.
judgeChecklist:
- shows the current count in the UI
- includes an increment button
- includes a decrement button
- clicking the buttons updates the count correctly
- id: app-test2-counter-reset
prompt: |-
Add a reset button that sets the counter back to 0
initial: ai_evals/fixtures/frontend/app/initial/test1_counter_app
judgeChecklist:
- adds a reset control to the existing counter app
- clicking reset sets the count back to 0
- keeps the existing increment and decrement behavior working
- id: app-test3-shopping-cart-quantity
prompt: |-
Add a quantity selector (+ and - buttons) to each cart item so users can adjust quantities without removing and re-adding items
initial: ai_evals/fixtures/frontend/app/initial/shopping_cart
judgeChecklist:
- each cart item has visible plus and minus quantity controls
- users can increase quantity without re-adding the product
- users can decrease quantity from the cart UI
- cart totals stay in sync with quantity changes
- id: app-test4-shopping-cart-discount
prompt: |-
Add a discount code input field in the cart.
When the code "SAVE10" is entered, apply a 10% discount to the total
initial: ai_evals/fixtures/frontend/app/initial/shopping_cart
judgeChecklist:
- adds a discount code input to the cart
- recognizes the code SAVE10
- applies a 10 percent discount to the displayed total
- keeps the rest of the cart behavior intact
- id: app-test5-file-manager-search
prompt: |-
Add a search bar in the toolbar that filters files and folders by name as the user types
initial: ai_evals/fixtures/frontend/app/initial/file_manager
judgeChecklist:
- adds a search input in the toolbar
- filters files and folders by name as the user types
- updates the visible file list from the search query
- keeps the rest of the file manager usable
- id: app-test6-file-manager-inline-rename
prompt: |-
Let users rename files and folders directly from the file list without leaving the page.
initial: ai_evals/fixtures/frontend/app/initial/file_manager
judgeChecklist:
- adds a visible rename action or inline edit mode in the file list
- lets users edit an item's name directly from the list
- saves the renamed item through the app's existing rename behavior
- refreshes the displayed name after a successful rename
- id: app-test7-file-manager-select-all
prompt: |-
Add a "Select All" checkbox in the file list header and individual checkboxes for each file.
Add a "Delete Selected" button that appears when items are selected
initial: ai_evals/fixtures/frontend/app/initial/file_manager
judgeChecklist:
- adds a select-all control in the file list header
- adds per-item selection controls
- shows a delete-selected action only when there is a selection
- deleting selected items updates the visible list
- id: app-test8-inventory-tracker-create
prompt: |-
Create an inventory tracker app for a small store.
Users should be able to add items with a name, sku, quantity, and price, search items by name or sku, and delete items.
The inventory should persist between sessions.
judgeChecklist:
- includes a form to add inventory items with name, sku, quantity, and price
- shows a list or table of saved inventory items
- supports searching or filtering by name or sku
- lets users delete existing inventory items
- persists the inventory data appropriately for a raw Windmill app
- id: app-test9-recipe-book-create
prompt: |-
Create a recipe book app where users can add recipes with a name, ingredients list, and instructions.
Include a search bar to filter recipes by name and the ability to delete recipes.
Recipes should persist between sessions.
judgeChecklist:
- includes a form to add recipes with name, ingredients, and instructions
- shows saved recipes in the app
- supports searching recipes by name
- lets users delete recipes
- persists recipes appropriately for a raw Windmill app

66
ai_evals/cases/cli.yaml Normal file
View File

@@ -0,0 +1,66 @@
- id: bun-hello-script
prompt: |-
Create a Windmill Bun script at `f/evals/hello.ts`.
It should take a `name` input and return a greeting object like `{ greeting: "Hello, Alice!" }`.
expected: ai_evals/fixtures/cli/expected/bun-hello-script
judgeChecklist:
- creates the requested Bun script at f/evals/hello.ts
- takes a name input
- returns an object containing the greeting
- id: bun-hello-flow
prompt: |-
Create a Windmill flow at `f/evals/hello__flow`.
It should take a `name` input and return a greeting object like `{ greeting: "Hello, Alice!" }`.
Put the step code in `hello.ts`.
expected: ai_evals/fixtures/cli/expected/bun-hello-flow
judgeChecklist:
- creates the requested flow folder with flow.yaml and hello.ts
- wires the name input into the flow step
- returns the greeting object
- id: python-add-numbers-script
prompt: |-
Add a Windmill Python script at `f/evals/add_numbers.py`.
It should take `a` and `b` as inputs and return `{ "total": a + b }`.
expected: ai_evals/fixtures/cli/expected/python-add-numbers-script
judgeChecklist:
- creates the requested Python script at f/evals/add_numbers.py
- takes `a` and `b` as inputs
- returns an object with total equal to a plus b
- id: bun-hello-script-uppercase
prompt: |-
Update `f/evals/hello.ts` so it accepts an optional `uppercase` boolean.
Keep returning `{ greeting: ... }`, but when `uppercase` is true the greeting should be uppercased before returning it.
initial: ai_evals/fixtures/cli/initial/bun-hello-script-uppercase
expected: ai_evals/fixtures/cli/expected/bun-hello-script-uppercase
judgeChecklist:
- updates the existing hello.ts file rather than creating a new script
- accepts an optional uppercase boolean input
- keeps returning an object with greeting
- uppercases the greeting when uppercase is true
- id: bun-hello-flow-punctuation
prompt: |-
Update the existing flow in `f/evals/hello__flow` so it also accepts an optional `punctuation` input.
The greeting should use that punctuation and default to `!` when it is missing.
initial: ai_evals/fixtures/cli/initial/bun-hello-flow-punctuation
expected: ai_evals/fixtures/cli/expected/bun-hello-flow-punctuation
judgeChecklist:
- updates the existing hello flow instead of creating a new one
- adds an optional punctuation input to the flow
- updates the step code so the returned greeting uses punctuation
- defaults punctuation to an exclamation mark when omitted
- id: flow-reuse-existing-script
prompt: |-
There is already a reusable greeting script at `f/lib/format_greeting.ts`.
Create a flow at `f/evals/reuse_greeting__flow` that takes a `name` input and reuses that existing script instead of duplicating the logic inline.
initial: ai_evals/fixtures/cli/initial/flow-reuse-existing-script
expected: ai_evals/fixtures/cli/expected/flow-reuse-existing-script
judgeChecklist:
- creates the requested flow at f/evals/reuse_greeting__flow
- reuses the existing script from f/lib by path
- does not duplicate the greeting logic in a new inline script
- wires the name input into the reused script

246
ai_evals/cases/flow.yaml Normal file
View File

@@ -0,0 +1,246 @@
- id: flow-test0-sum-two-numbers
prompt: |-
Create a flow that takes two numbers, `a` and `b`, and returns their sum.
Keep it simple and use a single step named `sum_numbers`.
expected: ai_evals/fixtures/frontend/flow/expected/test0_sum_two_numbers.json
judgeChecklist:
- "the flow takes `a` and `b` as inputs"
- "the main step is named `sum_numbers`"
- the flow returns the sum of the two numbers
- id: flow-test1-reuse-existing-script
prompt: |-
I need a flow that adds two numbers.
If there is already a script in the workspace that does that, reuse it instead of rewriting the logic.
The flow should take `a` and `b` as inputs and use a single step named `sum_numbers`.
initial: ai_evals/fixtures/frontend/flow/initial/test1_reuse_existing_script_initial.json
expected: ai_evals/fixtures/frontend/flow/expected/test1_reuse_existing_script.json
judgeChecklist:
- "the flow takes `a` and `b` as inputs"
- "the main step is named `sum_numbers`"
- the flow reuses the existing workspace script instead of rewriting the addition logic
- id: flow-test2-call-existing-subflow
prompt: |-
Create a parent flow that adds two numbers by reusing an existing flow in the workspace if one already exists.
The parent flow should take `a` and `b` as inputs and delegate the calculation instead of inlining it.
Use a single step named `call_add_numbers`.
initial: ai_evals/fixtures/frontend/flow/initial/test2_call_existing_subflow_initial.json
expected: ai_evals/fixtures/frontend/flow/expected/test2_call_existing_subflow.json
judgeChecklist:
- "the parent flow takes `a` and `b` as inputs"
- "the main step is named `call_add_numbers`"
- the parent flow delegates to an existing workspace subflow instead of inlining the addition logic
- id: flow-test3-branchone-routing
prompt: |-
Create a flow that routes incoming support requests based on the customer's tier.
The input should contain a string field named `tier`.
Free, pro, and enterprise requests should go to different queues, and unknown tiers should fall back to a default queue.
Name the main routing step `route_by_tier`.
expected: ai_evals/fixtures/frontend/flow/expected/test3_branchone_routing.json
judgeChecklist:
- "the input schema includes a string field named `tier`"
- "the main routing step is named `route_by_tier`"
- free requests go to a free queue
- pro requests go to a pro queue
- enterprise requests go to an enterprise queue
- unknown tiers fall back to a default queue
- id: flow-test4-order-processing-loop
prompt: |-
Build an order-processing flow.
The input should include an order with:
- an `items` array containing `name`, `price`, and `quantity`
- `customer_email`
- `shipping_address`
The flow should:
- validate that every item has a positive price and quantity
- calculate the order total with 8% tax
- check inventory for each item using placeholder availability data
- create a shipment if everything is in stock, otherwise create a backorder
- send a confirmation using placeholder email logic
- return a final order summary with the status
validate:
schemaAnyOf:
- requiredPaths:
- order
- order.items
- order.customer_email
- order.shipping_address
- requiredPaths:
- items
- customer_email
- shipping_address
resolveResultsRefs: true
judgeChecklist:
- the flow validates that every item has a positive price and quantity
- the flow calculates the order total with 8% tax
- the flow checks inventory for each item using placeholder availability data
- the flow creates a shipment if everything is in stock, otherwise a backorder
- the flow sends a confirmation using placeholder email logic
- the flow returns a final order summary with the resulting status
- id: flow-test5-parallel-data-pipeline
prompt: |-
Create a data-processing flow for three external data sources.
It should:
- load a small placeholder configuration listing the three sources
- fetch placeholder records from each source
- clean and validate each source's records
- combine everything into one dataset
- compute an overall quality score
- store the result differently depending on the score:
- 90 or above goes to the primary database
- 70 to 89 goes to a secondary database with a warning
- below 70 goes to quarantine and triggers an alert
- return a processing report with total records, quality score, and destination
judgeChecklist:
- the flow loads a placeholder configuration listing three external sources
- the flow fetches placeholder records from each source
- the flow cleans and validates each source's records
- the flow combines everything into one dataset
- the flow computes an overall quality score
- scores of 90 or above go to the primary database
- scores from 70 to 89 go to a secondary database with a warning
- scores below 70 go to quarantine and trigger an alert
- the final report includes total records, quality score, and destination
- id: flow-test6-ai-agent-tools
prompt: |-
Create a customer support flow.
The input should include `customer_id` and `query_text`.
The flow should load the customer's profile and order history, then use an AI assistant to help with the request.
The assistant should be able to:
- look up orders
- check refund eligibility
- search FAQs
- open a support ticket when needed
After that, log the interaction and return the assistant's response along with any actions it took.
judgeChecklist:
- "the input schema includes `customer_id` and `query_text`"
- the flow loads the customer's profile and order history
- the flow uses an AI assistant step
- the assistant can look up orders
- the assistant can check refund eligibility
- the assistant can search FAQs
- the assistant can open a support ticket
- the flow logs the interaction
- the final output returns the assistant response along with any actions taken or resulting support action details
- id: flow-test7-simple-modification
prompt: |-
Update this flow so it validates processed data before saving it.
After `process_data`, add a `validate_data` step that checks the data array is not empty.
If the array is empty, it should return an error object with the message `No data to save`.
If validation passes, let the save continue normally.
Update `save_results` so it handles the validation result correctly.
initial: ai_evals/fixtures/frontend/flow/initial/test5_initial.json
expected: ai_evals/fixtures/frontend/flow/expected/test5_modify_simple.json
judgeChecklist:
- the updated flow keeps the original fetch and process steps intact
- "a `validate_data` step is added after `process_data`"
- "`validate_data` checks that the processed data array is not empty"
- "empty data returns an error object with the message `No data to save`"
- "`save_results` handles the validation result correctly"
- id: flow-test8-branching-in-loop
prompt: |-
Update the order-processing logic inside `loop_orders` so different order types are handled differently.
For `express`, mark the order as priority and use a shipping cost of $15.99.
For `standard`, use a shipping cost of $5.99.
For `pickup`, mark it as no shipping required with a cost of $0.
Keep the existing processing as a fallback for unknown order types.
Each path should return the orderId, shipping cost, and shipping type.
initial: ai_evals/fixtures/frontend/flow/initial/test6_initial.json
judgeChecklist:
- "the existing `loop_orders` flow still handles per-order processing"
- exact branching topology is not required as long as `loop_orders` handles the order types correctly
- express orders are marked as priority and use a shipping cost of 15.99
- standard orders use a shipping cost of 5.99
- pickup orders use a shipping cost of 0 and are treated as no shipping required
- unknown order types still follow a fallback path
- "each processed order returns `orderId`, `shippingCost`, and `shippingType`"
- id: flow-test9-parallel-refactor
prompt: |-
Refactor this flow so the enrichment work no longer runs one step at a time.
`enrich_price`, `enrich_inventory`, and `enrich_reviews` should run independently.
Each one should return a fallback value if it fails.
Update `combine_data` so it merges the enrichment results and sets a `hasFallbacks` flag when any fallback was used.
Keep `get_item` as the first step and `return_result` as the last step.
initial: ai_evals/fixtures/frontend/flow/initial/test7_initial.json
expected: ai_evals/fixtures/frontend/flow/expected/test7_modify_complex.json
judgeChecklist:
- "the updated flow keeps `get_item` as the first step"
- "the updated flow keeps `return_result` as the last step"
- "`enrich_price`, `enrich_inventory`, and `enrich_reviews` run independently rather than sequentially"
- each enrichment path returns a fallback value if it fails
- "`combine_data` merges the enrichment results"
- "`combine_data` sets `hasFallbacks` when any fallback was used"
- id: flow-test10-while-loop-counter
prompt: |-
Create a flow that keeps incrementing a counter until it reaches a target value.
The input should include a number field named `target`.
Name the looping step `count_until_target`.
Once the target is reached, return the final counter value.
expected: ai_evals/fixtures/frontend/flow/expected/test10_while_loop_counter.json
judgeChecklist:
- "the input schema includes a number field named `target`"
- "the looping step is named `count_until_target`"
- the flow keeps incrementing a counter until the target is reached
- the final output returns the final counter value
- id: flow-test11-preprocessor-and-failure-handler
prompt: |-
Create an event-processing flow for a string payload.
Before the main processing runs, trim the payload and reject empty strings.
The main step should be named `process_event` and return a simple success object.
If anything fails, return a compact error object with the error message and the failing step id.
expected: ai_evals/fixtures/frontend/flow/expected/test11_preprocessor_failure.json
validate:
requireSpecialModules:
- preprocessor_module
- failure_module
judgeChecklist:
- the flow trims the payload before the main processing runs
- the flow rejects empty payload strings
- "the main step is named `process_event`"
- "`process_event` returns a simple success object"
- failures return a compact error object with the error message and failing step id
- id: flow-test12-approval-step
prompt: |-
Create a purchase approval flow.
The input should include `requester_email` and `amount`.
Add an approval step named `request_approval` that pauses the flow and asks the approver for a comment.
One approval should be enough to continue.
After approval, add a final step named `finalize_purchase` that returns an approved status object.
expected: ai_evals/fixtures/frontend/flow/expected/test12_approval_step.json
validate:
schemaRequiredPaths:
- requester_email
- amount
requireSuspendSteps:
- id: request_approval
requiredEvents: 1
resumeRequiredStringFieldAnyOf:
- comment
- approver_comment
judgeChecklist:
- "the flow includes an approval step named `request_approval`"
- "`request_approval` pauses the flow and asks the approver for a comment"
- one approval is enough to continue
- "the flow includes a final step named `finalize_purchase`"
- "`finalize_purchase` returns an approved status object after approval"

View File

@@ -0,0 +1,11 @@
- id: script-test1-greet-user
prompt: |-
Update the current Bun script so it takes the existing `name` input and returns a plain greeting string like `Hello, Alice!`.
Do not wrap the result in an object or array.
Keep it simple and do not add external dependencies.
initial: ai_evals/fixtures/frontend/script/initial/test1_empty_bun.json
expected: ai_evals/fixtures/frontend/script/expected/test1_greet_user.json
judgeChecklist:
- uses the existing `name` input
- returns a plain greeting string
- does not wrap the result in an object or array

295
ai_evals/cli/index.ts Normal file
View File

@@ -0,0 +1,295 @@
#!/usr/bin/env bun
import { Command, InvalidArgumentError } from "commander";
import { loadCases, loadSelectedCases } from "../core/cases";
import {
EVAL_MODELS,
type EvalModelSpec,
formatRunModelLabel,
getCliEvalModel,
getEvalModelHelpText,
resolveEvalModel,
} from "../core/models";
import {
appendHistoryRecord,
buildRunResult,
formatRunSummary,
resolveRunOutputPath,
writeRunArtifacts,
writeRunResult,
} from "../core/results";
import { runSuite } from "../core/runSuite";
import { EVAL_MODES, type EvalMode } from "../core/types";
import { DEFAULT_JUDGE_MODEL } from "../core/judge";
import { createCliModeRunner } from "../modes/cli";
import { runFrontendBenchmarkAdapter } from "../adapters/frontend/runtime";
async function main() {
const program = new Command()
.name("bun run cli --")
.description("Run AI eval cases against the current production prompts and guidance")
.showHelpAfterError()
.showSuggestionAfterError()
.addHelpText(
"after",
[
"",
"Examples:",
" bun run cli -- models",
" bun run cli -- cases",
" bun run cli -- cases flow",
" bun run cli -- run flow",
" bun run cli -- run flow --model 4o",
" bun run cli -- run flow --models haiku,opus,4o",
" bun run cli -- run flow flow-test0-sum-two-numbers --verbose",
" bun run cli -- run flow --record",
" bun run cli -- run flow flow-test5-simple-modification --runs 3",
" bun run cli -- run cli bun-hello-script",
"",
"Models:",
getEvalModelHelpText(),
].join("\n")
);
program
.command("models")
.description("List available model aliases")
.action(() => {
handleModels();
});
program
.command("cases")
.description("List available cases")
.argument("[mode]", "cli, flow, script, or app", parseOptionalMode)
.action(async (mode?: EvalMode) => {
await handleCases(mode);
});
program
.command("run")
.description("Run one benchmark mode")
.argument("<mode>", "cli, flow, script, or app", parseMode)
.argument("[caseIds...]", "specific case ids to run")
.option("--runs <n>", "number of attempts per case", parsePositiveInteger, 1)
.option("--output <path>", "write the result JSON to this path")
.option("--model <name>", `model alias (${EVAL_MODELS.map((entry) => entry.id).join(", ")})`)
.option("--models <names>", "comma-separated model aliases to run sequentially")
.option("--verbose", "stream assistant output during frontend runs")
.option("--record", "append a compact summary line to ai_evals/history/<mode>.jsonl")
.action(
async (
mode: EvalMode,
caseIds: string[],
options: {
runs: number;
output?: string;
model?: string;
models?: string;
verbose?: boolean;
record?: boolean;
}
) => {
await handleRun({
mode,
caseIds,
runs: options.runs,
outputPath: options.output,
model: options.model,
models: options.models,
verbose: options.verbose ?? false,
record: options.record ?? false,
});
}
);
await program.parseAsync(process.argv);
}
async function handleCases(mode?: EvalMode) {
const modes = mode ? [mode] : [...EVAL_MODES];
for (const entry of modes) {
const cases = await loadCases(entry);
process.stdout.write(`${entry} (${cases.length})\n`);
for (const evalCase of cases) {
process.stdout.write(`- ${evalCase.id}\n`);
}
process.stdout.write("\n");
}
}
function handleModels() {
process.stdout.write("Available models\n");
for (const model of EVAL_MODELS) {
const supports = [
...(model.frontend ? ["flow", "script", "app"] : []),
...(model.cli ? ["cli"] : []),
];
const aliases = [model.id, ...model.aliases.filter((alias) => alias !== model.id)];
process.stdout.write(`- ${model.id}: ${model.label}\n`);
process.stdout.write(` aliases: ${aliases.join(", ")}\n`);
process.stdout.write(` modes: ${supports.join(", ")}\n`);
}
process.stdout.write(`\nJudge model: ${DEFAULT_JUDGE_MODEL}\n`);
}
async function handleRun(input: {
mode: EvalMode;
caseIds: string[];
runs: number;
outputPath?: string;
model?: string;
models?: string;
verbose: boolean;
record: boolean;
}) {
if (input.record && input.caseIds.length > 0) {
throw new Error("--record only supports full-suite runs; omit case ids to record history");
}
if (input.model && input.models) {
throw new Error("Use either --model or --models, not both");
}
const selectedCases = await loadSelectedCases(input.mode, input.caseIds);
const models = resolveRequestedModels(input.mode, input.model, input.models);
if (input.outputPath && models.length > 1) {
throw new Error("--output only supports a single model run");
}
const summaries: Array<{ label: string; passRate: number; averageDurationMs: number }> = [];
for (const [index, model] of models.entries()) {
const runModel = formatRunModelLabel(input.mode, model);
if (models.length > 1) {
process.stdout.write(
`${index > 0 ? "\n" : ""}=== ${input.mode} ${model.id} (${runModel}) ===\n`
);
}
process.stderr.write(`Starting ${input.mode} benchmark...\n`);
const result =
input.mode === "cli"
? await runCliBenchmark(selectedCases, input.runs, getCliEvalModel(model), runModel)
: await runFrontendBenchmarkAdapter({
mode: input.mode,
caseIds: input.caseIds,
runs: input.runs,
model: model.id,
verbose: input.verbose,
});
const resolvedOutputPath =
models.length === 1
? resolveRunOutputPath(input.mode, input.outputPath)
: resolveRunOutputPath(input.mode);
const artifactsPath = await writeRunArtifacts(result, resolvedOutputPath);
const resultPath = await writeRunResult(result, resolvedOutputPath);
const historyPath = input.record ? await appendHistoryRecord(result) : null;
process.stdout.write(`${formatRunSummary(result)}\n`);
process.stdout.write(`Saved: ${resultPath}\n`);
if (artifactsPath) {
process.stdout.write(`Artifacts: ${artifactsPath}\n`);
}
if (historyPath) {
process.stdout.write(`Recorded: ${historyPath}\n`);
}
summaries.push({
label: `${model.id} (${runModel})`,
passRate: result.passRate,
averageDurationMs: result.averageDurationMs,
});
}
if (summaries.length > 1) {
process.stdout.write("\nModel summary\n");
for (const summary of summaries) {
process.stdout.write(
`- ${summary.label}: ${formatPercent(summary.passRate)} | ${Math.round(summary.averageDurationMs)}ms\n`
);
}
}
}
async function runCliBenchmark(
cases: Awaited<ReturnType<typeof loadSelectedCases>>,
runs: number,
model: ReturnType<typeof getCliEvalModel>,
runModel: string
) {
const caseResults = await runSuite({
modeRunner: createCliModeRunner(model),
cases,
runs,
runModel,
judgeModel: DEFAULT_JUDGE_MODEL,
});
return buildRunResult({
mode: "cli",
runs,
runModel,
judgeModel: DEFAULT_JUDGE_MODEL,
caseResults,
});
}
function parseMode(value: string): EvalMode {
if (EVAL_MODES.includes(value as EvalMode)) {
return value as EvalMode;
}
throw new InvalidArgumentError(`mode must be one of: ${EVAL_MODES.join(", ")}`);
}
function parseOptionalMode(value: string | undefined): EvalMode | undefined {
return value ? parseMode(value) : undefined;
}
function parsePositiveInteger(value: string): number {
const parsed = Number(value);
if (!Number.isInteger(parsed) || parsed <= 0) {
throw new InvalidArgumentError("must be a positive integer");
}
return parsed;
}
function resolveRequestedModels(
mode: EvalMode,
singleModel?: string,
multipleModels?: string
): EvalModelSpec[] {
if (!multipleModels) {
return [resolveEvalModel(mode, singleModel)];
}
const aliases = multipleModels
.split(",")
.map((value) => value.trim())
.filter(Boolean);
if (aliases.length === 0) {
throw new Error("--models requires at least one model alias");
}
const seen = new Set<string>();
const models: EvalModelSpec[] = [];
for (const alias of aliases) {
const model = resolveEvalModel(mode, alias);
if (seen.has(model.id)) {
continue;
}
seen.add(model.id);
models.push(model);
}
return models;
}
function formatPercent(value: number): string {
return `${(value * 100).toFixed(1)}%`;
}
void main().catch((error) => {
const message = error instanceof Error ? error.message : String(error);
process.stderr.write(`${message}\n`);
process.exit(1);
});

71
ai_evals/core/cases.ts Normal file
View File

@@ -0,0 +1,71 @@
import { readFile } from "node:fs/promises";
import path from "node:path";
import { fileURLToPath } from "node:url";
import { parse } from "yaml";
import type { EvalCase, EvalMode, FlowValidationSpec } from "./types";
const REPO_ROOT = fileURLToPath(new URL("../../", import.meta.url));
const CASES_DIR = path.join(REPO_ROOT, "ai_evals", "cases");
interface RawEvalCase {
id: string;
prompt: string;
initial?: string;
expected?: string;
validate?: FlowValidationSpec;
judgeChecklist?: string[];
}
export function getRepoRoot(): string {
return REPO_ROOT;
}
export function getAiEvalsRoot(): string {
return path.join(REPO_ROOT, "ai_evals");
}
export async function loadCases(mode: EvalMode): Promise<EvalCase[]> {
const filePath = path.join(CASES_DIR, `${mode}.yaml`);
const raw = await readFile(filePath, "utf8");
const parsed = parse(raw);
if (!Array.isArray(parsed)) {
throw new Error(`Expected ${filePath} to contain a YAML list of cases`);
}
return parsed.map((entry) => ({
id: entry.id,
prompt: entry.prompt,
initialPath: resolveFixturePath(entry.initial),
expectedPath: resolveFixturePath(entry.expected),
validate: entry.validate,
judgeChecklist: entry.judgeChecklist,
}));
}
export async function loadSelectedCases(
mode: EvalMode,
selectedIds: string[]
): Promise<EvalCase[]> {
const allCases = await loadCases(mode);
if (selectedIds.length === 0) {
return allCases;
}
const caseMap = new Map(allCases.map((entry) => [entry.id, entry]));
const missing = selectedIds.filter((id) => !caseMap.has(id));
if (missing.length > 0) {
throw new Error(
`Unknown ${mode} case${missing.length === 1 ? "" : "s"}: ${missing.join(", ")}`
);
}
return selectedIds.map((id) => caseMap.get(id)!);
}
function resolveFixturePath(value: string | undefined): string | undefined {
if (!value) {
return undefined;
}
return path.isAbsolute(value) ? value : path.join(REPO_ROOT, value);
}

67
ai_evals/core/files.ts Normal file
View File

@@ -0,0 +1,67 @@
import { access, copyFile, mkdir, readdir, readFile } from "node:fs/promises";
import path from "node:path";
export async function exists(filePath: string): Promise<boolean> {
try {
await access(filePath);
return true;
} catch {
return false;
}
}
export async function readJsonFile<T>(filePath: string): Promise<T> {
const raw = await readFile(filePath, "utf8");
return JSON.parse(raw) as T;
}
export async function readDirectoryFiles(
rootDir: string,
options: {
ignore?: Set<string>;
} = {}
): Promise<Record<string, string>> {
const files: Record<string, string> = {};
await walkDirectory(rootDir, "", files, options.ignore ?? new Set());
return files;
}
export async function copyDirectory(sourceDir: string, targetDir: string): Promise<void> {
const entries = await readdir(sourceDir, { withFileTypes: true });
await mkdir(targetDir, { recursive: true });
for (const entry of entries) {
const sourcePath = path.join(sourceDir, entry.name);
const targetPath = path.join(targetDir, entry.name);
if (entry.isDirectory()) {
await copyDirectory(sourcePath, targetPath);
continue;
}
await mkdir(path.dirname(targetPath), { recursive: true });
await copyFile(sourcePath, targetPath);
}
}
async function walkDirectory(
absoluteDir: string,
relativeDir: string,
output: Record<string, string>,
ignore: Set<string>
): Promise<void> {
const entries = await readdir(absoluteDir, { withFileTypes: true });
for (const entry of entries) {
const relativePath = relativeDir ? `${relativeDir}/${entry.name}` : entry.name;
if (ignore.has(relativePath) || ignore.has(entry.name)) {
continue;
}
const absolutePath = path.join(absoluteDir, entry.name);
if (entry.isDirectory()) {
await walkDirectory(absolutePath, relativePath, output, ignore);
continue;
}
output[relativePath] = await readFile(absolutePath, "utf8");
}
}

149
ai_evals/core/judge.ts Normal file
View File

@@ -0,0 +1,149 @@
import Anthropic from "@anthropic-ai/sdk";
import type { EvalMode, JudgeResult } from "./types";
export const DEFAULT_JUDGE_MODEL = "claude-sonnet-4-6";
const JUDGE_TOOL_NAME = "submit_judgement";
export async function judgeOutput(input: {
mode: EvalMode;
prompt: string;
checklist?: string[];
initial?: unknown;
expected?: unknown;
actual: unknown;
model?: string;
}): Promise<JudgeResult> {
const apiKey = process.env.ANTHROPIC_API_KEY;
if (!apiKey) {
return {
success: false,
score: 0,
summary: "Judge unavailable",
error: "ANTHROPIC_API_KEY is not set",
};
}
const client = new Anthropic({ apiKey });
const model = input.model ?? DEFAULT_JUDGE_MODEL;
const system = [
"You evaluate benchmark outputs for Windmill AI generation.",
"Deterministic checks already run separately. Focus on whether the final output satisfies the user request.",
"If expected state is provided, treat it as a valid example and reward semantically equivalent outputs.",
"If a checklist is provided, treat it as the explicit acceptance criteria for this case.",
"Be strict about missing requested functionality.",
"When the prompt wording is ambiguous, prefer the checklist over inferred structural requirements.",
"Do not require exact ids, exact topology, or exact field names unless the prompt, checklist, or expected state clearly requires them.",
`Always respond by calling the ${JUDGE_TOOL_NAME} tool exactly once.`,
].join("\n\n");
const user = [
`Mode: ${input.mode}`,
"",
"User prompt:",
input.prompt,
"",
"Checklist:",
formatChecklist(input.checklist),
"",
"Initial state:",
formatJsonBlock(input.initial),
"",
"Expected state:",
formatJsonBlock(input.expected),
"",
"Actual result:",
formatJsonBlock(input.actual),
].join("\n");
try {
const response = await client.messages.create({
model,
max_tokens: 1024,
temperature: 0,
system,
messages: [{ role: "user", content: user }],
tools: [
{
name: JUDGE_TOOL_NAME,
description: "Submit the benchmark judgement as structured data.",
input_schema: {
type: "object",
properties: {
score: {
type: "integer",
minimum: 0,
maximum: 100,
},
summary: {
type: "string",
},
},
required: ["score", "summary"],
},
},
],
tool_choice: {
type: "tool",
name: JUDGE_TOOL_NAME,
disable_parallel_tool_use: true,
},
});
const toolUseBlock = response.content.find(
(block): block is Anthropic.ToolUseBlock =>
block.type === "tool_use" && block.name === JUDGE_TOOL_NAME
);
if (!toolUseBlock) {
return {
success: false,
score: 0,
summary: "Judge returned no tool output",
error: "Expected structured tool output from judge",
};
}
const parsed = toolUseBlock.input as {
score: number;
summary: string;
};
return {
success: true,
score: normalizeScore(parsed.score),
summary: parsed.summary,
};
} catch (error) {
const message = error instanceof Error ? error.message : String(error);
return {
success: false,
score: 0,
summary: "Judge failed",
error: message,
};
}
}
function formatJsonBlock(value: unknown): string {
if (value === undefined) {
return "(none)";
}
return JSON.stringify(value, null, 2);
}
function formatChecklist(checklist: string[] | undefined): string {
if (!checklist || checklist.length === 0) {
return "(none)";
}
return checklist.map((item) => `- ${item}`).join("\n");
}
function normalizeScore(value: number): number {
if (!Number.isFinite(value)) {
return 0;
}
return Math.max(0, Math.min(100, Math.round(value)));
}

View File

@@ -0,0 +1,29 @@
import { describe, expect, it } from "bun:test";
import { resolveEvalModel } from "./models";
describe("resolveEvalModel", () => {
it("supports Gemini aliases for frontend evals", () => {
expect(resolveEvalModel("flow", "gemini").frontend).toEqual({
provider: "googleai",
model: "gemini-2.5-flash",
});
expect(resolveEvalModel("app", "gemini-pro").frontend).toEqual({
provider: "googleai",
model: "gemini-2.5-pro",
});
expect(resolveEvalModel("script", "gemini-3-flash-preview").frontend).toEqual({
provider: "googleai",
model: "gemini-3-flash-preview",
});
expect(resolveEvalModel("flow", "gemini-3.1-pro-preview").frontend).toEqual({
provider: "googleai",
model: "gemini-3.1-pro-preview",
});
});
it("rejects Gemini aliases for cli evals", () => {
expect(() => resolveEvalModel("cli", "gemini")).toThrow(
"Model gemini-flash is not supported for cli mode"
);
});
});

185
ai_evals/core/models.ts Normal file
View File

@@ -0,0 +1,185 @@
import type { EvalMode } from "./types";
export interface FrontendEvalModelConfig {
provider: "anthropic" | "openai" | "googleai";
model: string;
}
export interface CliEvalModelConfig {
provider: "anthropic";
model: string;
}
export interface EvalModelSpec {
id: string;
label: string;
aliases: string[];
frontend?: FrontendEvalModelConfig;
cli?: CliEvalModelConfig;
}
export const EVAL_MODELS: EvalModelSpec[] = [
{
id: "haiku",
label: "Claude Haiku 4.5",
aliases: [
"haiku",
"haiku-4.5",
"claude-haiku",
"claude-haiku-4.5",
"claude-haiku-4-5",
"claude-haiku-4-5-20251001",
],
frontend: {
provider: "anthropic",
model: "claude-haiku-4-5-20251001",
},
cli: {
provider: "anthropic",
model: "haiku",
},
},
{
id: "sonnet",
label: "Claude Sonnet 4.5",
aliases: [
"sonnet",
"sonnet-4.5",
"claude-sonnet",
"claude-sonnet-4.5",
"claude-sonnet-4-5",
"claude-sonnet-4-5-20250929",
],
frontend: {
provider: "anthropic",
model: "claude-sonnet-4-5-20250929",
},
cli: {
provider: "anthropic",
model: "sonnet",
},
},
{
id: "opus",
label: "Claude Opus 4.6",
aliases: [
"opus",
"opus-4.6",
"claude-opus",
"claude-opus-4.6",
"claude-opus-4-6",
],
frontend: {
provider: "anthropic",
model: "claude-opus-4-6",
},
cli: {
provider: "anthropic",
model: "opus",
},
},
{
id: "4o",
label: "GPT-4o",
aliases: ["4o", "gpt-4o"],
frontend: {
provider: "openai",
model: "gpt-4o",
},
},
{
id: "gemini-flash",
label: "Gemini 2.5 Flash",
aliases: ["gemini", "gemini-flash", "gemini-2.5-flash"],
frontend: {
provider: "googleai",
model: "gemini-2.5-flash",
},
},
{
id: "gemini-pro",
label: "Gemini 2.5 Pro",
aliases: ["gemini-pro", "gemini-2.5-pro"],
frontend: {
provider: "googleai",
model: "gemini-2.5-pro",
},
},
{
id: "gemini-3-flash-preview",
label: "Gemini 3 Flash Preview",
aliases: ["gemini-3-flash-preview", "gemini-3-flash"],
frontend: {
provider: "googleai",
model: "gemini-3-flash-preview",
},
},
{
id: "gemini-3.1-pro-preview",
label: "Gemini 3.1 Pro Preview",
aliases: ["gemini-3.1-pro-preview", "gemini-3.1-pro", "gemini-3-pro-preview"],
frontend: {
provider: "googleai",
model: "gemini-3.1-pro-preview",
},
},
];
export function resolveEvalModel(mode: EvalMode, alias?: string): EvalModelSpec {
const spec = alias ? findEvalModel(alias) : getDefaultEvalModel(mode);
if (!spec) {
throw new Error(`Unknown model: ${alias}`);
}
if (mode === "cli" && !spec.cli) {
throw new Error(`Model ${spec.id} is not supported for cli mode`);
}
if (mode !== "cli" && !spec.frontend) {
throw new Error(`Model ${spec.id} is not supported for ${mode} mode`);
}
return spec;
}
export function getEvalModelHelpText(): string {
return EVAL_MODELS.map((model) => {
const modes = [
...(model.frontend ? ["flow", "script", "app"] : []),
...(model.cli ? ["cli"] : []),
];
return ` ${model.id.padEnd(8)} ${model.label} (${modes.join(", ")})`;
}).join("\n");
}
export function formatRunModelLabel(mode: EvalMode, model: EvalModelSpec): string {
if (mode === "cli") {
return `${model.cli!.provider}:${model.cli!.model}`;
}
return `${model.frontend!.provider}:${model.frontend!.model}`;
}
export function getFrontendEvalModel(model: EvalModelSpec): FrontendEvalModelConfig {
if (!model.frontend) {
throw new Error(`Model ${model.id} does not support frontend evals`);
}
return model.frontend;
}
export function getCliEvalModel(model: EvalModelSpec): CliEvalModelConfig {
if (!model.cli) {
throw new Error(`Model ${model.id} does not support cli evals`);
}
return model.cli;
}
function getDefaultEvalModel(mode: EvalMode): EvalModelSpec {
return mode === "cli" ? EVAL_MODELS[0]! : EVAL_MODELS[0]!;
}
function findEvalModel(alias: string): EvalModelSpec | undefined {
const normalized = alias.trim().toLowerCase();
return EVAL_MODELS.find((model) =>
[model.id, ...model.aliases].some((candidate) => candidate.toLowerCase() === normalized)
);
}

296
ai_evals/core/results.ts Normal file
View File

@@ -0,0 +1,296 @@
import { appendFile, mkdir, rm, writeFile } from "node:fs/promises";
import path from "node:path";
import { execFileSync } from "node:child_process";
import { getAiEvalsRoot, getRepoRoot } from "./cases";
import type {
BenchmarkArtifactFile,
BenchmarkCaseResult,
BenchmarkRunResult,
BenchmarkTokenUsage,
EvalMode,
} from "./types";
export async function writeRunResult(
result: BenchmarkRunResult,
outputPath?: string
): Promise<string> {
const targetPath = resolveRunOutputPath(result.mode, outputPath);
await mkdir(path.dirname(targetPath), { recursive: true });
await writeFile(targetPath, JSON.stringify(toSerializableRunResult(result), null, 2) + "\n", "utf8");
return targetPath;
}
export async function appendHistoryRecord(
result: BenchmarkRunResult,
historyPath = resolveHistoryPath(result.mode)
): Promise<string> {
await mkdir(path.dirname(historyPath), { recursive: true });
await appendFile(historyPath, JSON.stringify(toHistoryRecord(result)) + "\n", "utf8");
return historyPath;
}
export async function writeRunArtifacts(
result: BenchmarkRunResult,
outputPath?: string
): Promise<string | null> {
const targetPath = resolveRunOutputPath(result.mode, outputPath);
const artifactRoot = defaultArtifactsRoot(targetPath);
await rm(artifactRoot, { recursive: true, force: true });
let wroteArtifacts = false;
for (const caseResult of result.cases) {
for (const attempt of caseResult.attempts) {
const artifactFiles = attempt.artifactFiles ?? [];
if (artifactFiles.length === 0) {
attempt.artifactsPath = null;
continue;
}
const attemptDir = path.join(artifactRoot, caseResult.id, `attempt-${attempt.attempt}`);
await writeArtifactFiles(attemptDir, artifactFiles);
attempt.artifactsPath = attemptDir;
wroteArtifacts = true;
}
}
result.artifactsPath = wroteArtifacts ? artifactRoot : null;
return result.artifactsPath ?? null;
}
export function buildRunResult(input: {
mode: EvalMode;
runs: number;
runModel: string | null;
judgeModel: string | null;
caseResults: BenchmarkCaseResult[];
}): BenchmarkRunResult {
const attemptCount = input.caseResults.reduce((sum, entry) => sum + entry.attempts.length, 0);
const passedAttempts = input.caseResults.reduce(
(sum, entry) => sum + entry.attempts.filter((attempt) => attempt.passed).length,
0
);
const durationTotal = input.caseResults.reduce(
(sum, entry) => sum + entry.attempts.reduce((inner, attempt) => inner + attempt.durationMs, 0),
0
);
const tokenUsageTotal = input.caseResults.reduce<BenchmarkTokenUsage | null>(
(sum, entry) => {
for (const attempt of entry.attempts) {
if (!attempt.tokenUsage) {
continue;
}
sum ??= { prompt: 0, completion: 0, total: 0 };
sum.prompt += attempt.tokenUsage.prompt;
sum.completion += attempt.tokenUsage.completion;
sum.total += attempt.tokenUsage.total;
}
return sum;
},
null
);
return {
version: 1,
mode: input.mode,
createdAt: new Date().toISOString(),
gitSha: getGitSha(),
runs: input.runs,
runModel: input.runModel,
judgeModel: input.judgeModel,
caseCount: input.caseResults.length,
attemptCount,
passedAttempts,
passRate: attemptCount === 0 ? 0 : passedAttempts / attemptCount,
averageDurationMs: attemptCount === 0 ? 0 : durationTotal / attemptCount,
totalTokenUsage: tokenUsageTotal,
averageTokenUsagePerAttempt:
attemptCount === 0 || !tokenUsageTotal
? null
: {
prompt: tokenUsageTotal.prompt / attemptCount,
completion: tokenUsageTotal.completion / attemptCount,
total: tokenUsageTotal.total / attemptCount,
},
cases: input.caseResults,
};
}
export function formatRunSummary(result: BenchmarkRunResult): string {
const lines = [
`${result.mode} benchmark complete`,
`Pass rate: ${formatPercent(result.passRate)} (${result.passedAttempts}/${result.attemptCount})`,
`Average duration: ${Math.round(result.averageDurationMs)}ms`,
];
const failures = collectFailures(result);
if (failures.length > 0) {
lines.push("Failures:");
for (const entry of failures.slice(0, 10)) {
lines.push(`- ${entry}`);
}
}
return lines.join("\n");
}
function collectFailures(result: BenchmarkRunResult): string[] {
const failures: string[] = [];
for (const caseResult of result.cases) {
for (const attempt of caseResult.attempts) {
if (attempt.passed) {
continue;
}
const failedChecks = attempt.checks.filter((check) => !check.passed).map((check) => check.name);
failures.push(
`${caseResult.id} attempt ${attempt.attempt}: ${failedChecks.join(", ") || attempt.error || "failed"}`
);
}
}
return failures;
}
function defaultFileName(mode: EvalMode): string {
return `${new Date().toISOString().replaceAll(":", "-")}__${mode}.json`;
}
export function resolveRunOutputPath(mode: EvalMode, outputPath?: string): string {
return outputPath ?? path.join(getAiEvalsRoot(), "results", defaultFileName(mode));
}
export function resolveHistoryPath(mode: EvalMode): string {
return path.join(getAiEvalsRoot(), "history", `${mode}.jsonl`);
}
function defaultArtifactsRoot(resultPath: string): string {
return resultPath.endsWith(".json")
? resultPath.slice(0, -".json".length)
: `${resultPath}.artifacts`;
}
async function writeArtifactFiles(
rootDir: string,
files: BenchmarkArtifactFile[]
): Promise<void> {
for (const file of files) {
const relativePath = normalizeArtifactPath(file.path);
const targetPath = path.join(rootDir, relativePath);
await mkdir(path.dirname(targetPath), { recursive: true });
await writeFile(targetPath, file.content, "utf8");
}
}
function normalizeArtifactPath(filePath: string): string {
const normalized = filePath.replaceAll("\\", "/").replace(/^\/+/, "");
const parts = normalized.split("/").filter(Boolean);
if (parts.length === 0 || parts.some((part) => part === "." || part === "..")) {
throw new Error(`Invalid artifact path: ${filePath}`);
}
return parts.join("/");
}
function toSerializableRunResult(result: BenchmarkRunResult): BenchmarkRunResult {
return {
...result,
cases: result.cases.map((caseResult) => ({
...caseResult,
attempts: caseResult.attempts.map(({ artifactFiles, ...attempt }) => attempt),
})),
};
}
function toHistoryRecord(result: BenchmarkRunResult) {
const judgeScores = result.cases.flatMap((caseResult) =>
caseResult.attempts.flatMap((attempt) =>
typeof attempt.judgeScore === "number" ? [attempt.judgeScore] : []
)
);
return {
createdAt: result.createdAt,
gitSha: result.gitSha,
mode: result.mode,
runs: result.runs,
runModel: result.runModel,
judgeModel: result.judgeModel,
caseCount: result.caseCount,
attemptCount: result.attemptCount,
passedAttempts: result.passedAttempts,
passRate: result.passRate,
averageDurationMs: result.averageDurationMs,
averageJudgeScore:
judgeScores.length === 0
? null
: judgeScores.reduce((sum, score) => sum + score, 0) / judgeScores.length,
averageTokenUsagePerAttempt: result.averageTokenUsagePerAttempt ?? null,
failedCaseIds: Array.from(
new Set(
result.cases
.filter((caseResult) => caseResult.attempts.some((attempt) => !attempt.passed))
.map((caseResult) => caseResult.id)
)
),
cases: result.cases.map((caseResult) => {
const attemptCount = caseResult.attempts.length;
const passedAttempts = caseResult.attempts.filter((attempt) => attempt.passed).length;
const totalDurationMs = caseResult.attempts.reduce(
(sum, attempt) => sum + attempt.durationMs,
0
);
const judgeScores = caseResult.attempts.flatMap((attempt) =>
typeof attempt.judgeScore === "number" ? [attempt.judgeScore] : []
);
const totalTokenUsage = caseResult.attempts.reduce<BenchmarkTokenUsage | null>(
(sum, attempt) => {
if (!attempt.tokenUsage) {
return sum;
}
sum ??= { prompt: 0, completion: 0, total: 0 };
sum.prompt += attempt.tokenUsage.prompt;
sum.completion += attempt.tokenUsage.completion;
sum.total += attempt.tokenUsage.total;
return sum;
},
null
);
return {
id: caseResult.id,
attemptCount,
passedAttempts,
passRate: attemptCount === 0 ? 0 : passedAttempts / attemptCount,
averageDurationMs: attemptCount === 0 ? 0 : totalDurationMs / attemptCount,
averageJudgeScore:
judgeScores.length === 0
? null
: judgeScores.reduce((sum, score) => sum + score, 0) / judgeScores.length,
averageTokenUsagePerAttempt:
attemptCount === 0 || !totalTokenUsage
? null
: {
prompt: totalTokenUsage.prompt / attemptCount,
completion: totalTokenUsage.completion / attemptCount,
total: totalTokenUsage.total / attemptCount,
},
};
}),
};
}
function getGitSha(): string | null {
try {
return execFileSync("git", ["rev-parse", "HEAD"], {
cwd: getRepoRoot(),
encoding: "utf8",
stdio: ["ignore", "pipe", "ignore"],
}).trim();
} catch {
return null;
}
}
function formatPercent(value: number): string {
return `${(value * 100).toFixed(1)}%`;
}

264
ai_evals/core/runSuite.ts Normal file
View File

@@ -0,0 +1,264 @@
import { judgeOutput, DEFAULT_JUDGE_MODEL } from "./judge";
import type {
BenchmarkAttemptResult,
BenchmarkCaseResult,
BenchmarkCheck,
EvalCase,
FrontendBenchmarkProgressEvent,
ModeRunner,
} from "./types";
export async function runSuite<TInitial, TExpected, TActual>(input: {
modeRunner: ModeRunner<TInitial, TExpected, TActual>;
cases: EvalCase[];
runs: number;
runModel: string | null;
judgeModel?: string | null;
concurrency?: number;
verbose?: boolean;
onProgress?: (event: FrontendBenchmarkProgressEvent) => void;
}): Promise<BenchmarkCaseResult[]> {
const judgeModel = input.judgeModel ?? DEFAULT_JUDGE_MODEL;
const concurrency = Math.max(1, input.concurrency ?? input.modeRunner.concurrency);
const results = new Array<BenchmarkCaseResult>(input.cases.length);
let cursor = 0;
if (input.modeRunner.mode !== "cli") {
input.onProgress?.({
type: "run-start",
surface: input.modeRunner.mode,
totalCases: input.cases.length,
runs: input.runs,
concurrency,
});
}
async function worker(): Promise<void> {
while (true) {
const caseIndex = cursor++;
if (caseIndex >= input.cases.length) {
return;
}
const evalCase = input.cases[caseIndex];
results[caseIndex] = {
id: evalCase.id,
prompt: evalCase.prompt,
initialPath: evalCase.initialPath,
expectedPath: evalCase.expectedPath,
attempts: await runCaseAttempts({
caseIndex,
evalCase,
runs: input.runs,
judgeModel,
judgeThreshold: input.modeRunner.judgeThreshold ?? 80,
modeRunner: input.modeRunner,
totalCases: input.cases.length,
verbose: input.verbose ?? false,
onProgress: input.onProgress,
}),
};
}
}
await Promise.all(
Array.from({ length: Math.min(concurrency, input.cases.length) }, () => worker())
);
return results;
}
async function runCaseAttempts<TInitial, TExpected, TActual>(input: {
caseIndex: number;
evalCase: EvalCase;
runs: number;
judgeModel: string;
judgeThreshold: number;
modeRunner: ModeRunner<TInitial, TExpected, TActual>;
totalCases: number;
verbose: boolean;
onProgress?: (event: FrontendBenchmarkProgressEvent) => void;
}): Promise<BenchmarkAttemptResult[]> {
const attempts: BenchmarkAttemptResult[] = [];
const surface = input.modeRunner.mode === "cli" ? null : input.modeRunner.mode;
for (let attempt = 1; attempt <= input.runs; attempt += 1) {
if (surface) {
input.onProgress?.({
type: "attempt-start",
surface,
caseId: input.evalCase.id,
caseNumber: input.caseIndex + 1,
totalCases: input.totalCases,
attempt,
runs: input.runs,
});
}
const startedAt = Date.now();
try {
const initial = await input.modeRunner.loadInitial(input.evalCase.initialPath);
const expected = await input.modeRunner.loadExpected(input.evalCase.expectedPath);
const run = await input.modeRunner.run(input.evalCase.prompt, initial, {
caseId: input.evalCase.id,
caseNumber: input.caseIndex + 1,
totalCases: input.totalCases,
attempt,
runs: input.runs,
verbose: input.verbose,
onAssistantMessageStart: input.verbose && surface
? () =>
input.onProgress?.({
type: "assistant-message-start",
surface,
caseId: input.evalCase.id,
caseNumber: input.caseIndex + 1,
totalCases: input.totalCases,
attempt,
runs: input.runs,
})
: undefined,
onAssistantChunk: input.verbose && surface
? (chunk: string) =>
input.onProgress?.({
type: "assistant-chunk",
surface,
caseId: input.evalCase.id,
caseNumber: input.caseIndex + 1,
totalCases: input.totalCases,
attempt,
runs: input.runs,
chunk,
})
: undefined,
onAssistantMessageEnd: input.verbose && surface
? () =>
input.onProgress?.({
type: "assistant-message-end",
surface,
caseId: input.evalCase.id,
caseNumber: input.caseIndex + 1,
totalCases: input.totalCases,
attempt,
runs: input.runs,
})
: undefined,
});
const checks: BenchmarkCheck[] = [
buildCheck("run succeeded", run.success, run.error),
...input.modeRunner.validate({
evalCase: input.evalCase,
prompt: input.evalCase.prompt,
initial,
expected,
actual: run.actual,
run,
}),
];
let judgeScore: number | null = null;
let judgeSummary: string | null = null;
if (run.success) {
const judge = await judgeOutput({
mode: input.modeRunner.mode,
prompt: input.evalCase.prompt,
checklist: input.evalCase.judgeChecklist,
initial,
expected: input.modeRunner.mode === "cli" ? undefined : expected,
actual: run.actual,
model: input.judgeModel,
});
judgeScore = judge.success ? judge.score : null;
judgeSummary = judge.summary;
checks.push(buildCheck("judge succeeded", judge.success, judge.error));
checks.push(
buildCheck(
`judge score >= ${input.judgeThreshold}`,
(judgeScore ?? 0) >= input.judgeThreshold,
judge.success ? `score=${judgeScore}` : judge.error
)
);
}
const artifactFiles = input.modeRunner.buildArtifacts?.(run.actual) ?? [];
const attemptResult: BenchmarkAttemptResult = {
attempt,
passed: checks.every((check) => check.passed),
durationMs: Date.now() - startedAt,
assistantMessageCount: run.assistantMessageCount,
toolCallCount: run.toolCallCount,
toolsUsed: uniqueStrings(run.toolsUsed),
skillsInvoked: uniqueStrings(run.skillsInvoked),
checks,
judgeScore,
judgeSummary,
error: run.error ?? null,
tokenUsage: run.tokenUsage ?? null,
artifactsPath: null,
artifactFiles,
};
if (surface) {
input.onProgress?.({
type: "attempt-finish",
surface,
caseId: input.evalCase.id,
caseNumber: input.caseIndex + 1,
totalCases: input.totalCases,
attempt,
runs: input.runs,
passed: attemptResult.passed,
durationMs: attemptResult.durationMs,
judgeScore: attemptResult.judgeScore,
error: attemptResult.error,
});
}
attempts.push(attemptResult);
} catch (error) {
const message = error instanceof Error ? error.message : String(error);
const failedAttempt: BenchmarkAttemptResult = {
attempt,
passed: false,
durationMs: Date.now() - startedAt,
assistantMessageCount: 0,
toolCallCount: 0,
toolsUsed: [],
skillsInvoked: [],
checks: [buildCheck("run crashed", false, message)],
judgeScore: null,
judgeSummary: null,
error: message,
tokenUsage: null,
};
if (surface) {
input.onProgress?.({
type: "attempt-finish",
surface,
caseId: input.evalCase.id,
caseNumber: input.caseIndex + 1,
totalCases: input.totalCases,
attempt,
runs: input.runs,
passed: false,
durationMs: failedAttempt.durationMs,
judgeScore: null,
error: message,
});
}
attempts.push(failedAttempt);
}
}
return attempts;
}
function buildCheck(name: string, passed: boolean, details?: string): BenchmarkCheck {
return details ? { name, passed, details } : { name, passed };
}
function uniqueStrings(values: string[]): string[] {
return [...new Set(values)];
}

198
ai_evals/core/types.ts Normal file
View File

@@ -0,0 +1,198 @@
export const EVAL_MODES = ["cli", "flow", "script", "app"] as const;
export type EvalMode = (typeof EVAL_MODES)[number];
export interface FlowValidationSpec {
schemaRequiredPaths?: string[];
schemaAnyOf?: Array<{
requiredPaths: string[];
}>;
resolveResultsRefs?: boolean;
requireSpecialModules?: Array<"preprocessor_module" | "failure_module">;
requireSuspendSteps?: Array<{
id: string;
requiredEvents?: number;
resumeRequiredStringFieldAnyOf?: string[];
}>;
}
export interface EvalCase {
id: string;
prompt: string;
initialPath?: string;
expectedPath?: string;
validate?: FlowValidationSpec;
judgeChecklist?: string[];
}
export interface BenchmarkCheck {
name: string;
passed: boolean;
details?: string;
}
export interface JudgeResult {
success: boolean;
score: number;
summary: string;
error?: string;
}
export interface BenchmarkArtifactFile {
path: string;
content: string;
}
export interface BenchmarkTokenUsage {
prompt: number;
completion: number;
total: number;
}
export interface ModeRunOutput<TActual> {
success: boolean;
actual: TActual;
error?: string;
assistantMessageCount: number;
toolCallCount: number;
toolsUsed: string[];
skillsInvoked: string[];
tokenUsage?: BenchmarkTokenUsage | null;
}
export interface ModeRunContext {
caseId: string;
caseNumber: number;
totalCases: number;
attempt: number;
runs: number;
verbose: boolean;
onAssistantMessageStart?: () => void;
onAssistantChunk?: (chunk: string) => void;
onAssistantMessageEnd?: () => void;
}
export interface ModeRunner<TInitial, TExpected, TActual> {
mode: EvalMode;
concurrency: number;
judgeThreshold?: number;
loadInitial(path?: string): Promise<TInitial | undefined>;
loadExpected(path?: string): Promise<TExpected | undefined>;
run(
prompt: string,
initial: TInitial | undefined,
context: ModeRunContext
): Promise<ModeRunOutput<TActual>>;
validate(input: {
evalCase: EvalCase;
prompt: string;
initial: TInitial | undefined;
expected: TExpected | undefined;
actual: TActual;
run: ModeRunOutput<TActual>;
}): BenchmarkCheck[];
buildArtifacts?(actual: TActual): BenchmarkArtifactFile[];
}
export interface BenchmarkAttemptResult {
attempt: number;
passed: boolean;
durationMs: number;
assistantMessageCount: number;
toolCallCount: number;
toolsUsed: string[];
skillsInvoked: string[];
checks: BenchmarkCheck[];
judgeScore: number | null;
judgeSummary: string | null;
error: string | null;
tokenUsage?: BenchmarkTokenUsage | null;
artifactsPath?: string | null;
artifactFiles?: BenchmarkArtifactFile[];
}
export interface BenchmarkCaseResult {
id: string;
prompt: string;
initialPath?: string;
expectedPath?: string;
attempts: BenchmarkAttemptResult[];
}
export interface BenchmarkRunResult {
version: 1;
mode: EvalMode;
createdAt: string;
gitSha: string | null;
runs: number;
runModel: string | null;
judgeModel: string | null;
caseCount: number;
attemptCount: number;
passedAttempts: number;
passRate: number;
averageDurationMs: number;
totalTokenUsage?: BenchmarkTokenUsage | null;
averageTokenUsagePerAttempt?: BenchmarkTokenUsage | null;
artifactsPath?: string | null;
cases: BenchmarkCaseResult[];
}
export type FrontendBenchmarkProgressEvent =
| {
type: "run-start";
surface: Exclude<EvalMode, "cli">;
totalCases: number;
runs: number;
concurrency: number;
}
| {
type: "attempt-start";
surface: Exclude<EvalMode, "cli">;
caseId: string;
caseNumber: number;
totalCases: number;
attempt: number;
runs: number;
}
| {
type: "attempt-finish";
surface: Exclude<EvalMode, "cli">;
caseId: string;
caseNumber: number;
totalCases: number;
attempt: number;
runs: number;
passed: boolean;
durationMs: number;
judgeScore: number | null;
error: string | null;
}
| {
type: "assistant-message-start";
surface: Exclude<EvalMode, "cli">;
caseId: string;
caseNumber: number;
totalCases: number;
attempt: number;
runs: number;
}
| {
type: "assistant-chunk";
surface: Exclude<EvalMode, "cli">;
caseId: string;
caseNumber: number;
totalCases: number;
attempt: number;
runs: number;
chunk: string;
}
| {
type: "assistant-message-end";
surface: Exclude<EvalMode, "cli">;
caseId: string;
caseNumber: number;
totalCases: number;
attempt: number;
runs: number;
};

View File

@@ -0,0 +1,36 @@
import { describe, expect, it } from "bun:test";
import { validateScriptState } from "./validators";
describe("validateScriptState", () => {
it("accepts semantically equivalent script implementations", () => {
const checks = validateScriptState({
actual: {
path: "f/evals/greet_user.ts",
lang: "bun",
code: "export async function main(name: string): Promise<string> {\n return `Hello, ${name}!`;\n}\n",
},
expected: {
path: "f/evals/greet_user.ts",
lang: "bun",
code: "export async function main(name: string) {\n\treturn `Hello, ${name}!`\n}\n",
},
});
expect(checks.every((check) => check.passed)).toBe(true);
});
it("still requires an exported main entrypoint", () => {
const checks = validateScriptState({
actual: {
path: "f/evals/greet_user.ts",
lang: "bun",
code: "async function main(name: string) {\n return `Hello, ${name}!`;\n}\n",
},
});
expect(checks).toContainEqual({
name: "script exports entrypoint",
passed: false,
});
});
});

997
ai_evals/core/validators.ts Normal file
View File

@@ -0,0 +1,997 @@
import path from "node:path";
import ts from "typescript";
import type { BenchmarkCheck, FlowValidationSpec } from "./types";
export interface ScriptState {
path: string;
lang: string;
args?: Record<string, unknown>;
code: string;
}
export interface FlowState {
summary?: string;
value?: {
preprocessor_module?: Record<string, unknown>;
failure_module?: Record<string, unknown>;
modules?: Array<Record<string, unknown>>;
[key: string]: unknown;
};
schema?: Record<string, unknown>;
}
export interface AppFilesState {
frontend: Record<string, string>;
backend: Record<string, AppRunnableState>;
}
export interface AppRunnableState {
type?: string;
name?: string;
path?: string;
inlineScript?: {
language?: string;
content?: string;
};
}
const TS_LIKE_LANGUAGES = new Set(["bun", "deno", "nativets", "bunnative", "ts", "typescript"]);
const CONTROL_FLOW_MODULE_TYPES = new Set(["branchone", "branchall", "forloopflow", "whileloopflow"]);
export function validateScriptState(input: {
actual: ScriptState;
initial?: ScriptState;
expected?: ScriptState;
}): BenchmarkCheck[] {
const checks: BenchmarkCheck[] = [
check("script exports entrypoint", hasSupportedEntrypoint(input.actual.code)),
check("script has no syntax errors", getScriptSyntaxErrors(input.actual.code, input.actual.lang).length === 0),
];
if (input.expected) {
checks.push(
check(
"script path matches expected",
input.actual.path === input.expected.path,
`expected ${input.expected.path}, got ${input.actual.path}`
)
);
checks.push(
check(
"script language matches expected",
input.actual.lang === input.expected.lang,
`expected ${input.expected.lang}, got ${input.actual.lang}`
)
);
}
if (input.initial) {
checks.push(
check(
"script differs from initial",
normalizeText(input.actual.code) !== normalizeText(input.initial.code)
)
);
}
return checks;
}
export function validateFlowState(input: {
actual: FlowState;
initial?: FlowState;
expected?: FlowState;
validate?: FlowValidationSpec;
}): BenchmarkCheck[] {
const actualModules = getFlowModules(input.actual);
const placeholderModuleIds = getInlineScriptPlaceholderModuleIds(input.actual);
const checks: BenchmarkCheck[] = [
check("flow has modules", actualModules.length > 0),
check(
"flow has no inline placeholder code",
placeholderModuleIds.length === 0,
placeholderModuleIds.length > 0
? `placeholder content in: ${placeholderModuleIds.join(", ")}`
: undefined
),
];
if (input.initial) {
checks.push(
check(
"flow differs from initial",
normalizeJson(input.actual) !== normalizeJson(input.initial)
)
);
}
if (input.expected) {
checks.push(...validateFlowExpectedStructure(input.actual, input.expected));
}
if (input.validate) {
checks.push(...validateFlowRequirements(input.actual, input.validate));
}
return checks;
}
export function validateAppState(input: {
actual: AppFilesState;
initial?: AppFilesState;
expected?: AppFilesState;
}): BenchmarkCheck[] {
const checks: BenchmarkCheck[] = [];
const frontendEntries = Object.entries(input.actual.frontend ?? {});
const backendEntries = Object.entries(input.actual.backend ?? {});
const frontendSyntaxProblems = getAppFrontendSyntaxProblems(input.actual.frontend);
const backendSyntaxProblems = getAppBackendSyntaxProblems(input.actual.backend);
const unresolvedBackendRefs = getUnresolvedBackendReferences(
input.actual.frontend,
input.actual.backend
);
checks.push(check("app has frontend entrypoint", Boolean(input.actual.frontend["/index.tsx"])));
checks.push(
check(
"app has non-empty frontend files",
frontendEntries.some(([, content]) => content.trim().length > 0)
)
);
checks.push(
check(
"frontend files have no syntax errors",
frontendSyntaxProblems.length === 0,
summarizeProblems(frontendSyntaxProblems)
)
);
checks.push(
check(
"backend inline scripts have entrypoints",
backendEntries.every(([, runnable]) => {
if (runnable.type !== "inline") {
return true;
}
return hasSupportedEntrypoint(runnable.inlineScript?.content ?? "");
})
)
);
checks.push(
check(
"backend inline scripts have no syntax errors",
backendSyntaxProblems.length === 0,
summarizeProblems(backendSyntaxProblems)
)
);
checks.push(
check(
"frontend backend references resolve",
unresolvedBackendRefs.length === 0,
summarizeProblems(unresolvedBackendRefs)
)
);
if (input.initial) {
checks.push(check("app differs from initial", !appStatesEqual(input.actual, input.initial)));
}
if (input.expected) {
for (const [filePath, content] of Object.entries(input.expected.frontend)) {
checks.push(
check(
`frontend includes ${filePath}`,
normalizeText(input.actual.frontend[filePath] ?? "") === normalizeText(content)
)
);
}
for (const [runnableName, runnable] of Object.entries(input.expected.backend)) {
const actualRunnable = input.actual.backend[runnableName];
checks.push(check(`backend includes ${runnableName}`, Boolean(actualRunnable)));
if (actualRunnable && runnable.inlineScript?.content) {
checks.push(
check(
`${runnableName} code matches expected`,
normalizeText(actualRunnable.inlineScript?.content ?? "") ===
normalizeText(runnable.inlineScript.content)
)
);
}
}
}
return checks;
}
export function validateCliWorkspace(input: {
actualFiles: Record<string, string>;
expectedFiles?: Record<string, string>;
initialFiles?: Record<string, string>;
}): BenchmarkCheck[] {
const checks: BenchmarkCheck[] = [];
if (input.expectedFiles) {
for (const [filePath, expectedContent] of Object.entries(input.expectedFiles)) {
const actualContent = input.actualFiles[filePath];
checks.push(check(`creates ${filePath}`, actualContent !== undefined));
if (actualContent !== undefined) {
checks.push(
check(
`${filePath} contains expected content`,
cliFileContainsExpectedContent(actualContent, expectedContent)
)
);
}
}
const expectedPaths = new Set(Object.keys(input.expectedFiles));
const unexpectedPaths = Object.keys(input.actualFiles).filter((filePath) => !expectedPaths.has(filePath));
checks.push(
check(
"workspace contains no unexpected files",
unexpectedPaths.length === 0,
summarizeProblems(unexpectedPaths)
)
);
}
if (input.initialFiles) {
checks.push(check("workspace differs from initial", !fileMapsEqual(input.actualFiles, input.initialFiles)));
}
return checks;
}
function cliFileContainsExpectedContent(actualContent: string, expectedContent: string): boolean {
const expectedSnippets = expectedContent
.replace(/\r\n/g, "\n")
.split("\n")
.map((line) => line.trim())
.filter((line) => line.length > 0);
if (expectedSnippets.length === 0) {
return true;
}
const normalizedActual = actualContent.replace(/\r\n/g, "\n");
return expectedSnippets.every((snippet) => normalizedActual.includes(snippet));
}
function check(name: string, passed: boolean, details?: string): BenchmarkCheck {
return !passed && details ? { name, passed, details } : { name, passed };
}
function normalizeText(value: string): string {
return value.replace(/\r\n/g, "\n").trim();
}
function normalizeJson(value: unknown): string {
return JSON.stringify(value);
}
function summarizeProblems(problems: string[], limit = 5): string | undefined {
if (problems.length === 0) {
return undefined;
}
if (problems.length <= limit) {
return problems.join("; ");
}
return `${problems.slice(0, limit).join("; ")}; ...and ${problems.length - limit} more`;
}
function hasSupportedEntrypoint(code: string): boolean {
return (
/export\s+(async\s+)?function\s+main\s*\(/.test(code) ||
/export\s+default\s+(async\s+)?function\s*\(/.test(code)
);
}
function getScriptSyntaxErrors(code: string, lang: string): string[] {
if (!TS_LIKE_LANGUAGES.has(lang)) {
return [];
}
return getTypeScriptSyntaxErrors(code, "eval.ts");
}
function getTypeScriptSyntaxErrors(code: string, fileName: string): string[] {
const result = ts.transpileModule(code, {
compilerOptions: {
target: ts.ScriptTarget.ES2022,
module: ts.ModuleKind.ESNext,
jsx: ts.JsxEmit.ReactJSX,
},
reportDiagnostics: true,
fileName,
});
return (result.diagnostics ?? []).map((diagnostic) =>
ts.flattenDiagnosticMessageText(diagnostic.messageText, "\n")
);
}
function getAppFrontendSyntaxProblems(frontend: Record<string, string>): string[] {
const problems: string[] = [];
for (const [filePath, content] of Object.entries(frontend)) {
if (!isFrontendCodeFile(filePath)) {
continue;
}
const errors = getTypeScriptSyntaxErrors(content, filePath);
for (const error of errors) {
problems.push(`${filePath}: ${error}`);
}
}
return problems;
}
function getAppBackendSyntaxProblems(backend: Record<string, AppRunnableState>): string[] {
const problems: string[] = [];
for (const [key, runnable] of Object.entries(backend)) {
if (runnable.type !== "inline") {
continue;
}
const language = runnable.inlineScript?.language ?? "";
const content = runnable.inlineScript?.content ?? "";
for (const error of getScriptSyntaxErrors(content, language)) {
problems.push(`${key}: ${error}`);
}
}
return problems;
}
function isFrontendCodeFile(filePath: string): boolean {
const extension = path.extname(filePath).toLowerCase();
return extension === ".ts" || extension === ".tsx" || extension === ".js" || extension === ".jsx";
}
function getUnresolvedBackendReferences(
frontend: Record<string, string>,
backend: Record<string, AppRunnableState>
): string[] {
const backendKeys = new Set(Object.keys(backend));
const unresolved = new Set<string>();
for (const [filePath, content] of Object.entries(frontend)) {
for (const key of extractBackendCallKeys(content)) {
if (!backendKeys.has(key)) {
unresolved.add(`${filePath} references missing backend.${key}()`);
}
}
}
return [...unresolved];
}
function extractBackendCallKeys(content: string): string[] {
const matches = content.matchAll(/\bbackend\.([A-Za-z_][A-Za-z0-9_]*)\s*\(/g);
return [...new Set([...matches].map((match) => match[1]))];
}
function getFlowModules(flow: FlowState): Array<Record<string, unknown>> {
return Array.isArray(flow.value?.modules) ? flow.value.modules : [];
}
function validateFlowExpectedStructure(
actual: FlowState,
expected: FlowState
): BenchmarkCheck[] {
const checks: BenchmarkCheck[] = [];
const expectedTopLevelModules = getFlowModules(expected);
const actualTopLevelModules = getFlowModules(actual);
const expectedSchemaFields = getTopLevelSchemaFields(expected.schema);
if (expectedSchemaFields.length > 0) {
checks.push(
check(
"flow schema includes expected top-level fields",
expectedSchemaFields.every((field) => hasSchemaPath(actual.schema, field)),
`missing one of: ${expectedSchemaFields.join(", ")}`
)
);
}
if (expectedTopLevelModules.length > 0) {
const actualIds = actualTopLevelModules
.map((module) => (typeof module.id === "string" ? module.id : null))
.filter((id): id is string => Boolean(id));
const expectedIds = expectedTopLevelModules
.map((module) => (typeof module.id === "string" ? module.id : null))
.filter((id): id is string => Boolean(id));
checks.push(
check(
"flow includes expected top-level step ids",
expectedIds.every((id) => actualIds.includes(id)),
`expected ids: ${expectedIds.join(", ")}; actual ids: ${actualIds.join(", ")}`
)
);
checks.push(
check(
"flow preserves expected top-level step order",
preservesRelativeOrder(actualIds, expectedIds),
`expected order: ${expectedIds.join(" -> ")}; actual ids: ${actualIds.join(" -> ")}`
)
);
for (const expectedModule of expectedTopLevelModules) {
const moduleId = typeof expectedModule.id === "string" ? expectedModule.id : null;
if (!moduleId) {
continue;
}
const actualModule = actualTopLevelModules.find((module) => module.id === moduleId);
if (!actualModule) {
continue;
}
const expectedType = getModuleType(expectedModule);
if (expectedType && !(hasSuspendConfig(expectedModule) || hasSuspendConfig(actualModule))) {
checks.push(
check(
`${moduleId} type matches expected`,
getModuleType(actualModule) === expectedType,
`expected ${expectedType}, got ${getModuleType(actualModule) ?? "(missing)"}`
)
);
}
const expectedPath = getModulePath(expectedModule);
if (expectedPath) {
checks.push(
check(
`${moduleId} path matches expected`,
getModulePath(actualModule) === expectedPath,
`expected ${expectedPath}, got ${getModulePath(actualModule) ?? "(missing)"}`
)
);
}
}
}
for (const specialModuleKey of ["preprocessor_module", "failure_module"] as const) {
const expectedSpecialModule = getSpecialFlowModule(expected, specialModuleKey);
if (!expectedSpecialModule) {
continue;
}
const actualSpecialModule = getSpecialFlowModule(actual, specialModuleKey);
checks.push(check(`${specialModuleKey} matches expected presence`, Boolean(actualSpecialModule)));
if (!actualSpecialModule) {
continue;
}
const expectedType = getModuleType(expectedSpecialModule);
if (expectedType) {
checks.push(
check(
`${specialModuleKey} type matches expected`,
getModuleType(actualSpecialModule) === expectedType,
`expected ${expectedType}, got ${getModuleType(actualSpecialModule) ?? "(missing)"}`
)
);
}
}
return checks;
}
function validateFlowRequirements(
flow: FlowState,
validate: FlowValidationSpec
): BenchmarkCheck[] {
const checks: BenchmarkCheck[] = [];
for (const requiredPath of validate.schemaRequiredPaths ?? []) {
checks.push(
check(
`schema includes ${requiredPath}`,
hasSchemaPath(flow.schema, requiredPath),
`missing schema path ${requiredPath}`
)
);
}
if (validate.schemaAnyOf && validate.schemaAnyOf.length > 0) {
const matchingVariant = validate.schemaAnyOf.find((variant) =>
variant.requiredPaths.every((requiredPath) => hasSchemaPath(flow.schema, requiredPath))
);
checks.push(
check(
"schema matches one accepted input shape",
Boolean(matchingVariant),
matchingVariant
? undefined
: `expected one of: ${validate.schemaAnyOf
.map((variant) => `[${variant.requiredPaths.join(", ")}]`)
.join(" or ")}`
)
);
}
if (validate.resolveResultsRefs) {
const unresolved = collectUnresolvedResultsRefs(flow);
checks.push(
check(
"results references resolve",
unresolved.length === 0,
unresolved.length > 0 ? unresolved.join("; ") : undefined
)
);
}
for (const specialModule of validate.requireSpecialModules ?? []) {
checks.push(
check(
`${specialModule} exists`,
Boolean(getSpecialFlowModule(flow, specialModule))
)
);
}
for (const suspendStep of validate.requireSuspendSteps ?? []) {
const module = findFlowModuleById(flow, suspendStep.id);
checks.push(check(`${suspendStep.id} step exists`, Boolean(module)));
if (!module) {
continue;
}
checks.push(check(`${suspendStep.id} includes suspend config`, hasSuspendConfig(module)));
if (!hasSuspendConfig(module)) {
continue;
}
if (suspendStep.requiredEvents !== undefined) {
checks.push(
check(
`${suspendStep.id} requires ${suspendStep.requiredEvents} approval event${suspendStep.requiredEvents === 1 ? "" : "s"}`,
getSuspendRequiredEvents(module) === suspendStep.requiredEvents,
`expected ${suspendStep.requiredEvents}, got ${getSuspendRequiredEvents(module) ?? "(missing)"}`
)
);
}
if (
suspendStep.resumeRequiredStringFieldAnyOf &&
suspendStep.resumeRequiredStringFieldAnyOf.length > 0
) {
const stringFields = getSuspendResumeStringFields(module);
checks.push(
check(
`${suspendStep.id} resume form includes one accepted comment field`,
suspendStep.resumeRequiredStringFieldAnyOf.some((field) =>
stringFields.includes(field)
),
`expected one of [${suspendStep.resumeRequiredStringFieldAnyOf.join(", ")}], got [${stringFields.join(", ")}]`
)
);
}
}
return checks;
}
function hasSchemaPath(schema: Record<string, unknown> | undefined, dottedPath: string): boolean {
if (!schema || typeof schema !== "object") {
return false;
}
const segments = dottedPath.split(".").filter(Boolean);
if (segments.length === 0) {
return false;
}
let current: Record<string, unknown> | undefined = schema;
for (const segment of segments) {
const properties = current?.properties;
if (!properties || typeof properties !== "object") {
return false;
}
const next = (properties as Record<string, unknown>)[segment];
if (!next || typeof next !== "object") {
return false;
}
current = next as Record<string, unknown>;
}
return true;
}
function getTopLevelSchemaFields(schema: Record<string, unknown> | undefined): string[] {
if (!schema || typeof schema !== "object") {
return [];
}
const properties = schema.properties;
if (!properties || typeof properties !== "object") {
return [];
}
return Object.keys(properties as Record<string, unknown>).filter((key) => key.length > 0);
}
function preservesRelativeOrder(actualIds: string[], expectedIds: string[]): boolean {
if (expectedIds.length === 0) {
return true;
}
let cursor = 0;
for (const actualId of actualIds) {
if (actualId === expectedIds[cursor]) {
cursor += 1;
if (cursor === expectedIds.length) {
return true;
}
}
}
return false;
}
function collectUnresolvedResultsRefs(flow: FlowState): string[] {
const unresolved = new Set<string>();
validateModuleSequence(getFlowModules(flow), new Map<string, Record<string, unknown>>(), unresolved);
return [...unresolved];
}
function validateModuleSequence(
modules: Array<Record<string, unknown>>,
parentVisibleModules: Map<string, Record<string, unknown>>,
unresolved: Set<string>
): void {
const visibleModules = new Map(parentVisibleModules);
for (const module of modules) {
validateResultsRefsInRecord(module, visibleModules, unresolved);
validateNestedModuleResultsRefs(module, visibleModules, unresolved);
if (typeof module.id === "string" && module.id.length > 0) {
visibleModules.set(module.id, module);
}
}
}
function validateNestedModuleResultsRefs(
module: Record<string, unknown>,
visibleModules: Map<string, Record<string, unknown>>,
unresolved: Set<string>
): void {
const value = isObjectRecord(module.value) ? module.value : null;
if (!value) {
return;
}
const nestedSequences: Array<Array<Record<string, unknown>>> = [];
if (Array.isArray(value.modules)) {
nestedSequences.push(asModuleArray(value.modules));
}
if (Array.isArray(value.default)) {
nestedSequences.push(asModuleArray(value.default));
}
if (Array.isArray(value.branches)) {
for (const branch of value.branches) {
if (!isObjectRecord(branch)) {
continue;
}
if (typeof branch.expr === "string") {
validateResultsRefsInExpression(
branch.expr,
`branch ${module.id ?? "(unnamed)"}`,
visibleModules,
unresolved
);
}
if (Array.isArray(branch.modules)) {
nestedSequences.push(asModuleArray(branch.modules));
}
}
}
for (const sequence of nestedSequences) {
validateModuleSequence(sequence, visibleModules, unresolved);
}
}
function validateResultsRefsInRecord(
value: unknown,
visibleModules: Map<string, Record<string, unknown>>,
unresolved: Set<string>,
context = "expression"
): void {
if (typeof value === "string") {
validateResultsRefsInExpression(value, context, visibleModules, unresolved);
return;
}
if (Array.isArray(value)) {
for (const entry of value) {
validateResultsRefsInRecord(entry, visibleModules, unresolved, context);
}
return;
}
if (!isObjectRecord(value)) {
return;
}
for (const [key, entry] of Object.entries(value)) {
if (key === "content" || key === "modules" || key === "branches" || key === "default") {
continue;
}
validateResultsRefsInRecord(entry, visibleModules, unresolved, key);
}
}
function validateResultsRefsInExpression(
expression: string,
context: string,
visibleModules: Map<string, Record<string, unknown>>,
unresolved: Set<string>
): void {
for (const ref of extractResultsRefs(expression)) {
const module = visibleModules.get(ref.root);
if (!module) {
unresolved.add(`${context} references missing results.${ref.root}`);
continue;
}
validateNestedResultsRefPath(ref.root, ref.path, module, context, unresolved);
}
}
function extractResultsRefs(
expression: string
): Array<{ root: string; path: string[] }> {
const matches = expression.matchAll(/\bresults\.([A-Za-z0-9_-]+)((?:\.[A-Za-z0-9_-]+)*)/g);
const refs = new Map<string, { root: string; path: string[] }>();
for (const match of matches) {
const root = match[1];
const path = match[2]
.split(".")
.filter(Boolean);
const key = `${root}:${path.join(".")}`;
refs.set(key, { root, path });
}
return [...refs.values()];
}
function validateNestedResultsRefPath(
rootId: string,
path: string[],
module: Record<string, unknown>,
context: string,
unresolved: Set<string>
): void {
if (path.length === 0) {
return;
}
const moduleType = getModuleType(module);
if (!moduleType || !CONTROL_FLOW_MODULE_TYPES.has(moduleType)) {
return;
}
const nestedIds = new Set(getImmediateNestedModuleIds(module));
const [firstSegment] = path;
if (nestedIds.has(firstSegment)) {
unresolved.add(
`${context} references nested results.${rootId}.${firstSegment} inside ${moduleType} ${rootId}`
);
}
}
function getAllFlowModules(flow: FlowState): Array<Record<string, unknown>> {
const modules: Array<Record<string, unknown>> = [];
const specialModules = ["preprocessor_module", "failure_module"] as const;
for (const key of specialModules) {
const specialModule = getSpecialFlowModule(flow, key);
if (specialModule) {
modules.push(specialModule);
modules.push(...collectNestedModules(specialModule));
}
}
for (const module of getFlowModules(flow)) {
modules.push(module);
modules.push(...collectNestedModules(module));
}
return modules;
}
function collectNestedModules(module: Record<string, unknown>): Array<Record<string, unknown>> {
const nested: Array<Record<string, unknown>> = [];
const value = isObjectRecord(module.value) ? module.value : null;
if (!value) {
return nested;
}
if (Array.isArray(value.modules)) {
for (const child of asModuleArray(value.modules)) {
nested.push(child, ...collectNestedModules(child));
}
}
if (Array.isArray(value.default)) {
for (const child of asModuleArray(value.default)) {
nested.push(child, ...collectNestedModules(child));
}
}
if (Array.isArray(value.branches)) {
for (const branch of value.branches) {
if (!isObjectRecord(branch) || !Array.isArray(branch.modules)) {
continue;
}
for (const child of asModuleArray(branch.modules)) {
nested.push(child, ...collectNestedModules(child));
}
}
}
return nested;
}
function findFlowModuleById(flow: FlowState, id: string): Record<string, unknown> | null {
for (const module of getAllFlowModules(flow)) {
if (module.id === id) {
return module;
}
}
return null;
}
function getInlineScriptPlaceholderModuleIds(flow: FlowState): string[] {
return getAllFlowModules(flow).flatMap((module) => {
const code = getModuleCode(module)?.trim();
if (!code || !/^inline_script\.[A-Za-z0-9_-]+$/.test(code)) {
return [];
}
if (typeof module.id === "string" && module.id.length > 0) {
return [module.id];
}
return ["(unnamed)"];
});
}
function getImmediateNestedModuleIds(module: Record<string, unknown>): string[] {
const ids: string[] = [];
const value = isObjectRecord(module.value) ? module.value : null;
if (!value) {
return ids;
}
if (Array.isArray(value.modules)) {
ids.push(...asModuleArray(value.modules).flatMap((child) => (typeof child.id === "string" ? [child.id] : [])));
}
if (Array.isArray(value.default)) {
ids.push(...asModuleArray(value.default).flatMap((child) => (typeof child.id === "string" ? [child.id] : [])));
}
if (Array.isArray(value.branches)) {
for (const branch of value.branches) {
if (!isObjectRecord(branch) || !Array.isArray(branch.modules)) {
continue;
}
ids.push(
...asModuleArray(branch.modules).flatMap((child) => (typeof child.id === "string" ? [child.id] : []))
);
}
}
return ids;
}
function getModuleCode(module: Record<string, unknown>): string | null {
const value = isObjectRecord(module.value) ? module.value : null;
return typeof value?.content === "string" ? value.content : null;
}
function asModuleArray(value: unknown[]): Array<Record<string, unknown>> {
return value.filter(isObjectRecord);
}
function isObjectRecord(value: unknown): value is Record<string, any> {
return typeof value === "object" && value !== null && !Array.isArray(value);
}
function getSpecialFlowModule(
flow: FlowState,
key: "preprocessor_module" | "failure_module"
): Record<string, unknown> | null {
if (!flow.value || typeof flow.value !== "object") {
return null;
}
const module = (flow.value as Record<string, unknown>)[key];
return module && typeof module === "object" ? (module as Record<string, unknown>) : null;
}
function getModuleType(module: Record<string, unknown>): string | null {
const value = module.value;
if (!value || typeof value !== "object") {
return null;
}
return typeof (value as Record<string, unknown>).type === "string"
? ((value as Record<string, string>).type)
: null;
}
function getModulePath(module: Record<string, unknown>): string | null {
const value = module.value;
if (!value || typeof value !== "object") {
return null;
}
return typeof (value as Record<string, unknown>).path === "string"
? ((value as Record<string, string>).path)
: null;
}
function hasSuspendConfig(module: Record<string, unknown>): boolean {
return typeof module.suspend === "object" && module.suspend !== null;
}
function getSuspendRequiredEvents(module: Record<string, unknown>): number | null {
const suspend = isObjectRecord(module.suspend) ? module.suspend : null;
return typeof suspend?.required_events === "number" ? suspend.required_events : null;
}
function getSuspendResumeStringFields(module: Record<string, unknown>): string[] {
const suspend = isObjectRecord(module.suspend) ? module.suspend : null;
const resumeForm = isObjectRecord(suspend?.resume_form) ? suspend.resume_form : null;
const schema = isObjectRecord(resumeForm?.schema) ? resumeForm.schema : null;
const properties = isObjectRecord(schema?.properties) ? schema.properties : null;
if (!properties) {
return [];
}
return Object.entries(properties).flatMap(([field, property]) => {
if (!isObjectRecord(property) || property.type !== "string") {
return [];
}
return [field];
});
}
function appStatesEqual(left: AppFilesState, right: AppFilesState): boolean {
return fileMapsEqual(left.frontend, right.frontend) && fileMapsEqual(stringifyBackend(left.backend), stringifyBackend(right.backend));
}
function stringifyBackend(backend: Record<string, AppRunnableState>): Record<string, string> {
const result: Record<string, string> = {};
for (const [key, value] of Object.entries(backend)) {
result[key] = JSON.stringify(value);
}
return result;
}
function fileMapsEqual(left: Record<string, string>, right: Record<string, string>): boolean {
const leftEntries = Object.entries(left).sort(([a], [b]) => a.localeCompare(b));
const rightEntries = Object.entries(right).sort(([a], [b]) => a.localeCompare(b));
if (leftEntries.length !== rightEntries.length) {
return false;
}
return leftEntries.every(([key, value], index) => {
const [otherKey, otherValue] = rightEntries[index];
return key === otherKey && normalizeText(value) === normalizeText(otherValue);
});
}

View File

@@ -0,0 +1,2 @@
main(name: string)
greeting: `Hello, ${name}!`

View File

@@ -0,0 +1,3 @@
export async function main(name: string) {
return { greeting: `Hello, ${name}!` };
}

View File

@@ -0,0 +1,2 @@
type: script
path: f/lib/format_greeting

View File

@@ -0,0 +1,3 @@
export async function main(name: string) {
return { greeting: `Hello, ${name}!` };
}

View File

@@ -0,0 +1,2 @@
def main(
return {"total": a + b}

View File

@@ -0,0 +1,20 @@
summary: Simple greeting flow
schema:
type: object
properties:
name:
type: string
description: Name to greet
required:
- name
value:
modules:
- id: hello_step
value:
type: rawscript
language: bun
content: !inline hello.ts
input_transforms:
name:
type: javascript
expr: flow_input.name

View File

@@ -0,0 +1,3 @@
export async function main(name: string) {
return { greeting: `Hello, ${name}!` };
}

View File

@@ -0,0 +1,3 @@
export async function main(name: string) {
return { greeting: `Hello, ${name}!` };
}

View File

@@ -0,0 +1,3 @@
export async function main(name: string) {
return { greeting: `Hello, ${name}!` };
}

View File

@@ -0,0 +1,31 @@
{
"summary": "",
"value": {
"modules": [
{
"id": "sum_numbers",
"value": {
"type": "rawscript",
"language": "bun",
"content": "export async function main(a: number, b: number) {\n return a + b;\n}",
"input_transforms": {
"a": {
"type": "javascript",
"expr": "flow_input.a"
},
"b": {
"type": "javascript",
"expr": "flow_input.b"
}
}
}
}
]
},
"schema": {
"$schema": "https://json-schema.org/draft/2020-12/schema",
"properties": {},
"required": [],
"type": "object"
}
}

View File

@@ -0,0 +1,30 @@
{
"value": {
"modules": [
{
"id": "count_until_target",
"value": {
"type": "whileloopflow"
}
},
{
"id": "return_final_count",
"value": {
"type": "rawscript"
}
}
]
},
"schema": {
"$schema": "https://json-schema.org/draft/2020-12/schema",
"type": "object",
"properties": {
"target": {
"type": "number"
}
},
"required": [
"target"
]
}
}

View File

@@ -0,0 +1,36 @@
{
"value": {
"preprocessor_module": {
"id": "preprocessor",
"value": {
"type": "rawscript"
}
},
"failure_module": {
"id": "failure",
"value": {
"type": "rawscript"
}
},
"modules": [
{
"id": "process_event",
"value": {
"type": "rawscript"
}
}
]
},
"schema": {
"$schema": "https://json-schema.org/draft/2020-12/schema",
"type": "object",
"properties": {
"payload": {
"type": "string"
}
},
"required": [
"payload"
]
}
}

View File

@@ -0,0 +1,44 @@
{
"value": {
"modules": [
{
"id": "request_approval",
"suspend": {
"required_events": 1,
"resume_form": {
"schema": {
"approver_comment": {
"type": "string"
}
}
}
},
"value": {
"type": "rawscript"
}
},
{
"id": "finalize_purchase",
"value": {
"type": "rawscript"
}
}
]
},
"schema": {
"$schema": "https://json-schema.org/draft/2020-12/schema",
"type": "object",
"properties": {
"requester_email": {
"type": "string"
},
"amount": {
"type": "number"
}
},
"required": [
"requester_email",
"amount"
]
}
}

View File

@@ -0,0 +1,39 @@
{
"value": {
"modules": [
{
"id": "sum_numbers",
"value": {
"type": "script",
"path": "f/evals/add_two_numbers.ts",
"input_transforms": {
"a": {
"type": "javascript",
"expr": "flow_input.a"
},
"b": {
"type": "javascript",
"expr": "flow_input.b"
}
}
}
}
]
},
"schema": {
"$schema": "https://json-schema.org/draft/2020-12/schema",
"type": "object",
"properties": {
"a": {
"type": "number"
},
"b": {
"type": "number"
}
},
"required": [
"a",
"b"
]
}
}

View File

@@ -0,0 +1,39 @@
{
"value": {
"modules": [
{
"id": "call_add_numbers",
"value": {
"type": "flow",
"path": "f/evals/add_numbers_flow",
"input_transforms": {
"a": {
"type": "javascript",
"expr": "flow_input.a"
},
"b": {
"type": "javascript",
"expr": "flow_input.b"
}
}
}
}
]
},
"schema": {
"$schema": "https://json-schema.org/draft/2020-12/schema",
"type": "object",
"properties": {
"a": {
"type": "number"
},
"b": {
"type": "number"
}
},
"required": [
"a",
"b"
]
}
}

Some files were not shown because too many files have changed in this diff Show More