Compare commits

..

2 Commits

Author SHA1 Message Date
Ruben Fiszel
1aac43977e test(cli): add non-dotted path tests for generate-metadata and sync pull
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-15 04:30:21 +00:00
Travis
05d4d6dd48 fix(cli): preserve non-dotted flow lock filenames 2026-03-14 22:41:59 -04:00
1808 changed files with 21712 additions and 125275 deletions

View File

@@ -1,59 +0,0 @@
---
name: commit
description: Create a git commit with conventional commit format. MUST use anytime you want to commit changes.
---
# Git Commit Skill
Create a focused, single-line commit following conventional commit conventions.
## Instructions
1. **Analyze changes**: Run `git status` and `git diff` to understand what was modified
2. **Stage only modified files**: Add files individually by name. NEVER use `git add -A` or `git add .`
3. **Write commit message**: Follow the conventional commit format as a single line
## Conventional Commit Format
```
<type>: <description>
```
### Types
- `feat`: New feature or capability
- `fix`: Bug fix
- `refactor`: Code change that neither fixes a bug nor adds a feature
- `docs`: Documentation only changes
- `style`: Formatting, missing semicolons, etc (no code change)
- `test`: Adding or correcting tests
- `chore`: Maintenance tasks, dependency updates, etc
- `perf`: Performance improvement
### Rules
- Message MUST be a single line (no multi-line messages)
- Description should be lowercase, imperative mood ("add" not "added")
- No period at the end
- Keep under 72 characters total
### Examples
```
feat: add token usage tracking for AI providers
fix: resolve null pointer in job executor
refactor: extract common validation logic
docs: update API endpoint documentation
chore: upgrade sqlx to 0.7
```
## Execution Steps
1. Run `git status` to see all changes
2. Run `git diff` to understand the changes in detail
3. Run `git log --oneline -5` to see recent commit style
4. Stage ONLY the modified/relevant files: `git add <file1> <file2> ...`
5. Create the commit with conventional format:
```bash
git commit -m "<type>: <description>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>"
```
6. Run `git status` to verify the commit succeeded

View File

@@ -1,97 +0,0 @@
---
name: local-review
description: Code review a pull request for bugs and CLAUDE.md compliance. MUST use when asked to review code.
---
# Local Code Review Skill
Review a pull request for real bugs and CLAUDE.md compliance violations. This review targets HIGH SIGNAL issues only.
## Review Philosophy
- **Only flag issues you are certain about.** If you are not sure an issue is real, do not flag it. False positives erode trust and waste reviewer time.
- Think like a senior engineer doing a final review — flag things that would cause incidents, not things that are merely imperfect.
## What to Flag
- Code that won't compile or parse (syntax errors, type errors, missing imports)
- Code that will definitely produce wrong results regardless of inputs
- Clear, unambiguous CLAUDE.md violations (quote the exact rule being violated)
- Security issues in introduced code (injection, auth bypass, data exposure)
- Incorrect logic that will fail in production
## What NOT to Flag
- Code style or quality concerns
- Potential issues that depend on specific inputs or runtime state
- Subjective suggestions or improvements
- Pre-existing issues not introduced by this PR
- Pedantic nitpicks a senior engineer wouldn't flag
- Issues a linter or type checker will catch
- General quality concerns unless explicitly prohibited in CLAUDE.md
- Issues silenced via lint ignore comments
## Execution Steps
1. **Determine the PR scope**:
- If an argument is provided, use it as the PR number or branch
- Otherwise, detect from the current branch vs main
- Run `gh pr view` if a PR exists, or use `git diff main...HEAD`
2. **Find relevant CLAUDE.md files**:
- Read the root `CLAUDE.md`
- Check for CLAUDE.md files in directories containing changed files
3. **Get the diff and metadata**:
- `gh pr diff` or `git diff main...HEAD` for the full diff
- `gh pr view` or `git log main..HEAD --oneline` for context
4. **Read changed files** where the diff alone is insufficient to understand context
5. **Review for**:
- CLAUDE.md compliance — check each rule against the changed code
- Bugs and logic errors — will this code work correctly?
- Security issues — injection, auth, data exposure in new code
6. **Self-validate each finding**: Before reporting, ask yourself:
- "Is this definitely a real issue, not a false positive?"
- "Would a senior engineer flag this in review?"
- If the answer to either is no, discard the finding
7. **Output findings** to the terminal (default) or post as PR comments (with `--comment` flag)
## Output Format
```
## Code review
Found N issues:
1. <description> (<reason: CLAUDE.md adherence | bug | security>)
<file_path:line_number>
2. <description> (<reason>)
<file_path:line_number>
```
If no issues are found:
```
## Code review
No issues found. Checked for bugs and CLAUDE.md compliance.
```
## Posting Comments (--comment flag)
If the user passes `--comment`, post findings as inline PR comments using:
```bash
gh pr review --comment --body "<summary>"
```
Or for inline comments on specific lines:
```bash
gh api repos/{owner}/{repo}/pulls/{pr}/reviews -f body="<summary>" -f event="COMMENT" -f comments="[...]"
```

View File

@@ -1,782 +0,0 @@
---
name: native-trigger
description: Guidance for adding native trigger services to Windmill. Use when implementing or modifying native trigger integrations across the backend and frontend.
---
# Skill: Adding Native Trigger Services
This skill provides comprehensive guidance for adding new native trigger services to Windmill. Native triggers allow external services (like Nextcloud, Google Drive, etc.) to trigger Windmill scripts/flows via webhooks or push notifications.
## Architecture Overview
The native trigger system consists of:
1. **Database Layer** - PostgreSQL tables and enum types
2. **Backend Rust Implementation** - Core trait, handlers, and service modules in the `windmill-native-triggers` crate
3. **Frontend Svelte Components** - Configuration forms and UI components
### Key Files
| Component | Path |
|-----------|------|
| Core module with `External` trait | `backend/windmill-native-triggers/src/lib.rs` |
| Generic CRUD handlers | `backend/windmill-native-triggers/src/handler.rs` |
| Background sync logic | `backend/windmill-native-triggers/src/sync.rs` |
| OAuth/workspace integration | `backend/windmill-native-triggers/src/workspace_integrations.rs` |
| Re-export shim (windmill-api) | `backend/windmill-api/src/native_triggers/mod.rs` |
| TriggerKind enum | `backend/windmill-common/src/triggers.rs` |
| JobTriggerKind enum | `backend/windmill-common/src/jobs.rs` |
| Frontend service registry | `frontend/src/lib/components/triggers/native/utils.ts` |
| Frontend trigger utilities | `frontend/src/lib/components/triggers/utils.ts` |
| Trigger badges (icons + counts) | `frontend/src/lib/components/graph/renderers/triggers/TriggersBadge.svelte` |
| Workspace integrations UI | `frontend/src/lib/components/workspaceSettings/WorkspaceIntegrations.svelte` |
| OAuth config form component | `frontend/src/lib/components/workspaceSettings/OAuthClientConfig.svelte` |
| OpenAPI spec | `backend/windmill-api/openapi.yaml` |
| Reference: Nextcloud module | `backend/windmill-native-triggers/src/nextcloud/` |
| Reference: Google module | `backend/windmill-native-triggers/src/google/` |
### Crate Structure
The native trigger code lives in the `windmill-native-triggers` crate (`backend/windmill-native-triggers/`). The `windmill-api` crate re-exports everything via a shim:
```rust
// backend/windmill-api/src/native_triggers/mod.rs
pub use windmill_native_triggers::*;
```
All new service modules go in `backend/windmill-native-triggers/src/`.
---
## Core Concepts
### The `External` Trait
Every native trigger service implements the `External` trait defined in `lib.rs`:
```rust
#[async_trait]
pub trait External: Send + Sync + 'static {
// Associated types:
type ServiceConfig: Debug + DeserializeOwned + Serialize + Send + Sync;
type TriggerData: Debug + Serialize + Send + Sync;
type OAuthData: DeserializeOwned + Serialize + Clone + Send + Sync;
type CreateResponse: DeserializeOwned + Send + Sync;
// Constants:
const SUPPORT_WEBHOOK: bool;
const SERVICE_NAME: ServiceName;
const DISPLAY_NAME: &'static str;
const TOKEN_ENDPOINT: &'static str;
const REFRESH_ENDPOINT: &'static str;
const AUTH_ENDPOINT: &'static str;
// Required methods:
async fn create(&self, w_id, oauth_data, webhook_token, data, db, tx) -> Result<Self::CreateResponse>;
async fn update(&self, w_id, oauth_data, external_id, webhook_token, data, db, tx) -> Result<serde_json::Value>;
async fn get(&self, w_id, oauth_data, external_id, db, tx) -> Result<Self::TriggerData>;
async fn delete(&self, w_id, oauth_data, external_id, db, tx) -> Result<()>;
async fn exists(&self, w_id, oauth_data, external_id, db, tx) -> Result<bool>;
async fn maintain_triggers(&self, db, workspace_id, triggers, oauth_data, synced, errors);
fn external_id_and_metadata_from_response(&self, resp) -> (String, Option<serde_json::Value>);
// Methods with defaults:
async fn prepare_webhook(&self, db, w_id, headers, body, script_path, is_flow) -> Result<PushArgsOwned>;
fn service_config_from_create_response(&self, data, resp) -> Option<serde_json::Value>;
fn additional_routes(&self) -> axum::Router;
async fn http_client_request<T, B>(&self, url, method, workspace_id, tx, db, headers, body) -> Result<T>;
}
```
Key design points:
- **`update()` returns `serde_json::Value`** - the resolved service_config to store. Each service is responsible for building the final config.
- **`maintain_triggers()`** - periodic background maintenance. Each service implements its own strategy (Nextcloud: reconcile with external state; Google: renew expiring channels).
- **No `list_all()` in the trait** - services that need it (Nextcloud) implement it privately; services that don't (Google) use different maintenance strategies.
- **No `get_external_id_from_trigger_data()` or `extract_service_config_from_trigger_data()`** - removed in favor of the `maintain_triggers` pattern.
### Create Lifecycle: Two Paths
The `create_native_trigger` handler in `handler.rs` supports two creation flows, controlled by `service_config_from_create_response()`:
**Path A: Short (Google pattern)** - `service_config_from_create_response()` returns `Some(config)`:
1. `create()` registers on external service
2. `external_id_and_metadata_from_response()` extracts the ID
3. `service_config_from_create_response()` builds the config directly from input data + response metadata
4. Stores trigger in DB -- done, no extra round-trip
Use this when the external_id is known before the create call (e.g., Google generates the channel_id as a UUID upfront and includes it in the webhook URL).
**Path B: Long (Nextcloud pattern)** - `service_config_from_create_response()` returns `None` (default):
1. `create()` registers on external service (webhook URL has no external_id yet)
2. `external_id_and_metadata_from_response()` extracts the ID
3. `update()` is called to fix the webhook URL with the now-known external_id
4. `update()` returns the resolved service_config
5. Stores trigger in DB
Use this when the external_id is assigned by the remote service and the webhook URL needs to be corrected after creation.
### OAuth Token Storage (Three-Table Pattern)
OAuth tokens are stored across three tables, NOT in `workspace_integrations.oauth_data` directly:
| Table | What's Stored |
|-------|---------------|
| `workspace_integrations` | `oauth_data` JSON with `base_url`, `client_id`, `client_secret`, `instance_shared` flag; `resource_path` pointing to the variable |
| `variable` | Encrypted `access_token` (at the path stored in `resource_path`), linked to `account` via `account` column |
| `account` | `refresh_token`, keyed by `workspace_id` + `client` (service name) + `is_workspace_integration = true` |
The `decrypt_oauth_data()` function in `lib.rs` assembles these into a unified struct:
```rust
pub struct OAuthConfig {
pub base_url: String,
pub access_token: String, // decrypted from variable
pub refresh_token: Option<String>, // from account table
pub client_id: String, // from oauth_data or instance settings
pub client_secret: String, // from oauth_data or instance settings
}
```
Instance-level sharing: when `oauth_data.instance_shared == true`, `client_id` and `client_secret` are read from global settings instead of workspace_integrations.
### URL Resolution
The `resolve_endpoint()` helper handles both absolute and relative OAuth URLs:
```rust
pub fn resolve_endpoint(base_url: &str, endpoint: &str) -> String {
if endpoint.starts_with("http://") || endpoint.starts_with("https://") {
endpoint.to_string() // Google: absolute URLs
} else {
format!("{}{}", base_url, endpoint) // Nextcloud: relative paths
}
}
```
### ServiceName Methods
`ServiceName` is the central registry enum. Each variant must implement these match arms:
| Method | Purpose |
|--------|---------|
| `as_str()` | Lowercase identifier (e.g., `"google"`) |
| `as_trigger_kind()` | Maps to `TriggerKind` enum |
| `as_job_trigger_kind()` | Maps to `JobTriggerKind` enum |
| `token_endpoint()` | OAuth token endpoint (relative or absolute) |
| `auth_endpoint()` | OAuth authorization endpoint |
| `oauth_scopes()` | Space-separated OAuth scopes |
| `resource_type()` | Resource type for token storage (e.g., `"gworkspace"`) |
| `extra_auth_params()` | Extra OAuth params (e.g., Google needs `access_type=offline`, `prompt=consent`) |
| `integration_service()` | Maps to the workspace integration service (usually `*self`) |
| `TryFrom<String>` | Parse from string |
| `Display` | Delegates to `as_str()` |
---
## Step-by-Step Implementation Guide
### Step 1: Database Migration
Create a new migration file: `backend/migrations/YYYYMMDDHHMMSS_newservice_trigger.up.sql`
```sql
-- Add the service to the native_trigger_service enum
ALTER TYPE native_trigger_service ADD VALUE IF NOT EXISTS 'newservice';
-- Add to TRIGGER_KIND enum (used for trigger tracking)
ALTER TYPE TRIGGER_KIND ADD VALUE IF NOT EXISTS 'newservice';
-- Add to job_trigger_kind enum (used for job tracking)
ALTER TYPE job_trigger_kind ADD VALUE IF NOT EXISTS 'newservice';
```
Also create the corresponding down migration.
### Step 2: Update windmill-common Enums
#### `backend/windmill-common/src/triggers.rs`
Add variant to `TriggerKind` enum, and update `to_key()` and `fmt()` implementations.
#### `backend/windmill-common/src/jobs.rs`
Add variant to `JobTriggerKind` enum and update the `Display` implementation.
### Step 3: Backend Service Module
Create a new directory: `backend/windmill-native-triggers/src/newservice/`
#### `mod.rs` - Type Definitions
```rust
use serde::{Deserialize, Serialize};
pub mod external;
// pub mod routes; // Only if you need additional service-specific routes
/// OAuth data deserialized from the three-table pattern.
/// The actual structure is built by decrypt_oauth_data() from variable + account + workspace_integrations.
#[derive(Debug, Clone, Deserialize, Serialize)]
pub struct NewServiceOAuthData {
pub base_url: String, // from workspace_integrations.oauth_data
pub access_token: String, // decrypted from variable table
pub refresh_token: Option<String>, // from account table
// Note: client_id and client_secret are in OAuthConfig, not here
// unless the service needs them at runtime for API calls
}
/// Configuration provided by user when creating/updating a trigger.
/// Stored as JSON in native_trigger.service_config.
#[derive(Debug, Clone, Serialize, Deserialize)]
#[serde(rename_all = "camelCase")]
pub struct NewServiceConfig {
// Service-specific configuration fields
pub folder_path: String,
pub file_filter: Option<String>,
}
/// Data retrieved from the external service about a trigger.
/// Returned by the get() method and shown in the UI.
#[derive(Debug, Clone, Serialize, Deserialize)]
#[serde(rename_all = "camelCase")]
pub struct NewServiceTriggerData {
pub folder_path: String,
pub file_filter: Option<String>,
// Fields that shouldn't affect service_config comparison should use #[serde(skip_serializing)]
}
/// Response from external service when creating a trigger/webhook.
#[derive(Debug, Deserialize)]
pub struct CreateTriggerResponse {
pub id: String,
}
/// Handler struct (stateless, used for routing)
#[derive(Copy, Clone)]
pub struct NewService;
```
#### `external.rs` - External Trait Implementation
```rust
use async_trait::async_trait;
use reqwest::Method;
use sqlx::PgConnection;
use std::collections::HashMap;
use windmill_common::{
error::{Error, Result},
BASE_URL, DB,
};
use crate::{
generate_webhook_service_url, External, NativeTrigger, NativeTriggerData, ServiceName,
sync::{SyncError, TriggerSyncInfo},
};
use super::{NewService, NewServiceConfig, NewServiceOAuthData, NewServiceTriggerData, CreateTriggerResponse};
#[async_trait]
impl External for NewService {
type ServiceConfig = NewServiceConfig;
type TriggerData = NewServiceTriggerData;
type OAuthData = NewServiceOAuthData;
type CreateResponse = CreateTriggerResponse;
const SERVICE_NAME: ServiceName = ServiceName::NewService;
const DISPLAY_NAME: &'static str = "New Service";
const SUPPORT_WEBHOOK: bool = true;
const TOKEN_ENDPOINT: &'static str = "/oauth/token";
const REFRESH_ENDPOINT: &'static str = "/oauth/token";
const AUTH_ENDPOINT: &'static str = "/oauth/authorize";
async fn create(
&self,
w_id: &str,
oauth_data: &Self::OAuthData,
webhook_token: &str,
data: &NativeTriggerData<Self::ServiceConfig>,
db: &DB,
tx: &mut PgConnection,
) -> Result<Self::CreateResponse> {
let base_url = &*BASE_URL.read().await;
// external_id is None during create (we get it from the response)
let webhook_url = generate_webhook_service_url(
base_url, w_id, &data.script_path, data.is_flow,
None, Self::SERVICE_NAME, webhook_token,
);
let url = format!("{}/api/webhooks/create", oauth_data.base_url);
let payload = serde_json::json!({
"callback_url": webhook_url,
"folder_path": data.service_config.folder_path,
});
let response: CreateTriggerResponse = self
.http_client_request(&url, Method::POST, w_id, tx, db, None, Some(&payload))
.await?;
Ok(response)
}
/// Update returns the resolved service_config as JSON.
/// For services using the update+get pattern, call self.get() and serialize.
async fn update(
&self,
w_id: &str,
oauth_data: &Self::OAuthData,
external_id: &str,
webhook_token: &str,
data: &NativeTriggerData<Self::ServiceConfig>,
db: &DB,
tx: &mut PgConnection,
) -> Result<serde_json::Value> {
let base_url = &*BASE_URL.read().await;
let webhook_url = generate_webhook_service_url(
base_url, w_id, &data.script_path, data.is_flow,
Some(external_id), Self::SERVICE_NAME, webhook_token,
);
let url = format!("{}/api/webhooks/{}", oauth_data.base_url, external_id);
let payload = serde_json::json!({
"callback_url": webhook_url,
"folder_path": data.service_config.folder_path,
});
let _: serde_json::Value = self
.http_client_request(&url, Method::PUT, w_id, tx, db, None, Some(&payload))
.await?;
// Fetch back the updated state to get the resolved config
let trigger_data = self.get(w_id, oauth_data, external_id, db, tx).await?;
serde_json::to_value(&trigger_data)
.map_err(|e| Error::InternalErr(format!("Failed to serialize trigger data: {}", e)))
}
async fn get(
&self,
w_id: &str,
oauth_data: &Self::OAuthData,
external_id: &str,
db: &DB,
tx: &mut PgConnection,
) -> Result<Self::TriggerData> {
let url = format!("{}/api/webhooks/{}", oauth_data.base_url, external_id);
self.http_client_request::<_, ()>(&url, Method::GET, w_id, tx, db, None, None).await
}
async fn delete(
&self,
w_id: &str,
oauth_data: &Self::OAuthData,
external_id: &str,
db: &DB,
tx: &mut PgConnection,
) -> Result<()> {
let url = format!("{}/api/webhooks/{}", oauth_data.base_url, external_id);
let _: serde_json::Value = self
.http_client_request::<_, ()>(&url, Method::DELETE, w_id, tx, db, None, None)
.await
.or_else(|e| match &e {
Error::InternalErr(msg) if msg.contains("404") => Ok(serde_json::Value::Null),
_ => Err(e),
})?;
Ok(())
}
async fn exists(
&self,
w_id: &str,
oauth_data: &Self::OAuthData,
external_id: &str,
db: &DB,
tx: &mut PgConnection,
) -> Result<bool> {
match self.get(w_id, oauth_data, external_id, db, tx).await {
Ok(_) => Ok(true),
Err(Error::NotFound(_)) => Ok(false),
Err(e) => Err(e),
}
}
/// Background maintenance. Choose the right pattern for your service:
/// - For services with queryable external state: use reconcile_with_external_state()
/// - For channel-based services with expiration: implement renewal logic
async fn maintain_triggers(
&self,
db: &DB,
workspace_id: &str,
triggers: &[NativeTrigger],
oauth_data: &Self::OAuthData,
synced: &mut Vec<TriggerSyncInfo>,
errors: &mut Vec<SyncError>,
) {
// Option A: Reconcile with external state (Nextcloud pattern)
// Fetch all triggers from external service and compare with DB
let external_triggers = match self.list_all(workspace_id, oauth_data, db).await {
Ok(triggers) => triggers,
Err(e) => {
errors.push(SyncError {
resource_path: format!("workspace:{}", workspace_id),
error_message: format!("Failed to list triggers: {}", e),
error_type: "api_error".to_string(),
});
return;
}
};
// Convert to (external_id, config_json) pairs
let external_pairs: Vec<(String, serde_json::Value)> = external_triggers
.into_iter()
.map(|t| (t.id.clone(), serde_json::to_value(&t).unwrap_or_default()))
.collect();
crate::sync::reconcile_with_external_state(
db, workspace_id, Self::SERVICE_NAME, triggers, &external_pairs, synced, errors,
).await;
}
fn external_id_and_metadata_from_response(
&self,
resp: &Self::CreateResponse,
) -> (String, Option<serde_json::Value>) {
(resp.id.clone(), None)
}
// service_config_from_create_response: NOT overridden (returns None).
// This means the handler uses the update+get pattern after create.
// Override and return Some(...) to skip the update+get cycle (Google pattern).
}
impl NewService {
/// Private helper to list all triggers from the external service.
async fn list_all(
&self,
w_id: &str,
oauth_data: &<Self as External>::OAuthData,
db: &DB,
) -> Result<Vec<<Self as External>::TriggerData>> {
// Implementation depends on the external service's API
todo!()
}
}
```
### Step 4: Update lib.rs Registry
In `backend/windmill-native-triggers/src/lib.rs`:
```rust
// Service modules - add new services here:
#[cfg(feature = "native_trigger")]
pub mod newservice; // <-- Add this
// ServiceName enum - add variant:
pub enum ServiceName {
Nextcloud,
Google,
NewService, // <-- Add this
}
// Then add match arms in ALL ServiceName methods:
// as_str(), as_trigger_kind(), as_job_trigger_kind(), token_endpoint(),
// auth_endpoint(), oauth_scopes(), resource_type(), extra_auth_params(),
// integration_service(), TryFrom<String>, Display
```
### Step 5: Update handler.rs Routes
In `backend/windmill-native-triggers/src/handler.rs`:
```rust
pub fn generate_native_trigger_routers() -> Router {
// ...
#[cfg(feature = "native_trigger")]
{
use crate::newservice::NewService;
return router
.nest("/nextcloud", service_routes(NextCloud))
.nest("/google", service_routes(Google))
.nest("/newservice", service_routes(NewService)); // <-- Add this
}
// ...
}
```
### Step 6: Update sync.rs
In `backend/windmill-native-triggers/src/sync.rs`:
```rust
pub async fn sync_all_triggers(db: &DB) -> Result<BackgroundSyncResult> {
// ...
#[cfg(feature = "native_trigger")]
{
use crate::newservice::NewService;
// ... existing service syncs ...
// New service sync
let (service_name, result) = sync_service_triggers(db, NewService).await;
total_synced += result.synced_triggers.len();
total_errors += result.errors.len();
service_results.insert(service_name, result);
}
// ...
}
```
### Step 7: Frontend Service Registry
In `frontend/src/lib/components/triggers/native/utils.ts`:
Add to `NATIVE_TRIGGER_SERVICES`, `getTriggerIconName()`, and `getServiceIcon()`.
### Step 8: Frontend Trigger Form Component
Create: `frontend/src/lib/components/triggers/native/services/newservice/NewServiceTriggerForm.svelte`
### Step 9: Frontend Icon Component
Create: `frontend/src/lib/components/icons/NewServiceIcon.svelte`
### Step 10: Update NativeTriggerEditor
Check `frontend/src/lib/components/triggers/native/NativeTriggerEditor.svelte` to ensure it dynamically loads form components based on service name.
### Step 11: Workspace Integration UI
Add your service to the `supportedServices` map in `frontend/src/lib/components/workspaceSettings/WorkspaceIntegrations.svelte`:
```typescript
const supportedServices: Record<string, ServiceConfig> = {
// ... existing services ...
newservice: {
name: 'newservice',
displayName: 'New Service',
description: 'Connect to New Service for triggers',
icon: NewServiceIcon,
docsUrl: 'https://www.windmill.dev/docs/integrations/newservice',
requiresBaseUrl: false, // false for cloud services, true for self-hosted
setupInstructions: [
'Step 1: Create an OAuth app on the service',
'Step 2: Configure the redirect URI shown below',
'Step 3: Enter the client credentials below'
]
}
}
```
### Step 12: Update `frontend/src/lib/components/triggers/utils.ts`
Update ALL of these maps/functions:
1. `triggerIconMap` - import and add icon
2. `triggerDisplayNamesMap` - add display name
3. `triggerTypeOrder` in `sortTriggers()` - add type
4. `getLightConfig()` - add case for your service
5. `getTriggerLabel()` - add case for your service
6. `jobTriggerKinds` - add to array
7. `countPropertyMap` - add count property
8. `triggerSaveFunctions` - add save function
### Step 13: Update TriggersBadge Component
In `frontend/src/lib/components/graph/renderers/triggers/TriggersBadge.svelte`:
1. Import the icon
2. Add to `baseConfig` with `countKey` (the dynamic `availableNativeServices` loop does NOT set `countKey`)
3. Add to the `allTypes` array
### Step 14: Update TriggersWrapper.svelte
In `frontend/src/lib/components/triggers/TriggersWrapper.svelte`:
Add a `{:else if selectedTrigger.type === 'yourservice'}` case that renders `<NativeTriggersPanel service="yourservice" ...>` with the same props pattern as the existing native trigger cases (e.g., `nextcloud`).
### Step 15: Update AddTriggersButton.svelte
In `frontend/src/lib/components/triggers/AddTriggersButton.svelte`:
1. Add `yourserviceAvailable` state variable
2. Add `setYourserviceState()` async function using `isServiceAvailable('yourservice', $workspaceStore!)`
3. Call it at module level
4. Add a dropdown entry to `addTriggerItems` with `hidden: !yourserviceAvailable`
### Step 16: Update TriggersEditor.svelte Delete Handling
In `frontend/src/lib/components/triggers/TriggersEditor.svelte`:
Add your service to the `nativeTriggerServices` map in `deleteDeployedTrigger()`. Native triggers use `NativeTriggerService.deleteNativeTrigger({ workspace, serviceName, externalId })` instead of the standard `path`-based delete.
### Step 17: Update OpenAPI Spec and Regenerate Types
Add to `JobTriggerKind` enum in `backend/windmill-api/openapi.yaml`, then:
```bash
cd frontend && npm run generate-backend-client
```
---
## Special Patterns
### Unified Service with `trigger_type` (Google Pattern)
When a single service handles multiple trigger types (e.g., Google Drive + Calendar share OAuth and API patterns), use a single `ServiceName` variant with a discriminator field:
```rust
pub enum GoogleTriggerType { Drive, Calendar }
pub struct GoogleServiceConfig {
pub trigger_type: GoogleTriggerType,
// Drive-specific fields (only used when trigger_type = Drive)
pub resource_id: Option<String>,
pub resource_name: Option<String>,
// Calendar-specific fields (only used when trigger_type = Calendar)
pub calendar_id: Option<String>,
pub calendar_name: Option<String>,
// Metadata set after creation
pub google_resource_id: Option<String>,
pub expiration: Option<String>,
}
```
Branch in trait methods based on `trigger_type`. Frontend uses a `ToggleButtonGroup` to switch between types. This keeps the codebase simpler (one service, one OAuth flow, one set of routes).
See `backend/windmill-native-triggers/src/google/` for the reference implementation.
### Skipping update+get After Create (Google Pattern)
Override `service_config_from_create_response()` to return `Some(config)` when the external_id is known before the create call:
```rust
fn service_config_from_create_response(
&self,
data: &NativeTriggerData<Self::ServiceConfig>,
resp: &Self::CreateResponse,
) -> Option<serde_json::Value> {
// Clone input config, add metadata from response
let mut config = data.service_config.clone();
config.google_resource_id = Some(resp.resource_id.clone());
config.expiration = Some(resp.expiration.clone());
Some(serde_json::to_value(&config).unwrap())
}
```
### Services with Absolute OAuth Endpoints (Google)
Unlike self-hosted services where OAuth endpoints are relative paths appended to `base_url`, services like Google have absolute URLs:
```rust
// Nextcloud: relative paths
ServiceName::Nextcloud => "/apps/oauth2/api/v1/token",
// Google: absolute URLs
ServiceName::Google => "https://oauth2.googleapis.com/token",
```
The `resolve_endpoint()` function handles both. For services with absolute endpoints:
- `base_url` can be empty
- `requiresBaseUrl: false` in the frontend workspace integration config
- Add `extra_auth_params()` if needed (Google requires `access_type=offline` and `prompt=consent`)
### Channel-Based Push Notifications with Renewal (Google Pattern)
For services using expiring watch channels instead of persistent webhooks:
1. Store expiration in `service_config` (as part of `ServiceConfig`)
2. In `maintain_triggers()`, implement renewal logic instead of using `reconcile_with_external_state()`:
```rust
async fn maintain_triggers(&self, db, workspace_id, triggers, oauth_data, synced, errors) {
for trigger in triggers {
if should_renew_channel(trigger) {
self.renew_channel(db, trigger, oauth_data).await;
}
}
}
```
3. Renewal: best-effort stop old channel, create new one with same external_id, update service_config with new expiration
4. Google example: Drive channels expire in 24h (renew when <1h left), Calendar channels expire in 7 days (renew when <1 day left)
### reconcile_with_external_state (Nextcloud Pattern)
The reusable function in `sync.rs` compares external triggers with DB state:
- Triggers missing externally: sets error "Trigger no longer exists on external service"
- Triggers present externally: clears errors, updates service_config if it differs
Usage in `maintain_triggers()`:
```rust
let external_pairs: Vec<(String, serde_json::Value)> = /* fetch from external */;
crate::sync::reconcile_with_external_state(
db, workspace_id, Self::SERVICE_NAME, triggers, &external_pairs, synced, errors,
).await;
```
### Webhook Payload Processing
Override `prepare_webhook()` to parse service-specific payloads into script/flow args:
```rust
async fn prepare_webhook(&self, db, w_id, headers, body, script_path, is_flow) -> Result<PushArgsOwned> {
let mut args = HashMap::new();
args.insert("event_type".to_string(), Box::new(headers.get("x-event-type").cloned()) as _);
args.insert("payload".to_string(), Box::new(serde_json::from_str::<serde_json::Value>(&body)?) as _);
Ok(PushArgsOwned { extra: None, args })
}
```
Then register in `prepare_native_trigger_args()` in `lib.rs`:
```rust
pub async fn prepare_native_trigger_args(service_name, db, w_id, headers, body) -> Result<Option<PushArgsOwned>> {
match service_name {
ServiceName::Google => { /* ... */ Ok(Some(args)) }
ServiceName::NewService => { /* ... */ Ok(Some(args)) }
ServiceName::Nextcloud => Ok(None), // Uses default body parsing
}
}
```
### Instance-Level OAuth Credentials
When `workspace_integrations.oauth_data.instance_shared == true`, `decrypt_oauth_data()` reads `client_id` and `client_secret` from instance-level global settings instead of workspace-level. This allows admins to share OAuth app credentials across workspaces.
The frontend handles this via the `generate_instance_connect_url` endpoint in `workspace_integrations.rs`.
---
## Testing Checklist
- [ ] Database migration runs successfully
- [ ] `cargo check -p windmill-native-triggers --features native_trigger` passes
- [ ] `npx svelte-check --threshold error` passes (in frontend/)
- [ ] Service appears in workspace integrations list
- [ ] OAuth flow completes successfully
- [ ] Can create a new trigger
- [ ] Can view trigger details
- [ ] Can update trigger configuration
- [ ] Can delete trigger
- [ ] Webhook receives and processes payloads
- [ ] Background sync works correctly (reconciliation or channel renewal)
- [ ] Error handling works (expired tokens, service unavailable)
---
## Reference Implementations
### Nextcloud (Self-Hosted, Update+Get Pattern)
| File | Purpose |
|------|---------|
| `nextcloud/mod.rs` | Types: NextCloudOAuthData, NextcloudServiceConfig, NextCloudTriggerData |
| `nextcloud/external.rs` | External trait: uses update+get pattern, reconcile_with_external_state for sync |
| `nextcloud/routes.rs` | Additional route: `GET /events` |
Key patterns: relative OAuth endpoints, base_url required, list_all + reconcile for sync, update returns JSON from get().
### Google (Cloud, Unified Service, Short Create)
| File | Purpose |
|------|---------|
| `google/mod.rs` | Types: GoogleServiceConfig with trigger_type discriminator, GoogleTriggerType enum |
| `google/external.rs` | External trait: overrides service_config_from_create_response, channel renewal for sync |
| `google/routes.rs` | Additional routes: `GET /calendars`, `GET /drive/files`, `GET /drive/shared_drives` |
Key patterns: absolute OAuth endpoints, empty base_url, trigger_type for Drive/Calendar, expiring watch channels with renewal, service_config_from_create_response skips update+get, get() reconstructs data from stored service_config (no external "get channel" API).

View File

@@ -1,109 +0,0 @@
---
name: pr
description: Open a draft pull request on GitHub. MUST use when you want to create/open a PR.
---
# Pull Request Skill
Create a draft pull request with a clear title and explicit description of changes.
## Instructions
1. **Analyze branch changes**: Understand all commits since diverging from main
2. **Push to remote**: Ensure all commits are pushed
3. **Create draft PR**: Always open as draft for review before merging
## PR Title Format
Follow conventional commit format for the PR title:
```
<type>: <description>
```
### Types
- `feat`: New feature or capability
- `fix`: Bug fix
- `refactor`: Code restructuring
- `docs`: Documentation changes
- `chore`: Maintenance tasks
- `perf`: Performance improvements
### Title Rules
- Keep under 70 characters
- Use lowercase, imperative mood
- No period at the end
- If `*_ee.rs` files were modified, prefix with `[ee]`: `[ee] <type>: <description>`
## PR Body Format
The body MUST be explicit about what changed. Structure:
```markdown
## Summary
<Clear description of what this PR does and why>
## Changes
- <Specific change 1>
- <Specific change 2>
- <Specific change 3>
## Test plan
- [ ] <How to verify change 1>
- [ ] <How to verify change 2>
---
Generated with [Claude Code](https://claude.com/claude-code)
```
## Execution Steps
1. Run `git status` to check for uncommitted changes
2. Run `git log main..HEAD --oneline` to see all commits in this branch
3. Run `git diff main...HEAD` to see the full diff against main
4. Check if remote branch exists and is up to date:
```bash
git rev-parse --abbrev-ref --symbolic-full-name @{u} 2>/dev/null || echo "no upstream"
```
5. Push to remote if needed: `git push -u origin HEAD`
6. Create draft PR using gh CLI:
```bash
gh pr create --draft --title "<type>: <description>" --body "$(cat <<'EOF'
## Summary
<description>
## Changes
- <change 1>
- <change 2>
## Test plan
- [ ] <test 1>
- [ ] <test 2>
---
Generated with [Claude Code](https://claude.com/claude-code)
EOF
)"
```
7. Return the PR URL to the user
## EE Companion PR (when `*_ee.rs` files were modified)
The `*_ee.rs` files in the windmill repo are **symlinks** to `windmill-ee-private` — changes won't appear in `git diff` of the windmill repo. Instead, check the EE repo for uncommitted or unpushed changes.
Follow the full EE PR workflow in `docs/enterprise.md`. The key PR-specific details:
1. Find the EE repo/worktree: see "Finding the EE Repo" in `docs/enterprise.md`
2. Check for changes: `git -C <ee-path> status --short`
- If there are no changes in the EE repo, skip this entire section
3. Follow steps 15 from the "EE PR Workflow" in `docs/enterprise.md`
4. Create the companion PR (title does NOT get the `[ee]` prefix):
```bash
gh pr create --draft --repo windmill-labs/windmill-ee-private --title "<type>: <description>" --body "$(cat <<'EOF'
Companion PR for windmill-labs/windmill#<PR_NUMBER>
---
Generated with [Claude Code](https://claude.com/claude-code)
EOF
)"
```
5. Commit `ee-repo-ref.txt` and push the updated windmill branch

View File

@@ -1,38 +0,0 @@
---
name: refine
description: End-of-session reflection. Reviews friction encountered during the session and proposes updates to docs/ to capture lessons learned.
---
# Refine Skill
Reflect on the current session and update documentation with lessons learned.
## Instructions
1. **Identify friction**: Review what happened in this session:
- Run `git diff main...HEAD --stat` to see what files were touched
- Think about: what was slow, what failed, what required multiple attempts, what information was missing or hard to find
2. **Read current docs**: Read the docs that were relevant to this session:
- `docs/validation.md`
- `docs/enterprise.md`
- `docs/autonomous-mode.md`
- Any skills that were invoked
3. **Propose updates**: For each piece of friction, decide if it warrants a doc update:
- **Missing knowledge**: Information you had to discover that should be documented
- **Wrong guidance**: Instructions that led you astray
- **Missing validation rule**: A check that should be in the validation matrix
- **New pattern**: A codebase pattern worth capturing for next time
4. **Apply updates**: Edit the relevant `docs/` files. Keep changes minimal and specific — add only what would have saved time this session.
5. **Report**: Summarize what was added/changed and why.
## Rules
- Only add knowledge confirmed by this session — no speculative additions
- Keep docs concise — add a line or two, not a paragraph
- If a whole new doc is needed, create it in `docs/` and add a pointer in `CLAUDE.md`
- Don't update skills unless a coding pattern was genuinely wrong
- Don't add things Claude already knows — only Windmill-specific knowledge

View File

@@ -1,107 +0,0 @@
---
name: rust-backend
description: Rust coding guidelines for the Windmill backend. MUST use when writing or modifying Rust code in the backend directory.
---
# Windmill Rust Patterns
Apply these Windmill-specific patterns when writing Rust code in `backend/`.
## Error Handling
Use `Error` from `windmill_common::error`. Return `Result<T, Error>` or `JsonResult<T>`:
```rust
use windmill_common::error::{Error, Result};
pub async fn get_job(db: &DB, id: Uuid) -> Result<Job> {
sqlx::query_as!(Job, "SELECT id, workspace_id FROM v2_job WHERE id = $1", id)
.fetch_optional(db)
.await?
.ok_or_else(|| Error::NotFound("job not found".to_string()))?;
}
```
Never panic in library code. Reserve `.unwrap()` for compile-time guarantees.
## SQLx Patterns
**Never use `SELECT *`** — always list columns explicitly. Critical for backwards compatibility when workers lag behind API version:
```rust
// Correct
sqlx::query_as!(Job, "SELECT id, workspace_id, path FROM v2_job WHERE id = $1", id)
// Wrong — breaks when columns are added
sqlx::query_as!(Job, "SELECT * FROM v2_job WHERE id = $1", id)
```
Use batch operations to avoid N+1:
```rust
// Preferred — single query with IN clause
sqlx::query!("SELECT ... WHERE id = ANY($1)", &ids[..]).fetch_all(db).await?
```
Use transactions for multi-step operations. Parameterize all queries.
## JSON Handling
Prefer `Box<serde_json::value::RawValue>` over `serde_json::Value` when storing/passing JSON without inspection:
```rust
pub struct Job {
pub args: Option<Box<serde_json::value::RawValue>>,
}
```
Only use `serde_json::Value` when you need to inspect or modify the JSON.
## Serde Optimizations
```rust
#[derive(Serialize, Deserialize)]
pub struct Job {
#[serde(skip_serializing_if = "Option::is_none")]
pub parent_job: Option<Uuid>,
#[serde(skip_serializing_if = "Vec::is_empty")]
pub tags: Vec<String>,
#[serde(default)]
pub priority: i32,
}
```
## Async & Concurrency
Never block the async runtime. Use `spawn_blocking` for CPU-intensive work:
```rust
let result = tokio::task::spawn_blocking(move || expensive_computation(&data)).await?;
```
**Mutex selection**: Prefer `std::sync::Mutex` (or `parking_lot::Mutex`) for data protection. Only use `tokio::sync::Mutex` when holding locks across `.await` points.
Use `tokio::sync::mpsc` (bounded) for channels. Avoid `std::thread::sleep` in async contexts.
## Module Structure & Visibility
- Use `pub(crate)` instead of `pub` when possible
- Place new code in the appropriate crate based on functionality
- API endpoints go in `windmill-api/src/` organized by domain
- Shared functionality goes in `windmill-common/src/`
## Code Navigation
Always use rust-analyzer LSP for go-to-definition, find-references, and type info. Do not guess at module paths.
## Axum Handlers
Destructure extractors directly in function signatures:
```rust
async fn process_job(
Extension(db): Extension<DB>,
Path((workspace, job_id)): Path<(String, Uuid)>,
Query(pagination): Query<Pagination>,
) -> Result<Json<Job>> { ... }
```

View File

@@ -1,80 +0,0 @@
---
name: svelte-frontend
description: Svelte coding guidelines for the Windmill frontend. MUST use when writing or modifying code in the frontend directory.
---
# Windmill Svelte Patterns
Apply these Windmill-specific patterns when writing Svelte code in `frontend/`. For general Svelte 5 syntax (runes, snippets, event handling), use the Svelte MCP server.
## Windmill UI Components (MUST use)
Always use Windmill's design-system components. Never use raw HTML elements.
### Buttons — `<Button>`
```svelte
<script>
import { Button } from '$lib/components/common'
import { ChevronLeft } from 'lucide-svelte'
</script>
<Button variant="default" onclick={handleClick}>Label</Button>
<Button startIcon={{ icon: ChevronLeft }} iconOnly onclick={prev} />
```
Props: `variant?: 'accent' | 'accent-secondary' | 'default' | 'subtle'`, `unifiedSize?: 'sm' | 'md' | 'lg'`, `startIcon?: { icon: SvelteComponent }`, `iconOnly?: boolean`, `disabled?: boolean`
### Text inputs — `<TextInput>`
```svelte
<script>
import { TextInput } from '$lib/components/common'
</script>
<TextInput bind:value={val} placeholder="Enter value" />
```
Props: `value?: string | number` (bindable), `placeholder?: string`, `disabled?: boolean`, `error?: string | boolean`, `size?: 'sm' | 'md' | 'lg'`
### Selects — `<Select>`
```svelte
<script>
import Select from '$lib/components/select/Select.svelte'
</script>
<Select items={[{ label: 'Jan', value: 1 }]} bind:value={selected} />
```
Props: `items?: Array<{ label?: string; value: any }>`, `value` (bindable), `placeholder?: string`, `clearable?: boolean`, `size?: 'sm' | 'md' | 'lg'`
### Icons — `lucide-svelte`
Never write inline SVGs. Import from `lucide-svelte`:
```svelte
<script>
import { ChevronLeft, X } from 'lucide-svelte'
</script>
<ChevronLeft size={16} />
```
## Form Components
Form components (TextInput, Toggle, Select, etc.) should use the unified size system when placed together.
## Styling
- Use Tailwind CSS for all styling — no custom CSS
- Use Windmill's theming classes for colors/surfaces (see `frontend/brand-guidelines.md`)
- Read component props JSDoc before using them
## Svelte MCP Server
Use the Svelte MCP tools when working on Svelte code:
1. **list-sections**: Call first to discover available docs
2. **get-documentation**: Fetch relevant sections based on use_cases
3. **svelte-autofixer**: MUST use on all Svelte code before finalizing — keep calling until no issues
4. **playground-link**: Only after user confirms and code was NOT written to project files

View File

@@ -1,25 +0,0 @@
# Code Review Instructions
Review this pull request and provide comprehensive feedback.
## Focus Areas
- **Code quality and best practices** — does the code follow established patterns?
- **Potential bugs or issues** — will this code work correctly in all cases?
- **Performance considerations** — are there unnecessary allocations, N+1 queries, or bottlenecks?
- **Security implications** — injection, auth bypass, data exposure?
## CLAUDE.md Compliance
Read all relevant CLAUDE.md files (root and in directories containing changed files). Check each rule against the changed code. Quote the exact rule when flagging a violation.
## Review Guidelines
- Provide detailed feedback using inline comments for specific issues
- Use top-level comments for general observations or praise
- Only flag issues introduced by this PR, not pre-existing problems
- Self-validate each finding: "Is this definitely a real issue?" If uncertain, discard it
## Testing Instructions
At the end of your review, add complete instructions to reproduce the added changes through the app interface. These instructions will be given to a tester so they can verify the changes. It should be a short descriptive text (not a step-by-step or a list) on how to navigate the app (what page, what action, what input, etc.) to see the changes.

View File

@@ -6,24 +6,53 @@ description: Code review a pull request for bugs and CLAUDE.md compliance. MUST
# Local Code Review Skill
Run the same review locally that the GitHub Claude Auto Review action runs on PRs. The shared review instructions live in `.claude/review-prompt.md` — read that file first and follow its instructions.
Review a pull request for real bugs and CLAUDE.md compliance violations. This review targets HIGH SIGNAL issues only.
## Review Philosophy
- **Only flag issues you are certain about.** If you are not sure an issue is real, do not flag it. False positives erode trust and waste reviewer time.
- Think like a senior engineer doing a final review — flag things that would cause incidents, not things that are merely imperfect.
## What to Flag
- Code that won't compile or parse (syntax errors, type errors, missing imports)
- Code that will definitely produce wrong results regardless of inputs
- Clear, unambiguous CLAUDE.md violations (quote the exact rule being violated)
- Security issues in introduced code (injection, auth bypass, data exposure)
- Incorrect logic that will fail in production
## What NOT to Flag
- Code style or quality concerns
- Potential issues that depend on specific inputs or runtime state
- Subjective suggestions or improvements
- Pre-existing issues not introduced by this PR
- Pedantic nitpicks a senior engineer wouldn't flag
- Issues a linter or type checker will catch
- General quality concerns unless explicitly prohibited in CLAUDE.md
- Issues silenced via lint ignore comments
## Execution Steps
1. **Read `.claude/review-prompt.md`** for the review criteria and focus areas
2. **Determine the PR scope**:
1. **Determine the PR scope**:
- If an argument is provided, use it as the PR number or branch
- Otherwise, detect from the current branch vs main
- Run `gh pr view` if a PR exists, or use `git diff main...HEAD`
2. **Find relevant CLAUDE.md files**:
- Read the root `CLAUDE.md`
- Check for CLAUDE.md files in directories containing changed files
3. **Get the diff and metadata**:
- `gh pr diff` or `git diff main...HEAD` for the full diff
- `gh pr view` or `git log main..HEAD --oneline` for context
4. **Read changed files** where the diff alone is insufficient to understand context
5. **Apply the review instructions from `.claude/review-prompt.md`**
5. **Review for**:
- CLAUDE.md compliance — check each rule against the changed code
- Bugs and logic errors — will this code work correctly?
- Security issues — injection, auth, data exposure in new code
6. **Self-validate each finding**: Before reporting, ask yourself:
- "Is this definitely a real issue, not a false positive?"

View File

@@ -1,8 +1,3 @@
---
name: native-trigger
description: Guidance for adding native trigger services to Windmill. Use when implementing or modifying native trigger integrations across the backend and frontend.
---
# Skill: Adding Native Trigger Services
This skill provides comprehensive guidance for adding new native trigger services to Windmill. Native triggers allow external services (like Nextcloud, Google Drive, etc.) to trigger Windmill scripts/flows via webhooks or push notifications.

View File

@@ -61,13 +61,12 @@ Generated with [Claude Code](https://claude.com/claude-code)
1. Run `git status` to check for uncommitted changes
2. Run `git log main..HEAD --oneline` to see all commits in this branch
3. Run `git diff main...HEAD` to see the full diff against main
4. **Run `/local-review`** before creating the PR. If issues are found, fix them and commit before proceeding. Do not skip this step.
5. Check if remote branch exists and is up to date:
4. Check if remote branch exists and is up to date:
```bash
git rev-parse --abbrev-ref --symbolic-full-name @{u} 2>/dev/null || echo "no upstream"
```
6. Push to remote if needed: `git push -u origin HEAD`
7. Create draft PR using gh CLI:
5. Push to remote if needed: `git push -u origin HEAD`
6. Create draft PR using gh CLI:
```bash
gh pr create --draft --title "<type>: <description>" --body "$(cat <<'EOF'
## Summary
@@ -86,7 +85,7 @@ Generated with [Claude Code](https://claude.com/claude-code)
EOF
)"
```
8. Return the PR URL to the user
7. Return the PR URL to the user
## EE Companion PR (when `*_ee.rs` files were modified)

6
.envrc
View File

@@ -1,7 +1 @@
use flake
# Per-worktree overrides (ports, DATABASE_URL, etc.) written by webmux/workmux
# post-create hooks. Must come after `use flake` so they take precedence over
# the flake's defaults.
# shellcheck source=/dev/null
[ -f .env.local ] && source .env.local

View File

@@ -1,23 +0,0 @@
You are reviewing a GitHub pull request for this repository.
Review policy:
- Read `CLAUDE.md` before reviewing code.
- Only report issues you are confident are real and introduced by this pull request.
- Focus on bugs, security problems, and clear `CLAUDE.md` violations.
- Do not report style nits, speculative concerns, pre-existing issues, or problems that a normal linter/typechecker would obviously catch.
- Keep the review high signal. If there is no clear issue, return no findings.
Repository context:
- Read `./.github/codex/pr-review-context.md` for the PR metadata and the exact diff commands to use.
- Review only the changes introduced by this PR.
- Read additional files only when the diff is not enough to validate a finding.
- Do not modify any files.
Output requirements:
- Return a GitHub PR comment in markdown, not JSON.
- Start with `## Codex Review`.
- Give a short overall summary first.
- If you found high-signal issues, list them in a short numbered list with file paths and line numbers when you know them confidently.
- If you found no high-signal issues, say that explicitly.
- End with a `### Reproduction instructions` section containing a short descriptive paragraph for a tester explaining how to navigate the app to observe the change. Do not make it a numbered list. If the diff is not enough to infer this safely, say that plainly.
- Prefer at most 10 findings.

View File

@@ -5,8 +5,6 @@ on:
push:
branches:
- "ci-windows-tests"
tags:
- "v*"
env:
CARGO_INCREMENTAL: 0

View File

@@ -290,49 +290,6 @@ jobs:
path: |
*.json
benchmark_wac:
runs-on: ubicloud-standard-8
services:
postgres:
image: postgres
env:
POSTGRES_DB: windmill
POSTGRES_PASSWORD: changeme
POSTGRES_INITDB_ARGS: "-c shared_buffers=2GB -c work_mem=32MB -c effective_cache_size=4GB"
options: >-
--health-cmd pg_isready --health-interval 10s --health-timeout 5s
--health-retries 5
--shm-size=2g
windmill:
image: ghcr.io/windmill-labs/windmill-ee:main
env:
DATABASE_URL: postgres://postgres:changeme@postgres:5432/windmill
LICENSE_KEY: ${{ secrets.WM_LICENSE_KEY_CI }}
WORKER_GROUP: main
WORKER_TAGS: deno,bun,go,python3,bash,dependency,flow,nativets
options: >-
--pull always --health-interval 10s --health-timeout 5s
--health-retries 5 --health-cmd "curl
http://localhost:8000/api/version"
ports:
- 8000:8000
steps:
- uses: denoland/setup-deno@v2
with:
deno-version: v2.x
- name: benchmark
timeout-minutes: 30
run: deno run -A -r
https://raw.githubusercontent.com/windmill-labs/windmill/${GITHUB_REF##ref/head/}/benchmarks/benchmark_suite.ts
-c
https://raw.githubusercontent.com/windmill-labs/windmill/${GITHUB_REF##ref/head/}/benchmarks/suite_wac.json
- name: Save benchmark results
uses: actions/upload-artifact@v4
with:
name: benchmark_wac
path: |
*.json
benchmark_graphs:
runs-on: ubicloud
needs:
@@ -340,7 +297,6 @@ jobs:
- benchmark_dedicated
- benchmark_4workers
- benchmark_8workers
- benchmark_wac
steps:
- uses: denoland/setup-deno@v2
with:

View File

@@ -13,10 +13,10 @@ on:
jobs:
check-membership:
if: |
(github.event_name == 'issue_comment' && startsWith(github.event.comment.body, '/ai') && !startsWith(github.event.comment.body, '/ai-fast')) ||
(github.event_name == 'pull_request_review_comment' && startsWith(github.event.comment.body, '/ai') && !startsWith(github.event.comment.body, '/ai-fast')) ||
(github.event_name == 'pull_request_review' && startsWith(github.event.review.body, '/ai') && !startsWith(github.event.review.body, '/ai-fast')) ||
(github.event_name == 'issues' && startsWith(github.event.issue.body, '/ai') && !startsWith(github.event.issue.body, '/ai-fast'))
(github.event_name == 'issue_comment' && contains(github.event.comment.body, '/ai')) ||
(github.event_name == 'pull_request_review_comment' && contains(github.event.comment.body, '/ai')) ||
(github.event_name == 'pull_request_review' && contains(github.event.review.body, '/ai')) ||
(github.event_name == 'issues' && contains(github.event.issue.body, '/ai'))
uses: ./.github/workflows/check-org-membership.yml
secrets:
access_token: ${{ secrets.ORG_ACCESS_TOKEN }}

View File

@@ -1,18 +1,15 @@
name: CLI Tests
on:
workflow_dispatch:
push:
branches: [main]
paths:
- "cli/**"
- "backend/migrations/**"
- ".github/workflows/cli-tests.yml"
pull_request:
branches: [main]
paths:
- "cli/**"
- "backend/migrations/**"
- ".github/workflows/cli-tests.yml"
env:

View File

@@ -1,145 +0,0 @@
name: Codex Auto Review
on:
pull_request:
types: [ready_for_review, opened]
concurrency:
group: codex-review-${{ github.event.pull_request.number }}
cancel-in-progress: true
jobs:
codex-review:
runs-on: ubicloud-standard-2
timeout-minutes: 30
if: github.event.pull_request.draft == false && github.event.pull_request.head.repo.fork == false
permissions:
contents: read
issues: write
steps:
- name: Check Codex configuration
id: codex_config
env:
CODEX_AUTH_JSON: ${{ secrets.CODEX_AUTH_JSON }}
run: |
if [ -n "$CODEX_AUTH_JSON" ]; then
echo "enabled=true" >> "$GITHUB_OUTPUT"
else
echo "enabled=false" >> "$GITHUB_OUTPUT"
echo "CODEX_AUTH_JSON is not configured; skipping Codex review."
fi
- name: Checkout repository
if: steps.codex_config.outputs.enabled == 'true'
uses: actions/checkout@v5
with:
ref: refs/pull/${{ github.event.pull_request.number }}/merge
fetch-depth: 1
- name: Set up Node.js
if: steps.codex_config.outputs.enabled == 'true'
uses: actions/setup-node@v4
with:
node-version: 22
- name: Install Codex CLI
if: steps.codex_config.outputs.enabled == 'true'
run: npm install --global @openai/codex@0.117.0
- name: Configure file-backed Codex auth
if: steps.codex_config.outputs.enabled == 'true'
env:
CODEX_AUTH_JSON: ${{ secrets.CODEX_AUTH_JSON }}
run: |
CODEX_HOME="$HOME/.codex"
echo "CODEX_HOME=$CODEX_HOME" >> "$GITHUB_ENV"
mkdir -p "$CODEX_HOME"
chmod 700 "$CODEX_HOME"
cat > "$CODEX_HOME/config.toml" <<'EOF'
cli_auth_credentials_store = "file"
EOF
printf '%s' "$CODEX_AUTH_JSON" > "$CODEX_HOME/auth.json"
chmod 600 "$CODEX_HOME/auth.json"
node -e 'JSON.parse(require("fs").readFileSync(process.argv[1], "utf8"))' "$CODEX_HOME/auth.json"
- name: Pre-fetch base and head refs for the PR
if: steps.codex_config.outputs.enabled == 'true'
env:
PR_BASE_REF: ${{ github.event.pull_request.base.ref }}
PR_NUMBER: ${{ github.event.pull_request.number }}
run: |
git fetch --no-tags origin \
"$PR_BASE_REF" \
"+refs/pull/$PR_NUMBER/head"
- name: Write Codex review context
if: steps.codex_config.outputs.enabled == 'true'
env:
PR_REPOSITORY: ${{ github.repository }}
PR_NUMBER: ${{ github.event.pull_request.number }}
PR_BASE_SHA: ${{ github.event.pull_request.base.sha }}
PR_HEAD_SHA: ${{ github.event.pull_request.head.sha }}
PR_TITLE: ${{ github.event.pull_request.title }}
PR_BODY: ${{ github.event.pull_request.body || '' }}
run: |
mkdir -p .github/codex
node <<'NODE'
const fs = require('fs');
const lines = [
`Repository: ${process.env.PR_REPOSITORY}`,
`PR number: ${process.env.PR_NUMBER}`,
`Base SHA: ${process.env.PR_BASE_SHA}`,
`Head SHA: ${process.env.PR_HEAD_SHA}`,
'',
'PR title:',
process.env.PR_TITLE || '(empty)',
'',
'PR body:',
process.env.PR_BODY || '(empty)',
'',
'Changed commits command:',
`git log --oneline ${process.env.PR_BASE_SHA}...${process.env.PR_HEAD_SHA}`,
'',
'Changed files command:',
`git diff --stat ${process.env.PR_BASE_SHA}...${process.env.PR_HEAD_SHA}`,
'',
'Full review diff command:',
`git diff --unified=0 ${process.env.PR_BASE_SHA}...${process.env.PR_HEAD_SHA}`
];
fs.writeFileSync('.github/codex/pr-review-context.md', `${lines.join('\n')}\n`);
NODE
- name: Run Codex review
if: steps.codex_config.outputs.enabled == 'true'
run: |
codex exec \
-C "$GITHUB_WORKSPACE" \
-m gpt-5.4 \
-c 'model_reasoning_effort="xhigh"' \
-s read-only \
-o codex-final-message.md \
- < .github/codex/pr-review.prompt.md
- name: Post Codex review comment
if: steps.codex_config.outputs.enabled == 'true'
uses: actions/github-script@v7
with:
github-token: ${{ github.token }}
script: |
const fs = require('fs');
const path = `${process.env.GITHUB_WORKSPACE}/codex-final-message.md`;
if (!fs.existsSync(path)) {
core.info('Codex did not produce a final message; skipping PR comment.');
return;
}
const body = fs.readFileSync(path, 'utf8').trim();
if (!body) {
core.info('Codex final message was empty; skipping PR comment.');
return;
}
await github.rest.issues.createComment({
owner: context.repo.owner,
repo: context.repo.repo,
issue_number: context.payload.pull_request.number,
body,
});

View File

@@ -212,59 +212,6 @@ jobs:
${{ steps.extract-ee.outputs.destination }}/*
${{ steps.extract-duckdb-ffi-internal.outputs.destination }}/*
attach_ee_debug_to_release:
needs: [build_ee]
runs-on: ubicloud
if: ${{ startsWith(github.ref, 'refs/tags/v') }}
strategy:
matrix:
platform: [linux/amd64, linux/arm64]
include:
- platform: linux/amd64
arch: amd64
- platform: linux/arm64
arch: arm64
steps:
- uses: actions/checkout@v4
with:
ref: ${{ github.ref }}
- name: Read EE repo commit hash
run: |
echo "ee_repo_ref=$(cat ./backend/ee-repo-ref.txt)" >> "$GITHUB_ENV"
- uses: actions/checkout@v4
with:
repository: windmill-labs/windmill-ee-private
path: ./windmill-ee-private
ref: ${{ env.ee_repo_ref }}
token: ${{ secrets.WINDMILL_EE_PRIVATE_ACCESS }}
- name: Substitute EE code
run: |
./backend/substitute_ee_code.sh --copy --dir ./windmill-ee-private
- uses: depot/setup-action@v1
- name: Extract EE debug info from builder stage (depot cache hit)
uses: depot/build-push-action@v1
with:
context: .
platforms: ${{ matrix.platform }}
target: debuginfo
build-args: |
features=ee
outputs: type=local,dest=./debuginfo
- name: Rename debug file with corresponding architecture
run: |
mv ./debuginfo/windmill.debug ./debuginfo/windmill-ee-${{ matrix.arch }}.debug
- name: Attach debug file to release
uses: softprops/action-gh-release@v2
with:
files: ./debuginfo/windmill-ee-${{ matrix.arch }}.debug
# attach_arm64_binary_to_release:
# needs: [build, build_ee]
# runs-on: ubicoud

View File

@@ -106,19 +106,6 @@ jobs:
git config --local user.name "windmill-internal-app[bot]"
git config pull.rebase true
git pull origin $BRANCH_NAME
# Checkout the correct windmill-ee-private commit from ee-repo-ref.txt
if [ -f backend/ee-repo-ref.txt ]; then
EE_REF=$(cat backend/ee-repo-ref.txt | tr -d '[:space:]')
echo "Checking out windmill-ee-private at commit: $EE_REF"
cd windmill-ee-private
git fetch origin $EE_REF
git checkout $EE_REF
cd ..
else
echo "Warning: ee-repo-ref.txt not found, using default branch"
fi
mkdir -p frontend/build
cd backend
cargo install sqlx-cli --version 0.8.5

View File

@@ -22,15 +22,6 @@ jobs:
with:
fetch-depth: 1
- name: Read review prompt
id: review-prompt
run: |
{
echo 'REVIEW_PROMPT<<EOF'
cat .claude/review-prompt.md
echo 'EOF'
} >> "$GITHUB_ENV"
- name: Automatic PR Review
uses: anthropics/claude-code-action@v1
with:
@@ -40,7 +31,18 @@ jobs:
REPO: ${{ github.repository }}
PR NUMBER: ${{ github.event.pull_request.number }}
${{ env.REVIEW_PROMPT }}
Please review this pull request and provide comprehensive feedback.
Focus on:
- Code quality and best practices
- Potential bugs or issues
- Performance considerations
- Security implications
Provide detailed feedback using inline comments for specific issues.
Use top-level comments for general observations or praise.
At the end of your review, add complete instructions to reproduce the added changes through the app interface. These instructions will be given to a tester so he can verify the changes. It should be a short descriptive text (not a step by step or a list) on how to navigate the app (what page, what action, what input, etc) to see the changes.
claude_args: |
--allowedTools "mcp__github_inline_comment__create_inline_comment,Bash(gh pr comment:*),Bash(gh pr diff:*),Bash(gh pr view:*)"
--model opus

View File

@@ -18,7 +18,10 @@ jobs:
runs-on: ubicloud-standard-8
steps:
- uses: actions/checkout@v4
- uses: cachix/install-nix-action@v31
- uses: cachix/install-nix-action@v20
with:
extra_nix_config: |
experimental-features = nix-command flakes
- name: Check rust client builds
run: cd rust-client && nix develop ../ --command ./dev.nu --check
timeout-minutes: 16

View File

@@ -10,7 +10,10 @@ jobs:
runs-on: ubicloud-standard-8
steps:
- uses: actions/checkout@v4
- uses: cachix/install-nix-action@v31
- uses: cachix/install-nix-action@v20
with:
extra_nix_config: |
experimental-features = nix-command flakes
- run: cd rust-client && nix develop ../ --command ./dev.nu --check --publish
env:
CRATES_IO_TOKEN: ${{ secrets.CRATES_IO_TOKEN }}

4
.gitignore vendored
View File

@@ -16,7 +16,6 @@ rust-client/Cargo.toml
# Worktree-generated port isolation
.env.local
.webmux.local.yaml
# Worktree-specific Claude Code settings (generated by scripts/worktree-env)
.claude/settings.local.json
@@ -25,10 +24,7 @@ rust-client/Cargo.toml
backend/target
frontend/node_modules
typescript-client/node_modules
ai_evals/node_modules
ai_evals/results/
frontend/.svelte-kit
backend/chrome_profiler.json
.fast-check/
__pycache__/
.playwright-mcp/

View File

@@ -43,8 +43,8 @@ profiles:
- Pane 0: this pane (claude agent)
- Pane 1: backend (cargo watch -x run)
- Pane 2: frontend (npm run dev)
To check logs, use: \`tmux capture-pane -t $(tmux display-message -t "$TMUX_PANE" -p '#{session_name}:#{window_name}').1 -p -S -50\` (backend) or \`tmux capture-pane -t $(tmux display-message -t "$TMUX_PANE" -p '#{session_name}:#{window_name}').2 -p -S -50\` (frontend).
For this window specifically, backend is running on: ${BACKEND_PORT} and frontend is running on: ${FRONTEND_PORT}.
To check logs, use: \`tmux capture-pane -t .1 -p -S -50\` (backend) or \`tmux capture-pane -t .2 -p -S -50\` (frontend).
When restarting backend or frontend, make sure to use ${BACKEND_PORT} and ${FRONTEND_PORT}.
To connect to the database, use this connection string: ${DATABASE_URL}
Because we are running backend with cargo watch, to verify your changes, just check the logs in the backend pane. No need for cargo check.
IMPORTANT: Read docs/autonomous-mode.md before starting any work.
@@ -55,13 +55,11 @@ profiles:
- id: backend
kind: command
split: right
workingDir: backend
command: PORT=${BACKEND_PORT:-8000} cargo watch -x "run ${CARGO_FEATURES:+--features $CARGO_FEATURES}"
command: ROOT="$(git rev-parse --show-toplevel)"; cd "$ROOT/backend" && PORT=${BACKEND_PORT:-8000} cargo watch -x "run ${CARGO_FEATURES:+--features $CARGO_FEATURES}"
- id: frontend
kind: command
split: bottom
workingDir: frontend
command: npm run generate-backend-client && REMOTE=${REMOTE:-http://localhost:${BACKEND_PORT:-8000}} npm run dev -- --port ${FRONTEND_PORT:-3000} --host 0.0.0.0
command: ROOT="$(git rev-parse --show-toplevel)"; cd "$ROOT/frontend" && npm run generate-backend-client && REMOTE=${REMOTE:-http://localhost:${BACKEND_PORT:-8000}} npm run dev -- --port ${FRONTEND_PORT:-3000} --host 0.0.0.0
frontendOnly:
runtime: host
@@ -72,8 +70,8 @@ profiles:
Pane layout (current window):
- Pane 0: this pane (claude agent)
- Pane 1: frontend (npm run dev)
To check logs, use: \`tmux capture-pane -t $(tmux display-message -t "$TMUX_PANE" -p '#{session_name}:#{window_name}').1 -p -S -50\` (frontend).
On this window specifically, frontend is running on: ${FRONTEND_PORT}.
To check logs, use: \`tmux capture-pane -t .1 -p -S -50\` (frontend).
When restarting frontend, make sure to use ${FRONTEND_PORT}.
To connect to the database, use this connection string: ${DATABASE_URL}
Because we are running frontend with npm run dev, to verify your changes, just check the logs in the frontend pane. No need for npm run build.
IMPORTANT: Read docs/autonomous-mode.md before starting any work.
@@ -84,8 +82,7 @@ profiles:
- id: frontend
kind: command
split: right
workingDir: frontend
command: npm run generate-backend-client && npm run dev -- --port ${FRONTEND_PORT:-3000} --host 0.0.0.0
command: ROOT="$(git rev-parse --show-toplevel)"; cd "$ROOT/frontend" && npm run generate-backend-client && npm run dev -- --port ${FRONTEND_PORT:-3000} --host 0.0.0.0
agentOnly:
runtime: host

74
.workmux.yaml Normal file
View File

@@ -0,0 +1,74 @@
main_branch: main
merge_strategy: rebase
# worktree_dir: .worktrees
worktree_naming: basename
worktree_prefix: ""
# Default: "wm-"
window_prefix: "wm-"
auto_name:
model: "gemini-2.5-flash-lite"
system_prompt: |
Generate a concise git branch name based on the task description.
Rules:
- Use kebab-case (lowercase with hyphens)
- Keep it short: 1-3 words, max 4 if necessary
- Focus on the core task/feature, not implementation details
- No prefixes like feat/, fix/, chore/
Examples of good branch names:
- "Add dark mode toggle" → dark-mode
- "Fix the search results not showing" → fix-search
- "Refactor the authentication module" → auth-refactor
- "Add CSV export to reports" → export-csv
- "Shell completion is broken" → shell-completion
Output ONLY the branch name, nothing else.
background: true
# Commands to run in new worktree before tmux window opens.
# These block window creation - use for short tasks only.
# Use "<global>" to inherit from global config.
# Set to empty list to disable: `post_create: []`
# post_create:
# - "<global>"
# - mise use
post_create:
- ./scripts/worktree-env
pre_remove:
- ./scripts/worktree-cleanup
panes:
- command: >-
claude --dangerously-skip-permissions --append-system-prompt
"You are running inside a tmux session with other panes running services.\n
Pane layout (current window):\n
- Pane 0: this pane (claude agent)\n
- Pane 1: backend (cargo watch -x run)\n
- Pane 2: frontend (npm run dev)\n\n
To check logs, use: \`tmux capture-pane -t .1 -p -S -50\` (backend) or \`tmux capture-pane -t .2 -p -S -50\` (frontend).\n
When restarting backend or frontend, make sure to use the ports listed in .env.local.\n
Because we are running backend with cargo watch, to verify your changes, just check the logs in the backend pane. No need for cargo check.\n\n
IMPORTANT: Read docs/autonomous-mode.md before starting any work."
focus: true
- command: 'ROOT="$(git rev-parse --show-toplevel)"; [ -f "$ROOT/.env.local" ] && source "$ROOT/.env.local"; cd "$ROOT/backend" && PORT=${BACKEND_PORT:-8000} cargo watch -x "run ${CARGO_FEATURES:+--features $CARGO_FEATURES}"'
split: horizontal
- command: 'ROOT="$(git rev-parse --show-toplevel)"; [ -f "$ROOT/.env.local" ] && source "$ROOT/.env.local"; cd "$ROOT/frontend" && npm run generate-backend-client && REMOTE=${REMOTE:-http://localhost:${BACKEND_PORT:-8000}} npm run dev -- --port ${FRONTEND_PORT:-3000} --host 0.0.0.0'
split: vertical
files:
copy:
- backend/.env
- scripts/
- wm-ts-nav/target/release/wm-ts-nav
sandbox:
enabled: false
toolchain: off

View File

@@ -1,603 +1,5 @@
# Changelog
## [1.683.0](https://github.com/windmill-labs/windmill/compare/v1.682.0...v1.683.0) (2026-04-13)
### Features
* add black-box ai eval benchmarks ([#8618](https://github.com/windmill-labs/windmill/issues/8618)) ([cdcc564](https://github.com/windmill-labs/windmill/commit/cdcc56461b77554964622f490ae901f170886595))
* add deploy restriction rule and fork review requests ([#8804](https://github.com/windmill-labs/windmill/issues/8804)) ([64c58c8](https://github.com/windmill-labs/windmill/commit/64c58c824fcefe00f15405b7e3877eb566a3ffa2))
* allow non-admins to create and edit HTTP triggers ([#8810](https://github.com/windmill-labs/windmill/issues/8810)) ([9fb7816](https://github.com/windmill-labs/windmill/commit/9fb78164b4baa14c10d10f91ae969d48590c29f3))
* display agent message in flow graph ([#8806](https://github.com/windmill-labs/windmill/issues/8806)) ([95411b2](https://github.com/windmill-labs/windmill/commit/95411b256332fa41816a93b19906f1534da9b300))
* folder default_permissioned_as rules for ownership defaults on deploy ([#8801](https://github.com/windmill-labs/windmill/issues/8801)) ([60211c1](https://github.com/windmill-labs/windmill/commit/60211c1d1910b5f7ac6fed112f790201d2047a4c))
* instance-level ruff config auto-pulled by LSP container ([#8803](https://github.com/windmill-labs/windmill/issues/8803)) ([3f5841f](https://github.com/windmill-labs/windmill/commit/3f5841f84d878cd3f43c435fa237d3f0c2265fb9))
### Bug Fixes
* **cli:** make cli help resilient to npm registry fetch failures ([#8809](https://github.com/windmill-labs/windmill/issues/8809)) ([b6f1cc7](https://github.com/windmill-labs/windmill/commit/b6f1cc70cd87c61df7112d3838fbb5fe9bcdc145))
* enrich OTEL log records with per-request LogContext ([#8812](https://github.com/windmill-labs/windmill/issues/8812)) ([42d3e8c](https://github.com/windmill-labs/windmill/commit/42d3e8c7893cd959c7faffd19cd210c869c604f8))
* silence user-facing toast for non-critical hub script tracking error ([#8808](https://github.com/windmill-labs/windmill/issues/8808)) ([378ba78](https://github.com/windmill-labs/windmill/commit/378ba7828456c871b5778f1144c4bb559bd5a733))
### Performance Improvements
* add inline-persist fast path for WAC v2 step() ([#8807](https://github.com/windmill-labs/windmill/issues/8807)) ([b3ef4bc](https://github.com/windmill-labs/windmill/commit/b3ef4bc26c5696624efee89b5e4e33e77e10cf15))
## [1.682.0](https://github.com/windmill-labs/windmill/compare/v1.681.0...v1.682.0) (2026-04-10)
### Features
* enrich hanging flow error with worker and service log info ([#8800](https://github.com/windmill-labs/windmill/issues/8800)) ([59c457a](https://github.com/windmill-labs/windmill/commit/59c457a13881e35c229baed3edd87e618f89b9a0))
### Bug Fixes
* bypass OTEL MITM tracing proxy for git sync jobs ([#8796](https://github.com/windmill-labs/windmill/issues/8796)) ([9c85565](https://github.com/windmill-labs/windmill/commit/9c855652212dbac0e49f87dedd447d3d7d7b500a))
* show full path on hover in deploy drawer and widen drawer ([#8799](https://github.com/windmill-labs/windmill/issues/8799)) ([b783bf2](https://github.com/windmill-labs/windmill/commit/b783bf2d835cde0843739f7d1099193bb0af042e))
## [1.681.0](https://github.com/windmill-labs/windmill/compare/v1.680.0...v1.681.0) (2026-04-10)
### Features
* add CI test scripts with auto-trigger on deploy ([#8736](https://github.com/windmill-labs/windmill/issues/8736)) ([c57c769](https://github.com/windmill-labs/windmill/commit/c57c769deaa207e7ba7995f75649d3630774e898))
* add edit yaml button to raw app settings ([#8771](https://github.com/windmill-labs/windmill/issues/8771)) ([b73be37](https://github.com/windmill-labs/windmill/commit/b73be37916de808dc64bec1337edf6e7d3993c5e))
* add user offboarding flow with object reassignment ([#8647](https://github.com/windmill-labs/windmill/issues/8647)) ([435b25e](https://github.com/windmill-labs/windmill/commit/435b25e6a4c7272c0189cbcfb83526379f41ebf0))
* allow selecting hub flows as raw app backend runnables ([#8772](https://github.com/windmill-labs/windmill/issues/8772)) ([5f57727](https://github.com/windmill-labs/windmill/commit/5f57727a4d956a9066b005b3c55f08dd6780475a))
* list external JWT tokens in instance settings ([#8783](https://github.com/windmill-labs/windmill/issues/8783)) ([ce3e676](https://github.com/windmill-labs/windmill/commit/ce3e676f4ab0c442058c64db4ebf35545a805ef5))
* oauth manual connect option ([#8770](https://github.com/windmill-labs/windmill/issues/8770)) ([4b87639](https://github.com/windmill-labs/windmill/commit/4b876392a0ce41ae42bd882ced10fe0187e532bc))
* unify CLI config to workspaces, deprecate gitBranches/environments ([#8767](https://github.com/windmill-labs/windmill/issues/8767)) ([5b97092](https://github.com/windmill-labs/windmill/commit/5b9709299761b83a88df17a4259c431dfcd244f9))
* **vault:** add skip_ssl_verify option for HashiCorp Vault ([#8791](https://github.com/windmill-labs/windmill/issues/8791)) ([6cf7ffc](https://github.com/windmill-labs/windmill/commit/6cf7ffc26bcbc8f4ef0e4ad2879fcd114332c4e2))
### Bug Fixes
* bypass sql type injection during formatting to prevent offset corruption ([#8786](https://github.com/windmill-labs/windmill/issues/8786)) ([8957d8f](https://github.com/windmill-labs/windmill/commit/8957d8f19bce3430871c2858b3accd53e0be178f))
* CLI falls back to workspace whoami for workspace-scoped tokens ([#8789](https://github.com/windmill-labs/windmill/issues/8789)) ([d243eb3](https://github.com/windmill-labs/windmill/commit/d243eb31b014781a249f903b2a467aa58909ddd6))
* disable scroll-to-change-number on number inputs ([#8777](https://github.com/windmill-labs/windmill/issues/8777)) ([e63924e](https://github.com/windmill-labs/windmill/commit/e63924e3778b40486813192dc2913e565e0a765e))
* error on flow/app folder suffix format mismatch during sync push/pull ([#8775](https://github.com/windmill-labs/windmill/issues/8775)) ([1deb31f](https://github.com/windmill-labs/windmill/commit/1deb31f1e01d6168eee3c2cc242cb483272d1965))
* flow dev page layout and compact toolbar improvements ([#8776](https://github.com/windmill-labs/windmill/issues/8776)) ([89920e7](https://github.com/windmill-labs/windmill/commit/89920e77f3f5dc45db939ec938d92c881dccc8a0))
* Flow status viewer layout nits (avoid excess y space and scroll) ([#8780](https://github.com/windmill-labs/windmill/issues/8780)) ([6d36eca](https://github.com/windmill-labs/windmill/commit/6d36eca21684f9d3ab36658c2b66f85b9be8d331))
* flow step testing UX improvements ([#8781](https://github.com/windmill-labs/windmill/issues/8781)) ([3fb557a](https://github.com/windmill-labs/windmill/commit/3fb557a7f51dbbd3fac445734196f1b9a1d2e287))
* hide legacy global_settings.worker_configs ghost row ([#8790](https://github.com/windmill-labs/windmill/issues/8790)) ([4fff89f](https://github.com/windmill-labs/windmill/commit/4fff89f98ce72997a055cc313c8fe217d2f1fe78))
* limit multi-runnable dedicated workers to one job at a time ([#8782](https://github.com/windmill-labs/windmill/issues/8782)) ([946848f](https://github.com/windmill-labs/windmill/commit/946848feef60aba2a54bc2f5b686b33cc96ec9ef))
* normalize multi-word pg types in build_parameters to fix float8 serialization ([#8778](https://github.com/windmill-labs/windmill/issues/8778)) ([3d02be9](https://github.com/windmill-labs/windmill/commit/3d02be98f748d985f688243f3215d15ca4227f8f))
* refresh custom instance user password if auth failed ([#8787](https://github.com/windmill-labs/windmill/issues/8787)) ([3d43d31](https://github.com/windmill-labs/windmill/commit/3d43d31aba276f23903f16f06035a4c4955b52e2))
* treat empty global setting strings as unset ([#8793](https://github.com/windmill-labs/windmill/issues/8793)) ([ec9cec1](https://github.com/windmill-labs/windmill/commit/ec9cec1d02d87328db92a71a1b3a945e9e0c6bd2))
* zero-downtime coordinated restarts for OTEL and other setting changes ([#8768](https://github.com/windmill-labs/windmill/issues/8768)) ([506b7f5](https://github.com/windmill-labs/windmill/commit/506b7f55e17472d1384e9676c1b6df7a9d7a118b))
## [1.680.0](https://github.com/windmill-labs/windmill/compare/v1.679.0...v1.680.0) (2026-04-08)
### Features
* add CLI workspace merge command and enhance fork with datatable/color support ([#8756](https://github.com/windmill-labs/windmill/issues/8756)) ([4342c18](https://github.com/windmill-labs/windmill/commit/4342c1854134500d3b2bc46280f9885ee84e2c9e))
* add scheduled job deletion with configurable retention period ([#8753](https://github.com/windmill-labs/windmill/issues/8753)) ([2d18a68](https://github.com/windmill-labs/windmill/commit/2d18a680991babe317ca315bbce40e6ce733afda))
* add status indicator dots to parallel loop iteration picker ([#8761](https://github.com/windmill-labs/windmill/issues/8761)) ([470b8aa](https://github.com/windmill-labs/windmill/commit/470b8aa5f1870e26fea022c1e2a9f48471d8a205))
### Bug Fixes
* move alert config from config table to global_settings ([#8762](https://github.com/windmill-labs/windmill/issues/8762)) ([fa66870](https://github.com/windmill-labs/windmill/commit/fa668707c0ee7f261d78e145666b1073471259fd))
* resolve esbuild host/binary version mismatch in app sync push ([#8765](https://github.com/windmill-labs/windmill/issues/8765)) ([e36d440](https://github.com/windmill-labs/windmill/commit/e36d440a251a43ea888e3ce378d0bb8ed8f42e11))
* skip serializing ws_specific on resources when false ([#8764](https://github.com/windmill-labs/windmill/issues/8764)) ([c69f10d](https://github.com/windmill-labs/windmill/commit/c69f10d20dd064f0c329934096c2945424ff81f2))
## [1.679.0](https://github.com/windmill-labs/windmill/compare/v1.678.0...v1.679.0) (2026-04-07)
### Features
* Fork datatables ([#8339](https://github.com/windmill-labs/windmill/issues/8339)) ([3d4f4c6](https://github.com/windmill-labs/windmill/commit/3d4f4c6c38155396e9b2236a6a7a7ad4e02da877))
## [1.678.0](https://github.com/windmill-labs/windmill/compare/v1.677.0...v1.678.0) (2026-04-07)
### Features
* accept any content type on webhooks/http triggers with fallback ([#8743](https://github.com/windmill-labs/windmill/issues/8743)) ([208a597](https://github.com/windmill-labs/windmill/commit/208a597d599b4d203f7ab817a5d8ce2c06f79d0a))
* add download all logs button for flow jobs ([#8748](https://github.com/windmill-labs/windmill/issues/8748)) ([d938625](https://github.com/windmill-labs/windmill/commit/d938625785ba301fbd2c5f3d001c320eab1c504c))
### Bug Fixes
* delete raw_script_temp rows before workspace deletion to avoid FK violation ([#8752](https://github.com/windmill-labs/windmill/issues/8752)) ([8b9523e](https://github.com/windmill-labs/windmill/commit/8b9523e03c82c5a095b7cb2d5f70a87b7bbc8608))
* Fix FlowTimeline duplicate key ([#8754](https://github.com/windmill-labs/windmill/issues/8754)) ([2413dbe](https://github.com/windmill-labs/windmill/commit/2413dbefe3cc3b65c28bea437cd4471cf7e9ecba))
* remove span.enter() in dedicated worker to prevent tracing panic ([#8749](https://github.com/windmill-labs/windmill/issues/8749)) ([db55e8e](https://github.com/windmill-labs/windmill/commit/db55e8efb0c9ae198ca5ac7013439a94dfe9f550))
* restore ai agent tool deletion ([#8744](https://github.com/windmill-labs/windmill/issues/8744)) ([2f7ba9e](https://github.com/windmill-labs/windmill/commit/2f7ba9edac1a57dfc0eb3417574c72292855fc56))
## [1.677.0](https://github.com/windmill-labs/windmill/compare/v1.676.0...v1.677.0) (2026-04-06)
### Features
* add AWS Secrets Manager as secret storage backend (Beta) ([#8734](https://github.com/windmill-labs/windmill/issues/8734)) ([09bbc18](https://github.com/windmill-labs/windmill/commit/09bbc18bb773d9ffaa5aaa4bd9d7ce296f3ac468))
### Bug Fixes
* remove stale KMS openapi/description, restore stripped doc comments ([c09a431](https://github.com/windmill-labs/windmill/commit/c09a4311fd73c58acc8f3997428f002598dacce6))
* use runnable key for file naming in generate-metadata to prevent duplicate scripts in raw apps ([#8740](https://github.com/windmill-labs/windmill/issues/8740)) ([edfe074](https://github.com/windmill-labs/windmill/commit/edfe074e98cb3955be0768de7ed19e6ed8525916))
## [1.676.0](https://github.com/windmill-labs/windmill/compare/v1.675.1...v1.676.0) (2026-04-06)
### Features
* add path name autocomplete with ghost text and folder cycling ([#8731](https://github.com/windmill-labs/windmill/issues/8731)) ([e326621](https://github.com/windmill-labs/windmill/commit/e32662169a9762605de2dbe058514ddefbe07982))
### Bug Fixes
* fix custom urls not found ([d2abc0d](https://github.com/windmill-labs/windmill/commit/d2abc0d4300bb53f4035102f214d3c05bf0976a1))
### Performance Improvements
* add partial index for expired cache resource cleanup ([#8728](https://github.com/windmill-labs/windmill/issues/8728)) ([c721fac](https://github.com/windmill-labs/windmill/commit/c721fac466524747de04e3623c8cd62de8bd4dae))
## [1.675.1](https://github.com/windmill-labs/windmill/compare/v1.675.0...v1.675.1) (2026-04-05)
### Bug Fixes
* log cleanup scans S3 orphans and works cross-server ([#8729](https://github.com/windmill-labs/windmill/issues/8729)) ([f703fba](https://github.com/windmill-labs/windmill/commit/f703fba1ef56c89a97b2b4da7b4c188158f4c982))
### Performance Improvements
* add indexes for cleanup deletes on concurrency_key and autoscaling_event ([#8726](https://github.com/windmill-labs/windmill/issues/8726)) ([eae46a2](https://github.com/windmill-labs/windmill/commit/eae46a21a93fe7ab191228658dd5825f472bd851))
## [1.675.0](https://github.com/windmill-labs/windmill/compare/v1.674.2...v1.675.0) (2026-04-05)
### Features
* add object storage usage view and manual log cleanup ([#8724](https://github.com/windmill-labs/windmill/issues/8724)) ([02d0ee9](https://github.com/windmill-labs/windmill/commit/02d0ee919880823a33b112bcaf626a8933e1f715))
### Bug Fixes
* add admin check to count_completed_jobs_detail and document query builder SQL safety ([#8722](https://github.com/windmill-labs/windmill/issues/8722)) ([dd39c11](https://github.com/windmill-labs/windmill/commit/dd39c110a8468bf31d42428fc978cd302426fa86))
* allow private AI base URLs in ai_proxy integration test ([#8715](https://github.com/windmill-labs/windmill/issues/8715)) ([2b865c0](https://github.com/windmill-labs/windmill/commit/2b865c0694d79ce6477e5f14a077b73837007500))
* enrich OTEL spans with job_kind, trigger_kind, trigger, created_by, and script_hash ([#8718](https://github.com/windmill-labs/windmill/issues/8718)) ([7bf6ac2](https://github.com/windmill-labs/windmill/commit/7bf6ac2b694fc829327248ff2480c20c97e03e48))
* split DB health endpoint and add slow query controls ([#8725](https://github.com/windmill-labs/windmill/issues/8725)) ([01e39d9](https://github.com/windmill-labs/windmill/commit/01e39d9cd1b841d085bcc28a578654a5486cf76e))
## [1.674.2](https://github.com/windmill-labs/windmill/compare/v1.674.1...v1.674.2) (2026-04-04)
### Bug Fixes
* enforce RLS on $var: resolution in AI proxy (GHSA-jwg4-v3cj-rvfm) ([#8713](https://github.com/windmill-labs/windmill/issues/8713)) ([ff8e39c](https://github.com/windmill-labs/windmill/commit/ff8e39c69b1438defcaabd9d4906e7adafa7010c))
* SSRF via X-Resource-Path header in AI proxy endpoint ([#8712](https://github.com/windmill-labs/windmill/issues/8712)) ([f394e67](https://github.com/windmill-labs/windmill/commit/f394e674f22af13bb77915f33aa1e8de402b6fe1))
## [1.674.1](https://github.com/windmill-labs/windmill/compare/v1.674.0...v1.674.1) (2026-04-04)
### Bug Fixes
* create pg connection for cloud-hosted jobs instead of panicking ([#8710](https://github.com/windmill-labs/windmill/issues/8710)) ([aff95c3](https://github.com/windmill-labs/windmill/commit/aff95c33b2fd4c248dfaf595b8d18a6dbc50f0e6))
## [1.674.0](https://github.com/windmill-labs/windmill/compare/v1.673.0...v1.674.0) (2026-04-03)
### Features
* add application-level heartbeat support for websocket triggers ([#8686](https://github.com/windmill-labs/windmill/issues/8686)) ([5b7fa63](https://github.com/windmill-labs/windmill/commit/5b7fa63bf1800313e9b82465b8a4399a48634371))
* add Azure Key Vault as secret storage backend ([#8704](https://github.com/windmill-labs/windmill/issues/8704)) ([dcd615f](https://github.com/windmill-labs/windmill/commit/dcd615fdc3c66ec2a8e39c01f8142a7e7c82f534))
* add http/protobuf support for OTEL exporters ([#8702](https://github.com/windmill-labs/windmill/issues/8702)) ([0aea49f](https://github.com/windmill-labs/windmill/commit/0aea49f9607d5cbb5bcfa3068a179c9b7bf9afd6))
* add optional labels to scripts, flows, apps, schedules, triggers ([#8609](https://github.com/windmill-labs/windmill/issues/8609)) ([c4c9ef5](https://github.com/windmill-labs/windmill/commit/c4c9ef5fd7b41052b08ee941725434e8ca4ac970))
* add powershell common parameters support ([#8683](https://github.com/windmill-labs/windmill/issues/8683)) ([0317d58](https://github.com/windmill-labs/windmill/commit/0317d5891cfcfbde7b04795c034c088e933ee3d0))
* sql.raw in Typescript client ([#8706](https://github.com/windmill-labs/windmill/issues/8706)) ([ce290f6](https://github.com/windmill-labs/windmill/commit/ce290f68db866c07b30c97c2c0b3e39fee0a26d8))
* Support .ducklake() and .datatable() in agent workers ([#8697](https://github.com/windmill-labs/windmill/issues/8697)) ([fda68a7](https://github.com/windmill-labs/windmill/commit/fda68a72e5dfcded2350d1ff33ca4c695ab337b7))
### Bug Fixes
* add secretKeyRef support for jwt_secret and rsa_keys ([#8698](https://github.com/windmill-labs/windmill/issues/8698)) ([ba21470](https://github.com/windmill-labs/windmill/commit/ba214709b94f9467738e66b016331e97ac7d5d10))
* align script push metadata warning with generated locks ([#8690](https://github.com/windmill-labs/windmill/issues/8690)) ([6656b46](https://github.com/windmill-labs/windmill/commit/6656b46f10408e1c15961a72cde4c13b5c5b3923))
* debounce S3 proxy logs ([#8694](https://github.com/windmill-labs/windmill/issues/8694)) ([a3073ad](https://github.com/windmill-labs/windmill/commit/a3073ad8244efd9043e27f6731f7b53dbda662c1))
* dedicated worker dispatch, cross-workspace deps, UI improvements ([#8689](https://github.com/windmill-labs/windmill/issues/8689)) ([bffa61e](https://github.com/windmill-labs/windmill/commit/bffa61e33f2305bbeb79a2c91989a47baa7dff31))
* gate relock_skip tests on private feature and update ee-repo-ref ([#8703](https://github.com/windmill-labs/windmill/issues/8703)) ([adc9fe7](https://github.com/windmill-labs/windmill/commit/adc9fe722d8511a5914d81faac40af757e7f5e3f))
* hide deprecated cli metadata commands ([#8699](https://github.com/windmill-labs/windmill/issues/8699)) ([b960598](https://github.com/windmill-labs/windmill/commit/b96059843168c072f24072f93fecd80431e5d4cf))
* optimize S3 proxy performance ([#8685](https://github.com/windmill-labs/windmill/issues/8685)) ([0cfa462](https://github.com/windmill-labs/windmill/commit/0cfa462c379e887fdb5ad5e3bbff7798648d4e91))
* pipeline DISCARD ALL with first query on cached pg connections ([#8707](https://github.com/windmill-labs/windmill/issues/8707)) ([6d58d1a](https://github.com/windmill-labs/windmill/commit/6d58d1a74d1e69b163210a795502a7b3931001b5))
* resolve schedule update deadlock ([#8701](https://github.com/windmill-labs/windmill/issues/8701)) ([27ca417](https://github.com/windmill-labs/windmill/commit/27ca417201c99cf6fe0ae5b52a63c0395033e196))
* support raw app deployment history ([#8657](https://github.com/windmill-labs/windmill/issues/8657)) ([f234df9](https://github.com/windmill-labs/windmill/commit/f234df97ec3cdc480ee9d403370a3512496b024b))
* use pre-aggregated stats for telemetry job usage queries ([#8688](https://github.com/windmill-labs/windmill/issues/8688)) ([cdf3c29](https://github.com/windmill-labs/windmill/commit/cdf3c29664e4142c0f4487c07e585d1af3f97f91))
## [1.673.0](https://github.com/windmill-labs/windmill/compare/v1.672.0...v1.673.0) (2026-04-02)
### Features
* add endpoint to restart workers in a worker group ([#8659](https://github.com/windmill-labs/windmill/issues/8659)) ([f0437eb](https://github.com/windmill-labs/windmill/commit/f0437eba1925a9aa4c430008027d637a0c89ee39))
* add Entra ID (Azure Workload Identity) database auth ([#8526](https://github.com/windmill-labs/windmill/issues/8526)) ([6a5cfbc](https://github.com/windmill-labs/windmill/commit/6a5cfbc159a0ad7925fd7ce5eefc8eaa21bbb70b))
* add LIMIT_WINDOWS_TO_1CU env var for Windows worker memory limits ([#8681](https://github.com/windmill-labs/windmill/issues/8681)) ([d2d6810](https://github.com/windmill-labs/windmill/commit/d2d6810db954114f3333853bd3476cb8fc735f92))
* restore bun for dedicated workers, fix dispatch & serialization, cross-workspace deps ([#8645](https://github.com/windmill-labs/windmill/issues/8645)) ([619ebb6](https://github.com/windmill-labs/windmill/commit/619ebb65ce8dce8264add31c3147919802a8286a))
### Bug Fixes
* add HMAC signature verification to Slack interactive callback endpoint ([#8611](https://github.com/windmill-labs/windmill/issues/8611)) ([55e8a5c](https://github.com/windmill-labs/windmill/commit/55e8a5cff1f185b1dbd332d37b877972efa1ed7d))
* correct raw app flow inputs ([#8667](https://github.com/windmill-labs/windmill/issues/8667)) ([28c0730](https://github.com/windmill-labs/windmill/commit/28c073056c65d4ed1600e39679497e5af964347f))
* pass selected language to AI agent when generating flow scripts ([#8680](https://github.com/windmill-labs/windmill/issues/8680)) ([381011a](https://github.com/windmill-labs/windmill/commit/381011a4a8e48454e9c146c64db502293e646b99))
* poll for preview results to avoid undici headers timeout ([#8682](https://github.com/windmill-labs/windmill/issues/8682)) ([ff5fa9f](https://github.com/windmill-labs/windmill/commit/ff5fa9f64fe4aaf33e06b20f02373894b5df0f95))
* pre-fix trigger edited_by for superadmins not in workspace ([#8669](https://github.com/windmill-labs/windmill/issues/8669)) ([350ffdc](https://github.com/windmill-labs/windmill/commit/350ffdce297ba5b84f9dd247eede6da0c6b0956c))
* resolve race condition where flow sync push reverts to stale version ([#8673](https://github.com/windmill-labs/windmill/issues/8673)) ([d569e9e](https://github.com/windmill-labs/windmill/commit/d569e9e29c588243a90b1cd25f866efb0d178640))
* respect disabled fields in JSON input mode ([#8663](https://github.com/windmill-labs/windmill/issues/8663)) ([7fd0bf9](https://github.com/windmill-labs/windmill/commit/7fd0bf974d2ba2644bb01dd5e9ddc84749e166f5))
* Run typed pg queries in a single protocol conversation ([#8679](https://github.com/windmill-labs/windmill/issues/8679)) ([8581a33](https://github.com/windmill-labs/windmill/commit/8581a3300d056040b7e3ab77d629c74f034c9c97))
* sanitize MCP tool schemas for JSON Schema draft 2020-12 compliance ([#8666](https://github.com/windmill-labs/windmill/issues/8666)) ([8c3c97f](https://github.com/windmill-labs/windmill/commit/8c3c97f7a670d47019cc666219f8187f48499672))
* skip generate-metadata confirmation prompt in non-interactive CI ([#8678](https://github.com/windmill-labs/windmill/issues/8678)) ([39af1b7](https://github.com/windmill-labs/windmill/commit/39af1b75afc8458f85dec4fe51dfaed3d0cb000d))
* strip f/ prefix from folder paths when deploying from workspace forks ([#8662](https://github.com/windmill-labs/windmill/issues/8662)) ([7ab0ea5](https://github.com/windmill-labs/windmill/commit/7ab0ea581d349fbfdb56d22cf9903a90efa045bb))
* support branch-specific folder.meta.yaml in missing-meta check ([#8661](https://github.com/windmill-labs/windmill/issues/8661)) ([c87a6a0](https://github.com/windmill-labs/windmill/commit/c87a6a0f2c1346bf5e21f128d32d89bdca039243))
* validate rd redirect on login with same rules as logout ([#8655](https://github.com/windmill-labs/windmill/issues/8655)) ([bcce627](https://github.com/windmill-labs/windmill/commit/bcce62738791a4e9b9f4dbc64731eef163230172))
## [1.672.0](https://github.com/windmill-labs/windmill/compare/v1.671.0...v1.672.0) (2026-04-01)
### Features
* add R language support ([#8263](https://github.com/windmill-labs/windmill/issues/8263)) ([a46aa64](https://github.com/windmill-labs/windmill/commit/a46aa641f9d72809c52a0eb11a877a0f2d587c32))
### Bug Fixes
* approval page freeze, stale state, and missing approval link ([#8653](https://github.com/windmill-labs/windmill/issues/8653)) ([7069202](https://github.com/windmill-labs/windmill/commit/70692021909443b86ed61fa621fe49f28742fb54))
## [1.671.0](https://github.com/windmill-labs/windmill/compare/v1.670.0...v1.671.0) (2026-03-31)
### Features
* add configurable preview job tag override in default tags settings ([#8649](https://github.com/windmill-labs/windmill/issues/8649)) ([da8886b](https://github.com/windmill-labs/windmill/commit/da8886be8575dd925b6d24c55ab379bc6984c5f8))
* improve CLI flow log streaming and job inspection ([#8644](https://github.com/windmill-labs/windmill/issues/8644)) ([6c3c971](https://github.com/windmill-labs/windmill/commit/6c3c971af5aa1362632ee0deeddf91b8bc47c853))
* support hub flows in raw app runnables ([#8627](https://github.com/windmill-labs/windmill/issues/8627)) ([040a199](https://github.com/windmill-labs/windmill/commit/040a199685cea5c99c944bacb5584a381d6ec829))
### Bug Fixes
* return default_args/enums in approval info and fix subflow resume buttons ([#8648](https://github.com/windmill-labs/windmill/issues/8648)) ([852c59e](https://github.com/windmill-labs/windmill/commit/852c59efbb04510e5e6f99919707effcf6769a2f))
## [1.670.0](https://github.com/windmill-labs/windmill/compare/v1.669.1...v1.670.0) (2026-03-31)
### Features
* add OR logic support to kafka/websocket trigger filters ([#8580](https://github.com/windmill-labs/windmill/issues/8580)) ([3876902](https://github.com/windmill-labs/windmill/commit/3876902a7be798fd5ef208bc5756b28fb55e569e))
* expose getJob and getJobLogs as MCP tools ([#8632](https://github.com/windmill-labs/windmill/issues/8632)) ([cd8edcd](https://github.com/windmill-labs/windmill/commit/cd8edcd94f2bf44c3e771000cb0bbad08accc0e7))
* support multiline secrets in resource password fields ([#8637](https://github.com/windmill-labs/windmill/issues/8637)) ([26050f9](https://github.com/windmill-labs/windmill/commit/26050f96c34f14826298760174a45f3559d3266c))
* support sensitive/secret fields for non-string types ([#8635](https://github.com/windmill-labs/windmill/issues/8635)) ([375fb66](https://github.com/windmill-labs/windmill/commit/375fb66abe2d1861b53dc2b36d2cf0e2eb82c3a8))
### Bug Fixes
* cap input history per_page to 100 on cloud ([#8624](https://github.com/windmill-labs/windmill/issues/8624)) ([8e973c8](https://github.com/windmill-labs/windmill/commit/8e973c892d768be2da2e6b4b7af9e40b62333052))
* compute highest workspace role across all instance groups ([#8633](https://github.com/windmill-labs/windmill/issues/8633)) ([92b9ac7](https://github.com/windmill-labs/windmill/commit/92b9ac72c5fc9a5085fcb2e9d835ccbb53bcd4b0))
* Ducklake UI Nits ([#8628](https://github.com/windmill-labs/windmill/issues/8628)) ([ef1757f](https://github.com/windmill-labs/windmill/commit/ef1757f5d747e513d201eb6fa48918dba8248abe))
* preserve flow notes/groups and field ordering in generate-metadata ([#8641](https://github.com/windmill-labs/windmill/issues/8641)) ([#8642](https://github.com/windmill-labs/windmill/issues/8642)) ([52a04d2](https://github.com/windmill-labs/windmill/commit/52a04d210f476f4598007f67770bc6520b045950))
* remove timeout on python client httpx to prevent ducklake query timeouts ([#8636](https://github.com/windmill-labs/windmill/issues/8636)) ([c5fccd2](https://github.com/windmill-labs/windmill/commit/c5fccd2f69ad8a6e46c514cf89b9aa21b380e6fe))
* resolve missing form schema for nested suspend steps in FlowNode sub-flows ([#8643](https://github.com/windmill-labs/windmill/issues/8643)) ([12ea7e7](https://github.com/windmill-labs/windmill/commit/12ea7e74237560a9dfc99b6bc1338e3343b57640))
* smarter secret masking based on secret length ([#8629](https://github.com/windmill-labs/windmill/issues/8629)) ([bfc2aef](https://github.com/windmill-labs/windmill/commit/bfc2aefdb8ab92b7284de7f9e485a5504502d944))
## [1.669.1](https://github.com/windmill-labs/windmill/compare/v1.669.0...v1.669.1) (2026-03-30)
### Bug Fixes
* avoid doubled /oauth2 path in Okta custom authorization server URLs ([#8620](https://github.com/windmill-labs/windmill/issues/8620)) ([4817913](https://github.com/windmill-labs/windmill/commit/4817913f0cab49980bfeb442089631d7953955ff))
* improve db health UI text and prevent label wrapping ([d532c1d](https://github.com/windmill-labs/windmill/commit/d532c1d470fcb0ef02ebc5342ad1cf22e58b1f4d))
## [1.669.0](https://github.com/windmill-labs/windmill/compare/v1.668.5...v1.669.0) (2026-03-30)
### Features
* WAC workflow diagram visualization via WASM ([#8604](https://github.com/windmill-labs/windmill/issues/8604)) ([abc6b12](https://github.com/windmill-labs/windmill/commit/abc6b12d6815edc4dda3ddf5f0572ecedcb670dd))
### Bug Fixes
* add path traversal check in service_logs get_log_file endpoint ([#8605](https://github.com/windmill-labs/windmill/issues/8605)) ([5f2d3e6](https://github.com/windmill-labs/windmill/commit/5f2d3e6812f01fe6194bcfd976970a6e3c4186cc))
* cast DuckDB IS_NULLABLE to string in metadata query ([#8607](https://github.com/windmill-labs/windmill/issues/8607)) ([f3012ee](https://github.com/windmill-labs/windmill/commit/f3012ee7ccc7a8947b5f6bd7c7df77984437f91e))
* enable S3 bundle cache for PHP previews without lock file ([#8608](https://github.com/windmill-labs/windmill/issues/8608)) ([ee62315](https://github.com/windmill-labs/windmill/commit/ee6231590ed91063f104e6d054b52e88b569986f))
* enforce workspace isolation on flow resume endpoint ([#8612](https://github.com/windmill-labs/windmill/issues/8612)) ([33032ed](https://github.com/windmill-labs/windmill/commit/33032ed297cf9ea867388d4ea2ece607c9d36dc7))
* handle DuckDB boolean types in ColumnDef deserializers ([#8610](https://github.com/windmill-labs/windmill/issues/8610)) ([22da5bd](https://github.com/windmill-labs/windmill/commit/22da5bd9ea1ca000cfab3eecf1e3fb0fc01200cb))
* use route_service instead of fallback_service for MCP router ([#8614](https://github.com/windmill-labs/windmill/issues/8614)) ([98934d5](https://github.com/windmill-labs/windmill/commit/98934d59c552325fcf88c016e31ae977970e8c9a))
## [1.668.5](https://github.com/windmill-labs/windmill/compare/v1.668.4...v1.668.5) (2026-03-29)
### Bug Fixes
* add per-IP and per-account brute force protection on login endpoint ([#8601](https://github.com/windmill-labs/windmill/issues/8601)) ([06bbe7b](https://github.com/windmill-labs/windmill/commit/06bbe7b94bfb846bd73aaf6abdc83e4c14e70adc))
* add timestamp validation to webhook signature verification ([#8596](https://github.com/windmill-labs/windmill/issues/8596)) ([74fba2a](https://github.com/windmill-labs/windmill/commit/74fba2abf3dc68b682777c01da360258786fded8))
* disable workspace webhook events when CLOUD_HOSTED ([#8598](https://github.com/windmill-labs/windmill/issues/8598)) ([be7fbeb](https://github.com/windmill-labs/windmill/commit/be7fbeb8b1f31d15e33b0783b2a504d6a01e532e))
* harden login rate limiting with CLOUD_HOSTED gating and memory eviction ([#8602](https://github.com/windmill-labs/windmill/issues/8602)) ([754b88a](https://github.com/windmill-labs/windmill/commit/754b88a52c4e76421cb21c1eed87ad9d8385e9aa))
* prevent SSRF and local file read via git repository resource URLs ([#8600](https://github.com/windmill-labs/windmill/issues/8600)) ([845db72](https://github.com/windmill-labs/windmill/commit/845db72b7344fb87ac9c5e24697750549665c7bf))
* rename snippet param to avoid svelte compiler shadowing bug in asset usages drawer ([#8595](https://github.com/windmill-labs/windmill/issues/8595)) ([8c770a2](https://github.com/windmill-labs/windmill/commit/8c770a206a3b0704642c0bda2ab2aeb199d8af3f))
* require mcp: scope for MCP endpoints instead of blanket bypass ([#8597](https://github.com/windmill-labs/windmill/issues/8597)) ([f5fc9f8](https://github.com/windmill-labs/windmill/commit/f5fc9f8485d2ec3e20f8b451305195446b90e5a3))
* use constant-time comparison for API key and basic auth validation ([#8593](https://github.com/windmill-labs/windmill/issues/8593)) ([b4d1f2a](https://github.com/windmill-labs/windmill/commit/b4d1f2aac789306c2e35e123ac93e12c47c26f99))
* validate JSON before sql_builder bind to prevent injection via JSONB queries ([#8599](https://github.com/windmill-labs/windmill/issues/8599)) ([970e859](https://github.com/windmill-labs/windmill/commit/970e859a410b0144847a1a30d7059955effdd402))
## [1.668.4](https://github.com/windmill-labs/windmill/compare/v1.668.3...v1.668.4) (2026-03-29)
### Bug Fixes
* update git sync version to latest cli ([0549f68](https://github.com/windmill-labs/windmill/commit/0549f682fe14f4d4b2f67941362ed2cc29d974a1))
## [1.668.3](https://github.com/windmill-labs/windmill/compare/v1.668.2...v1.668.3) (2026-03-28)
### Bug Fixes
* **cli:** phantom diffs, flow safety, trigger DX, lint watch, error clarity ([#8588](https://github.com/windmill-labs/windmill/issues/8588)) ([c6ce319](https://github.com/windmill-labs/windmill/commit/c6ce3197a72ceeffd702cf2263b1074ecbf1ca33))
## [1.668.2](https://github.com/windmill-labs/windmill/compare/v1.668.1...v1.668.2) (2026-03-28)
### Bug Fixes
* **cli:** app push crash, lint path, push --message, run validation, history timestamps ([#8585](https://github.com/windmill-labs/windmill/issues/8585)) ([f40cdaf](https://github.com/windmill-labs/windmill/commit/f40cdaf43453d2643800ed730d6abe6873bbe8e7))
## [1.668.1](https://github.com/windmill-labs/windmill/compare/v1.668.0...v1.668.1) (2026-03-28)
### Bug Fixes
* **cli:** fix 13 CLI bugs — exit codes, sync tar fallback, variable encryption, JSON output ([#8582](https://github.com/windmill-labs/windmill/issues/8582)) ([38acaa3](https://github.com/windmill-labs/windmill/commit/38acaa3653728bf9e0ae6f746edf433703b4ab63))
## [1.668.0](https://github.com/windmill-labs/windmill/compare/v1.667.0...v1.668.0) (2026-03-28)
### Features
* add DB health diagnostic dashboard for superadmins ([#8574](https://github.com/windmill-labs/windmill/issues/8574)) ([9ceab73](https://github.com/windmill-labs/windmill/commit/9ceab730d7def09c2b46527f8a586789d14f2ce0))
* **cli:** add job, group, audit, token commands and schedule enable/disable ([#8581](https://github.com/windmill-labs/windmill/issues/8581)) ([d29cb23](https://github.com/windmill-labs/windmill/commit/d29cb234dbff07473b911e5e75e362def8a47650))
* IAM RDS auth for PostgreSQL worker resources ([#8573](https://github.com/windmill-labs/windmill/issues/8573)) ([56253c0](https://github.com/windmill-labs/windmill/commit/56253c04cb679c58d00750da699a6cb62ed52aca))
### Bug Fixes
* add Authority Key Identifier to MITM proxy leaf certs ([#8576](https://github.com/windmill-labs/windmill/issues/8576)) ([ce2e6c8](https://github.com/windmill-labs/windmill/commit/ce2e6c8c015110d0385e6afecdc8313aabca1364))
* Improve CLI developer experience: error handling, sync workflow, JSON output, workspace forks ([#8578](https://github.com/windmill-labs/windmill/issues/8578)) ([501a4ff](https://github.com/windmill-labs/windmill/commit/501a4ff2a94510145952686d24ccc639781beefe))
* trigger capture filter and focus issues ([#8579](https://github.com/windmill-labs/windmill/issues/8579)) ([820f28f](https://github.com/windmill-labs/windmill/commit/820f28f8799f8dad5cfab94b51ac9921d664f04a))
## [1.667.0](https://github.com/windmill-labs/windmill/compare/v1.666.0...v1.667.0) (2026-03-27)
### Features
* add schedule support to CLI branch-specific items ([#8570](https://github.com/windmill-labs/windmill/issues/8570)) ([b592996](https://github.com/windmill-labs/windmill/commit/b592996eee98ddb664f1b007b95a2096d5d4e3a6))
* add workspace-level service accounts ([#8560](https://github.com/windmill-labs/windmill/issues/8560)) ([3959fe8](https://github.com/windmill-labs/windmill/commit/3959fe82974f5f0383e94fd83a5d78fe4212d56a))
* **cli:** generate commented wmill.yaml and add config reference command ([#8546](https://github.com/windmill-labs/windmill/issues/8546)) ([d06b426](https://github.com/windmill-labs/windmill/commit/d06b42613f73c4a7b31c990be22b0c97efab2666))
* DB-coordinated graceful restart staggering for settings changes ([#8555](https://github.com/windmill-labs/windmill/issues/8555)) ([2f32675](https://github.com/windmill-labs/windmill/commit/2f326758013dd1f1e6ae732e5784a32f1fb6e4bd))
* improve-replay-ui ([#8250](https://github.com/windmill-labs/windmill/issues/8250)) ([c0aafee](https://github.com/windmill-labs/windmill/commit/c0aafee9a9923d5dc2fa3b99da4378e923933a06))
* support multiple folder selection in MCP scope selector ([#8557](https://github.com/windmill-labs/windmill/issues/8557)) ([ad19ac9](https://github.com/windmill-labs/windmill/commit/ad19ac9b37b04591c921f93f180bdda961af6cef))
### Bug Fixes
* **cli:** preserve inline script files during flow generate-locks ([#8561](https://github.com/windmill-labs/windmill/issues/8561)) ([a8b651d](https://github.com/windmill-labs/windmill/commit/a8b651da9ff86766119e14c0b61652be8a7b453a))
* emit 0 for OTEL queue metrics when tag queue is empty ([#8559](https://github.com/windmill-labs/windmill/issues/8559)) ([79cc4a9](https://github.com/windmill-labs/windmill/commit/79cc4a92d88486c999799826bd0c9663767103f5))
* handle inline script deletion in sync push + flow new nonDottedPaths ([#8553](https://github.com/windmill-labs/windmill/issues/8553)) ([943fe9c](https://github.com/windmill-labs/windmill/commit/943fe9c6cc9b046e24007e45b5c37afc4804256a))
* include importer_kind in dependency debounce key to prevent cross-kind collisions ([#8567](https://github.com/windmill-labs/windmill/issues/8567)) ([bc7007b](https://github.com/windmill-labs/windmill/commit/bc7007bb4265e1f1375c1f0678b74325882a4e92))
* multi-script dedicated workers race on shared job_dir ([#8551](https://github.com/windmill-labs/windmill/issues/8551)) ([#8569](https://github.com/windmill-labs/windmill/issues/8569)) ([63a3573](https://github.com/windmill-labs/windmill/commit/63a3573951d1f724cc63728ed973d039a5468072))
* preserve notes on nodes inside collapsed groups ([#8552](https://github.com/windmill-labs/windmill/issues/8552)) ([0fb1153](https://github.com/windmill-labs/windmill/commit/0fb115304afc49812420e9ce24e5048502621059))
* sanitize flow step summaries for filesystem-safe names ([#8554](https://github.com/windmill-labs/windmill/issues/8554)) ([e15bfbf](https://github.com/windmill-labs/windmill/commit/e15bfbf91ee1517432a6861ebb48e129485006aa))
* use admin db pool in get_copilot_settings_state ([#8564](https://github.com/windmill-labs/windmill/issues/8564)) ([70f3ee5](https://github.com/windmill-labs/windmill/commit/70f3ee5ed4470e9993be822874f2b38e83a96611))
### Performance Improvements
* enable bun bundle caching for WAC v2 scripts ([#8556](https://github.com/windmill-labs/windmill/issues/8556)) ([ab868e9](https://github.com/windmill-labs/windmill/commit/ab868e9ebceadaa55e54770d9d59dc5524da13ff))
## [1.666.0](https://github.com/windmill-labs/windmill/compare/v1.665.0...v1.666.0) (2026-03-26)
### Features
* add PDF input support to AI agent ([#8525](https://github.com/windmill-labs/windmill/issues/8525)) ([e44504c](https://github.com/windmill-labs/windmill/commit/e44504c6e93e7a4ee94ced03ab626b79a4fd0754))
### Bug Fixes
* add relative imports to the dependency list in deploymentUI ([#8548](https://github.com/windmill-labs/windmill/issues/8548)) ([d760ea5](https://github.com/windmill-labs/windmill/commit/d760ea5eaf4dc33007f1fd3e5e07b86925a0aa11))
* filter null entries in FileUpload initialValue to prevent s3 access error ([#8544](https://github.com/windmill-labs/windmill/issues/8544)) ([1a73012](https://github.com/windmill-labs/windmill/commit/1a73012e0737a6ebea8307013dc0f79982269d91))
* pass pre-bound TcpListener to run_server to fix Windows CI test race ([#8542](https://github.com/windmill-labs/windmill/issues/8542)) ([d7f4b95](https://github.com/windmill-labs/windmill/commit/d7f4b950ce6e966ed1b410e03d48fe96bc036e73))
* resolve parent_hash race condition in sync push with auto_parent ([#8545](https://github.com/windmill-labs/windmill/issues/8545)) ([71549c3](https://github.com/windmill-labs/windmill/commit/71549c3db053bcc209c7065ac8cd42f1e8047cc3))
* upload_s3_file not working in VS Code extension ([#8547](https://github.com/windmill-labs/windmill/issues/8547)) ([1fa4d91](https://github.com/windmill-labs/windmill/commit/1fa4d919b30ac9eff2d1789fba2695450ba115e7))
## [1.665.0](https://github.com/windmill-labs/windmill/compare/v1.664.0...v1.665.0) (2026-03-26)
### Features
* add instance setting to enforce workspace prefix for HTTP routes ([#8528](https://github.com/windmill-labs/windmill/issues/8528)) ([9b3e558](https://github.com/windmill-labs/windmill/commit/9b3e558d84f15052e9c32695a467f8ef7e4ad1f5))
* add trashbin system for soft-deleting items ([#8519](https://github.com/windmill-labs/windmill/issues/8519)) ([69ce946](https://github.com/windmill-labs/windmill/commit/69ce946241d98ea90bc7135d44ca0c87f928be88))
* mask sensitive values in job logs ([#8520](https://github.com/windmill-labs/windmill/issues/8520)) ([0885d8c](https://github.com/windmill-labs/windmill/commit/0885d8c986f13ac210e4db3ad38febe9be391ba4))
* move basic git sync from EE to CE with runtime user count gating ([#8493](https://github.com/windmill-labs/windmill/issues/8493)) ([79d2bd5](https://github.com/windmill-labs/windmill/commit/79d2bd51a00654162754046308d7670242120df6))
* runner groups for shared-process multi-script dedicated workers ([#8434](https://github.com/windmill-labs/windmill/issues/8434)) ([c28314f](https://github.com/windmill-labs/windmill/commit/c28314f424ea0e04b86565ce88e6c91e0df1a0cf))
* SCIM user deprovisioning (active:false) + instance-level user disable ([#8484](https://github.com/windmill-labs/windmill/issues/8484)) ([0bd7568](https://github.com/windmill-labs/windmill/commit/0bd756839c0261f255111d62088bdaaecb838085))
* show groups and notes in flow status viewer ([#8535](https://github.com/windmill-labs/windmill/issues/8535)) ([167084a](https://github.com/windmill-labs/windmill/commit/167084a0ebe73384fa0d31f0b24017a47686a072))
### Bug Fixes
* auto-generate datatable SDK reference for app mode system prompt ([#8522](https://github.com/windmill-labs/windmill/issues/8522)) ([8a32322](https://github.com/windmill-labs/windmill/commit/8a32322c187ccc60ec7eafb61a9678f267a82282))
* consider wmill.yaml environments alias in git sync ([#8532](https://github.com/windmill-labs/windmill/issues/8532)) ([b7475c7](https://github.com/windmill-labs/windmill/commit/b7475c73094a28f520f798f6cb1a0c6b4807ccb7))
* GitHub Enterprise Server support for self-managed GitHub Apps ([#8507](https://github.com/windmill-labs/windmill/issues/8507)) ([935fb44](https://github.com/windmill-labs/windmill/commit/935fb44c848b8bf9430b5600dd3c3bedb2f89efd))
* raw apps bundle not found during deployment error ([#8515](https://github.com/windmill-labs/windmill/issues/8515)) ([34e3115](https://github.com/windmill-labs/windmill/commit/34e3115bcbd19a8e0b6f483435586a2ab43d0a8e))
* require admin for workspace encryption key export ([#8523](https://github.com/windmill-labs/windmill/issues/8523)) ([0317668](https://github.com/windmill-labs/windmill/commit/031766808945aefc926f0836d011c0b2a5d2243d))
* restrict logout redirect to whitelisted domains ([#8524](https://github.com/windmill-labs/windmill/issues/8524)) ([4c8edd5](https://github.com/windmill-labs/windmill/commit/4c8edd5e944d77ed2d41c2b87171c1115c0fdcdc))
* serve index disk storage sizes from /srch/ endpoint ([#8511](https://github.com/windmill-labs/windmill/issues/8511)) ([e3620e0](https://github.com/windmill-labs/windmill/commit/e3620e074e1bdb46b2b8d732f35a91d300589663))
* use /apps_raw/get/ redirect URL for raw apps set as workspace default ([#8508](https://github.com/windmill-labs/windmill/issues/8508)) ([85c52e2](https://github.com/windmill-labs/windmill/commit/85c52e2cded10606cc895d0d3b717e13c69bc9b3))
* use resource-level scope overrides during OAuth2 token refresh ([#8540](https://github.com/windmill-labs/windmill/issues/8540)) ([55ad0ff](https://github.com/windmill-labs/windmill/commit/55ad0ff5c499c33b766f47c6f32ba5d3eeb14763))
## [1.664.0](https://github.com/windmill-labs/windmill/compare/v1.663.0...v1.664.0) (2026-03-24)
### Features
* add instance-level AI settings ([#8453](https://github.com/windmill-labs/windmill/issues/8453)) ([db5e036](https://github.com/windmill-labs/windmill/commit/db5e03610da325288d53afdbca94b9cbfc7ceace))
* add selfApproval option to WAC + inline approval buttons ([#8440](https://github.com/windmill-labs/windmill/issues/8440)) ([d578e40](https://github.com/windmill-labs/windmill/commit/d578e40101a838d3dffda14157cf72ee4d5a93c0))
* flow group nodes with collapsible groups ([#8075](https://github.com/windmill-labs/windmill/issues/8075)) ([81eb446](https://github.com/windmill-labs/windmill/commit/81eb446eee359f44374b81320690e5345fd08c15))
### Bug Fixes
* add GIT_SSL_CAINFO to tracing proxy env vars ([#8502](https://github.com/windmill-labs/windmill/issues/8502)) ([bdfd5d5](https://github.com/windmill-labs/windmill/commit/bdfd5d57261a4bb760fc57ad41ee56aff9b9c0af))
* create parent dirs and accept 'python' alias in script bootstrap ([#8497](https://github.com/windmill-labs/windmill/issues/8497)) ([7f27d99](https://github.com/windmill-labs/windmill/commit/7f27d996accb3c3b471d1c50df397867d89c738a))
## [1.663.0](https://github.com/windmill-labs/windmill/compare/v1.662.0...v1.663.0) (2026-03-24)
### Features
* add summary field for native triggers ([#8476](https://github.com/windmill-labs/windmill/issues/8476)) ([5089a45](https://github.com/windmill-labs/windmill/commit/5089a458819abbc6f241bc354bebb91520bd1a52))
* add typed request body to OpenAPI spec generation ([#8481](https://github.com/windmill-labs/windmill/issues/8481)) ([37ebaf4](https://github.com/windmill-labs/windmill/commit/37ebaf4d0ac342703498733f97778a552f979f6a))
* **cli:** better stale scripts detection [#3](https://github.com/windmill-labs/windmill/issues/3) ([#8480](https://github.com/windmill-labs/windmill/issues/8480)) ([9643006](https://github.com/windmill-labs/windmill/commit/9643006f1e90b991b334bb58caf62301bc26d09d))
* Debounce node ([#8324](https://github.com/windmill-labs/windmill/issues/8324)) ([5d1c54d](https://github.com/windmill-labs/windmill/commit/5d1c54d9b33d6ff6f2c98481a2740d1e7629cdfa))
* surface permissioned_as selector in trigger editor UI ([#8475](https://github.com/windmill-labs/windmill/issues/8475)) ([f035b53](https://github.com/windmill-labs/windmill/commit/f035b538bbd786445526339f88be8f33a3628105))
### Bug Fixes
* clean up stale dependency map entries for renamed scripts ([#8492](https://github.com/windmill-labs/windmill/issues/8492)) ([47c0c36](https://github.com/windmill-labs/windmill/commit/47c0c363f4fc1d9af7efd07ea172e32989ce50d2))
* **cli:** add Svelte 5 event delegation guidance and safe push to raw-app skill ([#8466](https://github.com/windmill-labs/windmill/issues/8466)) ([911df95](https://github.com/windmill-labs/windmill/commit/911df958e78d2dab9823dfa7d7e5c9824fc2d565))
* Fix worker panic when job_isolation changed to unshare at runtime ([#8490](https://github.com/windmill-labs/windmill/issues/8490)) ([cbe47c0](https://github.com/windmill-labs/windmill/commit/cbe47c0b6c22f79452d020777e481ee26970f25b))
* improve SQS retries ([3c8d351](https://github.com/windmill-labs/windmill/commit/3c8d351c9722a089133871019d27cf3bc3cdc159))
* Move database manager SQL queries to backend ([#8306](https://github.com/windmill-labs/windmill/issues/8306)) ([aa30fd2](https://github.com/windmill-labs/windmill/commit/aa30fd252dcf40233d191c43a6293fb9feabf010))
* prevent SQL injection in job query parameters ([#8494](https://github.com/windmill-labs/windmill/issues/8494)) ([54f5a19](https://github.com/windmill-labs/windmill/commit/54f5a19377e9df712e18f85f896e21b1776981ed))
* respect NO_COLOR env variable for stdout log output ([#8483](https://github.com/windmill-labs/windmill/issues/8483)) ([f329ee7](https://github.com/windmill-labs/windmill/commit/f329ee7aaefbae0ad344743c40825440a936bd30))
* show effective isolation level on workers page ([#8491](https://github.com/windmill-labs/windmill/issues/8491)) ([37886ed](https://github.com/windmill-labs/windmill/commit/37886edda1443293806a9b1b810196b72e076b12))
* skip debounce arg accumulation when batch table is empty (CE) ([#8485](https://github.com/windmill-labs/windmill/issues/8485)) ([010753c](https://github.com/windmill-labs/windmill/commit/010753c73ac85237af50acadf9c08567b1bc993c))
* stop_after_if with empty error_message prevents flow from stopping ([#8464](https://github.com/windmill-labs/windmill/issues/8464)) ([1503bf9](https://github.com/windmill-labs/windmill/commit/1503bf948e3340b8a6933d71885f8f2cb8dc1867))
## [1.662.0](https://github.com/windmill-labs/windmill/compare/v1.661.0...v1.662.0) (2026-03-20)
### Features
* mcp oauth gateway ([#8443](https://github.com/windmill-labs/windmill/issues/8443)) ([51957f7](https://github.com/windmill-labs/windmill/commit/51957f7d921b624fc132ca9ea03cdd30a5810e51))
### Bug Fixes
* replace email with permissioned_as for triggers/schedules ([#8439](https://github.com/windmill-labs/windmill/issues/8439)) ([efb4a27](https://github.com/windmill-labs/windmill/commit/efb4a27d5181bf9db3deb5e8100ec60adbe45e7f))
* strip invalid enum values from MCP schemas ([#8462](https://github.com/windmill-labs/windmill/issues/8462)) ([88ad376](https://github.com/windmill-labs/windmill/commit/88ad3767916b86c4e0b272d040ee0b75a0580d76))
## [1.661.0](https://github.com/windmill-labs/windmill/compare/v1.660.1...v1.661.0) (2026-03-19)
### Features
* add OTel metrics support ([#8442](https://github.com/windmill-labs/windmill/issues/8442)) ([7de98c0](https://github.com/windmill-labs/windmill/commit/7de98c0df464d8a7c9cf5d04228753294183f759))
### Bug Fixes
* fix datatable setup on RDS ([#8450](https://github.com/windmill-labs/windmill/issues/8450)) ([446afb5](https://github.com/windmill-labs/windmill/commit/446afb5b36211e5cbe8a279ce68f2f790a5953b9))
* full code apps deployable on merge UI and deploy UI ([#8451](https://github.com/windmill-labs/windmill/issues/8451)) ([0e022b1](https://github.com/windmill-labs/windmill/commit/0e022b14fd36e897106219010917bd7ceabf4078))
* improve DND drag feedback in EditableSchemaForm ([#8449](https://github.com/windmill-labs/windmill/issues/8449)) ([fd7f0d3](https://github.com/windmill-labs/windmill/commit/fd7f0d3da9153d91c15df5847aaae51e67479cde))
* prevent raw app iframe reload on userStore refresh ([#8455](https://github.com/windmill-labs/windmill/issues/8455)) ([4e59a1a](https://github.com/windmill-labs/windmill/commit/4e59a1a166847045897a6b576812bb53546e683b))
* resolve blank inline script panel for components with underscores in ID ([#8457](https://github.com/windmill-labs/windmill/issues/8457)) ([b2c1e3d](https://github.com/windmill-labs/windmill/commit/b2c1e3de0a263f606127f0decedb11a2ce0b822b))
## [1.660.1](https://github.com/windmill-labs/windmill/compare/v1.660.0...v1.660.1) (2026-03-19)
### Bug Fixes
* prevent S3 file browser crash when selecting storage ([#8444](https://github.com/windmill-labs/windmill/issues/8444)) ([a8fa0cc](https://github.com/windmill-labs/windmill/commit/a8fa0cccef870f841c68be77832d9be12109badb))
* schema inference not updating on reset and language switch ([#8446](https://github.com/windmill-labs/windmill/issues/8446)) ([c0edbe4](https://github.com/windmill-labs/windmill/commit/c0edbe431773f878201e96a79ce291d4b37a10bb))
## [1.660.0](https://github.com/windmill-labs/windmill/compare/v1.659.1...v1.660.0) (2026-03-18)
### Features
* **cli:** use local scripts when previewing flows ([#8365](https://github.com/windmill-labs/windmill/issues/8365)) ([435de95](https://github.com/windmill-labs/windmill/commit/435de95e7d5c9433dafac5369cfc533fd738fc22))
* MCP server readiness for Anthropic connectors directory ([#8438](https://github.com/windmill-labs/windmill/issues/8438)) ([1cfb40b](https://github.com/windmill-labs/windmill/commit/1cfb40bdaa877f1616fc1c1cf5fb6b6aa1832b86))
### Bug Fixes
* exclude wm_deployers group from CE group limit check ([#8429](https://github.com/windmill-labs/windmill/issues/8429)) ([9a6ce44](https://github.com/windmill-labs/windmill/commit/9a6ce44c8414810292ebc8a1ae64950ee2c76307))
* prevent AI agent tool jobs from becoming zombies on cancellation ([#8437](https://github.com/windmill-labs/windmill/issues/8437)) ([f4489cb](https://github.com/windmill-labs/windmill/commit/f4489cbe645489a892994c70d17df2284b494568))
* show cancelled WAC jobs as done in workflow timeline ([#8436](https://github.com/windmill-labs/windmill/issues/8436)) ([bee9282](https://github.com/windmill-labs/windmill/commit/bee928276e098ce7b17e20af74e34458e5c5353e))
### Performance Improvements
* cache composer vendor dir to skip reinstall on repeated php executions ([#8330](https://github.com/windmill-labs/windmill/issues/8330)) ([66a8e84](https://github.com/windmill-labs/windmill/commit/66a8e844a64d91d57dcabb7ad31d9308dec99032))
## [1.659.1](https://github.com/windmill-labs/windmill/compare/v1.659.0...v1.659.1) (2026-03-18)
### Bug Fixes
* add checkpoint.json mount to python nsjail config for WAC v2 ([#8421](https://github.com/windmill-labs/windmill/issues/8421)) ([4829f44](https://github.com/windmill-labs/windmill/commit/4829f447ed3df8489995c5e54955fbfe6b31e37d))
* cleanup job debounce batch ([#8420](https://github.com/windmill-labs/windmill/issues/8420)) ([ad03a5d](https://github.com/windmill-labs/windmill/commit/ad03a5dbd7f93748115037791143249ba0ab6161))
* **frontend:** fix output of resultnode + svelte5 nits ([#8424](https://github.com/windmill-labs/windmill/issues/8424)) ([f481ea4](https://github.com/windmill-labs/windmill/commit/f481ea4059b4e5cb01273cffeb53ff340e8bd5bd))
* per-tab test panel in script editor for WAC v2 modules ([#8422](https://github.com/windmill-labs/windmill/issues/8422)) ([0f26169](https://github.com/windmill-labs/windmill/commit/0f261695a3cb2c3a95d16390e54aa7a6ac3e11e7))
## [1.659.0](https://github.com/windmill-labs/windmill/compare/v1.658.0...v1.659.0) (2026-03-17)
### Features
* add end_user_email claim to OIDC ID tokens ([#8401](https://github.com/windmill-labs/windmill/issues/8401)) ([de5b13b](https://github.com/windmill-labs/windmill/commit/de5b13b840f90e23df1871f80317fdcc2b98174d))
* add ws_base_url instance setting for WebSocket URL override ([#8405](https://github.com/windmill-labs/windmill/issues/8405)) ([372023e](https://github.com/windmill-labs/windmill/commit/372023e99560885a76e8da3487ae705fd2f861d4))
* **cli:** add --env alias for --branch and environments config alias ([#8415](https://github.com/windmill-labs/windmill/issues/8415)) ([fe051aa](https://github.com/windmill-labs/windmill/commit/fe051aa22b59cc1c450b14af9c5f203448bb3dd5))
* DB-backed instance events webhook with superadmin UI ([#8402](https://github.com/windmill-labs/windmill/issues/8402)) ([7d9fb57](https://github.com/windmill-labs/windmill/commit/7d9fb57368ad3b2c719523ef649c9bd5fddf17a5))
* instance groups instance-level role support ([#8404](https://github.com/windmill-labs/windmill/issues/8404)) ([18b3528](https://github.com/windmill-labs/windmill/commit/18b3528ba4188721d918fd47f0f86a6b41209453))
* script module mode with CLI sync, preview, and WAC UI improvements ([#8380](https://github.com/windmill-labs/windmill/issues/8380)) ([31d6660](https://github.com/windmill-labs/windmill/commit/31d6660d56cd23d9269133d430b0607d58314229))
* store hashed tokens instead of plaintext ([#8217](https://github.com/windmill-labs/windmill/issues/8217)) ([f2be625](https://github.com/windmill-labs/windmill/commit/f2be625348ef308e9768d487e110abbd44d27855))
* workspace-specific registry overrides ([#8406](https://github.com/windmill-labs/windmill/issues/8406)) ([73fe45b](https://github.com/windmill-labs/windmill/commit/73fe45b6cb97ce50d029240c6bd63917b301abe1))
### Bug Fixes
* devops getting logged out on workers page ([#8416](https://github.com/windmill-labs/windmill/issues/8416)) ([920a7f9](https://github.com/windmill-labs/windmill/commit/920a7f9fa4719015885947b9de0c35e5e618fcc8))
* Folders as presets in FilterSearchbar ([#8409](https://github.com/windmill-labs/windmill/issues/8409)) ([ebf9347](https://github.com/windmill-labs/windmill/commit/ebf9347d3fd876689dba58bc24399e9036ef5b67))
* improve OOM killer observability for debugging pod-level kills ([#8398](https://github.com/windmill-labs/windmill/issues/8398)) ([fd41cd1](https://github.com/windmill-labs/windmill/commit/fd41cd12b444fb2439214fcd25536280e5baacb2))
## [1.658.0](https://github.com/windmill-labs/windmill/compare/v1.657.2...v1.658.0) (2026-03-16)
### Features
* add GET /api/saml/metadata endpoint ([#8394](https://github.com/windmill-labs/windmill/issues/8394)) ([50b24cf](https://github.com/windmill-labs/windmill/commit/50b24cfdc8bf54656adbdc3315037aa773632076))
* support custom headers in customai resource type ([#8364](https://github.com/windmill-labs/windmill/issues/8364)) ([5acb367](https://github.com/windmill-labs/windmill/commit/5acb367cf9b4b96ac7129c91df229d1a25258f5b))
* support multiple secret variables during resource creation ([#8386](https://github.com/windmill-labs/windmill/issues/8386)) ([54841b7](https://github.com/windmill-labs/windmill/commit/54841b7549d5c9719d4dc3cb43e282ba057cd0f3))
### Bug Fixes
* /updatesqlx now uses ee-repo-ref.txt commit hash ([#8387](https://github.com/windmill-labs/windmill/issues/8387)) ([a519d41](https://github.com/windmill-labs/windmill/commit/a519d4113086430ace1d7ac8795bd2c2a8cf99e9))
* **native-triggers:** preserve API error response body in HttpRequestError ([#8392](https://github.com/windmill-labs/windmill/issues/8392)) ([1eee89d](https://github.com/windmill-labs/windmill/commit/1eee89d99fbf31751d6257a4015e0b22e3871372))
* OutputPicker shows stale result after 'Test up to here' ([#8390](https://github.com/windmill-labs/windmill/issues/8390)) ([2907084](https://github.com/windmill-labs/windmill/commit/2907084ca653fc5540bb04a409d2789ddaeec05b))
* propagate enterprise feature to windmill-api-schedule ([#8391](https://github.com/windmill-labs/windmill/issues/8391)) ([50ef9e7](https://github.com/windmill-labs/windmill/commit/50ef9e79fcef8ee2cccd789b5eb1aacf5647365f))
* set nsjail time_limit from job timeout so configured defaults are respected ([#8389](https://github.com/windmill-labs/windmill/issues/8389)) ([65a92d9](https://github.com/windmill-labs/windmill/commit/65a92d98994dbe4ae90a5e554e55b3ab44463f86))
* soft error on AI agent max iterations + rename retries tab to error handling ([#8366](https://github.com/windmill-labs/windmill/issues/8366)) ([1a1e8a1](https://github.com/windmill-labs/windmill/commit/1a1e8a164cccbfcc663b963cb062af9208ff51be))
* use bookworm-based php image to fix glibc 2.38 incompatibility ([#8381](https://github.com/windmill-labs/windmill/issues/8381)) ([68fd900](https://github.com/windmill-labs/windmill/commit/68fd900076ecf8b20f6622cd5794f1b52c0f5cab))
## [1.657.2](https://github.com/windmill-labs/windmill/compare/v1.657.1...v1.657.2) (2026-03-15)
### Bug Fixes
* **cli:** Fix nonDottedPaths handling in cli flow lock generation ([#8375](https://github.com/windmill-labs/windmill/issues/8375)) ([eb03ebb](https://github.com/windmill-labs/windmill/commit/eb03ebbb0486b33c290fba3c34ea959e6e82fd13))
## [1.657.1](https://github.com/windmill-labs/windmill/compare/v1.657.0...v1.657.1) (2026-03-14)
### Bug Fixes
* powershell WindmillClient module loading on Windows workers ([#8370](https://github.com/windmill-labs/windmill/issues/8370)) ([3a268a9](https://github.com/windmill-labs/windmill/commit/3a268a9cf16add2ea2530e6eab247120a4d4754e))
## [1.657.0](https://github.com/windmill-labs/windmill/compare/v1.656.0...v1.657.0) (2026-03-14)

View File

@@ -26,7 +26,6 @@ Open-source platform for internal tools, workflows, API integrations, background
- **DB**: `psql postgres://postgres:changeme@localhost:5432/windmill`
- **Login**: `admin@windmill.dev` / `changeme`
- **Instance settings**: navigate to `/#superadmin-settings`
- **Migrations**: use `cargo sqlx migrate add -r <name>` from `backend/` to create new migrations (never generate timestamps manually)
## Banned Patterns
@@ -59,8 +58,6 @@ let { my_prop = $bindable(default_value) }: { my_prop?: string } = $props()
- `refs "X" --caller` instead of reading files to find which function contains each reference
- `callers "X"` / `callees "X"` for call-graph questions
EE files (`*_ee.rs`, `*_ee.ts`, `*_ee.svelte`) are indexed — you can `outline`, `def`, `body`, `refs` etc. on them just like regular files.
```bash
NAV="sh wm-ts-nav/nav"
# Use --root backend for Rust, --root frontend/src for TS/Svelte

View File

@@ -11,8 +11,18 @@
{$BASE_URL} {
bind {$ADDRESS}
# Extra services: LSP, Multiplayer, Debugger (windmill_extra gateway)
reverse_proxy /ws/* /ws_mp/* /ws_debug/* http://windmill_extra:3000
# LSP - Language Server Protocol for code intelligence (windmill_extra:3001)
reverse_proxy /ws/* http://windmill_extra:3001
# Multiplayer - Real-time collaboration, Enterprise Edition (windmill_extra:3002)
# Uncomment and set ENABLE_MULTIPLAYER=true in docker-compose.yml
# reverse_proxy /ws_mp/* http://windmill_extra:3002
# Debugger - Interactive debugging via DAP WebSocket (windmill_extra:3003)
# Set ENABLE_DEBUGGER=true in docker-compose.yml to enable
handle_path /ws_debug/* {
reverse_proxy http://windmill_extra:3003
}
# Search indexer, Enterprise Edition (windmill_indexer:8002)
# reverse_proxy /api/srch/* http://windmill_indexer:8002

View File

@@ -118,18 +118,6 @@ RUN --mount=type=cache,target=/usr/local/cargo/registry \
--mount=type=cache,target=$SCCACHE_DIR,sharing=locked \
CARGO_NET_GIT_FETCH_WITH_CLI=true cargo build --release --features "$features"
# Split debug info into a separate file, then strip the binary.
# The .debug file can be extracted as a CI artifact for production debugging.
# The debuglink allows gdb to auto-discover the debug file when placed next to the binary.
RUN objcopy --only-keep-debug /windmill/target/release/windmill /windmill/target/release/windmill.debug \
&& strip /windmill/target/release/windmill \
&& objcopy --add-gnu-debuglink=/windmill/target/release/windmill.debug /windmill/target/release/windmill
# Standalone stage for extracting the .debug file without including it in the final image.
# Build with: docker build --target debuginfo --output type=local,dest=./out .
FROM scratch AS debuginfo
COPY --from=builder /windmill/target/release/windmill.debug /windmill.debug
FROM ${DEBIAN_IMAGE}
ARG TARGETPLATFORM
@@ -162,19 +150,11 @@ ENV PATH /usr/local/bin:/root/.local/bin:/tmp/.local/bin:$PATH
RUN apt-get update \
&& apt-get install -y --no-install-recommends netbase tzdata ca-certificates wget curl jq unzip build-essential unixodbc xmlsec1 software-properties-common tini gnupg lsb-release \
&& apt-get install -y --no-install-recommends netbase tzdata ca-certificates wget curl jq unzip build-essential unixodbc xmlsec1 software-properties-common tini \
&& if echo "$features" | grep -q "ee"; then apt-get install -y --no-install-recommends libsasl2-modules-gssapi-mit krb5-user; fi \
&& apt-get clean \
&& rm -rf /var/lib/apt/lists/*
# Install latest PostgreSQL client (pg_dump) from official PostgreSQL apt repository
RUN curl -fsSL https://www.postgresql.org/media/keys/ACCC4CF8.asc | gpg --dearmor -o /usr/share/keyrings/postgresql-archive-keyring.gpg \
&& echo "deb [signed-by=/usr/share/keyrings/postgresql-archive-keyring.gpg] https://apt.postgresql.org/pub/repos/apt $(lsb_release -cs)-pgdg main" > /etc/apt/sources.list.d/pgdg.list \
&& apt-get update \
&& apt-get install -y --no-install-recommends postgresql-client \
&& apt-get clean \
&& rm -rf /var/lib/apt/lists/*
RUN if [ "$WITH_GIT" = "true" ]; then \
apt-get update -y \
&& apt-get install -y git \
@@ -288,7 +268,7 @@ RUN bun install -g windmill-cli \
RUN curl -fsSL https://claude.ai/install.sh | bash \
&& cp /root/.local/share/claude/versions/* /usr/bin/claude
COPY --from=php:8.3.30-cli-bookworm /usr/local/bin/php /usr/bin/php
COPY --from=php:8.3.30-cli /usr/local/bin/php /usr/bin/php
COPY --from=composer:2.9.5 /usr/bin/composer /usr/bin/composer
# add the docker client to call docker from a worker if enabled

2
ai_evals/.gitignore vendored
View File

@@ -1,2 +0,0 @@
.env
results/

View File

@@ -1,172 +0,0 @@
# AI Evals Authoring Guide
This folder contains black-box benchmark cases for:
- `flow`
- `app`
- `script`
- `cli`
The goal is to test the current production prompts and guidance with realistic user requests, not to test one exact implementation shape.
## Core rules
1. Write prompts like a real user request.
2. Prefer behavior, inputs, constraints, and outcomes over internal implementation details.
3. Keep deterministic validation narrow and hard.
4. Put semantic expectations in `judgeChecklist`.
5. Use `expected` fixtures only when exact structure really matters.
## Prompt writing
Prompts should sound like something a user would naturally ask.
Good:
- "Create a flow that routes support requests based on customer tier."
- "Add a reset button that sets the counter back to 0."
- "Create a flow that reuses the existing greeting script instead of duplicating the logic."
Bad:
- "Use `branchone` with 3 branches and a default branch."
- "Create a `rawscript` step with this exact topology."
- "This is a benchmark harness."
Do not write prompts as if the user knows Windmill internals unless the case is explicitly testing a power-user workflow.
## Flow-specific rules
This is the main principle you asked for:
- flow prompts should read like requests from a user who does not know the product internals
- the user should ask for behavior, not for `branchone`, `branchall`, `rawscript`, `preprocessor_module`, `failure_module`, exact graph topology, or other internal constructs
That means:
- creation cases should describe the business behavior and expected result
- modification cases may mention existing step names, because the user can see the current flow
- only mention special Windmill constructs when the case is explicitly about those constructs
Examples:
- acceptable creation prompt:
"Create a purchase approval flow that pauses for approval and asks the approver for a comment."
- avoid:
"Create a suspend step with one required event and a resume form."
For flow cases, do not fail a case just because the model chose a different valid topology.
## App-specific rules
App prompts should focus on user-visible behavior:
- what the UI should let the user do
- what should persist
- what backend behavior is needed
Avoid prompting in terms of React structure, component names, or implementation unless the case is specifically about editing an existing app.
## CLI-specific rules
CLI prompts can be more explicit about paths and file names because real CLI users often do specify them.
Still, avoid benchmark phrasing. The prompt should read like a repo task, not a harness instruction.
When relevant, ask the assistant to tell the user which `wmill` commands to run next. That is part of the benchmarked behavior.
## Deterministic validation
Use deterministic validation only for hard failures such as:
- missing required files
- unexpected extra files when the prompt says not to create them
- syntax errors
- unresolved flow refs
- missing required special modules or suspend config
- obvious artifact corruption
Do not use deterministic validation to enforce one preferred implementation for broad creation tasks.
Examples of bad hard checks:
- exact step topology for a creation flow
- exact branch structure when the prompt only asked for routing behavior
- exact input shape when multiple reasonable shapes are acceptable
## Judge checklist
Every non-trivial case should have a `judgeChecklist`.
The checklist should capture:
- the user-visible behavior that must be present
- important constraints
- key completion criteria
The checklist should not duplicate low-level implementation details unless they are truly required by the task.
Good checklist items:
- "the flow calculates the order total with 8% tax"
- "the app persists recipes appropriately for a raw Windmill app"
- "the flow reuses the existing workspace script instead of rewriting the logic"
Bad checklist items:
- "uses `branchone`"
- "contains a `rawscript` node"
## When to use `expected`
Use `expected` fixtures when the case is structure-sensitive, for example:
- exact file creation
- exact script content
- modification cases where a specific file must change in a specific way
- cases where preserving an existing structure is part of the requirement
Do not use a full `expected` artifact as the semantic oracle for broad creation tasks when multiple valid outputs should pass.
## When to use `initial`
Use `initial` when the benchmark is about:
- editing an existing artifact
- reusing existing workspace assets
- preserving existing behavior while adding a change
If the case is greenfield, prefer no `initial`.
## Case design ladder
Prefer suites that get gradually harder:
1. trivial create case
2. realistic create case
3. reuse-existing-assets case
4. modification case
5. refactor case
6. edge-case or niche product behavior
The last cases in a suite should cover unusual or product-specific behavior.
## Anti-patterns
Avoid these:
- benchmark framing in prompts
- over-specified internal topology for creation tasks
- judge checklists that just restate implementation details
- deterministic validation that encodes one preferred solution
- fixtures that are so minimal or brittle that they create false negatives
## Before adding a case
Ask:
1. Would a real user plausibly write this prompt?
2. If the model solves it in a different valid way, would the case still pass?
3. Are the hard deterministic checks only catching objectively broken output?
4. Does the `judgeChecklist` describe the real success criteria?
5. If this case fails, will the reason be understandable from the saved artifacts?

View File

@@ -1 +0,0 @@
@AGENTS.md

View File

@@ -1,176 +0,0 @@
# AI Evals
Small benchmark runner for the four Windmill AI generation modes:
- `cli`
- `flow`
- `script`
- `app`
The benchmark always tests the current production prompts, tools, and guidance in this checkout.
Each attempt runs:
1. the real production path
2. deterministic validation
3. LLM judging
## Install
```bash
cd ai_evals
bun install
```
Frontend modes also require frontend dependencies:
```bash
cd frontend
bun install
```
## Commands
List model aliases:
```bash
cd ai_evals
bun run cli -- models
```
List cases:
```bash
cd ai_evals
bun run cli -- cases
bun run cli -- cases flow
```
Run benchmarks:
```bash
cd ai_evals
bun run cli -- run flow
bun run cli -- run flow flow-test4-order-processing-loop --model opus
bun run cli -- run flow flow-test0-sum-two-numbers --models haiku,opus,4o
bun run cli -- run flow flow-test0-sum-two-numbers --runs 3 --verbose
bun run cli -- run flow --record
bun run cli -- run cli bun-hello-script
```
Public CLI surface:
- `models`
- `cases [mode]`
- `run <mode> [caseIds...]`
`run` options:
- `--runs <n>`: repeat each case `n` times
- `--output <path>`: custom result JSON path
- `--model <alias>`: choose the model under test
- `--models <a,b,c>`: run the same cases sequentially against several model aliases
- `--verbose`: stream assistant output for frontend runs
- `--record`: append a compact tracked summary line to `ai_evals/history/<mode>.jsonl` for full-suite runs only
## Models
Use `bun run cli -- models` to see the current aliases.
Today:
- `haiku`
- `sonnet`
- `opus`
- `4o`
- `gemini-flash`
- `gemini-pro`
- `gemini-3-flash-preview`
- `gemini-3.1-pro-preview`
Notes:
- the command also prints accepted alias spellings such as `gpt-4o`, `claude-opus-4.6`, and `claude-haiku-4.5`
- frontend modes (`flow`, `script`, `app`) can use Anthropic, OpenAI, and Gemini-backed aliases
- `cli` mode always uses the Anthropic agent SDK, so only Anthropic aliases are valid there
- the judge model is separate and currently defaults to `claude-sonnet-4-6`
## Case Format
Cases live in one YAML file per mode under `ai_evals/cases/`.
Minimal shape:
```yaml
- id: flow-test0-sum-two-numbers
prompt: |-
Create a flow that takes two numbers, `a` and `b`, and returns their sum.
initial: ai_evals/fixtures/...
expected: ai_evals/fixtures/...
```
Optional fields:
- `initial`: starting state fixture
- `expected`: expected artifact fixture
- `validate`: extra deterministic validation rules
For `flow` mode, `validate` can express requirements such as:
- accepted input schema shapes
- required `results.*` reference validity
- required module/code/input characteristics
For `flow` mode, an `initial` fixture can also include a benchmark workspace catalog of
existing scripts and flows. That lets the real `search_workspace` and
`get_runnable_details` tools discover reusable workspace runnables during evals.
## Results And Artifacts
Every run writes:
- a summary JSON under `ai_evals/results/`
- generated artifacts in a sibling directory
If `--record` is used, the CLI also appends one compact JSON line to:
- `ai_evals/history/flow.jsonl`
- `ai_evals/history/script.jsonl`
- `ai_evals/history/app.jsonl`
- `ai_evals/history/cli.jsonl`
Each recorded line contains:
- run metadata (`createdAt`, `gitSha`, `mode`, `runModel`, `judgeModel`)
- suite totals (`caseCount`, `attemptCount`, `passedAttempts`, `passRate`, `averageDurationMs`, `averageJudgeScore`)
- average token usage (`averageTokenUsagePerAttempt`)
- per-case metrics under `cases[]` (`averageDurationMs`, `averageJudgeScore`, `averageTokenUsagePerAttempt`, pass rate)
- `failedCaseIds`
Example:
- summary: `ai_evals/results/2026-04-09T09-40-33.051Z__flow.json`
- artifacts: `ai_evals/results/2026-04-09T09-40-33.051Z__flow/`
Typical artifacts by mode:
- `flow`: `flow.json`
- `script`: `script.json` plus the generated script file
- `app`: `app.json` plus frontend/backend files
- `cli`: `assistant-output.txt` plus generated workspace files
## Layout
- `cases/`: one YAML file per mode
- `fixtures/`: initial and expected fixtures
- `core/`: shared loading, model resolution, validation, judging, and result writing
- `modes/`: one runner per mode
- `history/`: optional tracked pass-rate history written by `run --record`, one JSONL file per mode
- `results/`: local benchmark output and artifacts
## Notes
- Frontend modes reuse the production frontend chat code through the Vitest bridge.
- CLI mode creates an isolated workspace, writes the current checkout guidance into it, and benchmarks the real skills / `AGENTS.md` flow.
- Frontend progress streams live while the benchmark is running.
- Deterministic validators should stay focused on real correctness constraints, not one exact implementation shape.

View File

@@ -1,72 +0,0 @@
import { describe, expect, it } from "bun:test";
import {
anthropicUsageToBenchmarkTokenUsage,
extractCliResultTokenUsage,
} from "./runtime";
describe("anthropicUsageToBenchmarkTokenUsage", () => {
it("includes cache tokens in prompt usage", () => {
expect(
anthropicUsageToBenchmarkTokenUsage({
input_tokens: 120,
output_tokens: 45,
cache_creation_input_tokens: 30,
cache_read_input_tokens: 5,
})
).toEqual({
prompt: 155,
completion: 45,
total: 200,
});
});
it("returns null when usage is absent", () => {
expect(anthropicUsageToBenchmarkTokenUsage(null)).toBeNull();
});
});
describe("extractCliResultTokenUsage", () => {
it("reads aggregate usage from the SDK result event", () => {
expect(
extractCliResultTokenUsage({
type: "result",
usage: {
input_tokens: 400,
output_tokens: 120,
cache_creation_input_tokens: 50,
cache_read_input_tokens: 25,
},
})
).toEqual({
prompt: 475,
completion: 120,
total: 595,
});
});
it("falls back to modelUsage when aggregate usage is unavailable", () => {
expect(
extractCliResultTokenUsage({
type: "result",
modelUsage: {
opus: {
inputTokens: 200,
outputTokens: 60,
cacheCreationInputTokens: 10,
cacheReadInputTokens: 5,
},
haiku: {
inputTokens: 80,
outputTokens: 20,
cacheCreationInputTokens: 0,
cacheReadInputTokens: 15,
},
},
})
).toEqual({
prompt: 310,
completion: 80,
total: 390,
});
});
});

View File

@@ -1,199 +0,0 @@
import { query, type Options } from "@anthropic-ai/claude-agent-sdk";
import { join } from "path";
import { fileURLToPath } from "url";
import { getCliEvalModel, resolveEvalModel, type CliEvalModelConfig } from "../../core/models";
import type { BenchmarkTokenUsage } from "../../core/types";
export interface ToolInvocation {
tool: string;
input: Record<string, unknown>;
timestamp: number;
}
export interface PromptRunResult {
toolsUsed: ToolInvocation[];
skillsInvoked: string[];
output: string;
durationMs: number;
assistantMessageCount: number;
tokenUsage: BenchmarkTokenUsage | null;
}
interface AnthropicUsageLike {
input_tokens?: number | null;
output_tokens?: number | null;
cache_creation_input_tokens?: number | null;
cache_read_input_tokens?: number | null;
}
interface AnthropicModelUsageLike {
inputTokens?: number | null;
outputTokens?: number | null;
cacheCreationInputTokens?: number | null;
cacheReadInputTokens?: number | null;
}
interface CliResultMessageLike {
type?: string;
usage?: AnthropicUsageLike | null;
modelUsage?: Record<string, AnthropicModelUsageLike> | null;
}
const REPO_ROOT = fileURLToPath(new URL("../../../", import.meta.url));
export const DEFAULT_CLI_EVAL_MODEL: CliEvalModelConfig = getCliEvalModel(resolveEvalModel("cli"));
export function getGeneratedSkillsSource(): string {
return join(REPO_ROOT, "system_prompts", "auto-generated", "skills");
}
export function anthropicUsageToBenchmarkTokenUsage(
usage: AnthropicUsageLike | null | undefined
): BenchmarkTokenUsage | null {
if (!usage) {
return null;
}
const prompt =
(usage.input_tokens ?? 0) +
(usage.cache_creation_input_tokens ?? 0) +
(usage.cache_read_input_tokens ?? 0);
const completion = usage.output_tokens ?? 0;
return {
prompt,
completion,
total: prompt + completion,
};
}
export function extractCliResultTokenUsage(message: unknown): BenchmarkTokenUsage | null {
if (!message || typeof message !== "object") {
return null;
}
const resultMessage = message as CliResultMessageLike;
if (resultMessage.type !== "result") {
return null;
}
const usage = anthropicUsageToBenchmarkTokenUsage(resultMessage.usage);
if (usage) {
return usage;
}
if (!resultMessage.modelUsage || typeof resultMessage.modelUsage !== "object") {
return null;
}
let prompt = 0;
let completion = 0;
let sawModelUsage = false;
for (const modelUsage of Object.values(resultMessage.modelUsage)) {
if (!modelUsage || typeof modelUsage !== "object") {
continue;
}
prompt +=
(modelUsage.inputTokens ?? 0) +
(modelUsage.cacheCreationInputTokens ?? 0) +
(modelUsage.cacheReadInputTokens ?? 0);
completion += modelUsage.outputTokens ?? 0;
sawModelUsage = true;
}
if (!sawModelUsage) {
return null;
}
return {
prompt,
completion,
total: prompt + completion,
};
}
export async function runPromptAndCapture(
prompt: string,
cwd: string,
maxTurns: number = 3,
modelConfig: CliEvalModelConfig = DEFAULT_CLI_EVAL_MODEL
): Promise<PromptRunResult> {
const toolsUsed: ToolInvocation[] = [];
const skillsInvoked: string[] = [];
let output = "";
let assistantMessageCount = 0;
let tokenUsage: BenchmarkTokenUsage | null = null;
const startedAt = Date.now();
const options: Options = {
cwd,
model: modelConfig.model,
maxTurns,
settingSources: ["project"],
allowedTools: ["Skill", "Read", "Glob", "Grep", "Bash", "Write", "Edit"]
};
for await (const message of query({ prompt, options })) {
if (message.type === "assistant") {
assistantMessageCount += 1;
const content = message.message?.content;
if (Array.isArray(content)) {
for (const block of content) {
if (block.type === "tool_use") {
toolsUsed.push({
tool: block.name,
input: block.input as Record<string, unknown>,
timestamp: Date.now()
});
if (block.name === "Skill" && typeof block.input === "object" && block.input !== null) {
const skillInput = block.input as { skill?: string };
if (skillInput.skill) {
skillsInvoked.push(skillInput.skill);
}
}
} else if (block.type === "text") {
output += block.text;
}
}
}
} else if (message.type === "result") {
const resultMessage = message as { result?: string };
tokenUsage = extractCliResultTokenUsage(message) ?? tokenUsage;
if (typeof resultMessage.result === "string") {
output += resultMessage.result;
}
}
}
return {
toolsUsed,
skillsInvoked,
output,
durationMs: Date.now() - startedAt,
assistantMessageCount,
tokenUsage,
};
}
export function wasSkillInvoked(result: PromptRunResult, skillName: string): boolean {
return result.skillsInvoked.some((skill) => skill === skillName || skill.includes(skillName));
}
export function wasToolUsed(result: PromptRunResult, toolName: string): boolean {
return result.toolsUsed.some((tool) => tool.tool === toolName);
}
export function formatCliRunModelLabel(modelConfig: CliEvalModelConfig): string {
return `${modelConfig.provider}:${modelConfig.model}`;
}
export function getToolInputs(
result: PromptRunResult,
toolName: string
): Record<string, unknown>[] {
return result.toolsUsed
.filter((tool) => tool.tool === toolName)
.map((tool) => tool.input);
}

View File

@@ -1,87 +0,0 @@
import { loadSelectedCases } from "../../core/cases";
import {
formatRunModelLabel,
getFrontendEvalModel,
resolveEvalModel,
} from "../../core/models";
import { buildRunResult } from "../../core/results";
import { runSuite } from "../../core/runSuite";
import type { BenchmarkRunResult, ModeRunner } from "../../core/types";
import { emitFrontendBenchmarkProgress } from "./progress";
import { createAppModeRunner } from "../../modes/app";
import { createFlowModeRunner } from "../../modes/flow";
import { createScriptModeRunner } from "../../modes/script";
import { DEFAULT_JUDGE_MODEL } from "../../core/judge";
export type FrontendBenchmarkMode = "flow" | "app" | "script";
export async function runFrontendBenchmarkFromEnv(): Promise<BenchmarkRunResult> {
const mode = parseMode(process.env.WMILL_FRONTEND_AI_EVAL_MODE);
const caseIds = parseOptionalJsonStringArray(process.env.WMILL_FRONTEND_AI_EVAL_CASE_IDS);
const runs = parsePositiveInteger(process.env.WMILL_FRONTEND_AI_EVAL_RUNS, "WMILL_FRONTEND_AI_EVAL_RUNS");
const emitProgress = process.env.WMILL_FRONTEND_AI_EVAL_PROGRESS === "1";
const verbose = process.env.WMILL_FRONTEND_AI_EVAL_VERBOSE === "1";
const model = resolveEvalModel(mode, process.env.WMILL_FRONTEND_AI_EVAL_MODEL);
const selectedCases = await loadSelectedCases(mode, caseIds);
const modeRunner = getModeRunner(mode, getFrontendEvalModel(model));
const runModel = formatRunModelLabel(mode, model);
const caseResults = await runSuite({
modeRunner,
cases: selectedCases,
runs,
runModel,
judgeModel: DEFAULT_JUDGE_MODEL,
concurrency: verbose ? 1 : undefined,
verbose,
onProgress: emitProgress ? (event) => emitFrontendBenchmarkProgress(event) : undefined,
});
return buildRunResult({
mode,
runs,
runModel,
judgeModel: DEFAULT_JUDGE_MODEL,
caseResults,
});
}
function getModeRunner(
mode: FrontendBenchmarkMode,
model: ReturnType<typeof getFrontendEvalModel>
): ModeRunner<any, any, any> {
switch (mode) {
case "flow":
return createFlowModeRunner(model);
case "app":
return createAppModeRunner(model);
case "script":
return createScriptModeRunner(model);
}
}
function parseMode(value: string | undefined): FrontendBenchmarkMode {
if (value === "flow" || value === "app" || value === "script") {
return value;
}
throw new Error(`Unsupported frontend benchmark mode: ${String(value)}`);
}
function parseOptionalJsonStringArray(value: string | undefined): string[] {
if (!value) {
return [];
}
const parsed = JSON.parse(value) as unknown;
if (!Array.isArray(parsed) || parsed.some((entry) => typeof entry !== "string")) {
throw new Error("WMILL_FRONTEND_AI_EVAL_CASE_IDS must be a JSON string array");
}
return parsed;
}
function parsePositiveInteger(value: string | undefined, envName: string): number {
const parsed = Number(value);
if (!Number.isInteger(parsed) || parsed <= 0) {
throw new Error(`${envName} must be a positive integer`);
}
return parsed;
}

View File

@@ -1,92 +0,0 @@
import { mkdtemp } from 'fs/promises'
import { tmpdir } from 'os'
import { join } from 'path'
import type {
AppFiles,
BackendRunnable,
AppAIChatHelpers
} from '../../../../../frontend/src/lib/components/copilot/chat/app/core'
import {
getAppTools,
prepareAppSystemMessage,
prepareAppUserMessage
} from '../../../../../frontend/src/lib/components/copilot/chat/app/core'
import type { Tool as ProductionTool } from '../../../../../frontend/src/lib/components/copilot/chat/shared'
import { createAppFileHelpers } from './fileHelpers'
import { runEval } from '../shared'
import type { AIProvider } from '$lib/gen/types.gen'
import type { ModeRunContext } from '../../../../core/types'
import type { TokenUsage } from '../shared/types'
export interface AppEvalResult {
success: boolean
files: AppFiles
error?: string
assistantMessageCount: number
toolCallCount: number
toolsUsed: string[]
tokenUsage: TokenUsage
}
export interface AppEvalOptions {
initialFrontend?: Record<string, string>
initialBackend?: Record<string, BackendRunnable>
model?: string
maxIterations?: number
provider?: AIProvider
workspaceRoot?: string
runContext?: ModeRunContext
}
export async function runAppEval(
userPrompt: string,
apiKey: string,
options?: AppEvalOptions
): Promise<AppEvalResult> {
const workspaceRoot =
options?.workspaceRoot ??
(await mkdtemp(join(tmpdir(), 'wmill-frontend-app-benchmark-')))
const { helpers, getFiles, cleanup } = await createAppFileHelpers(
options?.initialFrontend ?? {},
options?.initialBackend ?? {},
workspaceRoot
)
try {
const systemMessage = prepareAppSystemMessage()
const tools = getAppTools() as ProductionTool<AppAIChatHelpers>[]
const model = options?.model ?? 'claude-haiku-4-5-20251001'
const userMessage = prepareAppUserMessage(userPrompt, helpers.getSelectedContext())
const rawResult = await runEval({
userPrompt,
systemMessage,
userMessage,
tools,
helpers,
apiKey,
getOutput: getFiles,
onAssistantMessageStart: options?.runContext?.onAssistantMessageStart,
onAssistantToken: options?.runContext?.onAssistantChunk,
onAssistantMessageEnd: options?.runContext?.onAssistantMessageEnd,
options: {
maxIterations: options?.maxIterations,
model,
workspace: workspaceRoot,
provider: options?.provider
}
})
return {
files: rawResult.output,
success: rawResult.success,
error: rawResult.error,
assistantMessageCount: rawResult.iterations,
toolCallCount: rawResult.toolCallsCount,
toolsUsed: rawResult.toolsCalled,
tokenUsage: rawResult.tokenUsage
}
} finally {
await cleanup()
}
}

View File

@@ -1,255 +0,0 @@
import { mkdir, rm, writeFile } from 'fs/promises'
import { dirname, join } from 'path'
import type {
AppAIChatHelpers,
AppFiles,
BackendRunnable,
DataTableSchema,
LintResult,
SelectedContext
} from '../../../../../frontend/src/lib/components/copilot/chat/app/core'
function createEmptyLintResult(): LintResult {
return {
errorCount: 0,
warningCount: 0,
errors: { frontend: {}, backend: {} },
warnings: { frontend: {}, backend: {} }
}
}
async function writeFrontendFile(
workspaceRoot: string | undefined,
path: string,
content: string
): Promise<void> {
if (!workspaceRoot) {
return
}
const relativePath = path.startsWith('/') ? path.slice(1) : path
const fullPath = join(workspaceRoot, 'frontend', relativePath)
await mkdir(dirname(fullPath), { recursive: true })
await writeFile(fullPath, content, 'utf8')
}
async function removeFrontendFile(workspaceRoot: string | undefined, path: string): Promise<void> {
if (!workspaceRoot) {
return
}
const relativePath = path.startsWith('/') ? path.slice(1) : path
await rm(join(workspaceRoot, 'frontend', relativePath), { force: true })
}
async function writeBackendRunnable(
workspaceRoot: string | undefined,
key: string,
runnable: BackendRunnable
): Promise<void> {
if (!workspaceRoot) {
return
}
const runnableDir = join(workspaceRoot, 'backend', key)
await mkdir(runnableDir, { recursive: true })
const meta: { name: string; language?: string; type?: string; path?: string } = {
name: runnable.name
}
if (runnable.type === 'inline' && runnable.inlineScript) {
meta.language = runnable.inlineScript.language
const extension = runnable.inlineScript.language === 'python3' ? 'py' : 'ts'
await writeFile(
join(runnableDir, `main.${extension}`),
runnable.inlineScript.content,
'utf8'
)
} else {
meta.type = runnable.type
if (runnable.path) {
meta.path = runnable.path
}
}
await writeFile(join(runnableDir, 'meta.json'), JSON.stringify(meta, null, 2) + '\n', 'utf8')
}
async function removeBackendRunnable(workspaceRoot: string | undefined, key: string): Promise<void> {
if (!workspaceRoot) {
return
}
await rm(join(workspaceRoot, 'backend', key), { recursive: true, force: true })
}
async function persistDatatables(
workspaceRoot: string | undefined,
datatables: DataTableSchema[]
): Promise<void> {
if (!workspaceRoot) {
return
}
await writeFile(
join(workspaceRoot, 'datatables.json'),
JSON.stringify(datatables, null, 2) + '\n',
'utf8'
)
}
export async function createAppFileHelpers(
initialFrontend: Record<string, string> = {},
initialBackend: Record<string, BackendRunnable> = {},
workspaceRoot?: string
): Promise<{
helpers: AppAIChatHelpers
getFiles: () => AppFiles
getFrontend: () => Record<string, string>
getBackend: () => Record<string, BackendRunnable>
cleanup: () => Promise<void>
workspaceDir: string | null
}> {
let frontend = { ...initialFrontend }
let backend = { ...initialBackend }
let snapshotId = 0
const snapshots = new Map<
number,
{ frontend: Record<string, string>; backend: Record<string, BackendRunnable> }
>()
const datatables: DataTableSchema[] = []
for (const [path, content] of Object.entries(frontend)) {
await writeFrontendFile(workspaceRoot, path, content)
}
for (const [key, runnable] of Object.entries(backend)) {
await writeBackendRunnable(workspaceRoot, key, runnable)
}
await persistDatatables(workspaceRoot, datatables)
const helpers: AppAIChatHelpers = {
listFrontendFiles: () => Object.keys(frontend),
getFrontendFile: (path: string) => frontend[path],
getFrontendFiles: () => ({ ...frontend }),
setFrontendFile: (path: string, content: string) => {
frontend[path] = content
void writeFrontendFile(workspaceRoot, path, content)
return createEmptyLintResult()
},
deleteFrontendFile: (path: string) => {
delete frontend[path]
void removeFrontendFile(workspaceRoot, path)
},
listBackendRunnables: () =>
Object.entries(backend).map(([key, runnable]) => ({
key,
name: runnable.name
})),
getBackendRunnable: (key: string) => backend[key],
getBackendRunnables: () => ({ ...backend }),
setBackendRunnable: async (key: string, runnable: BackendRunnable) => {
backend[key] = runnable
await writeBackendRunnable(workspaceRoot, key, runnable)
return createEmptyLintResult()
},
deleteBackendRunnable: (key: string) => {
delete backend[key]
void removeBackendRunnable(workspaceRoot, key)
},
getFiles: (): AppFiles => ({
frontend: { ...frontend },
backend: { ...backend }
}),
getSelectedContext: (): SelectedContext => ({ type: 'none' }),
snapshot: () => {
const id = ++snapshotId
snapshots.set(id, {
frontend: { ...frontend },
backend: { ...backend }
})
return id
},
revertToSnapshot: (id: number) => {
const snapshot = snapshots.get(id)
if (!snapshot) {
return
}
frontend = { ...snapshot.frontend }
backend = { ...snapshot.backend }
void syncWorkspace()
},
lint: () => createEmptyLintResult(),
getDatatables: async () => structuredClone(datatables),
getAvailableDatatableNames: () => datatables.map((datatable) => datatable.datatable_name),
execDatatableSql: async (
datatableName: string,
sql: string,
newTable?: { schema: string; name: string }
) => {
if (newTable) {
datatables.push({
datatable_name: datatableName,
schemas: {
[newTable.schema]: {
[newTable.name]: {}
}
}
})
await persistDatatables(workspaceRoot, datatables)
}
return {
success: true,
result: [
{
datatableName,
sql
}
]
}
},
addTableToWhitelist: (datatableName: string, schemaName: string, tableName: string) => {
const existing = datatables.find((entry) => entry.datatable_name === datatableName)
if (existing) {
existing.schemas[schemaName] ??= {}
existing.schemas[schemaName][tableName] ??= {}
} else {
datatables.push({
datatable_name: datatableName,
schemas: {
[schemaName]: {
[tableName]: {}
}
}
})
}
void persistDatatables(workspaceRoot, datatables)
}
}
async function syncWorkspace(): Promise<void> {
if (!workspaceRoot) {
return
}
await rm(join(workspaceRoot, 'frontend'), { recursive: true, force: true })
await rm(join(workspaceRoot, 'backend'), { recursive: true, force: true })
for (const [path, content] of Object.entries(frontend)) {
await writeFrontendFile(workspaceRoot, path, content)
}
for (const [key, runnable] of Object.entries(backend)) {
await writeBackendRunnable(workspaceRoot, key, runnable)
}
await persistDatatables(workspaceRoot, datatables)
}
return {
helpers,
getFiles: () => ({
frontend: { ...frontend },
backend: { ...backend }
}),
getFrontend: () => ({ ...frontend }),
getBackend: () => ({ ...backend }),
cleanup: async () => {
if (workspaceRoot) {
await rm(workspaceRoot, { recursive: true, force: true })
}
},
workspaceDir: workspaceRoot ?? null
}
}

View File

@@ -1,161 +0,0 @@
import { mkdir, rm, writeFile } from 'fs/promises'
import { dirname, join } from 'path'
import type { FlowModule, InputTransform } from '../../../../../frontend/src/lib/gen'
import type { ExtendedOpenFlow } from '../../../../../frontend/src/lib/components/flows/types'
import type { FlowAIChatHelpers } from '../../../../../frontend/src/lib/components/copilot/chat/flow/core'
import type { ScriptLintResult } from '../../../../../frontend/src/lib/components/copilot/chat/shared'
import { findModuleById } from '../../../../../frontend/src/lib/components/copilot/chat/shared'
import {
createInlineScriptSession
} from '../../../../../frontend/src/lib/components/copilot/chat/flow/inlineScriptsUtils'
import {
registerBenchmarkWorkspace,
registerBenchmarkWorkspaceRunnables,
unregisterBenchmarkWorkspaceRunnables,
createBenchmarkCompletedJob,
type BenchmarkWorkspaceFlow,
type BenchmarkWorkspaceScript
} from '../../mockBackend'
const EMPTY_SCRIPT_LINT_RESULT: ScriptLintResult = {
errorCount: 0,
warningCount: 0,
errors: [],
warnings: []
}
export interface FlowWorkspaceFixtures {
scripts?: BenchmarkWorkspaceScript[]
flows?: BenchmarkWorkspaceFlow[]
}
export async function createFlowFileHelpers(
initialModules: FlowModule[] = [],
initialSchema?: Record<string, any>,
workspaceRoot?: string,
workspaceFixtures?: FlowWorkspaceFixtures
): Promise<{
helpers: FlowAIChatHelpers
getFlow: () => ExtendedOpenFlow
getModules: () => FlowModule[]
cleanup: () => Promise<void>
workspaceDir: string | null
}> {
let flow: ExtendedOpenFlow = {
value: { modules: structuredClone(initialModules) },
summary: '',
schema: initialSchema ?? {
$schema: 'https://json-schema.org/draft/2020-12/schema',
properties: {},
required: [],
type: 'object'
}
}
const inlineScriptSession = createInlineScriptSession()
const flowFilePath = workspaceRoot ? join(workspaceRoot, 'flow.json') : null
async function persistFlow(): Promise<void> {
if (!flowFilePath) {
return
}
await mkdir(dirname(flowFilePath), { recursive: true })
await writeFile(flowFilePath, JSON.stringify(flow, null, 2) + '\n', 'utf8')
}
await persistFlow()
if (workspaceRoot) {
registerBenchmarkWorkspace(workspaceRoot)
if (workspaceFixtures) {
registerBenchmarkWorkspaceRunnables(workspaceRoot, workspaceFixtures)
}
}
const helpers: FlowAIChatHelpers = {
getFlowAndSelectedId: () => ({ flow, selectedId: '' }),
getModules: (id?: string) => {
if (!id) return flow.value.modules
const module = findModuleById(flow.value.modules, id)
return module ? [module] : []
},
inlineScriptSession,
setSnapshot: () => {},
revertToSnapshot: () => {},
setCode: async (id: string, code: string) => {
const module = findModuleById(flow.value.modules, id)
if (module && module.value.type === 'rawscript') {
module.value.content = code
}
inlineScriptSession.set(id, code)
await persistFlow()
},
setFlowJson: async (
modules: FlowModule[] | undefined,
schema: Record<string, any> | undefined
) => {
if (modules) {
flow.value.modules = inlineScriptSession.restoreInlineScriptReferences(modules)
const unresolvedRefs = inlineScriptSession.findUnresolvedInlineScriptRefs(flow.value.modules)
if (unresolvedRefs.length > 0) {
throw new Error(
`Unresolved inline script references: ${unresolvedRefs.join(', ')}`
)
}
}
if (schema !== undefined) {
flow.schema = schema
}
await persistFlow()
},
getFlowInputsSchema: async () => flow.schema ?? {},
updateExprsToSet: (_id: string, _inputTransforms: Record<string, InputTransform>) => {},
acceptAllModuleActions: () => {},
rejectAllModuleActions: () => {},
hasPendingChanges: () => false,
selectStep: (_id: string) => {},
testFlow: async (args?: Record<string, any>) => {
if (workspaceRoot) {
const runPath = join(workspaceRoot, 'test-run.json')
await writeFile(
runPath,
JSON.stringify(
{
requestedArgs: args ?? {},
modules: flow.value.modules.map((module) => module.id)
},
null,
2
) + '\n',
'utf8'
)
}
return createBenchmarkCompletedJob({
workspace: workspaceRoot ?? 'benchmark',
jobKind: 'flowpreview',
result: {
requestedArgs: args ?? {},
modules: flow.value.modules.map((module) => module.id),
mocked: true
},
logs: 'Mock benchmark flow test run completed successfully.'
})
},
getLintErrors: async () => EMPTY_SCRIPT_LINT_RESULT
}
return {
helpers,
getFlow: () => flow,
getModules: () => flow.value.modules,
cleanup: async () => {
if (workspaceRoot) {
unregisterBenchmarkWorkspaceRunnables(workspaceRoot)
}
if (workspaceRoot) {
await rm(workspaceRoot, { recursive: true, force: true })
}
},
workspaceDir: workspaceRoot ?? null
}
}

View File

@@ -1,103 +0,0 @@
import { mkdtemp } from 'fs/promises'
import { tmpdir } from 'os'
import { join } from 'path'
import type { FlowModule } from '$lib/gen'
import type { AIProvider } from '$lib/gen/types.gen'
import type { ExtendedOpenFlow } from '$lib/components/flows/types'
import {
flowTools,
prepareFlowSystemMessage,
prepareFlowUserMessage,
type FlowAIChatHelpers
} from '../../../../../frontend/src/lib/components/copilot/chat/flow/core'
import type { Tool as ProductionTool } from '../../../../../frontend/src/lib/components/copilot/chat/shared'
import { createFlowFileHelpers, type FlowWorkspaceFixtures } from './fileHelpers'
import { runEval } from '../shared'
import type { ModeRunContext } from '../../../../core/types'
import type { TokenUsage } from '../shared/types'
export interface FlowFixture {
value?: {
modules?: FlowModule[]
}
schema?: Record<string, unknown>
}
export interface FlowEvalResult {
success: boolean
flow: ExtendedOpenFlow
error?: string
assistantMessageCount: number
toolCallCount: number
toolsUsed: string[]
tokenUsage: TokenUsage
}
export interface FlowEvalOptions {
initialFlow?: FlowFixture
workspaceFixtures?: FlowWorkspaceFixtures
model?: string
maxIterations?: number
provider?: AIProvider
workspaceRoot?: string
runContext?: ModeRunContext
}
export async function runFlowEval(
userPrompt: string,
apiKey: string,
options?: FlowEvalOptions
): Promise<FlowEvalResult> {
const workspaceRoot =
options?.workspaceRoot ??
(await mkdtemp(join(tmpdir(), 'wmill-frontend-flow-benchmark-')))
const { helpers, getFlow, cleanup } = await createFlowFileHelpers(
options?.initialFlow?.value?.modules ?? [],
options?.initialFlow?.schema,
workspaceRoot,
options?.workspaceFixtures
)
try {
const systemMessage = prepareFlowSystemMessage()
const tools = flowTools as ProductionTool<FlowAIChatHelpers>[]
const model = options?.model ?? 'claude-haiku-4-5-20251001'
const userMessage = prepareFlowUserMessage(
userPrompt,
helpers.getFlowAndSelectedId(),
[],
helpers.inlineScriptSession
)
const rawResult = await runEval({
userPrompt,
systemMessage,
userMessage,
tools,
helpers,
apiKey,
getOutput: getFlow,
onAssistantMessageStart: options?.runContext?.onAssistantMessageStart,
onAssistantToken: options?.runContext?.onAssistantChunk,
onAssistantMessageEnd: options?.runContext?.onAssistantMessageEnd,
options: {
maxIterations: options?.maxIterations,
model,
workspace: workspaceRoot,
provider: options?.provider
}
})
return {
flow: rawResult.output,
success: rawResult.success,
error: rawResult.error,
assistantMessageCount: rawResult.iterations,
toolCallCount: rawResult.toolCallsCount,
toolsUsed: rawResult.toolsCalled,
tokenUsage: rawResult.tokenUsage
}
} finally {
await cleanup()
}
}

View File

@@ -1,73 +0,0 @@
import { mkdir, rm, writeFile } from 'fs/promises'
import { dirname, join } from 'path'
import type { ScriptLang } from '../../../../../frontend/src/lib/gen/types.gen'
import type { ReviewChangesOpts } from '../../../../../frontend/src/lib/components/copilot/chat/monaco-adapter'
import type { ScriptChatHelpers } from '../../../../../frontend/src/lib/components/copilot/chat/script/core'
import { buildScriptLintResult } from './preview'
import { registerBenchmarkWorkspace, unregisterBenchmarkWorkspace } from '../../mockBackend'
export interface ScriptEvalState {
code: string
lang: ScriptLang | 'bunnative'
path: string
args: Record<string, any>
}
export async function createScriptFileHelpers(
initialScript: ScriptEvalState,
workspaceRoot?: string
): Promise<{
helpers: ScriptChatHelpers
getScript: () => ScriptEvalState
cleanup: () => Promise<void>
workspaceDir: string | null
}> {
let script = structuredClone(initialScript)
const scriptFilePath = workspaceRoot ? join(workspaceRoot, script.path) : null
async function persistScript(): Promise<void> {
if (!scriptFilePath) {
return
}
await mkdir(dirname(scriptFilePath), { recursive: true })
await writeFile(scriptFilePath, script.code, 'utf8')
}
await persistScript()
if (workspaceRoot) {
registerBenchmarkWorkspace(workspaceRoot)
}
const helpers: ScriptChatHelpers = {
getScriptOptions: () => ({
code: script.code,
lang: script.lang,
path: script.path,
args: structuredClone(script.args)
}),
applyCode: async (code: string, opts?: ReviewChangesOpts) => {
if (opts?.mode === 'revert') {
return
}
script = {
...script,
code
}
await persistScript()
},
getLintErrors: () => buildScriptLintResult(script.code, script.lang)
}
return {
helpers,
getScript: () => structuredClone(script),
cleanup: async () => {
if (workspaceRoot) {
unregisterBenchmarkWorkspace(workspaceRoot)
await rm(workspaceRoot, { recursive: true, force: true })
}
},
workspaceDir: workspaceRoot ?? null
}
}

View File

@@ -1,96 +0,0 @@
import ts from 'typescript'
import type { ScriptLang } from '../../../../../frontend/src/lib/gen/types.gen'
import type { ScriptLintResult } from '../../../../../frontend/src/lib/components/copilot/chat/shared'
export type ScriptPreviewLanguage = ScriptLang | 'bunnative'
const TS_LIKE_LANGUAGES = new Set<ScriptPreviewLanguage>(['bun', 'deno', 'nativets', 'bunnative'])
const JS_LIKE_LANGUAGES = new Set<ScriptPreviewLanguage>(['bun', 'deno', 'nativets', 'bunnative'])
function hasSupportedEntrypoint(code: string): boolean {
return (
/export\s+(async\s+)?function\s+main\s*\(/.test(code) ||
/export\s+(async\s+)?function\s+preprocessor\s*\(/.test(code)
)
}
function compilerOptionsForLanguage(lang: ScriptPreviewLanguage): ts.CompilerOptions | null {
if (!TS_LIKE_LANGUAGES.has(lang)) {
return null
}
return {
target: ts.ScriptTarget.ES2022,
module: ts.ModuleKind.ESNext,
moduleResolution: ts.ModuleResolutionKind.Bundler,
noEmit: true,
allowJs: true,
checkJs: false,
strict: false,
skipLibCheck: true
}
}
function getLineAndColumn(sourceText: string, start: number): { line: number; column: number } {
const prefix = sourceText.slice(0, Math.max(0, start))
const line = prefix.split('\n').length
const lastNewline = prefix.lastIndexOf('\n')
const column = lastNewline === -1 ? prefix.length + 1 : prefix.length - lastNewline
return { line, column }
}
export function buildScriptLintResult(
code: string,
lang: ScriptPreviewLanguage
): ScriptLintResult {
const diagnostics: ScriptLintResult['errors'] = []
const compilerOptions = compilerOptionsForLanguage(lang)
if (compilerOptions) {
const sourceFile = ts.createSourceFile(
'script.ts',
code,
ts.ScriptTarget.ES2022,
true,
JS_LIKE_LANGUAGES.has(lang) ? ts.ScriptKind.TS : ts.ScriptKind.JS
)
const output = ts.transpileModule(code, {
compilerOptions,
fileName: sourceFile.fileName,
reportDiagnostics: true
})
for (const diagnostic of output.diagnostics ?? []) {
const start = diagnostic.start ?? 0
const length = diagnostic.length ?? 1
const { line, column } = getLineAndColumn(code, start)
const message = ts.flattenDiagnosticMessageText(diagnostic.messageText, '\n')
diagnostics.push({
startLineNumber: line,
startColumn: column,
endLineNumber: line,
endColumn: column + Math.max(1, length),
message,
severity: 8
} as ScriptLintResult['errors'][number])
}
}
if (!hasSupportedEntrypoint(code)) {
diagnostics.push({
startLineNumber: 1,
startColumn: 1,
endLineNumber: 1,
endColumn: 1,
message: 'Script must export a main or preprocessor function.',
severity: 8
} as ScriptLintResult['errors'][number])
}
return {
errorCount: diagnostics.length,
warningCount: 0,
errors: diagnostics,
warnings: []
}
}

View File

@@ -1,109 +0,0 @@
import { mkdtemp } from 'fs/promises'
import { tmpdir } from 'os'
import { join } from 'path'
import type { AIProvider, AIProviderModel, ScriptLang } from '$lib/gen/types.gen'
import type { ContextElement } from '../../../../../frontend/src/lib/components/copilot/chat/context'
import {
prepareScriptSystemMessage,
prepareScriptTools,
prepareScriptUserMessage,
type ScriptChatHelpers
} from '../../../../../frontend/src/lib/components/copilot/chat/script/core'
import type { Tool as ProductionTool } from '../../../../../frontend/src/lib/components/copilot/chat/shared'
import { createScriptFileHelpers, type ScriptEvalState } from './fileHelpers'
import { runEval } from '../shared'
import type { ModeRunContext } from '../../../../core/types'
import type { TokenUsage } from '../shared/types'
export interface ScriptEvalResult {
success: boolean
script: ScriptEvalState
error?: string
assistantMessageCount: number
toolCallCount: number
toolsUsed: string[]
tokenUsage: TokenUsage
}
export interface ScriptEvalOptions {
initialScript: ScriptEvalState
model?: string
maxIterations?: number
provider?: AIProvider
workspaceRoot?: string
runContext?: ModeRunContext
}
function resolveModelProvider(
model: string,
provider?: AIProvider
): AIProviderModel {
if (provider) {
return { provider, model }
}
if (model.startsWith('claude')) {
return { provider: 'anthropic', model }
}
return { provider: 'openai', model }
}
export async function runScriptEval(
userPrompt: string,
apiKey: string,
options: ScriptEvalOptions
): Promise<ScriptEvalResult> {
const workspaceRoot =
options.workspaceRoot ?? (await mkdtemp(join(tmpdir(), 'wmill-frontend-script-benchmark-')))
const { helpers, getScript, cleanup } = await createScriptFileHelpers(
options.initialScript,
workspaceRoot
)
try {
const model = options.model ?? 'claude-haiku-4-5-20251001'
const modelProvider = resolveModelProvider(model, options.provider)
const selectedContext: ContextElement[] = []
const systemMessage = prepareScriptSystemMessage(
modelProvider,
options.initialScript.lang,
{}
)
const tools = prepareScriptTools(
modelProvider,
options.initialScript.lang,
selectedContext
) as ProductionTool<ScriptChatHelpers>[]
const userMessage = prepareScriptUserMessage(userPrompt, selectedContext)
const rawResult = await runEval({
userPrompt,
systemMessage,
userMessage,
tools,
helpers,
apiKey,
getOutput: getScript,
onAssistantMessageStart: options.runContext?.onAssistantMessageStart,
onAssistantToken: options.runContext?.onAssistantChunk,
onAssistantMessageEnd: options.runContext?.onAssistantMessageEnd,
options: {
maxIterations: options.maxIterations,
model,
workspace: workspaceRoot,
provider: modelProvider.provider
}
})
return {
script: rawResult.output,
success: rawResult.success,
error: rawResult.error,
assistantMessageCount: rawResult.iterations,
toolCallCount: rawResult.toolCallsCount,
toolsUsed: rawResult.toolsCalled,
tokenUsage: rawResult.tokenUsage
}
} finally {
await cleanup()
}
}

View File

@@ -1,173 +0,0 @@
import type {
ChatCompletionMessageParam,
ChatCompletionSystemMessageParam
} from 'openai/resources/chat/completions.mjs'
import type { AIProviderModel } from '$lib/gen/types.gen'
import type { TokenUsage, ToolCallDetail, EvalRunnerOptions, RawEvalResult } from './types'
import { runChatLoop, type ChatClients } from '../../../../../frontend/src/lib/components/copilot/chat/chatLoop'
import type {
Tool as ProductionTool,
ToolCallbacks
} from '../../../../../frontend/src/lib/components/copilot/chat/shared'
import {
createEvalClients,
type FrontendEvalProvider,
resolveEvalModelProvider
} from './providerConfig'
/**
* Parameters for running a base evaluation.
*/
export interface RunEvalParams<THelpers, TOutput> {
/** The user's prompt/instruction */
userPrompt: string
/** System message for the LLM */
systemMessage: ChatCompletionSystemMessageParam
/** User message for the LLM */
userMessage: ChatCompletionMessageParam
/** Tool definitions for the LLM API (unused — derived from tools) */
toolDefs?: unknown
/** Full tool implementations for execution */
tools: ProductionTool<THelpers>[]
/** Domain-specific helpers for tool execution */
helpers: THelpers
/** API key for the provider */
apiKey: string
/** Function to get the current output state */
getOutput: () => TOutput
/** Optional configuration */
options?: EvalRunnerOptions
onAssistantMessageStart?: () => void
onAssistantToken?: (token: string) => void
onAssistantMessageEnd?: () => void
}
/**
* Runs a generic evaluation using the shared chat loop (same code path as production).
* Uses streaming via real provider SDKs instead of OpenRouter non-streaming.
*/
export async function runEval<THelpers, TOutput>(
params: RunEvalParams<THelpers, TOutput>
): Promise<RawEvalResult<TOutput>> {
const {
systemMessage,
userMessage,
tools,
helpers,
apiKey,
getOutput,
options,
onAssistantMessageStart,
onAssistantToken,
onAssistantMessageEnd
} = params
let shouldEmitMessageStart = true
const model = options?.model ?? 'gpt-4o'
const maxIterations = options?.maxIterations ?? 20
const workspace = options?.workspace ?? 'test-workspace'
const provider = options?.provider
const modelProvider = resolveEvalModelProvider(
model,
provider as FrontendEvalProvider | undefined
) as AIProviderModel
const clients = createEvalClients(modelProvider.provider, apiKey) as ChatClients
const messages: ChatCompletionMessageParam[] = [userMessage]
let toolCallsCount = 0
const toolsCalled: string[] = []
const toolCallDetails: ToolCallDetail[] = []
// Wrap tools to intercept fn calls for tracking.
// Cast to ProductionTool since the eval Tool has a narrower toolCallbacks type
// but the actual callbacks passed at runtime will satisfy both interfaces.
const wrappedTools = tools.map((tool) => ({
...tool,
fn: async (p: any) => {
toolCallsCount++
toolsCalled.push(tool.def.function.name)
try {
const args =
typeof p.args === 'string' ? JSON.parse(p.args) : p.args
toolCallDetails.push({ name: tool.def.function.name, arguments: args })
} catch {
toolCallDetails.push({
name: tool.def.function.name,
arguments: p.args
})
}
return tool.fn(p)
}
}))
// No-op callbacks for eval
const callbacks: ToolCallbacks & {
onNewToken: (token: string) => void
onMessageEnd: () => void
} = {
setToolStatus: () => {},
removeToolStatus: () => {},
onNewToken: (token: string) => {
if (shouldEmitMessageStart) {
onAssistantMessageStart?.()
shouldEmitMessageStart = false
}
onAssistantToken?.(token)
},
onMessageEnd: () => {
if (!shouldEmitMessageStart) {
onAssistantMessageEnd?.()
}
shouldEmitMessageStart = true
}
}
const abortController = new AbortController()
try {
const result = await runChatLoop({
messages,
systemMessage,
tools: wrappedTools,
helpers,
abortController,
callbacks,
modelProvider,
clients,
workspace,
maxIterations,
skipResponsesApi: modelProvider.provider !== 'openai' && modelProvider.provider !== 'azure_openai'
})
return {
success: true,
output: getOutput(),
tokenUsage: result.tokenUsage,
toolCallsCount,
toolsCalled,
toolCallDetails,
iterations: Math.max(1, result.addedMessages.filter((m) => m.role === 'assistant').length),
messages
}
} catch (err) {
let errorMessage: string
if (err instanceof Error) {
errorMessage = err.stack ?? err.message
} else {
errorMessage = String(err)
}
return {
success: false,
output: getOutput(),
error: errorMessage,
tokenUsage: { prompt: 0, completion: 0, total: 0 },
toolCallsCount,
toolsCalled,
toolCallDetails,
iterations: 0,
messages
}
}
}

View File

@@ -1,3 +0,0 @@
export type { TokenUsage, ToolCallDetail, EvalRunnerOptions, RawEvalResult } from './types'
export type { RunEvalParams } from './baseEvalRunner'
export { runEval } from './baseEvalRunner'

View File

@@ -1,41 +0,0 @@
import { describe, expect, it } from "bun:test";
import {
buildOpenAICompatibleClientOptions,
resolveEvalModelProvider,
} from "./providerConfig";
describe("buildOpenAICompatibleClientOptions", () => {
it("adds Gemini's OpenAI-compatible base URL and client header", () => {
const options = buildOpenAICompatibleClientOptions("googleai", "gemini-test-key");
expect(options).toMatchObject({
apiKey: "gemini-test-key",
baseURL: "https://generativelanguage.googleapis.com/v1beta/openai/",
defaultHeaders: {
"x-goog-api-client": "windmill-ai-evals/1.0",
},
});
});
it("keeps the default OpenAI-compatible config for OpenAI", () => {
expect(buildOpenAICompatibleClientOptions("openai", "openai-test-key")).toEqual({
apiKey: "openai-test-key",
});
});
});
describe("resolveEvalModelProvider", () => {
it("infers googleai from Gemini model ids", () => {
expect(resolveEvalModelProvider("gemini-2.5-flash")).toEqual({
provider: "googleai",
model: "gemini-2.5-flash",
});
});
it("preserves an explicit provider", () => {
expect(resolveEvalModelProvider("gemini-2.5-pro", "googleai")).toEqual({
provider: "googleai",
model: "gemini-2.5-pro",
});
});
});

View File

@@ -1,71 +0,0 @@
import Anthropic from "@anthropic-ai/sdk";
import OpenAI from "openai";
import type { FrontendEvalModelConfig } from "../../../../core/models";
export type FrontendEvalProvider = FrontendEvalModelConfig["provider"];
export interface EvalClients {
openai: OpenAI;
anthropic: Anthropic;
}
export interface ResolvedEvalModelProvider {
provider: FrontendEvalProvider;
model: string;
}
const GEMINI_OPENAI_BASE_URL = "https://generativelanguage.googleapis.com/v1beta/openai/";
const GEMINI_GOOG_API_CLIENT = "windmill-ai-evals/1.0";
export function buildOpenAICompatibleClientOptions(
provider: Exclude<FrontendEvalProvider, "anthropic">,
apiKey: string
): ConstructorParameters<typeof OpenAI>[0] {
if (provider === "googleai") {
return {
apiKey,
baseURL: GEMINI_OPENAI_BASE_URL,
defaultHeaders: {
"x-goog-api-client": GEMINI_GOOG_API_CLIENT,
},
};
}
return { apiKey };
}
export function createEvalClients(
provider: FrontendEvalProvider,
apiKey: string
): EvalClients {
if (provider === "anthropic") {
return {
openai: new OpenAI({ apiKey: "unused" }),
anthropic: new Anthropic({ apiKey }),
};
}
return {
openai: new OpenAI(buildOpenAICompatibleClientOptions(provider, apiKey)),
anthropic: new Anthropic({ apiKey: "unused" }),
};
}
export function resolveEvalModelProvider(
model: string,
provider?: FrontendEvalProvider
): ResolvedEvalModelProvider {
if (provider) {
return { provider, model };
}
if (model.startsWith("claude")) {
return { provider: "anthropic", model };
}
if (model.startsWith("gemini")) {
return { provider: "googleai", model };
}
if (model.startsWith("gpt") || model.startsWith("o")) {
return { provider: "openai", model };
}
return { provider: "openai", model };
}

View File

@@ -1,32 +0,0 @@
import type { ChatCompletionMessageParam } from 'openai/resources/chat/completions.mjs'
import type { AIProvider } from '$lib/gen/types.gen'
export interface TokenUsage {
prompt: number
completion: number
total: number
}
export interface ToolCallDetail {
name: string
arguments: Record<string, unknown>
}
export interface EvalRunnerOptions {
maxIterations?: number
model?: string
workspace?: string
provider?: AIProvider
}
export interface RawEvalResult<TOutput> {
success: boolean
output: TOutput
error?: string
tokenUsage: TokenUsage
toolCallsCount: number
toolsCalled: string[]
toolCallDetails: ToolCallDetail[]
iterations: number
messages: ChatCompletionMessageParam[]
}

View File

@@ -1,270 +0,0 @@
import { randomUUID } from 'node:crypto'
import type { CompletedJob, Flow, Script } from '../../../frontend/src/lib/gen'
import type { ScriptLang } from '../../../frontend/src/lib/gen/types.gen'
import { buildScriptLintResult } from './core/script/preview'
const BENCHMARK_TIMESTAMP = '1970-01-01T00:00:00.000Z'
export interface BenchmarkWorkspaceScript {
path: string
summary: string
description?: string
language: Script['language']
schema?: Record<string, unknown>
content: string
}
export interface BenchmarkWorkspaceFlow {
path: string
summary: string
description?: string
schema?: Record<string, unknown>
value: Flow['value']
}
export interface BenchmarkWorkspaceRunnables {
scripts?: BenchmarkWorkspaceScript[]
flows?: BenchmarkWorkspaceFlow[]
}
type BenchmarkCompletedJob = CompletedJob & { type: 'CompletedJob' }
const benchmarkWorkspaces = new Set<string>()
const benchmarkWorkspaceRunnables = new Map<string, BenchmarkWorkspaceRunnables>()
const benchmarkJobs = new Map<string, { workspace: string; job: BenchmarkCompletedJob }>()
export function resetBenchmarkMockBackend(): void {
benchmarkWorkspaces.clear()
benchmarkWorkspaceRunnables.clear()
benchmarkJobs.clear()
}
export function registerBenchmarkWorkspace(workspace: string): void {
benchmarkWorkspaces.add(workspace)
}
export function registerBenchmarkWorkspaceRunnables(
workspace: string,
runnables: BenchmarkWorkspaceRunnables
): void {
benchmarkWorkspaces.add(workspace)
benchmarkWorkspaceRunnables.set(workspace, runnables)
}
export function unregisterBenchmarkWorkspace(workspace: string): void {
benchmarkWorkspaces.delete(workspace)
benchmarkWorkspaceRunnables.delete(workspace)
for (const [jobId, entry] of benchmarkJobs.entries()) {
if (entry.workspace === workspace) {
benchmarkJobs.delete(jobId)
}
}
}
export function unregisterBenchmarkWorkspaceRunnables(workspace: string): void {
unregisterBenchmarkWorkspace(workspace)
}
export function hasBenchmarkWorkspace(workspace: string): boolean {
return benchmarkWorkspaces.has(workspace)
}
export function listBenchmarkScripts(workspace: string): Script[] | null {
const runnables = benchmarkWorkspaceRunnables.get(workspace)
if (!runnables) {
return null
}
return (runnables.scripts ?? []).map(buildBenchmarkScript)
}
export function listBenchmarkFlows(workspace: string): Flow[] | null {
const runnables = benchmarkWorkspaceRunnables.get(workspace)
if (!runnables) {
return null
}
return (runnables.flows ?? []).map(buildBenchmarkFlow)
}
export function getBenchmarkScriptByPath(workspace: string, path: string): Script | null {
const script = benchmarkWorkspaceRunnables
.get(workspace)
?.scripts?.find((entry) => entry.path === path)
return script ? buildBenchmarkScript(script) : null
}
export function getBenchmarkScriptByHash(workspace: string, hash: string): Script | null {
const script = benchmarkWorkspaceRunnables
.get(workspace)
?.scripts?.find((entry) => buildBenchmarkScriptHash(entry.path) === hash)
return script ? buildBenchmarkScript(script) : null
}
export function getBenchmarkFlowByPath(workspace: string, path: string): Flow | null {
const flow = benchmarkWorkspaceRunnables
.get(workspace)
?.flows?.find((entry) => entry.path === path)
return flow ? buildBenchmarkFlow(flow) : null
}
export function createBenchmarkCompletedJob(input: {
workspace: string
jobKind: CompletedJob['job_kind']
success?: boolean
result?: unknown
logs?: string
scriptPath?: string
scriptHash?: string
args?: Record<string, unknown>
}): string {
const jobId = `benchmark-job-${randomUUID()}`
const now = new Date().toISOString()
const job: BenchmarkCompletedJob = {
type: 'CompletedJob',
id: jobId,
workspace_id: input.workspace,
created_by: 'ai-evals',
created_at: now,
started_at: now,
completed_at: now,
duration_ms: 0,
success: input.success ?? true,
script_path: input.scriptPath,
script_hash: input.scriptHash,
args: input.args,
result: input.result,
logs: input.logs,
canceled: false,
job_kind: input.jobKind,
permissioned_as: 'u/ai-evals',
is_flow_step: false,
is_skipped: false,
email: 'ai-evals@local',
visible_to_owner: true,
tag: 'benchmark'
}
benchmarkJobs.set(jobId, { workspace: input.workspace, job })
return jobId
}
export function getBenchmarkCompletedJob(
workspace: string,
jobId: string
): BenchmarkCompletedJob | null {
const entry = benchmarkJobs.get(jobId)
if (!entry || entry.workspace !== workspace) {
return null
}
return structuredClone(entry.job)
}
export function runBenchmarkScriptPreview(input: {
workspace: string
requestBody: {
content?: string
language?: ScriptLang | 'bunnative'
args?: Record<string, unknown>
path?: string
}
}): string {
const content = input.requestBody.content ?? ''
const language = input.requestBody.language ?? 'bun'
const lintResult = buildScriptLintResult(content, language)
const success = lintResult.errorCount === 0
return createBenchmarkCompletedJob({
workspace: input.workspace,
jobKind: 'preview',
success,
scriptPath: input.requestBody.path,
args: input.requestBody.args,
result: success
? {
path: input.requestBody.path,
args: input.requestBody.args ?? {},
validated: true
}
: {
path: input.requestBody.path,
args: input.requestBody.args ?? {},
errorCount: lintResult.errorCount,
errors: lintResult.errors.map((entry) => ({
line: entry.startLineNumber,
message: entry.message
}))
}
})
}
export function runBenchmarkFlowByPath(input: {
workspace: string
path: string
args?: Record<string, unknown>
}): string {
const flow = getBenchmarkFlowByPath(input.workspace, input.path)
return createBenchmarkCompletedJob({
workspace: input.workspace,
jobKind: 'flowpreview',
success: flow !== null,
args: input.args,
result:
flow !== null
? {
path: input.path,
args: input.args ?? {},
mocked: true
}
: {
error: `Flow "${input.path}" not found in benchmark workspace`
},
logs:
flow !== null
? 'Mock benchmark flow run completed successfully.'
: `Flow "${input.path}" not found in benchmark workspace.`
})
}
function buildBenchmarkScriptHash(path: string): string {
return `benchmark:${path}`
}
function buildBenchmarkScript(script: BenchmarkWorkspaceScript): Script {
return {
workspace_id: 'benchmark',
hash: buildBenchmarkScriptHash(script.path),
path: script.path,
parent_hashes: [],
summary: script.summary,
description: script.description ?? '',
content: script.content,
created_by: 'benchmark',
created_at: BENCHMARK_TIMESTAMP,
archived: false,
schema: script.schema ?? {},
deleted: false,
is_template: false,
extra_perms: {},
language: script.language,
kind: 'script',
starred: false,
has_preprocessor: false,
modules: null
}
}
function buildBenchmarkFlow(flow: BenchmarkWorkspaceFlow): Flow {
return {
path: flow.path,
summary: flow.summary,
description: flow.description ?? '',
value: flow.value,
schema: flow.schema ?? {},
edited_by: 'benchmark',
edited_at: BENCHMARK_TIMESTAMP,
archived: false,
extra_perms: {}
} as Flow
}

View File

@@ -1,133 +0,0 @@
export type FrontendBenchmarkProgressSurface = 'flow' | 'app' | 'script'
export type FrontendBenchmarkProgressEvent =
| {
type: 'run-start'
surface: FrontendBenchmarkProgressSurface
totalCases: number
runs: number
concurrency: number
}
| {
type: 'attempt-start'
surface: FrontendBenchmarkProgressSurface
caseId: string
caseNumber: number
totalCases: number
attempt: number
runs: number
}
| {
type: 'attempt-finish'
surface: FrontendBenchmarkProgressSurface
caseId: string
caseNumber: number
totalCases: number
attempt: number
runs: number
passed: boolean
durationMs: number
judgeScore: number | null
error: string | null
}
| {
type: 'assistant-message-start'
surface: FrontendBenchmarkProgressSurface
caseId: string
caseNumber: number
totalCases: number
attempt: number
runs: number
}
| {
type: 'assistant-chunk'
surface: FrontendBenchmarkProgressSurface
caseId: string
caseNumber: number
totalCases: number
attempt: number
runs: number
chunk: string
}
| {
type: 'assistant-message-end'
surface: FrontendBenchmarkProgressSurface
caseId: string
caseNumber: number
totalCases: number
attempt: number
runs: number
}
export const FRONTEND_BENCHMARK_PROGRESS_PREFIX = 'WMILL_FRONTEND_AI_EVAL_PROGRESS '
export function emitFrontendBenchmarkProgress(event: FrontendBenchmarkProgressEvent): void {
process.stderr.write(
`${FRONTEND_BENCHMARK_PROGRESS_PREFIX}${JSON.stringify(event)}\n`
)
}
export function parseFrontendBenchmarkProgressLine(
line: string
): FrontendBenchmarkProgressEvent | null {
if (!line.startsWith(FRONTEND_BENCHMARK_PROGRESS_PREFIX)) {
return null
}
try {
const parsed = JSON.parse(
line.slice(FRONTEND_BENCHMARK_PROGRESS_PREFIX.length)
) as FrontendBenchmarkProgressEvent
return parsed?.type ? parsed : null
} catch {
return null
}
}
export function formatFrontendBenchmarkProgressEvent(
event: FrontendBenchmarkProgressEvent
): string {
switch (event.type) {
case 'run-start':
return `Running ${event.surface}: ${event.totalCases} cases x ${event.runs} run${event.runs === 1 ? '' : 's'}, concurrency ${event.concurrency}`
case 'attempt-start':
return `${formatCasePrefix(event.caseNumber, event.totalCases)} ${event.caseId} attempt ${event.attempt}/${event.runs}...`
case 'attempt-finish': {
const parts = [
`${formatCasePrefix(event.caseNumber, event.totalCases)} ${event.caseId} attempt ${event.attempt}/${event.runs} ${event.passed ? 'pass' : 'fail'}`,
formatDuration(event.durationMs)
]
if (event.judgeScore !== null) {
parts.push(`judge ${formatNumber(event.judgeScore)}`)
}
if (event.error) {
parts.push(truncateSingleLine(event.error, 120))
}
return parts.join(' | ')
}
case 'assistant-message-start':
case 'assistant-chunk':
case 'assistant-message-end':
return ''
}
}
function formatCasePrefix(caseNumber: number, totalCases: number): string {
return `[${caseNumber}/${totalCases}]`
}
function formatDuration(durationMs: number): string {
return `${formatNumber(durationMs / 1000)}s`
}
function formatNumber(value: number): string {
return Number.isInteger(value) ? String(value) : value.toFixed(1)
}
function truncateSingleLine(value: string, maxLength: number): string {
const normalized = value.replace(/\s+/g, ' ').trim()
if (normalized.length <= maxLength) {
return normalized
}
return `${normalized.slice(0, Math.max(0, maxLength - 3))}...`
}

View File

@@ -1,216 +0,0 @@
import { spawn } from 'node:child_process'
import { mkdtemp, readFile, rm } from 'node:fs/promises'
import { tmpdir } from 'node:os'
import path from 'node:path'
import { fileURLToPath } from 'node:url'
import {
formatFrontendBenchmarkProgressEvent,
parseFrontendBenchmarkProgressLine
} from './progress'
import type { BenchmarkRunResult } from '../../core/types'
const REPO_ROOT = fileURLToPath(new URL('../../../', import.meta.url))
const FRONTEND_DIR = path.join(REPO_ROOT, 'frontend')
const FRONTEND_BENCHMARK_TEST = '../ai_evals/adapters/frontend/vitestAdapter.test.ts'
const FRONTEND_BENCHMARK_CONFIG = '../ai_evals/adapters/frontend/vitest.config.ts'
export type FrontendMode = 'flow' | 'app' | 'script'
export async function runFrontendBenchmarkAdapter(input: {
mode: FrontendMode
caseIds: string[]
runs: number
model?: string
verbose?: boolean
}): Promise<BenchmarkRunResult> {
const tempDir = await mkdtemp(path.join(tmpdir(), 'wmill-frontend-benchmark-'))
const outputPath = path.join(tempDir, 'result.json')
try {
await runVitestBenchmark(
path.join(FRONTEND_DIR, 'node_modules', '.bin', 'vitest'),
[
'run',
FRONTEND_BENCHMARK_TEST,
'--project',
'server',
'--config',
FRONTEND_BENCHMARK_CONFIG
],
{
cwd: FRONTEND_DIR,
env: {
...process.env,
BROWSERSLIST_IGNORE_OLD_DATA: '1',
WMILL_FRONTEND_AI_EVAL_OUTPUT_PATH: outputPath,
WMILL_FRONTEND_AI_EVAL_MODE: input.mode,
WMILL_FRONTEND_AI_EVAL_CASE_IDS: JSON.stringify(input.caseIds),
WMILL_FRONTEND_AI_EVAL_RUNS: String(input.runs),
WMILL_FRONTEND_AI_EVAL_MODEL: input.model ?? "",
WMILL_FRONTEND_AI_EVAL_PROGRESS: '1',
WMILL_FRONTEND_AI_EVAL_VERBOSE: input.verbose ? '1' : '0'
}
}
)
const raw = await readFile(outputPath, 'utf8')
return JSON.parse(raw) as BenchmarkRunResult
} catch (error) {
throw new Error(`Frontend benchmark adapter failed:\n${toErrorMessage(error)}`)
} finally {
await rm(tempDir, { recursive: true, force: true })
}
}
async function runVitestBenchmark(
command: string,
args: string[],
options: {
cwd: string
env: NodeJS.ProcessEnv
}
): Promise<void> {
const child = spawn(command, args, {
cwd: options.cwd,
env: options.env,
stdio: ['ignore', 'pipe', 'pipe']
})
let stdout = ''
let stderr = ''
let stderrLineBuffer = ''
let assistantStreamOpen = false
child.stdout?.setEncoding('utf8')
child.stdout?.on('data', (chunk: string) => {
stdout += chunk
})
child.stderr?.setEncoding('utf8')
child.stderr?.on('data', (chunk: string) => {
stderrLineBuffer += chunk
const { remainder, passthrough, nextAssistantStreamOpen } = drainProgressLines(
stderrLineBuffer,
assistantStreamOpen
)
stderrLineBuffer = remainder
stderr += passthrough
assistantStreamOpen = nextAssistantStreamOpen
})
await new Promise<void>((resolve, reject) => {
child.once('error', reject)
child.once('close', (code) => {
if (stderrLineBuffer.length > 0) {
const {
remainder,
passthrough,
nextAssistantStreamOpen
} = drainProgressLines(`${stderrLineBuffer}\n`, assistantStreamOpen)
stderrLineBuffer = remainder
stderr += passthrough
assistantStreamOpen = nextAssistantStreamOpen
}
if (code === 0) {
if (assistantStreamOpen) {
process.stderr.write('\n')
}
resolve()
return
}
const details = [`vitest exited with code ${code}`, stdout, stderr].filter(Boolean).join('\n')
reject(new Error(details))
})
})
}
function drainProgressLines(buffer: string): {
remainder: string
passthrough: string
nextAssistantStreamOpen: boolean
}
function drainProgressLines(
buffer: string,
initialAssistantStreamOpen: boolean
): {
remainder: string
passthrough: string
nextAssistantStreamOpen: boolean
} {
let remainder = buffer
let passthrough = ''
let assistantStreamOpen = initialAssistantStreamOpen
while (true) {
const newlineIndex = remainder.indexOf('\n')
if (newlineIndex === -1) {
return { remainder, passthrough, nextAssistantStreamOpen: assistantStreamOpen }
}
const line = remainder.slice(0, newlineIndex).replace(/\r$/, '')
remainder = remainder.slice(newlineIndex + 1)
const progressEvent = parseFrontendBenchmarkProgressLine(line)
if (progressEvent) {
if (progressEvent.type === 'assistant-message-start') {
if (assistantStreamOpen) {
process.stderr.write('\n')
}
process.stderr.write(
`${formatCasePrefix(progressEvent.caseNumber, progressEvent.totalCases)} ${progressEvent.caseId} attempt ${progressEvent.attempt}/${progressEvent.runs} assistant:\n`
)
assistantStreamOpen = true
continue
}
if (progressEvent.type === 'assistant-chunk') {
process.stderr.write(progressEvent.chunk)
continue
}
if (progressEvent.type === 'assistant-message-end') {
if (assistantStreamOpen) {
process.stderr.write('\n')
}
assistantStreamOpen = false
continue
}
if (assistantStreamOpen) {
process.stderr.write('\n')
assistantStreamOpen = false
}
process.stderr.write(`${formatFrontendBenchmarkProgressEvent(progressEvent)}\n`)
continue
}
if (shouldSuppressFrontendStderrLine(line)) {
continue
}
passthrough += `${line}\n`
process.stderr.write(`${line}\n`)
}
}
function formatCasePrefix(caseNumber: number, totalCases: number): string {
return `[${caseNumber}/${totalCases}]`
}
function shouldSuppressFrontendStderrLine(line: string): boolean {
return (
line.startsWith('[baseline-browser-mapping] ') ||
line.startsWith('Browserslist: browsers data (caniuse-lite) is ') ||
line.includes('update-browserslist-db@latest') ||
line.includes('update-db#readme')
)
}
function toErrorMessage(error: unknown): string {
if (error instanceof Error) {
return error.message
}
return String(error)
}

View File

@@ -1,28 +0,0 @@
import { fileURLToPath } from 'node:url'
import frontendConfig from '../../../frontend/vite.config.js'
const FRONTEND_VITE_CONFIG_PATH = fileURLToPath(new URL('../../../frontend/vite.config.js', import.meta.url))
const FRONTEND_TEST_SETUP_PATH = fileURLToPath(
new URL('../../../frontend/src/lib/test-setup.ts', import.meta.url)
)
const ADAPTER_TEST_PATH = fileURLToPath(new URL('./vitestAdapter.test.ts', import.meta.url))
const config = {
...frontendConfig,
test: {
...frontendConfig.test,
projects: [
{
extends: FRONTEND_VITE_CONFIG_PATH,
test: {
name: 'server',
environment: 'node',
include: [ADAPTER_TEST_PATH],
setupFiles: [FRONTEND_TEST_SETUP_PATH]
}
}
]
}
}
export default config

View File

@@ -1,165 +0,0 @@
import { expect, it, vi } from 'vitest'
// @ts-ignore - Node.js fs/promises
import { mkdir, writeFile } from 'fs/promises'
// @ts-ignore - Node.js path
import { dirname, resolve } from 'path'
vi.mock('monaco-editor', () => ({
editor: {},
languages: {},
KeyCode: {},
Uri: {
parse: (value: string) => ({ toString: () => value })
},
MarkerSeverity: {
Error: 8,
Warning: 4,
Info: 2,
Hint: 1
}
}))
vi.mock('@codingame/monaco-vscode-standalone-typescript-language-features', () => ({
getTypeScriptWorker: async () => async () => ({}),
typescriptVersion: 'test'
}))
vi.mock('@codingame/monaco-vscode-languages-service-override', () => ({
default: () => ({})
}))
vi.mock('$lib/components/vscode', () => ({}))
vi.mock('$lib/gen', async () => {
const actual = await vi.importActual<any>('$lib/gen')
const {
getBenchmarkCompletedJob,
getBenchmarkFlowByPath,
getBenchmarkScriptByHash,
getBenchmarkScriptByPath,
hasBenchmarkWorkspace,
listBenchmarkFlows,
listBenchmarkScripts,
runBenchmarkFlowByPath,
runBenchmarkScriptPreview
} = await import('./mockBackend')
function wrapService<T extends object>(target: T, overrides: Record<string, unknown>): T {
return new Proxy(target, {
get(source, property, receiver) {
if (typeof property === 'string' && property in overrides) {
return overrides[property]
}
return Reflect.get(source, property, receiver)
}
})
}
return {
...actual,
ScriptService: wrapService(actual.ScriptService, {
listScripts: async (data: { workspace: string }) =>
hasBenchmarkWorkspace(data.workspace)
? (listBenchmarkScripts(data.workspace) ?? [])
: actual.ScriptService.listScripts(data),
getScriptByPath: async (data: { workspace: string; path: string }) => {
if (hasBenchmarkWorkspace(data.workspace)) {
const script = getBenchmarkScriptByPath(data.workspace, data.path)
if (!script) {
throw new Error(`Script "${data.path}" not found in benchmark workspace`)
}
return script
}
return actual.ScriptService.getScriptByPath(data)
},
getScriptByHash: async (data: { workspace: string; hash: string }) => {
if (hasBenchmarkWorkspace(data.workspace)) {
const script = getBenchmarkScriptByHash(data.workspace, data.hash)
if (!script) {
throw new Error(`Script hash "${data.hash}" not found in benchmark workspace`)
}
return script
}
return actual.ScriptService.getScriptByHash(data)
}
}),
FlowService: wrapService(actual.FlowService, {
listFlows: async (data: { workspace: string }) =>
hasBenchmarkWorkspace(data.workspace)
? (listBenchmarkFlows(data.workspace) ?? [])
: actual.FlowService.listFlows(data),
getFlowByPath: async (data: { workspace: string; path: string }) => {
if (hasBenchmarkWorkspace(data.workspace)) {
const flow = getBenchmarkFlowByPath(data.workspace, data.path)
if (!flow) {
throw new Error(`Flow "${data.path}" not found in benchmark workspace`)
}
return flow
}
return actual.FlowService.getFlowByPath(data)
}
}),
JobService: wrapService(actual.JobService, {
runScriptPreview: async (data: {
workspace: string
requestBody?: {
content?: string
language?: string
args?: Record<string, unknown>
path?: string
}
}) =>
hasBenchmarkWorkspace(data.workspace)
? runBenchmarkScriptPreview({
workspace: data.workspace,
requestBody: data.requestBody ?? {}
})
: actual.JobService.runScriptPreview(data),
runFlowByPath: async (data: {
workspace: string
path: string
requestBody?: Record<string, unknown>
}) =>
hasBenchmarkWorkspace(data.workspace)
? runBenchmarkFlowByPath({
workspace: data.workspace,
path: data.path,
args: data.requestBody
})
: actual.JobService.runFlowByPath(data),
getJob: async (data: { workspace: string; id: string }) => {
if (hasBenchmarkWorkspace(data.workspace)) {
const job = getBenchmarkCompletedJob(data.workspace, data.id)
if (!job) {
throw new Error(`Job "${data.id}" not found in benchmark workspace`)
}
return job
}
return actual.JobService.getJob(data)
}
})
}
})
const benchmarkOutputPath = process.env.WMILL_FRONTEND_AI_EVAL_OUTPUT_PATH
const benchmarkIt = benchmarkOutputPath ? it : it.skip
benchmarkIt(
'runs the frontend benchmark adapter from environment input',
async () => {
const { resetBenchmarkMockBackend } = await import('./mockBackend')
resetBenchmarkMockBackend()
const { runFrontendBenchmarkFromEnv } = await import('./benchmarkRunner')
try {
const payload = await runFrontendBenchmarkFromEnv()
const absoluteOutputPath = resolve(benchmarkOutputPath!)
await mkdir(dirname(absoluteOutputPath), { recursive: true })
await writeFile(absoluteOutputPath, JSON.stringify(payload, null, 2) + '\n', 'utf8')
expect(payload.cases.length).toBeGreaterThan(0)
} finally {
resetBenchmarkMockBackend()
}
},
600_000
)

View File

@@ -1,313 +0,0 @@
{
"lockfileVersion": 1,
"configVersion": 1,
"workspaces": {
"": {
"name": "windmill-ai-evals",
"dependencies": {
"@anthropic-ai/claude-agent-sdk": "^0.2.25",
"@anthropic-ai/sdk": "^0.39.0",
"commander": "^14.0.3",
"openai": "^6.9.1",
"yaml": "^2.8.3",
},
"devDependencies": {
"@types/bun": "latest",
"typescript": "^5.0.0",
},
},
},
"packages": {
"@anthropic-ai/claude-agent-sdk": ["@anthropic-ai/claude-agent-sdk@0.2.87", "", { "dependencies": { "@anthropic-ai/sdk": "^0.74.0", "@modelcontextprotocol/sdk": "^1.27.1" }, "optionalDependencies": { "@img/sharp-darwin-arm64": "^0.34.2", "@img/sharp-darwin-x64": "^0.34.2", "@img/sharp-linux-arm": "^0.34.2", "@img/sharp-linux-arm64": "^0.34.2", "@img/sharp-linux-x64": "^0.34.2", "@img/sharp-linuxmusl-arm64": "^0.34.2", "@img/sharp-linuxmusl-x64": "^0.34.2", "@img/sharp-win32-arm64": "^0.34.2", "@img/sharp-win32-x64": "^0.34.2" }, "peerDependencies": { "zod": "^4.0.0" } }, "sha512-WWmgBPxPhBOvNT0ujI8vPTI2lK+w5YEkEZ/y1mH0EDkK/0kBnxVJNhCtG5vnueiAViwLoUOFn66pbkDiivijdA=="],
"@anthropic-ai/sdk": ["@anthropic-ai/sdk@0.39.0", "", { "dependencies": { "@types/node": "^18.11.18", "@types/node-fetch": "^2.6.4", "abort-controller": "^3.0.0", "agentkeepalive": "^4.2.1", "form-data-encoder": "1.7.2", "formdata-node": "^4.3.2", "node-fetch": "^2.6.7" } }, "sha512-eMyDIPRZbt1CCLErRCi3exlAvNkBtRe+kW5vvJyef93PmNr/clstYgHhtvmkxN82nlKgzyGPCyGxrm0JQ1ZIdg=="],
"@babel/runtime": ["@babel/runtime@7.29.2", "", {}, "sha512-JiDShH45zKHWyGe4ZNVRrCjBz8Nh9TMmZG1kh4QTK8hCBTWBi8Da+i7s1fJw7/lYpM4ccepSNfqzZ/QvABBi5g=="],
"@hono/node-server": ["@hono/node-server@1.19.12", "", { "peerDependencies": { "hono": "^4" } }, "sha512-txsUW4SQ1iilgE0l9/e9VQWmELXifEFvmdA1j6WFh/aFPj99hIntrSsq/if0UWyGVkmrRPKA1wCeP+UCr1B9Uw=="],
"@img/sharp-darwin-arm64": ["@img/sharp-darwin-arm64@0.34.5", "", { "optionalDependencies": { "@img/sharp-libvips-darwin-arm64": "1.2.4" }, "os": "darwin", "cpu": "arm64" }, "sha512-imtQ3WMJXbMY4fxb/Ndp6HBTNVtWCUI0WdobyheGf5+ad6xX8VIDO8u2xE4qc/fr08CKG/7dDseFtn6M6g/r3w=="],
"@img/sharp-darwin-x64": ["@img/sharp-darwin-x64@0.34.5", "", { "optionalDependencies": { "@img/sharp-libvips-darwin-x64": "1.2.4" }, "os": "darwin", "cpu": "x64" }, "sha512-YNEFAF/4KQ/PeW0N+r+aVVsoIY0/qxxikF2SWdp+NRkmMB7y9LBZAVqQ4yhGCm/H3H270OSykqmQMKLBhBJDEw=="],
"@img/sharp-libvips-darwin-arm64": ["@img/sharp-libvips-darwin-arm64@1.2.4", "", { "os": "darwin", "cpu": "arm64" }, "sha512-zqjjo7RatFfFoP0MkQ51jfuFZBnVE2pRiaydKJ1G/rHZvnsrHAOcQALIi9sA5co5xenQdTugCvtb1cuf78Vf4g=="],
"@img/sharp-libvips-darwin-x64": ["@img/sharp-libvips-darwin-x64@1.2.4", "", { "os": "darwin", "cpu": "x64" }, "sha512-1IOd5xfVhlGwX+zXv2N93k0yMONvUlANylbJw1eTah8K/Jtpi15KC+WSiaX/nBmbm2HxRM1gZ0nSdjSsrZbGKg=="],
"@img/sharp-libvips-linux-arm": ["@img/sharp-libvips-linux-arm@1.2.4", "", { "os": "linux", "cpu": "arm" }, "sha512-bFI7xcKFELdiNCVov8e44Ia4u2byA+l3XtsAj+Q8tfCwO6BQ8iDojYdvoPMqsKDkuoOo+X6HZA0s0q11ANMQ8A=="],
"@img/sharp-libvips-linux-arm64": ["@img/sharp-libvips-linux-arm64@1.2.4", "", { "os": "linux", "cpu": "arm64" }, "sha512-excjX8DfsIcJ10x1Kzr4RcWe1edC9PquDRRPx3YVCvQv+U5p7Yin2s32ftzikXojb1PIFc/9Mt28/y+iRklkrw=="],
"@img/sharp-libvips-linux-x64": ["@img/sharp-libvips-linux-x64@1.2.4", "", { "os": "linux", "cpu": "x64" }, "sha512-tJxiiLsmHc9Ax1bz3oaOYBURTXGIRDODBqhveVHonrHJ9/+k89qbLl0bcJns+e4t4rvaNBxaEZsFtSfAdquPrw=="],
"@img/sharp-libvips-linuxmusl-arm64": ["@img/sharp-libvips-linuxmusl-arm64@1.2.4", "", { "os": "linux", "cpu": "arm64" }, "sha512-FVQHuwx1IIuNow9QAbYUzJ+En8KcVm9Lk5+uGUQJHaZmMECZmOlix9HnH7n1TRkXMS0pGxIJokIVB9SuqZGGXw=="],
"@img/sharp-libvips-linuxmusl-x64": ["@img/sharp-libvips-linuxmusl-x64@1.2.4", "", { "os": "linux", "cpu": "x64" }, "sha512-+LpyBk7L44ZIXwz/VYfglaX/okxezESc6UxDSoyo2Ks6Jxc4Y7sGjpgU9s4PMgqgjj1gZCylTieNamqA1MF7Dg=="],
"@img/sharp-linux-arm": ["@img/sharp-linux-arm@0.34.5", "", { "optionalDependencies": { "@img/sharp-libvips-linux-arm": "1.2.4" }, "os": "linux", "cpu": "arm" }, "sha512-9dLqsvwtg1uuXBGZKsxem9595+ujv0sJ6Vi8wcTANSFpwV/GONat5eCkzQo/1O6zRIkh0m/8+5BjrRr7jDUSZw=="],
"@img/sharp-linux-arm64": ["@img/sharp-linux-arm64@0.34.5", "", { "optionalDependencies": { "@img/sharp-libvips-linux-arm64": "1.2.4" }, "os": "linux", "cpu": "arm64" }, "sha512-bKQzaJRY/bkPOXyKx5EVup7qkaojECG6NLYswgktOZjaXecSAeCWiZwwiFf3/Y+O1HrauiE3FVsGxFg8c24rZg=="],
"@img/sharp-linux-x64": ["@img/sharp-linux-x64@0.34.5", "", { "optionalDependencies": { "@img/sharp-libvips-linux-x64": "1.2.4" }, "os": "linux", "cpu": "x64" }, "sha512-MEzd8HPKxVxVenwAa+JRPwEC7QFjoPWuS5NZnBt6B3pu7EG2Ge0id1oLHZpPJdn3OQK+BQDiw9zStiHBTJQQQQ=="],
"@img/sharp-linuxmusl-arm64": ["@img/sharp-linuxmusl-arm64@0.34.5", "", { "optionalDependencies": { "@img/sharp-libvips-linuxmusl-arm64": "1.2.4" }, "os": "linux", "cpu": "arm64" }, "sha512-fprJR6GtRsMt6Kyfq44IsChVZeGN97gTD331weR1ex1c1rypDEABN6Tm2xa1wE6lYb5DdEnk03NZPqA7Id21yg=="],
"@img/sharp-linuxmusl-x64": ["@img/sharp-linuxmusl-x64@0.34.5", "", { "optionalDependencies": { "@img/sharp-libvips-linuxmusl-x64": "1.2.4" }, "os": "linux", "cpu": "x64" }, "sha512-Jg8wNT1MUzIvhBFxViqrEhWDGzqymo3sV7z7ZsaWbZNDLXRJZoRGrjulp60YYtV4wfY8VIKcWidjojlLcWrd8Q=="],
"@img/sharp-win32-arm64": ["@img/sharp-win32-arm64@0.34.5", "", { "os": "win32", "cpu": "arm64" }, "sha512-WQ3AgWCWYSb2yt+IG8mnC6Jdk9Whs7O0gxphblsLvdhSpSTtmu69ZG1Gkb6NuvxsNACwiPV6cNSZNzt0KPsw7g=="],
"@img/sharp-win32-x64": ["@img/sharp-win32-x64@0.34.5", "", { "os": "win32", "cpu": "x64" }, "sha512-+29YMsqY2/9eFEiW93eqWnuLcWcufowXewwSNIT6UwZdUUCrM3oFjMWH/Z6/TMmb4hlFenmfAVbpWeup2jryCw=="],
"@modelcontextprotocol/sdk": ["@modelcontextprotocol/sdk@1.29.0", "", { "dependencies": { "@hono/node-server": "^1.19.9", "ajv": "^8.17.1", "ajv-formats": "^3.0.1", "content-type": "^1.0.5", "cors": "^2.8.5", "cross-spawn": "^7.0.5", "eventsource": "^3.0.2", "eventsource-parser": "^3.0.0", "express": "^5.2.1", "express-rate-limit": "^8.2.1", "hono": "^4.11.4", "jose": "^6.1.3", "json-schema-typed": "^8.0.2", "pkce-challenge": "^5.0.0", "raw-body": "^3.0.0", "zod": "^3.25 || ^4.0", "zod-to-json-schema": "^3.25.1" }, "peerDependencies": { "@cfworker/json-schema": "^4.1.1" }, "optionalPeers": ["@cfworker/json-schema"] }, "sha512-zo37mZA9hJWpULgkRpowewez1y6ML5GsXJPY8FI0tBBCd77HEvza4jDqRKOXgHNn867PVGCyTdzqpz0izu5ZjQ=="],
"@types/bun": ["@types/bun@1.3.11", "", { "dependencies": { "bun-types": "1.3.11" } }, "sha512-5vPne5QvtpjGpsGYXiFyycfpDF2ECyPcTSsFBMa0fraoxiQyMJ3SmuQIGhzPg2WJuWxVBoxWJ2kClYTcw/4fAg=="],
"@types/node": ["@types/node@18.19.130", "", { "dependencies": { "undici-types": "~5.26.4" } }, "sha512-GRaXQx6jGfL8sKfaIDD6OupbIHBr9jv7Jnaml9tB7l4v068PAOXqfcujMMo5PhbIs6ggR1XODELqahT2R8v0fg=="],
"@types/node-fetch": ["@types/node-fetch@2.6.13", "", { "dependencies": { "@types/node": "*", "form-data": "^4.0.4" } }, "sha512-QGpRVpzSaUs30JBSGPjOg4Uveu384erbHBoT1zeONvyCfwQxIkUshLAOqN/k9EjGviPRmWTTe6aH2qySWKTVSw=="],
"abort-controller": ["abort-controller@3.0.0", "", { "dependencies": { "event-target-shim": "^5.0.0" } }, "sha512-h8lQ8tacZYnR3vNQTgibj+tODHI5/+l06Au2Pcriv/Gmet0eaj4TwWH41sO9wnHDiQsEj19q0drzdWdeAHtweg=="],
"accepts": ["accepts@2.0.0", "", { "dependencies": { "mime-types": "^3.0.0", "negotiator": "^1.0.0" } }, "sha512-5cvg6CtKwfgdmVqY1WIiXKc3Q1bkRqGLi+2W/6ao+6Y7gu/RCwRuAhGEzh5B4KlszSuTLgZYuqFqo5bImjNKng=="],
"agentkeepalive": ["agentkeepalive@4.6.0", "", { "dependencies": { "humanize-ms": "^1.2.1" } }, "sha512-kja8j7PjmncONqaTsB8fQ+wE2mSU2DJ9D4XKoJ5PFWIdRMa6SLSN1ff4mOr4jCbfRSsxR4keIiySJU0N9T5hIQ=="],
"ajv": ["ajv@8.18.0", "", { "dependencies": { "fast-deep-equal": "^3.1.3", "fast-uri": "^3.0.1", "json-schema-traverse": "^1.0.0", "require-from-string": "^2.0.2" } }, "sha512-PlXPeEWMXMZ7sPYOHqmDyCJzcfNrUr3fGNKtezX14ykXOEIvyK81d+qydx89KY5O71FKMPaQ2vBfBFI5NHR63A=="],
"ajv-formats": ["ajv-formats@3.0.1", "", { "dependencies": { "ajv": "^8.0.0" } }, "sha512-8iUql50EUR+uUcdRQ3HDqa6EVyo3docL8g5WJ3FNcWmu62IbkGUue/pEyLBW8VGKKucTPgqeks4fIU1DA4yowQ=="],
"asynckit": ["asynckit@0.4.0", "", {}, "sha512-Oei9OH4tRh0YqU3GxhX79dM/mwVgvbZJaSNaRk+bshkj0S5cfHcgYakreBjrHwatXKbz+IoIdYLxrKim2MjW0Q=="],
"body-parser": ["body-parser@2.2.2", "", { "dependencies": { "bytes": "^3.1.2", "content-type": "^1.0.5", "debug": "^4.4.3", "http-errors": "^2.0.0", "iconv-lite": "^0.7.0", "on-finished": "^2.4.1", "qs": "^6.14.1", "raw-body": "^3.0.1", "type-is": "^2.0.1" } }, "sha512-oP5VkATKlNwcgvxi0vM0p/D3n2C3EReYVX+DNYs5TjZFn/oQt2j+4sVJtSMr18pdRr8wjTcBl6LoV+FUwzPmNA=="],
"bun-types": ["bun-types@1.3.11", "", { "dependencies": { "@types/node": "*" } }, "sha512-1KGPpoxQWl9f6wcZh57LvrPIInQMn2TQ7jsgxqpRzg+l0QPOFvJVH7HmvHo/AiPgwXy+/Thf6Ov3EdVn1vOabg=="],
"bytes": ["bytes@3.1.2", "", {}, "sha512-/Nf7TyzTx6S3yRJObOAV7956r8cr2+Oj8AC5dt8wSP3BQAoeX58NoHyCU8P8zGkNXStjTSi6fzO6F0pBdcYbEg=="],
"call-bind-apply-helpers": ["call-bind-apply-helpers@1.0.2", "", { "dependencies": { "es-errors": "^1.3.0", "function-bind": "^1.1.2" } }, "sha512-Sp1ablJ0ivDkSzjcaJdxEunN5/XvksFJ2sMBFfq6x0ryhQV/2b/KwFe21cMpmHtPOSij8K99/wSfoEuTObmuMQ=="],
"call-bound": ["call-bound@1.0.4", "", { "dependencies": { "call-bind-apply-helpers": "^1.0.2", "get-intrinsic": "^1.3.0" } }, "sha512-+ys997U96po4Kx/ABpBCqhA9EuxJaQWDQg7295H4hBphv3IZg0boBKuwYpt4YXp6MZ5AmZQnU/tyMTlRpaSejg=="],
"combined-stream": ["combined-stream@1.0.8", "", { "dependencies": { "delayed-stream": "~1.0.0" } }, "sha512-FQN4MRfuJeHf7cBbBMJFXhKSDq+2kAArBlmRBvcvFE5BB1HZKXtSFASDhdlz9zOYwxh8lDdnvmMOe/+5cdoEdg=="],
"commander": ["commander@14.0.3", "", {}, "sha512-H+y0Jo/T1RZ9qPP4Eh1pkcQcLRglraJaSLoyOtHxu6AapkjWVCy2Sit1QQ4x3Dng8qDlSsZEet7g5Pq06MvTgw=="],
"content-disposition": ["content-disposition@1.0.1", "", {}, "sha512-oIXISMynqSqm241k6kcQ5UwttDILMK4BiurCfGEREw6+X9jkkpEe5T9FZaApyLGGOnFuyMWZpdolTXMtvEJ08Q=="],
"content-type": ["content-type@1.0.5", "", {}, "sha512-nTjqfcBFEipKdXCv4YDQWCfmcLZKm81ldF0pAopTvyrFGVbcR6P/VAAd5G7N+0tTr8QqiU0tFadD6FK4NtJwOA=="],
"cookie": ["cookie@0.7.2", "", {}, "sha512-yki5XnKuf750l50uGTllt6kKILY4nQ1eNIQatoXEByZ5dWgnKqbnqmTrBE5B4N7lrMJKQ2ytWMiTO2o0v6Ew/w=="],
"cookie-signature": ["cookie-signature@1.2.2", "", {}, "sha512-D76uU73ulSXrD1UXF4KE2TMxVVwhsnCgfAyTg9k8P6KGZjlXKrOLe4dJQKI3Bxi5wjesZoFXJWElNWBjPZMbhg=="],
"cors": ["cors@2.8.6", "", { "dependencies": { "object-assign": "^4", "vary": "^1" } }, "sha512-tJtZBBHA6vjIAaF6EnIaq6laBBP9aq/Y3ouVJjEfoHbRBcHBAHYcMh/w8LDrk2PvIMMq8gmopa5D4V8RmbrxGw=="],
"cross-spawn": ["cross-spawn@7.0.6", "", { "dependencies": { "path-key": "^3.1.0", "shebang-command": "^2.0.0", "which": "^2.0.1" } }, "sha512-uV2QOWP2nWzsy2aMp8aRibhi9dlzF5Hgh5SHaB9OiTGEyDTiJJyx0uy51QXdyWbtAHNua4XJzUKca3OzKUd3vA=="],
"debug": ["debug@4.4.3", "", { "dependencies": { "ms": "^2.1.3" } }, "sha512-RGwwWnwQvkVfavKVt22FGLw+xYSdzARwm0ru6DhTVA3umU5hZc28V3kO4stgYryrTlLpuvgI9GiijltAjNbcqA=="],
"delayed-stream": ["delayed-stream@1.0.0", "", {}, "sha512-ZySD7Nf91aLB0RxL4KGrKHBXl7Eds1DAmEdcoVawXnLD7SDhpNgtuII2aAkg7a7QS41jxPSZ17p4VdGnMHk3MQ=="],
"depd": ["depd@2.0.0", "", {}, "sha512-g7nH6P6dyDioJogAAGprGpCtVImJhpPk/roCzdb3fIh61/s/nPsfR6onyMwkCAR/OlC3yBC0lESvUoQEAssIrw=="],
"dunder-proto": ["dunder-proto@1.0.1", "", { "dependencies": { "call-bind-apply-helpers": "^1.0.1", "es-errors": "^1.3.0", "gopd": "^1.2.0" } }, "sha512-KIN/nDJBQRcXw0MLVhZE9iQHmG68qAVIBg9CqmUYjmQIhgij9U5MFvrqkUL5FbtyyzZuOeOt0zdeRe4UY7ct+A=="],
"ee-first": ["ee-first@1.1.1", "", {}, "sha512-WMwm9LhRUo+WUaRN+vRuETqG89IgZphVSNkdFgeb6sS/E4OrDIN7t48CAewSHXc6C8lefD8KKfr5vY61brQlow=="],
"encodeurl": ["encodeurl@2.0.0", "", {}, "sha512-Q0n9HRi4m6JuGIV1eFlmvJB7ZEVxu93IrMyiMsGC0lrMJMWzRgx6WGquyfQgZVb31vhGgXnfmPNNXmxnOkRBrg=="],
"es-define-property": ["es-define-property@1.0.1", "", {}, "sha512-e3nRfgfUZ4rNGL232gUgX06QNyyez04KdjFrF+LTRoOXmrOgFKDg4BCdsjW8EnT69eqdYGmRpJwiPVYNrCaW3g=="],
"es-errors": ["es-errors@1.3.0", "", {}, "sha512-Zf5H2Kxt2xjTvbJvP2ZWLEICxA6j+hAmMzIlypy4xcBg1vKVnx89Wy0GbS+kf5cwCVFFzdCFh2XSCFNULS6csw=="],
"es-object-atoms": ["es-object-atoms@1.1.1", "", { "dependencies": { "es-errors": "^1.3.0" } }, "sha512-FGgH2h8zKNim9ljj7dankFPcICIK9Cp5bm+c2gQSYePhpaG5+esrLODihIorn+Pe6FGJzWhXQotPv73jTaldXA=="],
"es-set-tostringtag": ["es-set-tostringtag@2.1.0", "", { "dependencies": { "es-errors": "^1.3.0", "get-intrinsic": "^1.2.6", "has-tostringtag": "^1.0.2", "hasown": "^2.0.2" } }, "sha512-j6vWzfrGVfyXxge+O0x5sh6cvxAog0a/4Rdd2K36zCMV5eJ+/+tOAngRO8cODMNWbVRdVlmGZQL2YS3yR8bIUA=="],
"escape-html": ["escape-html@1.0.3", "", {}, "sha512-NiSupZ4OeuGwr68lGIeym/ksIZMJodUGOSCZ/FSnTxcrekbvqrgdUxlJOMpijaKZVjAJrWrGs/6Jy8OMuyj9ow=="],
"etag": ["etag@1.8.1", "", {}, "sha512-aIL5Fx7mawVa300al2BnEE4iNvo1qETxLrPI/o05L7z6go7fCw1J6EQmbK4FmJ2AS7kgVF/KEZWufBfdClMcPg=="],
"event-target-shim": ["event-target-shim@5.0.1", "", {}, "sha512-i/2XbnSz/uxRCU6+NdVJgKWDTM427+MqYbkQzD321DuCQJUqOuJKIA0IM2+W2xtYHdKOmZ4dR6fExsd4SXL+WQ=="],
"eventsource": ["eventsource@3.0.7", "", { "dependencies": { "eventsource-parser": "^3.0.1" } }, "sha512-CRT1WTyuQoD771GW56XEZFQ/ZoSfWid1alKGDYMmkt2yl8UXrVR4pspqWNEcqKvVIzg6PAltWjxcSSPrboA4iA=="],
"eventsource-parser": ["eventsource-parser@3.0.6", "", {}, "sha512-Vo1ab+QXPzZ4tCa8SwIHJFaSzy4R6SHf7BY79rFBDf0idraZWAkYrDjDj8uWaSm3S2TK+hJ7/t1CEmZ7jXw+pg=="],
"express": ["express@5.2.1", "", { "dependencies": { "accepts": "^2.0.0", "body-parser": "^2.2.1", "content-disposition": "^1.0.0", "content-type": "^1.0.5", "cookie": "^0.7.1", "cookie-signature": "^1.2.1", "debug": "^4.4.0", "depd": "^2.0.0", "encodeurl": "^2.0.0", "escape-html": "^1.0.3", "etag": "^1.8.1", "finalhandler": "^2.1.0", "fresh": "^2.0.0", "http-errors": "^2.0.0", "merge-descriptors": "^2.0.0", "mime-types": "^3.0.0", "on-finished": "^2.4.1", "once": "^1.4.0", "parseurl": "^1.3.3", "proxy-addr": "^2.0.7", "qs": "^6.14.0", "range-parser": "^1.2.1", "router": "^2.2.0", "send": "^1.1.0", "serve-static": "^2.2.0", "statuses": "^2.0.1", "type-is": "^2.0.1", "vary": "^1.1.2" } }, "sha512-hIS4idWWai69NezIdRt2xFVofaF4j+6INOpJlVOLDO8zXGpUVEVzIYk12UUi2JzjEzWL3IOAxcTubgz9Po0yXw=="],
"express-rate-limit": ["express-rate-limit@8.3.2", "", { "dependencies": { "ip-address": "10.1.0" }, "peerDependencies": { "express": ">= 4.11" } }, "sha512-77VmFeJkO0/rvimEDuUC5H30oqUC4EyOhyGccfqoLebB0oiEYfM7nwPrsDsBL1gsTpwfzX8SFy2MT3TDyRq+bg=="],
"fast-deep-equal": ["fast-deep-equal@3.1.3", "", {}, "sha512-f3qQ9oQy9j2AhBe/H9VC91wLmKBCCU/gDOnKNAYG5hswO7BLKj09Hc5HYNz9cGI++xlpDCIgDaitVs03ATR84Q=="],
"fast-uri": ["fast-uri@3.1.0", "", {}, "sha512-iPeeDKJSWf4IEOasVVrknXpaBV0IApz/gp7S2bb7Z4Lljbl2MGJRqInZiUrQwV16cpzw/D3S5j5Julj/gT52AA=="],
"finalhandler": ["finalhandler@2.1.1", "", { "dependencies": { "debug": "^4.4.0", "encodeurl": "^2.0.0", "escape-html": "^1.0.3", "on-finished": "^2.4.1", "parseurl": "^1.3.3", "statuses": "^2.0.1" } }, "sha512-S8KoZgRZN+a5rNwqTxlZZePjT/4cnm0ROV70LedRHZ0p8u9fRID0hJUZQpkKLzro8LfmC8sx23bY6tVNxv8pQA=="],
"form-data": ["form-data@4.0.5", "", { "dependencies": { "asynckit": "^0.4.0", "combined-stream": "^1.0.8", "es-set-tostringtag": "^2.1.0", "hasown": "^2.0.2", "mime-types": "^2.1.12" } }, "sha512-8RipRLol37bNs2bhoV67fiTEvdTrbMUYcFTiy3+wuuOnUog2QBHCZWXDRijWQfAkhBj2Uf5UnVaiWwA5vdd82w=="],
"form-data-encoder": ["form-data-encoder@1.7.2", "", {}, "sha512-qfqtYan3rxrnCk1VYaA4H+Ms9xdpPqvLZa6xmMgFvhO32x7/3J/ExcTd6qpxM0vH2GdMI+poehyBZvqfMTto8A=="],
"formdata-node": ["formdata-node@4.4.1", "", { "dependencies": { "node-domexception": "1.0.0", "web-streams-polyfill": "4.0.0-beta.3" } }, "sha512-0iirZp3uVDjVGt9p49aTaqjk84TrglENEDuqfdlZQ1roC9CWlPk6Avf8EEnZNcAqPonwkG35x4n3ww/1THYAeQ=="],
"forwarded": ["forwarded@0.2.0", "", {}, "sha512-buRG0fpBtRHSTCOASe6hD258tEubFoRLb4ZNA6NxMVHNw2gOcwHo9wyablzMzOA5z9xA9L1KNjk/Nt6MT9aYow=="],
"fresh": ["fresh@2.0.0", "", {}, "sha512-Rx/WycZ60HOaqLKAi6cHRKKI7zxWbJ31MhntmtwMoaTeF7XFH9hhBp8vITaMidfljRQ6eYWCKkaTK+ykVJHP2A=="],
"function-bind": ["function-bind@1.1.2", "", {}, "sha512-7XHNxH7qX9xG5mIwxkhumTox/MIRNcOgDrxWsMt2pAr23WHp6MrRlN7FBSFpCpr+oVO0F744iUgR82nJMfG2SA=="],
"get-intrinsic": ["get-intrinsic@1.3.0", "", { "dependencies": { "call-bind-apply-helpers": "^1.0.2", "es-define-property": "^1.0.1", "es-errors": "^1.3.0", "es-object-atoms": "^1.1.1", "function-bind": "^1.1.2", "get-proto": "^1.0.1", "gopd": "^1.2.0", "has-symbols": "^1.1.0", "hasown": "^2.0.2", "math-intrinsics": "^1.1.0" } }, "sha512-9fSjSaos/fRIVIp+xSJlE6lfwhES7LNtKaCBIamHsjr2na1BiABJPo0mOjjz8GJDURarmCPGqaiVg5mfjb98CQ=="],
"get-proto": ["get-proto@1.0.1", "", { "dependencies": { "dunder-proto": "^1.0.1", "es-object-atoms": "^1.0.0" } }, "sha512-sTSfBjoXBp89JvIKIefqw7U2CCebsc74kiY6awiGogKtoSGbgjYE/G/+l9sF3MWFPNc9IcoOC4ODfKHfxFmp0g=="],
"gopd": ["gopd@1.2.0", "", {}, "sha512-ZUKRh6/kUFoAiTAtTYPZJ3hw9wNxx+BIBOijnlG9PnrJsCcSjs1wyyD6vJpaYtgnzDrKYRSqf3OO6Rfa93xsRg=="],
"has-symbols": ["has-symbols@1.1.0", "", {}, "sha512-1cDNdwJ2Jaohmb3sg4OmKaMBwuC48sYni5HUw2DvsC8LjGTLK9h+eb1X6RyuOHe4hT0ULCW68iomhjUoKUqlPQ=="],
"has-tostringtag": ["has-tostringtag@1.0.2", "", { "dependencies": { "has-symbols": "^1.0.3" } }, "sha512-NqADB8VjPFLM2V0VvHUewwwsw0ZWBaIdgo+ieHtK3hasLz4qeCRjYcqfB6AQrBggRKppKF8L52/VqdVsO47Dlw=="],
"hasown": ["hasown@2.0.2", "", { "dependencies": { "function-bind": "^1.1.2" } }, "sha512-0hJU9SCPvmMzIBdZFqNPXWa6dqh7WdH0cII9y+CyS8rG3nL48Bclra9HmKhVVUHyPWNH5Y7xDwAB7bfgSjkUMQ=="],
"hono": ["hono@4.12.9", "", {}, "sha512-wy3T8Zm2bsEvxKZM5w21VdHDDcwVS1yUFFY6i8UobSsKfFceT7TOwhbhfKsDyx7tYQlmRM5FLpIuYvNFyjctiA=="],
"http-errors": ["http-errors@2.0.1", "", { "dependencies": { "depd": "~2.0.0", "inherits": "~2.0.4", "setprototypeof": "~1.2.0", "statuses": "~2.0.2", "toidentifier": "~1.0.1" } }, "sha512-4FbRdAX+bSdmo4AUFuS0WNiPz8NgFt+r8ThgNWmlrjQjt1Q7ZR9+zTlce2859x4KSXrwIsaeTqDoKQmtP8pLmQ=="],
"humanize-ms": ["humanize-ms@1.2.1", "", { "dependencies": { "ms": "^2.0.0" } }, "sha512-Fl70vYtsAFb/C06PTS9dZBo7ihau+Tu/DNCk/OyHhea07S+aeMWpFFkUaXRa8fI+ScZbEI8dfSxwY7gxZ9SAVQ=="],
"iconv-lite": ["iconv-lite@0.7.2", "", { "dependencies": { "safer-buffer": ">= 2.1.2 < 3.0.0" } }, "sha512-im9DjEDQ55s9fL4EYzOAv0yMqmMBSZp6G0VvFyTMPKWxiSBHUj9NW/qqLmXUwXrrM7AvqSlTCfvqRb0cM8yYqw=="],
"inherits": ["inherits@2.0.4", "", {}, "sha512-k/vGaX4/Yla3WzyMCvTQOXYeIHvqOKtnqBduzTHpzpQZzAskKMhZ2K+EnBiSM9zGSoIFeMpXKxa4dYeZIQqewQ=="],
"ip-address": ["ip-address@10.1.0", "", {}, "sha512-XXADHxXmvT9+CRxhXg56LJovE+bmWnEWB78LB83VZTprKTmaC5QfruXocxzTZ2Kl0DNwKuBdlIhjL8LeY8Sf8Q=="],
"ipaddr.js": ["ipaddr.js@1.9.1", "", {}, "sha512-0KI/607xoxSToH7GjN1FfSbLoU0+btTicjsQSWQlh/hZykN8KpmMf7uYwPW3R+akZ6R/w18ZlXSHBYXiYUPO3g=="],
"is-promise": ["is-promise@4.0.0", "", {}, "sha512-hvpoI6korhJMnej285dSg6nu1+e6uxs7zG3BYAm5byqDsgJNWwxzM6z6iZiAgQR4TJ30JmBTOwqZUw3WlyH3AQ=="],
"isexe": ["isexe@2.0.0", "", {}, "sha512-RHxMLp9lnKHGHRng9QFhRCMbYAcVpn69smSGcq3f36xjgVVWThj4qqLbTLlq7Ssj8B+fIQ1EuCEGI2lKsyQeIw=="],
"jose": ["jose@6.2.2", "", {}, "sha512-d7kPDd34KO/YnzaDOlikGpOurfF0ByC2sEV4cANCtdqLlTfBlw2p14O/5d/zv40gJPbIQxfES3nSx1/oYNyuZQ=="],
"json-schema-to-ts": ["json-schema-to-ts@3.1.1", "", { "dependencies": { "@babel/runtime": "^7.18.3", "ts-algebra": "^2.0.0" } }, "sha512-+DWg8jCJG2TEnpy7kOm/7/AxaYoaRbjVB4LFZLySZlWn8exGs3A4OLJR966cVvU26N7X9TWxl+Jsw7dzAqKT6g=="],
"json-schema-traverse": ["json-schema-traverse@1.0.0", "", {}, "sha512-NM8/P9n3XjXhIZn1lLhkFaACTOURQXjWhV4BA/RnOv8xvgqtqpAX9IO4mRQxSx1Rlo4tqzeqb0sOlruaOy3dug=="],
"json-schema-typed": ["json-schema-typed@8.0.2", "", {}, "sha512-fQhoXdcvc3V28x7C7BMs4P5+kNlgUURe2jmUT1T//oBRMDrqy1QPelJimwZGo7Hg9VPV3EQV5Bnq4hbFy2vetA=="],
"math-intrinsics": ["math-intrinsics@1.1.0", "", {}, "sha512-/IXtbwEk5HTPyEwyKX6hGkYXxM9nbj64B+ilVJnC/R6B0pH5G4V3b0pVbL7DBj4tkhBAppbQUlf6F6Xl9LHu1g=="],
"media-typer": ["media-typer@1.1.0", "", {}, "sha512-aisnrDP4GNe06UcKFnV5bfMNPBUw4jsLGaWwWfnH3v02GnBuXX2MCVn5RbrWo0j3pczUilYblq7fQ7Nw2t5XKw=="],
"merge-descriptors": ["merge-descriptors@2.0.0", "", {}, "sha512-Snk314V5ayFLhp3fkUREub6WtjBfPdCPY1Ln8/8munuLuiYhsABgBVWsozAG+MWMbVEvcdcpbi9R7ww22l9Q3g=="],
"mime-db": ["mime-db@1.54.0", "", {}, "sha512-aU5EJuIN2WDemCcAp2vFBfp/m4EAhWJnUNSSw0ixs7/kXbd6Pg64EmwJkNdFhB8aWt1sH2CTXrLxo/iAGV3oPQ=="],
"mime-types": ["mime-types@3.0.2", "", { "dependencies": { "mime-db": "^1.54.0" } }, "sha512-Lbgzdk0h4juoQ9fCKXW4by0UJqj+nOOrI9MJ1sSj4nI8aI2eo1qmvQEie4VD1glsS250n15LsWsYtCugiStS5A=="],
"ms": ["ms@2.1.3", "", {}, "sha512-6FlzubTLZG3J2a/NVCAleEhjzq5oxgHyaCU9yYXvcLsvoVaHJq/s5xXI6/XXP6tz7R9xAOtHnSO/tXtF3WRTlA=="],
"negotiator": ["negotiator@1.0.0", "", {}, "sha512-8Ofs/AUQh8MaEcrlq5xOX0CQ9ypTF5dl78mjlMNfOK08fzpgTHQRQPBxcPlEtIw0yRpws+Zo/3r+5WRby7u3Gg=="],
"node-domexception": ["node-domexception@1.0.0", "", {}, "sha512-/jKZoMpw0F8GRwl4/eLROPA3cfcXtLApP0QzLmUT/HuPCZWyB7IY9ZrMeKw2O/nFIqPQB3PVM9aYm0F312AXDQ=="],
"node-fetch": ["node-fetch@2.7.0", "", { "dependencies": { "whatwg-url": "^5.0.0" }, "peerDependencies": { "encoding": "^0.1.0" }, "optionalPeers": ["encoding"] }, "sha512-c4FRfUm/dbcWZ7U+1Wq0AwCyFL+3nt2bEw05wfxSz+DWpWsitgmSgYmy2dQdWyKC1694ELPqMs/YzUSNozLt8A=="],
"object-assign": ["object-assign@4.1.1", "", {}, "sha512-rJgTQnkUnH1sFw8yT6VSU3zD3sWmu6sZhIseY8VX+GRu3P6F7Fu+JNDoXfklElbLJSnc3FUQHVe4cU5hj+BcUg=="],
"object-inspect": ["object-inspect@1.13.4", "", {}, "sha512-W67iLl4J2EXEGTbfeHCffrjDfitvLANg0UlX3wFUUSTx92KXRFegMHUVgSqE+wvhAbi4WqjGg9czysTV2Epbew=="],
"on-finished": ["on-finished@2.4.1", "", { "dependencies": { "ee-first": "1.1.1" } }, "sha512-oVlzkg3ENAhCk2zdv7IJwd/QUD4z2RxRwpkcGY8psCVcCYZNq4wYnVWALHM+brtuJjePWiYF/ClmuDr8Ch5+kg=="],
"once": ["once@1.4.0", "", { "dependencies": { "wrappy": "1" } }, "sha512-lNaJgI+2Q5URQBkccEKHTQOPaXdUxnZZElQTZY0MFUAuaEqe1E+Nyvgdz/aIyNi6Z9MzO5dv1H8n58/GELp3+w=="],
"openai": ["openai@6.34.0", "", { "peerDependencies": { "ws": "^8.18.0", "zod": "^3.25 || ^4.0" }, "optionalPeers": ["ws", "zod"], "bin": { "openai": "bin/cli" } }, "sha512-yEr2jdGf4tVFYG6ohmr3pF6VJuveP0EA/sS8TBx+4Eq5NT10alu5zg2dmxMXMgqpihRDQlFGpRt2XwsGj+Fyxw=="],
"parseurl": ["parseurl@1.3.3", "", {}, "sha512-CiyeOxFT/JZyN5m0z9PfXw4SCBJ6Sygz1Dpl0wqjlhDEGGBP1GnsUVEL0p63hoG1fcj3fHynXi9NYO4nWOL+qQ=="],
"path-key": ["path-key@3.1.1", "", {}, "sha512-ojmeN0qd+y0jszEtoY48r0Peq5dwMEkIlCOu6Q5f41lfkswXuKtYrhgoTpLnyIcHm24Uhqx+5Tqm2InSwLhE6Q=="],
"path-to-regexp": ["path-to-regexp@8.4.1", "", {}, "sha512-fvU78fIjZ+SBM9YwCknCvKOUKkLVqtWDVctl0s7xIqfmfb38t2TT4ZU2gHm+Z8xGwgW+QWEU3oQSAzIbo89Ggw=="],
"pkce-challenge": ["pkce-challenge@5.0.1", "", {}, "sha512-wQ0b/W4Fr01qtpHlqSqspcj3EhBvimsdh0KlHhH8HRZnMsEa0ea2fTULOXOS9ccQr3om+GcGRk4e+isrZWV8qQ=="],
"proxy-addr": ["proxy-addr@2.0.7", "", { "dependencies": { "forwarded": "0.2.0", "ipaddr.js": "1.9.1" } }, "sha512-llQsMLSUDUPT44jdrU/O37qlnifitDP+ZwrmmZcoSKyLKvtZxpyV0n2/bD/N4tBAAZ/gJEdZU7KMraoK1+XYAg=="],
"qs": ["qs@6.15.0", "", { "dependencies": { "side-channel": "^1.1.0" } }, "sha512-mAZTtNCeetKMH+pSjrb76NAM8V9a05I9aBZOHztWy/UqcJdQYNsf59vrRKWnojAT9Y+GbIvoTBC++CPHqpDBhQ=="],
"range-parser": ["range-parser@1.2.1", "", {}, "sha512-Hrgsx+orqoygnmhFbKaHE6c296J+HTAQXoxEF6gNupROmmGJRoyzfG3ccAveqCBrwr/2yxQ5BVd/GTl5agOwSg=="],
"raw-body": ["raw-body@3.0.2", "", { "dependencies": { "bytes": "~3.1.2", "http-errors": "~2.0.1", "iconv-lite": "~0.7.0", "unpipe": "~1.0.0" } }, "sha512-K5zQjDllxWkf7Z5xJdV0/B0WTNqx6vxG70zJE4N0kBs4LovmEYWJzQGxC9bS9RAKu3bgM40lrd5zoLJ12MQ5BA=="],
"require-from-string": ["require-from-string@2.0.2", "", {}, "sha512-Xf0nWe6RseziFMu+Ap9biiUbmplq6S9/p+7w7YXP/JBHhrUDDUhwa+vANyubuqfZWTveU//DYVGsDG7RKL/vEw=="],
"router": ["router@2.2.0", "", { "dependencies": { "debug": "^4.4.0", "depd": "^2.0.0", "is-promise": "^4.0.0", "parseurl": "^1.3.3", "path-to-regexp": "^8.0.0" } }, "sha512-nLTrUKm2UyiL7rlhapu/Zl45FwNgkZGaCpZbIHajDYgwlJCOzLSk+cIPAnsEqV955GjILJnKbdQC1nVPz+gAYQ=="],
"safer-buffer": ["safer-buffer@2.1.2", "", {}, "sha512-YZo3K82SD7Riyi0E1EQPojLz7kpepnSQI9IyPbHHg1XXXevb5dJI7tpyN2ADxGcQbHG7vcyRHk0cbwqcQriUtg=="],
"send": ["send@1.2.1", "", { "dependencies": { "debug": "^4.4.3", "encodeurl": "^2.0.0", "escape-html": "^1.0.3", "etag": "^1.8.1", "fresh": "^2.0.0", "http-errors": "^2.0.1", "mime-types": "^3.0.2", "ms": "^2.1.3", "on-finished": "^2.4.1", "range-parser": "^1.2.1", "statuses": "^2.0.2" } }, "sha512-1gnZf7DFcoIcajTjTwjwuDjzuz4PPcY2StKPlsGAQ1+YH20IRVrBaXSWmdjowTJ6u8Rc01PoYOGHXfP1mYcZNQ=="],
"serve-static": ["serve-static@2.2.1", "", { "dependencies": { "encodeurl": "^2.0.0", "escape-html": "^1.0.3", "parseurl": "^1.3.3", "send": "^1.2.0" } }, "sha512-xRXBn0pPqQTVQiC8wyQrKs2MOlX24zQ0POGaj0kultvoOCstBQM5yvOhAVSUwOMjQtTvsPWoNCHfPGwaaQJhTw=="],
"setprototypeof": ["setprototypeof@1.2.0", "", {}, "sha512-E5LDX7Wrp85Kil5bhZv46j8jOeboKq5JMmYM3gVGdGH8xFpPWXUMsNrlODCrkoxMEeNi/XZIwuRvY4XNwYMJpw=="],
"shebang-command": ["shebang-command@2.0.0", "", { "dependencies": { "shebang-regex": "^3.0.0" } }, "sha512-kHxr2zZpYtdmrN1qDjrrX/Z1rR1kG8Dx+gkpK1G4eXmvXswmcE1hTWBWYUzlraYw1/yZp6YuDY77YtvbN0dmDA=="],
"shebang-regex": ["shebang-regex@3.0.0", "", {}, "sha512-7++dFhtcx3353uBaq8DDR4NuxBetBzC7ZQOhmTQInHEd6bSrXdiEyzCvG07Z44UYdLShWUyXt5M/yhz8ekcb1A=="],
"side-channel": ["side-channel@1.1.0", "", { "dependencies": { "es-errors": "^1.3.0", "object-inspect": "^1.13.3", "side-channel-list": "^1.0.0", "side-channel-map": "^1.0.1", "side-channel-weakmap": "^1.0.2" } }, "sha512-ZX99e6tRweoUXqR+VBrslhda51Nh5MTQwou5tnUDgbtyM0dBgmhEDtWGP/xbKn6hqfPRHujUNwz5fy/wbbhnpw=="],
"side-channel-list": ["side-channel-list@1.0.0", "", { "dependencies": { "es-errors": "^1.3.0", "object-inspect": "^1.13.3" } }, "sha512-FCLHtRD/gnpCiCHEiJLOwdmFP+wzCmDEkc9y7NsYxeF4u7Btsn1ZuwgwJGxImImHicJArLP4R0yX4c2KCrMrTA=="],
"side-channel-map": ["side-channel-map@1.0.1", "", { "dependencies": { "call-bound": "^1.0.2", "es-errors": "^1.3.0", "get-intrinsic": "^1.2.5", "object-inspect": "^1.13.3" } }, "sha512-VCjCNfgMsby3tTdo02nbjtM/ewra6jPHmpThenkTYh8pG9ucZ/1P8So4u4FGBek/BjpOVsDCMoLA/iuBKIFXRA=="],
"side-channel-weakmap": ["side-channel-weakmap@1.0.2", "", { "dependencies": { "call-bound": "^1.0.2", "es-errors": "^1.3.0", "get-intrinsic": "^1.2.5", "object-inspect": "^1.13.3", "side-channel-map": "^1.0.1" } }, "sha512-WPS/HvHQTYnHisLo9McqBHOJk2FkHO/tlpvldyrnem4aeQp4hai3gythswg6p01oSoTl58rcpiFAjF2br2Ak2A=="],
"statuses": ["statuses@2.0.2", "", {}, "sha512-DvEy55V3DB7uknRo+4iOGT5fP1slR8wQohVdknigZPMpMstaKJQWhwiYBACJE3Ul2pTnATihhBYnRhZQHGBiRw=="],
"toidentifier": ["toidentifier@1.0.1", "", {}, "sha512-o5sSPKEkg/DIQNmH43V0/uerLrpzVedkUh8tGNvaeXpfpuwjKenlSox/2O/BTlZUtEe+JG7s5YhEz608PlAHRA=="],
"tr46": ["tr46@0.0.3", "", {}, "sha512-N3WMsuqV66lT30CrXNbEjx4GEwlow3v6rr4mCcv6prnfwhS01rkgyFdjPNBYd9br7LpXV1+Emh01fHnq2Gdgrw=="],
"ts-algebra": ["ts-algebra@2.0.0", "", {}, "sha512-FPAhNPFMrkwz76P7cdjdmiShwMynZYN6SgOujD1urY4oNm80Ou9oMdmbR45LotcKOXoy7wSmHkRFE6Mxbrhefw=="],
"type-is": ["type-is@2.0.1", "", { "dependencies": { "content-type": "^1.0.5", "media-typer": "^1.1.0", "mime-types": "^3.0.0" } }, "sha512-OZs6gsjF4vMp32qrCbiVSkrFmXtG/AZhY3t0iAMrMBiAZyV9oALtXO8hsrHbMXF9x6L3grlFuwW2oAz7cav+Gw=="],
"typescript": ["typescript@5.9.3", "", { "bin": { "tsc": "bin/tsc", "tsserver": "bin/tsserver" } }, "sha512-jl1vZzPDinLr9eUt3J/t7V6FgNEw9QjvBPdysz9KfQDD41fQrC2Y4vKQdiaUpFT4bXlb1RHhLpp8wtm6M5TgSw=="],
"undici-types": ["undici-types@5.26.5", "", {}, "sha512-JlCMO+ehdEIKqlFxk6IfVoAUVmgz7cU7zD/h9XZ0qzeosSHmUJVOzSQvvYSYWXkFXC+IfLKSIffhv0sVZup6pA=="],
"unpipe": ["unpipe@1.0.0", "", {}, "sha512-pjy2bYhSsufwWlKwPc+l3cN7+wuJlK6uz0YdJEOlQDbl6jo/YlPi4mb8agUkVC8BF7V8NuzeyPNqRksA3hztKQ=="],
"vary": ["vary@1.1.2", "", {}, "sha512-BNGbWLfd0eUPabhkXUVm0j8uuvREyTh5ovRa/dyow/BqAbZJyC+5fU+IzQOzmAKzYqYRAISoRhdQr3eIZ/PXqg=="],
"web-streams-polyfill": ["web-streams-polyfill@4.0.0-beta.3", "", {}, "sha512-QW95TCTaHmsYfHDybGMwO5IJIM93I/6vTRk+daHTWFPhwh+C8Cg7j7XyKrwrj8Ib6vYXe0ocYNrmzY4xAAN6ug=="],
"webidl-conversions": ["webidl-conversions@3.0.1", "", {}, "sha512-2JAn3z8AR6rjK8Sm8orRC0h/bcl/DqL7tRPdGZ4I1CjdF+EaMLmYxBHyXuKL849eucPFhvBoxMsflfOb8kxaeQ=="],
"whatwg-url": ["whatwg-url@5.0.0", "", { "dependencies": { "tr46": "~0.0.3", "webidl-conversions": "^3.0.0" } }, "sha512-saE57nupxk6v3HY35+jzBwYa0rKSy0XR8JSxZPwgLr7ys0IBzhGviA1/TUGJLmSVqs8pb9AnvICXEuOHLprYTw=="],
"which": ["which@2.0.2", "", { "dependencies": { "isexe": "^2.0.0" }, "bin": { "node-which": "./bin/node-which" } }, "sha512-BLI3Tl1TW3Pvl70l3yq3Y64i+awpwXqsGBYWkkqMtnbXgrMD+yj7rhW0kuEDxzJaYXGjEW5ogapKNMEKNMjibA=="],
"wrappy": ["wrappy@1.0.2", "", {}, "sha512-l4Sp/DRseor9wL6EvV2+TuQn63dMkPjZ/sp9XkghTEbV9KlPS1xUsZ3u7/IQO4wxtcFB4bgpQPRcR3QCvezPcQ=="],
"yaml": ["yaml@2.8.3", "", { "bin": { "yaml": "bin.mjs" } }, "sha512-AvbaCLOO2Otw/lW5bmh9d/WEdcDFdQp2Z2ZUH3pX9U2ihyUY0nvLv7J6TrWowklRGPYbB/IuIMfYgxaCPg5Bpg=="],
"zod": ["zod@4.3.6", "", {}, "sha512-rftlrkhHZOcjDwkGlnUtZZkvaPHCsDATp4pGpuOOMDaTdDDXF91wuVDJoWoPsKX/3YPQ5fHuF3STjcYyKr+Qhg=="],
"zod-to-json-schema": ["zod-to-json-schema@3.25.2", "", { "peerDependencies": { "zod": "^3.25.28 || ^4" } }, "sha512-O/PgfnpT1xKSDeQYSCfRI5Gy3hPf91mKVDuYLUHZJMiDFptvP41MSnWofm8dnCm0256ZNfZIM7DSzuSMAFnjHA=="],
"@anthropic-ai/claude-agent-sdk/@anthropic-ai/sdk": ["@anthropic-ai/sdk@0.74.0", "", { "dependencies": { "json-schema-to-ts": "^3.1.1" }, "peerDependencies": { "zod": "^3.25.0 || ^4.0.0" }, "optionalPeers": ["zod"], "bin": { "anthropic-ai-sdk": "bin/cli" } }, "sha512-srbJV7JKsc5cQ6eVuFzjZO7UR3xEPJqPamHFIe29bs38Ij2IripoAhC0S5NslNbaFUYqBKypmmpzMTpqfHEUDw=="],
"@types/node-fetch/@types/node": ["@types/node@25.5.0", "", { "dependencies": { "undici-types": "~7.18.0" } }, "sha512-jp2P3tQMSxWugkCUKLRPVUpGaL5MVFwF8RDuSRztfwgN1wmqJeMSbKlnEtQqU8UrhTmzEmZdu2I6v2dpp7XIxw=="],
"bun-types/@types/node": ["@types/node@25.5.0", "", { "dependencies": { "undici-types": "~7.18.0" } }, "sha512-jp2P3tQMSxWugkCUKLRPVUpGaL5MVFwF8RDuSRztfwgN1wmqJeMSbKlnEtQqU8UrhTmzEmZdu2I6v2dpp7XIxw=="],
"form-data/mime-types": ["mime-types@2.1.35", "", { "dependencies": { "mime-db": "1.52.0" } }, "sha512-ZDY+bPm5zTTF+YpCrAU9nK0UgICYPT0QtT1NZWFv4s++TNkcgVaT0g6+4R2uI4MjQjzysHB1zxuWL50hzaeXiw=="],
"@types/node-fetch/@types/node/undici-types": ["undici-types@7.18.2", "", {}, "sha512-AsuCzffGHJybSaRrmr5eHr81mwJU3kjw6M+uprWvCXiNeN9SOGwQ3Jn8jb8m3Z6izVgknn1R0FTCEAP2QrLY/w=="],
"bun-types/@types/node/undici-types": ["undici-types@7.18.2", "", {}, "sha512-AsuCzffGHJybSaRrmr5eHr81mwJU3kjw6M+uprWvCXiNeN9SOGwQ3Jn8jb8m3Z6izVgknn1R0FTCEAP2QrLY/w=="],
"form-data/mime-types/mime-db": ["mime-db@1.52.0", "", {}, "sha512-sPU4uV7dYlvtWJxwwxHD0PuihVNiE7TyAbQ5SWxDCB9mUYvOgroQOwYQQOKPJ8CIbE+1ETVlOoK1UC2nU3gYvg=="],
}
}

View File

@@ -1,93 +0,0 @@
- id: app-test1-counter-create
prompt: |-
Create a simple counter app with increment and decrement buttons.
judgeChecklist:
- shows the current count in the UI
- includes an increment button
- includes a decrement button
- clicking the buttons updates the count correctly
- id: app-test2-counter-reset
prompt: |-
Add a reset button that sets the counter back to 0
initial: ai_evals/fixtures/frontend/app/initial/test1_counter_app
judgeChecklist:
- adds a reset control to the existing counter app
- clicking reset sets the count back to 0
- keeps the existing increment and decrement behavior working
- id: app-test3-shopping-cart-quantity
prompt: |-
Add a quantity selector (+ and - buttons) to each cart item so users can adjust quantities without removing and re-adding items
initial: ai_evals/fixtures/frontend/app/initial/shopping_cart
judgeChecklist:
- each cart item has visible plus and minus quantity controls
- users can increase quantity without re-adding the product
- users can decrease quantity from the cart UI
- cart totals stay in sync with quantity changes
- id: app-test4-shopping-cart-discount
prompt: |-
Add a discount code input field in the cart.
When the code "SAVE10" is entered, apply a 10% discount to the total
initial: ai_evals/fixtures/frontend/app/initial/shopping_cart
judgeChecklist:
- adds a discount code input to the cart
- recognizes the code SAVE10
- applies a 10 percent discount to the displayed total
- keeps the rest of the cart behavior intact
- id: app-test5-file-manager-search
prompt: |-
Add a search bar in the toolbar that filters files and folders by name as the user types
initial: ai_evals/fixtures/frontend/app/initial/file_manager
judgeChecklist:
- adds a search input in the toolbar
- filters files and folders by name as the user types
- updates the visible file list from the search query
- keeps the rest of the file manager usable
- id: app-test6-file-manager-inline-rename
prompt: |-
Let users rename files and folders directly from the file list without leaving the page.
initial: ai_evals/fixtures/frontend/app/initial/file_manager
judgeChecklist:
- adds a visible rename action or inline edit mode in the file list
- lets users edit an item's name directly from the list
- saves the renamed item through the app's existing rename behavior
- refreshes the displayed name after a successful rename
- id: app-test7-file-manager-select-all
prompt: |-
Add a "Select All" checkbox in the file list header and individual checkboxes for each file.
Add a "Delete Selected" button that appears when items are selected
initial: ai_evals/fixtures/frontend/app/initial/file_manager
judgeChecklist:
- adds a select-all control in the file list header
- adds per-item selection controls
- shows a delete-selected action only when there is a selection
- deleting selected items updates the visible list
- id: app-test8-inventory-tracker-create
prompt: |-
Create an inventory tracker app for a small store.
Users should be able to add items with a name, sku, quantity, and price, search items by name or sku, and delete items.
The inventory should persist between sessions.
judgeChecklist:
- includes a form to add inventory items with name, sku, quantity, and price
- shows a list or table of saved inventory items
- supports searching or filtering by name or sku
- lets users delete existing inventory items
- persists the inventory data appropriately for a raw Windmill app
- id: app-test9-recipe-book-create
prompt: |-
Create a recipe book app where users can add recipes with a name, ingredients list, and instructions.
Include a search bar to filter recipes by name and the ability to delete recipes.
Recipes should persist between sessions.
judgeChecklist:
- includes a form to add recipes with name, ingredients, and instructions
- shows saved recipes in the app
- supports searching recipes by name
- lets users delete recipes
- persists recipes appropriately for a raw Windmill app

View File

@@ -1,66 +0,0 @@
- id: bun-hello-script
prompt: |-
Create a Windmill Bun script at `f/evals/hello.ts`.
It should take a `name` input and return a greeting object like `{ greeting: "Hello, Alice!" }`.
expected: ai_evals/fixtures/cli/expected/bun-hello-script
judgeChecklist:
- creates the requested Bun script at f/evals/hello.ts
- takes a name input
- returns an object containing the greeting
- id: bun-hello-flow
prompt: |-
Create a Windmill flow at `f/evals/hello__flow`.
It should take a `name` input and return a greeting object like `{ greeting: "Hello, Alice!" }`.
Put the step code in `hello.ts`.
expected: ai_evals/fixtures/cli/expected/bun-hello-flow
judgeChecklist:
- creates the requested flow folder with flow.yaml and hello.ts
- wires the name input into the flow step
- returns the greeting object
- id: python-add-numbers-script
prompt: |-
Add a Windmill Python script at `f/evals/add_numbers.py`.
It should take `a` and `b` as inputs and return `{ "total": a + b }`.
expected: ai_evals/fixtures/cli/expected/python-add-numbers-script
judgeChecklist:
- creates the requested Python script at f/evals/add_numbers.py
- takes `a` and `b` as inputs
- returns an object with total equal to a plus b
- id: bun-hello-script-uppercase
prompt: |-
Update `f/evals/hello.ts` so it accepts an optional `uppercase` boolean.
Keep returning `{ greeting: ... }`, but when `uppercase` is true the greeting should be uppercased before returning it.
initial: ai_evals/fixtures/cli/initial/bun-hello-script-uppercase
expected: ai_evals/fixtures/cli/expected/bun-hello-script-uppercase
judgeChecklist:
- updates the existing hello.ts file rather than creating a new script
- accepts an optional uppercase boolean input
- keeps returning an object with greeting
- uppercases the greeting when uppercase is true
- id: bun-hello-flow-punctuation
prompt: |-
Update the existing flow in `f/evals/hello__flow` so it also accepts an optional `punctuation` input.
The greeting should use that punctuation and default to `!` when it is missing.
initial: ai_evals/fixtures/cli/initial/bun-hello-flow-punctuation
expected: ai_evals/fixtures/cli/expected/bun-hello-flow-punctuation
judgeChecklist:
- updates the existing hello flow instead of creating a new one
- adds an optional punctuation input to the flow
- updates the step code so the returned greeting uses punctuation
- defaults punctuation to an exclamation mark when omitted
- id: flow-reuse-existing-script
prompt: |-
There is already a reusable greeting script at `f/lib/format_greeting.ts`.
Create a flow at `f/evals/reuse_greeting__flow` that takes a `name` input and reuses that existing script instead of duplicating the logic inline.
initial: ai_evals/fixtures/cli/initial/flow-reuse-existing-script
expected: ai_evals/fixtures/cli/expected/flow-reuse-existing-script
judgeChecklist:
- creates the requested flow at f/evals/reuse_greeting__flow
- reuses the existing script from f/lib by path
- does not duplicate the greeting logic in a new inline script
- wires the name input into the reused script

View File

@@ -1,246 +0,0 @@
- id: flow-test0-sum-two-numbers
prompt: |-
Create a flow that takes two numbers, `a` and `b`, and returns their sum.
Keep it simple and use a single step named `sum_numbers`.
expected: ai_evals/fixtures/frontend/flow/expected/test0_sum_two_numbers.json
judgeChecklist:
- "the flow takes `a` and `b` as inputs"
- "the main step is named `sum_numbers`"
- the flow returns the sum of the two numbers
- id: flow-test1-reuse-existing-script
prompt: |-
I need a flow that adds two numbers.
If there is already a script in the workspace that does that, reuse it instead of rewriting the logic.
The flow should take `a` and `b` as inputs and use a single step named `sum_numbers`.
initial: ai_evals/fixtures/frontend/flow/initial/test1_reuse_existing_script_initial.json
expected: ai_evals/fixtures/frontend/flow/expected/test1_reuse_existing_script.json
judgeChecklist:
- "the flow takes `a` and `b` as inputs"
- "the main step is named `sum_numbers`"
- the flow reuses the existing workspace script instead of rewriting the addition logic
- id: flow-test2-call-existing-subflow
prompt: |-
Create a parent flow that adds two numbers by reusing an existing flow in the workspace if one already exists.
The parent flow should take `a` and `b` as inputs and delegate the calculation instead of inlining it.
Use a single step named `call_add_numbers`.
initial: ai_evals/fixtures/frontend/flow/initial/test2_call_existing_subflow_initial.json
expected: ai_evals/fixtures/frontend/flow/expected/test2_call_existing_subflow.json
judgeChecklist:
- "the parent flow takes `a` and `b` as inputs"
- "the main step is named `call_add_numbers`"
- the parent flow delegates to an existing workspace subflow instead of inlining the addition logic
- id: flow-test3-branchone-routing
prompt: |-
Create a flow that routes incoming support requests based on the customer's tier.
The input should contain a string field named `tier`.
Free, pro, and enterprise requests should go to different queues, and unknown tiers should fall back to a default queue.
Name the main routing step `route_by_tier`.
expected: ai_evals/fixtures/frontend/flow/expected/test3_branchone_routing.json
judgeChecklist:
- "the input schema includes a string field named `tier`"
- "the main routing step is named `route_by_tier`"
- free requests go to a free queue
- pro requests go to a pro queue
- enterprise requests go to an enterprise queue
- unknown tiers fall back to a default queue
- id: flow-test4-order-processing-loop
prompt: |-
Build an order-processing flow.
The input should include an order with:
- an `items` array containing `name`, `price`, and `quantity`
- `customer_email`
- `shipping_address`
The flow should:
- validate that every item has a positive price and quantity
- calculate the order total with 8% tax
- check inventory for each item using placeholder availability data
- create a shipment if everything is in stock, otherwise create a backorder
- send a confirmation using placeholder email logic
- return a final order summary with the status
validate:
schemaAnyOf:
- requiredPaths:
- order
- order.items
- order.customer_email
- order.shipping_address
- requiredPaths:
- items
- customer_email
- shipping_address
resolveResultsRefs: true
judgeChecklist:
- the flow validates that every item has a positive price and quantity
- the flow calculates the order total with 8% tax
- the flow checks inventory for each item using placeholder availability data
- the flow creates a shipment if everything is in stock, otherwise a backorder
- the flow sends a confirmation using placeholder email logic
- the flow returns a final order summary with the resulting status
- id: flow-test5-parallel-data-pipeline
prompt: |-
Create a data-processing flow for three external data sources.
It should:
- load a small placeholder configuration listing the three sources
- fetch placeholder records from each source
- clean and validate each source's records
- combine everything into one dataset
- compute an overall quality score
- store the result differently depending on the score:
- 90 or above goes to the primary database
- 70 to 89 goes to a secondary database with a warning
- below 70 goes to quarantine and triggers an alert
- return a processing report with total records, quality score, and destination
judgeChecklist:
- the flow loads a placeholder configuration listing three external sources
- the flow fetches placeholder records from each source
- the flow cleans and validates each source's records
- the flow combines everything into one dataset
- the flow computes an overall quality score
- scores of 90 or above go to the primary database
- scores from 70 to 89 go to a secondary database with a warning
- scores below 70 go to quarantine and trigger an alert
- the final report includes total records, quality score, and destination
- id: flow-test6-ai-agent-tools
prompt: |-
Create a customer support flow.
The input should include `customer_id` and `query_text`.
The flow should load the customer's profile and order history, then use an AI assistant to help with the request.
The assistant should be able to:
- look up orders
- check refund eligibility
- search FAQs
- open a support ticket when needed
After that, log the interaction and return the assistant's response along with any actions it took.
judgeChecklist:
- "the input schema includes `customer_id` and `query_text`"
- the flow loads the customer's profile and order history
- the flow uses an AI assistant step
- the assistant can look up orders
- the assistant can check refund eligibility
- the assistant can search FAQs
- the assistant can open a support ticket
- the flow logs the interaction
- the final output returns the assistant response along with any actions taken or resulting support action details
- id: flow-test7-simple-modification
prompt: |-
Update this flow so it validates processed data before saving it.
After `process_data`, add a `validate_data` step that checks the data array is not empty.
If the array is empty, it should return an error object with the message `No data to save`.
If validation passes, let the save continue normally.
Update `save_results` so it handles the validation result correctly.
initial: ai_evals/fixtures/frontend/flow/initial/test5_initial.json
expected: ai_evals/fixtures/frontend/flow/expected/test5_modify_simple.json
judgeChecklist:
- the updated flow keeps the original fetch and process steps intact
- "a `validate_data` step is added after `process_data`"
- "`validate_data` checks that the processed data array is not empty"
- "empty data returns an error object with the message `No data to save`"
- "`save_results` handles the validation result correctly"
- id: flow-test8-branching-in-loop
prompt: |-
Update the order-processing logic inside `loop_orders` so different order types are handled differently.
For `express`, mark the order as priority and use a shipping cost of $15.99.
For `standard`, use a shipping cost of $5.99.
For `pickup`, mark it as no shipping required with a cost of $0.
Keep the existing processing as a fallback for unknown order types.
Each path should return the orderId, shipping cost, and shipping type.
initial: ai_evals/fixtures/frontend/flow/initial/test6_initial.json
judgeChecklist:
- "the existing `loop_orders` flow still handles per-order processing"
- exact branching topology is not required as long as `loop_orders` handles the order types correctly
- express orders are marked as priority and use a shipping cost of 15.99
- standard orders use a shipping cost of 5.99
- pickup orders use a shipping cost of 0 and are treated as no shipping required
- unknown order types still follow a fallback path
- "each processed order returns `orderId`, `shippingCost`, and `shippingType`"
- id: flow-test9-parallel-refactor
prompt: |-
Refactor this flow so the enrichment work no longer runs one step at a time.
`enrich_price`, `enrich_inventory`, and `enrich_reviews` should run independently.
Each one should return a fallback value if it fails.
Update `combine_data` so it merges the enrichment results and sets a `hasFallbacks` flag when any fallback was used.
Keep `get_item` as the first step and `return_result` as the last step.
initial: ai_evals/fixtures/frontend/flow/initial/test7_initial.json
expected: ai_evals/fixtures/frontend/flow/expected/test7_modify_complex.json
judgeChecklist:
- "the updated flow keeps `get_item` as the first step"
- "the updated flow keeps `return_result` as the last step"
- "`enrich_price`, `enrich_inventory`, and `enrich_reviews` run independently rather than sequentially"
- each enrichment path returns a fallback value if it fails
- "`combine_data` merges the enrichment results"
- "`combine_data` sets `hasFallbacks` when any fallback was used"
- id: flow-test10-while-loop-counter
prompt: |-
Create a flow that keeps incrementing a counter until it reaches a target value.
The input should include a number field named `target`.
Name the looping step `count_until_target`.
Once the target is reached, return the final counter value.
expected: ai_evals/fixtures/frontend/flow/expected/test10_while_loop_counter.json
judgeChecklist:
- "the input schema includes a number field named `target`"
- "the looping step is named `count_until_target`"
- the flow keeps incrementing a counter until the target is reached
- the final output returns the final counter value
- id: flow-test11-preprocessor-and-failure-handler
prompt: |-
Create an event-processing flow for a string payload.
Before the main processing runs, trim the payload and reject empty strings.
The main step should be named `process_event` and return a simple success object.
If anything fails, return a compact error object with the error message and the failing step id.
expected: ai_evals/fixtures/frontend/flow/expected/test11_preprocessor_failure.json
validate:
requireSpecialModules:
- preprocessor_module
- failure_module
judgeChecklist:
- the flow trims the payload before the main processing runs
- the flow rejects empty payload strings
- "the main step is named `process_event`"
- "`process_event` returns a simple success object"
- failures return a compact error object with the error message and failing step id
- id: flow-test12-approval-step
prompt: |-
Create a purchase approval flow.
The input should include `requester_email` and `amount`.
Add an approval step named `request_approval` that pauses the flow and asks the approver for a comment.
One approval should be enough to continue.
After approval, add a final step named `finalize_purchase` that returns an approved status object.
expected: ai_evals/fixtures/frontend/flow/expected/test12_approval_step.json
validate:
schemaRequiredPaths:
- requester_email
- amount
requireSuspendSteps:
- id: request_approval
requiredEvents: 1
resumeRequiredStringFieldAnyOf:
- comment
- approver_comment
judgeChecklist:
- "the flow includes an approval step named `request_approval`"
- "`request_approval` pauses the flow and asks the approver for a comment"
- one approval is enough to continue
- "the flow includes a final step named `finalize_purchase`"
- "`finalize_purchase` returns an approved status object after approval"

View File

@@ -1,11 +0,0 @@
- id: script-test1-greet-user
prompt: |-
Update the current Bun script so it takes the existing `name` input and returns a plain greeting string like `Hello, Alice!`.
Do not wrap the result in an object or array.
Keep it simple and do not add external dependencies.
initial: ai_evals/fixtures/frontend/script/initial/test1_empty_bun.json
expected: ai_evals/fixtures/frontend/script/expected/test1_greet_user.json
judgeChecklist:
- uses the existing `name` input
- returns a plain greeting string
- does not wrap the result in an object or array

View File

@@ -1,295 +0,0 @@
#!/usr/bin/env bun
import { Command, InvalidArgumentError } from "commander";
import { loadCases, loadSelectedCases } from "../core/cases";
import {
EVAL_MODELS,
type EvalModelSpec,
formatRunModelLabel,
getCliEvalModel,
getEvalModelHelpText,
resolveEvalModel,
} from "../core/models";
import {
appendHistoryRecord,
buildRunResult,
formatRunSummary,
resolveRunOutputPath,
writeRunArtifacts,
writeRunResult,
} from "../core/results";
import { runSuite } from "../core/runSuite";
import { EVAL_MODES, type EvalMode } from "../core/types";
import { DEFAULT_JUDGE_MODEL } from "../core/judge";
import { createCliModeRunner } from "../modes/cli";
import { runFrontendBenchmarkAdapter } from "../adapters/frontend/runtime";
async function main() {
const program = new Command()
.name("bun run cli --")
.description("Run AI eval cases against the current production prompts and guidance")
.showHelpAfterError()
.showSuggestionAfterError()
.addHelpText(
"after",
[
"",
"Examples:",
" bun run cli -- models",
" bun run cli -- cases",
" bun run cli -- cases flow",
" bun run cli -- run flow",
" bun run cli -- run flow --model 4o",
" bun run cli -- run flow --models haiku,opus,4o",
" bun run cli -- run flow flow-test0-sum-two-numbers --verbose",
" bun run cli -- run flow --record",
" bun run cli -- run flow flow-test5-simple-modification --runs 3",
" bun run cli -- run cli bun-hello-script",
"",
"Models:",
getEvalModelHelpText(),
].join("\n")
);
program
.command("models")
.description("List available model aliases")
.action(() => {
handleModels();
});
program
.command("cases")
.description("List available cases")
.argument("[mode]", "cli, flow, script, or app", parseOptionalMode)
.action(async (mode?: EvalMode) => {
await handleCases(mode);
});
program
.command("run")
.description("Run one benchmark mode")
.argument("<mode>", "cli, flow, script, or app", parseMode)
.argument("[caseIds...]", "specific case ids to run")
.option("--runs <n>", "number of attempts per case", parsePositiveInteger, 1)
.option("--output <path>", "write the result JSON to this path")
.option("--model <name>", `model alias (${EVAL_MODELS.map((entry) => entry.id).join(", ")})`)
.option("--models <names>", "comma-separated model aliases to run sequentially")
.option("--verbose", "stream assistant output during frontend runs")
.option("--record", "append a compact summary line to ai_evals/history/<mode>.jsonl")
.action(
async (
mode: EvalMode,
caseIds: string[],
options: {
runs: number;
output?: string;
model?: string;
models?: string;
verbose?: boolean;
record?: boolean;
}
) => {
await handleRun({
mode,
caseIds,
runs: options.runs,
outputPath: options.output,
model: options.model,
models: options.models,
verbose: options.verbose ?? false,
record: options.record ?? false,
});
}
);
await program.parseAsync(process.argv);
}
async function handleCases(mode?: EvalMode) {
const modes = mode ? [mode] : [...EVAL_MODES];
for (const entry of modes) {
const cases = await loadCases(entry);
process.stdout.write(`${entry} (${cases.length})\n`);
for (const evalCase of cases) {
process.stdout.write(`- ${evalCase.id}\n`);
}
process.stdout.write("\n");
}
}
function handleModels() {
process.stdout.write("Available models\n");
for (const model of EVAL_MODELS) {
const supports = [
...(model.frontend ? ["flow", "script", "app"] : []),
...(model.cli ? ["cli"] : []),
];
const aliases = [model.id, ...model.aliases.filter((alias) => alias !== model.id)];
process.stdout.write(`- ${model.id}: ${model.label}\n`);
process.stdout.write(` aliases: ${aliases.join(", ")}\n`);
process.stdout.write(` modes: ${supports.join(", ")}\n`);
}
process.stdout.write(`\nJudge model: ${DEFAULT_JUDGE_MODEL}\n`);
}
async function handleRun(input: {
mode: EvalMode;
caseIds: string[];
runs: number;
outputPath?: string;
model?: string;
models?: string;
verbose: boolean;
record: boolean;
}) {
if (input.record && input.caseIds.length > 0) {
throw new Error("--record only supports full-suite runs; omit case ids to record history");
}
if (input.model && input.models) {
throw new Error("Use either --model or --models, not both");
}
const selectedCases = await loadSelectedCases(input.mode, input.caseIds);
const models = resolveRequestedModels(input.mode, input.model, input.models);
if (input.outputPath && models.length > 1) {
throw new Error("--output only supports a single model run");
}
const summaries: Array<{ label: string; passRate: number; averageDurationMs: number }> = [];
for (const [index, model] of models.entries()) {
const runModel = formatRunModelLabel(input.mode, model);
if (models.length > 1) {
process.stdout.write(
`${index > 0 ? "\n" : ""}=== ${input.mode} ${model.id} (${runModel}) ===\n`
);
}
process.stderr.write(`Starting ${input.mode} benchmark...\n`);
const result =
input.mode === "cli"
? await runCliBenchmark(selectedCases, input.runs, getCliEvalModel(model), runModel)
: await runFrontendBenchmarkAdapter({
mode: input.mode,
caseIds: input.caseIds,
runs: input.runs,
model: model.id,
verbose: input.verbose,
});
const resolvedOutputPath =
models.length === 1
? resolveRunOutputPath(input.mode, input.outputPath)
: resolveRunOutputPath(input.mode);
const artifactsPath = await writeRunArtifacts(result, resolvedOutputPath);
const resultPath = await writeRunResult(result, resolvedOutputPath);
const historyPath = input.record ? await appendHistoryRecord(result) : null;
process.stdout.write(`${formatRunSummary(result)}\n`);
process.stdout.write(`Saved: ${resultPath}\n`);
if (artifactsPath) {
process.stdout.write(`Artifacts: ${artifactsPath}\n`);
}
if (historyPath) {
process.stdout.write(`Recorded: ${historyPath}\n`);
}
summaries.push({
label: `${model.id} (${runModel})`,
passRate: result.passRate,
averageDurationMs: result.averageDurationMs,
});
}
if (summaries.length > 1) {
process.stdout.write("\nModel summary\n");
for (const summary of summaries) {
process.stdout.write(
`- ${summary.label}: ${formatPercent(summary.passRate)} | ${Math.round(summary.averageDurationMs)}ms\n`
);
}
}
}
async function runCliBenchmark(
cases: Awaited<ReturnType<typeof loadSelectedCases>>,
runs: number,
model: ReturnType<typeof getCliEvalModel>,
runModel: string
) {
const caseResults = await runSuite({
modeRunner: createCliModeRunner(model),
cases,
runs,
runModel,
judgeModel: DEFAULT_JUDGE_MODEL,
});
return buildRunResult({
mode: "cli",
runs,
runModel,
judgeModel: DEFAULT_JUDGE_MODEL,
caseResults,
});
}
function parseMode(value: string): EvalMode {
if (EVAL_MODES.includes(value as EvalMode)) {
return value as EvalMode;
}
throw new InvalidArgumentError(`mode must be one of: ${EVAL_MODES.join(", ")}`);
}
function parseOptionalMode(value: string | undefined): EvalMode | undefined {
return value ? parseMode(value) : undefined;
}
function parsePositiveInteger(value: string): number {
const parsed = Number(value);
if (!Number.isInteger(parsed) || parsed <= 0) {
throw new InvalidArgumentError("must be a positive integer");
}
return parsed;
}
function resolveRequestedModels(
mode: EvalMode,
singleModel?: string,
multipleModels?: string
): EvalModelSpec[] {
if (!multipleModels) {
return [resolveEvalModel(mode, singleModel)];
}
const aliases = multipleModels
.split(",")
.map((value) => value.trim())
.filter(Boolean);
if (aliases.length === 0) {
throw new Error("--models requires at least one model alias");
}
const seen = new Set<string>();
const models: EvalModelSpec[] = [];
for (const alias of aliases) {
const model = resolveEvalModel(mode, alias);
if (seen.has(model.id)) {
continue;
}
seen.add(model.id);
models.push(model);
}
return models;
}
function formatPercent(value: number): string {
return `${(value * 100).toFixed(1)}%`;
}
void main().catch((error) => {
const message = error instanceof Error ? error.message : String(error);
process.stderr.write(`${message}\n`);
process.exit(1);
});

View File

@@ -1,71 +0,0 @@
import { readFile } from "node:fs/promises";
import path from "node:path";
import { fileURLToPath } from "node:url";
import { parse } from "yaml";
import type { EvalCase, EvalMode, FlowValidationSpec } from "./types";
const REPO_ROOT = fileURLToPath(new URL("../../", import.meta.url));
const CASES_DIR = path.join(REPO_ROOT, "ai_evals", "cases");
interface RawEvalCase {
id: string;
prompt: string;
initial?: string;
expected?: string;
validate?: FlowValidationSpec;
judgeChecklist?: string[];
}
export function getRepoRoot(): string {
return REPO_ROOT;
}
export function getAiEvalsRoot(): string {
return path.join(REPO_ROOT, "ai_evals");
}
export async function loadCases(mode: EvalMode): Promise<EvalCase[]> {
const filePath = path.join(CASES_DIR, `${mode}.yaml`);
const raw = await readFile(filePath, "utf8");
const parsed = parse(raw);
if (!Array.isArray(parsed)) {
throw new Error(`Expected ${filePath} to contain a YAML list of cases`);
}
return parsed.map((entry) => ({
id: entry.id,
prompt: entry.prompt,
initialPath: resolveFixturePath(entry.initial),
expectedPath: resolveFixturePath(entry.expected),
validate: entry.validate,
judgeChecklist: entry.judgeChecklist,
}));
}
export async function loadSelectedCases(
mode: EvalMode,
selectedIds: string[]
): Promise<EvalCase[]> {
const allCases = await loadCases(mode);
if (selectedIds.length === 0) {
return allCases;
}
const caseMap = new Map(allCases.map((entry) => [entry.id, entry]));
const missing = selectedIds.filter((id) => !caseMap.has(id));
if (missing.length > 0) {
throw new Error(
`Unknown ${mode} case${missing.length === 1 ? "" : "s"}: ${missing.join(", ")}`
);
}
return selectedIds.map((id) => caseMap.get(id)!);
}
function resolveFixturePath(value: string | undefined): string | undefined {
if (!value) {
return undefined;
}
return path.isAbsolute(value) ? value : path.join(REPO_ROOT, value);
}

View File

@@ -1,67 +0,0 @@
import { access, copyFile, mkdir, readdir, readFile } from "node:fs/promises";
import path from "node:path";
export async function exists(filePath: string): Promise<boolean> {
try {
await access(filePath);
return true;
} catch {
return false;
}
}
export async function readJsonFile<T>(filePath: string): Promise<T> {
const raw = await readFile(filePath, "utf8");
return JSON.parse(raw) as T;
}
export async function readDirectoryFiles(
rootDir: string,
options: {
ignore?: Set<string>;
} = {}
): Promise<Record<string, string>> {
const files: Record<string, string> = {};
await walkDirectory(rootDir, "", files, options.ignore ?? new Set());
return files;
}
export async function copyDirectory(sourceDir: string, targetDir: string): Promise<void> {
const entries = await readdir(sourceDir, { withFileTypes: true });
await mkdir(targetDir, { recursive: true });
for (const entry of entries) {
const sourcePath = path.join(sourceDir, entry.name);
const targetPath = path.join(targetDir, entry.name);
if (entry.isDirectory()) {
await copyDirectory(sourcePath, targetPath);
continue;
}
await mkdir(path.dirname(targetPath), { recursive: true });
await copyFile(sourcePath, targetPath);
}
}
async function walkDirectory(
absoluteDir: string,
relativeDir: string,
output: Record<string, string>,
ignore: Set<string>
): Promise<void> {
const entries = await readdir(absoluteDir, { withFileTypes: true });
for (const entry of entries) {
const relativePath = relativeDir ? `${relativeDir}/${entry.name}` : entry.name;
if (ignore.has(relativePath) || ignore.has(entry.name)) {
continue;
}
const absolutePath = path.join(absoluteDir, entry.name);
if (entry.isDirectory()) {
await walkDirectory(absolutePath, relativePath, output, ignore);
continue;
}
output[relativePath] = await readFile(absolutePath, "utf8");
}
}

View File

@@ -1,149 +0,0 @@
import Anthropic from "@anthropic-ai/sdk";
import type { EvalMode, JudgeResult } from "./types";
export const DEFAULT_JUDGE_MODEL = "claude-sonnet-4-6";
const JUDGE_TOOL_NAME = "submit_judgement";
export async function judgeOutput(input: {
mode: EvalMode;
prompt: string;
checklist?: string[];
initial?: unknown;
expected?: unknown;
actual: unknown;
model?: string;
}): Promise<JudgeResult> {
const apiKey = process.env.ANTHROPIC_API_KEY;
if (!apiKey) {
return {
success: false,
score: 0,
summary: "Judge unavailable",
error: "ANTHROPIC_API_KEY is not set",
};
}
const client = new Anthropic({ apiKey });
const model = input.model ?? DEFAULT_JUDGE_MODEL;
const system = [
"You evaluate benchmark outputs for Windmill AI generation.",
"Deterministic checks already run separately. Focus on whether the final output satisfies the user request.",
"If expected state is provided, treat it as a valid example and reward semantically equivalent outputs.",
"If a checklist is provided, treat it as the explicit acceptance criteria for this case.",
"Be strict about missing requested functionality.",
"When the prompt wording is ambiguous, prefer the checklist over inferred structural requirements.",
"Do not require exact ids, exact topology, or exact field names unless the prompt, checklist, or expected state clearly requires them.",
`Always respond by calling the ${JUDGE_TOOL_NAME} tool exactly once.`,
].join("\n\n");
const user = [
`Mode: ${input.mode}`,
"",
"User prompt:",
input.prompt,
"",
"Checklist:",
formatChecklist(input.checklist),
"",
"Initial state:",
formatJsonBlock(input.initial),
"",
"Expected state:",
formatJsonBlock(input.expected),
"",
"Actual result:",
formatJsonBlock(input.actual),
].join("\n");
try {
const response = await client.messages.create({
model,
max_tokens: 1024,
temperature: 0,
system,
messages: [{ role: "user", content: user }],
tools: [
{
name: JUDGE_TOOL_NAME,
description: "Submit the benchmark judgement as structured data.",
input_schema: {
type: "object",
properties: {
score: {
type: "integer",
minimum: 0,
maximum: 100,
},
summary: {
type: "string",
},
},
required: ["score", "summary"],
},
},
],
tool_choice: {
type: "tool",
name: JUDGE_TOOL_NAME,
disable_parallel_tool_use: true,
},
});
const toolUseBlock = response.content.find(
(block): block is Anthropic.ToolUseBlock =>
block.type === "tool_use" && block.name === JUDGE_TOOL_NAME
);
if (!toolUseBlock) {
return {
success: false,
score: 0,
summary: "Judge returned no tool output",
error: "Expected structured tool output from judge",
};
}
const parsed = toolUseBlock.input as {
score: number;
summary: string;
};
return {
success: true,
score: normalizeScore(parsed.score),
summary: parsed.summary,
};
} catch (error) {
const message = error instanceof Error ? error.message : String(error);
return {
success: false,
score: 0,
summary: "Judge failed",
error: message,
};
}
}
function formatJsonBlock(value: unknown): string {
if (value === undefined) {
return "(none)";
}
return JSON.stringify(value, null, 2);
}
function formatChecklist(checklist: string[] | undefined): string {
if (!checklist || checklist.length === 0) {
return "(none)";
}
return checklist.map((item) => `- ${item}`).join("\n");
}
function normalizeScore(value: number): number {
if (!Number.isFinite(value)) {
return 0;
}
return Math.max(0, Math.min(100, Math.round(value)));
}

View File

@@ -1,29 +0,0 @@
import { describe, expect, it } from "bun:test";
import { resolveEvalModel } from "./models";
describe("resolveEvalModel", () => {
it("supports Gemini aliases for frontend evals", () => {
expect(resolveEvalModel("flow", "gemini").frontend).toEqual({
provider: "googleai",
model: "gemini-2.5-flash",
});
expect(resolveEvalModel("app", "gemini-pro").frontend).toEqual({
provider: "googleai",
model: "gemini-2.5-pro",
});
expect(resolveEvalModel("script", "gemini-3-flash-preview").frontend).toEqual({
provider: "googleai",
model: "gemini-3-flash-preview",
});
expect(resolveEvalModel("flow", "gemini-3.1-pro-preview").frontend).toEqual({
provider: "googleai",
model: "gemini-3.1-pro-preview",
});
});
it("rejects Gemini aliases for cli evals", () => {
expect(() => resolveEvalModel("cli", "gemini")).toThrow(
"Model gemini-flash is not supported for cli mode"
);
});
});

View File

@@ -1,185 +0,0 @@
import type { EvalMode } from "./types";
export interface FrontendEvalModelConfig {
provider: "anthropic" | "openai" | "googleai";
model: string;
}
export interface CliEvalModelConfig {
provider: "anthropic";
model: string;
}
export interface EvalModelSpec {
id: string;
label: string;
aliases: string[];
frontend?: FrontendEvalModelConfig;
cli?: CliEvalModelConfig;
}
export const EVAL_MODELS: EvalModelSpec[] = [
{
id: "haiku",
label: "Claude Haiku 4.5",
aliases: [
"haiku",
"haiku-4.5",
"claude-haiku",
"claude-haiku-4.5",
"claude-haiku-4-5",
"claude-haiku-4-5-20251001",
],
frontend: {
provider: "anthropic",
model: "claude-haiku-4-5-20251001",
},
cli: {
provider: "anthropic",
model: "haiku",
},
},
{
id: "sonnet",
label: "Claude Sonnet 4.5",
aliases: [
"sonnet",
"sonnet-4.5",
"claude-sonnet",
"claude-sonnet-4.5",
"claude-sonnet-4-5",
"claude-sonnet-4-5-20250929",
],
frontend: {
provider: "anthropic",
model: "claude-sonnet-4-5-20250929",
},
cli: {
provider: "anthropic",
model: "sonnet",
},
},
{
id: "opus",
label: "Claude Opus 4.6",
aliases: [
"opus",
"opus-4.6",
"claude-opus",
"claude-opus-4.6",
"claude-opus-4-6",
],
frontend: {
provider: "anthropic",
model: "claude-opus-4-6",
},
cli: {
provider: "anthropic",
model: "opus",
},
},
{
id: "4o",
label: "GPT-4o",
aliases: ["4o", "gpt-4o"],
frontend: {
provider: "openai",
model: "gpt-4o",
},
},
{
id: "gemini-flash",
label: "Gemini 2.5 Flash",
aliases: ["gemini", "gemini-flash", "gemini-2.5-flash"],
frontend: {
provider: "googleai",
model: "gemini-2.5-flash",
},
},
{
id: "gemini-pro",
label: "Gemini 2.5 Pro",
aliases: ["gemini-pro", "gemini-2.5-pro"],
frontend: {
provider: "googleai",
model: "gemini-2.5-pro",
},
},
{
id: "gemini-3-flash-preview",
label: "Gemini 3 Flash Preview",
aliases: ["gemini-3-flash-preview", "gemini-3-flash"],
frontend: {
provider: "googleai",
model: "gemini-3-flash-preview",
},
},
{
id: "gemini-3.1-pro-preview",
label: "Gemini 3.1 Pro Preview",
aliases: ["gemini-3.1-pro-preview", "gemini-3.1-pro", "gemini-3-pro-preview"],
frontend: {
provider: "googleai",
model: "gemini-3.1-pro-preview",
},
},
];
export function resolveEvalModel(mode: EvalMode, alias?: string): EvalModelSpec {
const spec = alias ? findEvalModel(alias) : getDefaultEvalModel(mode);
if (!spec) {
throw new Error(`Unknown model: ${alias}`);
}
if (mode === "cli" && !spec.cli) {
throw new Error(`Model ${spec.id} is not supported for cli mode`);
}
if (mode !== "cli" && !spec.frontend) {
throw new Error(`Model ${spec.id} is not supported for ${mode} mode`);
}
return spec;
}
export function getEvalModelHelpText(): string {
return EVAL_MODELS.map((model) => {
const modes = [
...(model.frontend ? ["flow", "script", "app"] : []),
...(model.cli ? ["cli"] : []),
];
return ` ${model.id.padEnd(8)} ${model.label} (${modes.join(", ")})`;
}).join("\n");
}
export function formatRunModelLabel(mode: EvalMode, model: EvalModelSpec): string {
if (mode === "cli") {
return `${model.cli!.provider}:${model.cli!.model}`;
}
return `${model.frontend!.provider}:${model.frontend!.model}`;
}
export function getFrontendEvalModel(model: EvalModelSpec): FrontendEvalModelConfig {
if (!model.frontend) {
throw new Error(`Model ${model.id} does not support frontend evals`);
}
return model.frontend;
}
export function getCliEvalModel(model: EvalModelSpec): CliEvalModelConfig {
if (!model.cli) {
throw new Error(`Model ${model.id} does not support cli evals`);
}
return model.cli;
}
function getDefaultEvalModel(mode: EvalMode): EvalModelSpec {
return mode === "cli" ? EVAL_MODELS[0]! : EVAL_MODELS[0]!;
}
function findEvalModel(alias: string): EvalModelSpec | undefined {
const normalized = alias.trim().toLowerCase();
return EVAL_MODELS.find((model) =>
[model.id, ...model.aliases].some((candidate) => candidate.toLowerCase() === normalized)
);
}

View File

@@ -1,296 +0,0 @@
import { appendFile, mkdir, rm, writeFile } from "node:fs/promises";
import path from "node:path";
import { execFileSync } from "node:child_process";
import { getAiEvalsRoot, getRepoRoot } from "./cases";
import type {
BenchmarkArtifactFile,
BenchmarkCaseResult,
BenchmarkRunResult,
BenchmarkTokenUsage,
EvalMode,
} from "./types";
export async function writeRunResult(
result: BenchmarkRunResult,
outputPath?: string
): Promise<string> {
const targetPath = resolveRunOutputPath(result.mode, outputPath);
await mkdir(path.dirname(targetPath), { recursive: true });
await writeFile(targetPath, JSON.stringify(toSerializableRunResult(result), null, 2) + "\n", "utf8");
return targetPath;
}
export async function appendHistoryRecord(
result: BenchmarkRunResult,
historyPath = resolveHistoryPath(result.mode)
): Promise<string> {
await mkdir(path.dirname(historyPath), { recursive: true });
await appendFile(historyPath, JSON.stringify(toHistoryRecord(result)) + "\n", "utf8");
return historyPath;
}
export async function writeRunArtifacts(
result: BenchmarkRunResult,
outputPath?: string
): Promise<string | null> {
const targetPath = resolveRunOutputPath(result.mode, outputPath);
const artifactRoot = defaultArtifactsRoot(targetPath);
await rm(artifactRoot, { recursive: true, force: true });
let wroteArtifacts = false;
for (const caseResult of result.cases) {
for (const attempt of caseResult.attempts) {
const artifactFiles = attempt.artifactFiles ?? [];
if (artifactFiles.length === 0) {
attempt.artifactsPath = null;
continue;
}
const attemptDir = path.join(artifactRoot, caseResult.id, `attempt-${attempt.attempt}`);
await writeArtifactFiles(attemptDir, artifactFiles);
attempt.artifactsPath = attemptDir;
wroteArtifacts = true;
}
}
result.artifactsPath = wroteArtifacts ? artifactRoot : null;
return result.artifactsPath ?? null;
}
export function buildRunResult(input: {
mode: EvalMode;
runs: number;
runModel: string | null;
judgeModel: string | null;
caseResults: BenchmarkCaseResult[];
}): BenchmarkRunResult {
const attemptCount = input.caseResults.reduce((sum, entry) => sum + entry.attempts.length, 0);
const passedAttempts = input.caseResults.reduce(
(sum, entry) => sum + entry.attempts.filter((attempt) => attempt.passed).length,
0
);
const durationTotal = input.caseResults.reduce(
(sum, entry) => sum + entry.attempts.reduce((inner, attempt) => inner + attempt.durationMs, 0),
0
);
const tokenUsageTotal = input.caseResults.reduce<BenchmarkTokenUsage | null>(
(sum, entry) => {
for (const attempt of entry.attempts) {
if (!attempt.tokenUsage) {
continue;
}
sum ??= { prompt: 0, completion: 0, total: 0 };
sum.prompt += attempt.tokenUsage.prompt;
sum.completion += attempt.tokenUsage.completion;
sum.total += attempt.tokenUsage.total;
}
return sum;
},
null
);
return {
version: 1,
mode: input.mode,
createdAt: new Date().toISOString(),
gitSha: getGitSha(),
runs: input.runs,
runModel: input.runModel,
judgeModel: input.judgeModel,
caseCount: input.caseResults.length,
attemptCount,
passedAttempts,
passRate: attemptCount === 0 ? 0 : passedAttempts / attemptCount,
averageDurationMs: attemptCount === 0 ? 0 : durationTotal / attemptCount,
totalTokenUsage: tokenUsageTotal,
averageTokenUsagePerAttempt:
attemptCount === 0 || !tokenUsageTotal
? null
: {
prompt: tokenUsageTotal.prompt / attemptCount,
completion: tokenUsageTotal.completion / attemptCount,
total: tokenUsageTotal.total / attemptCount,
},
cases: input.caseResults,
};
}
export function formatRunSummary(result: BenchmarkRunResult): string {
const lines = [
`${result.mode} benchmark complete`,
`Pass rate: ${formatPercent(result.passRate)} (${result.passedAttempts}/${result.attemptCount})`,
`Average duration: ${Math.round(result.averageDurationMs)}ms`,
];
const failures = collectFailures(result);
if (failures.length > 0) {
lines.push("Failures:");
for (const entry of failures.slice(0, 10)) {
lines.push(`- ${entry}`);
}
}
return lines.join("\n");
}
function collectFailures(result: BenchmarkRunResult): string[] {
const failures: string[] = [];
for (const caseResult of result.cases) {
for (const attempt of caseResult.attempts) {
if (attempt.passed) {
continue;
}
const failedChecks = attempt.checks.filter((check) => !check.passed).map((check) => check.name);
failures.push(
`${caseResult.id} attempt ${attempt.attempt}: ${failedChecks.join(", ") || attempt.error || "failed"}`
);
}
}
return failures;
}
function defaultFileName(mode: EvalMode): string {
return `${new Date().toISOString().replaceAll(":", "-")}__${mode}.json`;
}
export function resolveRunOutputPath(mode: EvalMode, outputPath?: string): string {
return outputPath ?? path.join(getAiEvalsRoot(), "results", defaultFileName(mode));
}
export function resolveHistoryPath(mode: EvalMode): string {
return path.join(getAiEvalsRoot(), "history", `${mode}.jsonl`);
}
function defaultArtifactsRoot(resultPath: string): string {
return resultPath.endsWith(".json")
? resultPath.slice(0, -".json".length)
: `${resultPath}.artifacts`;
}
async function writeArtifactFiles(
rootDir: string,
files: BenchmarkArtifactFile[]
): Promise<void> {
for (const file of files) {
const relativePath = normalizeArtifactPath(file.path);
const targetPath = path.join(rootDir, relativePath);
await mkdir(path.dirname(targetPath), { recursive: true });
await writeFile(targetPath, file.content, "utf8");
}
}
function normalizeArtifactPath(filePath: string): string {
const normalized = filePath.replaceAll("\\", "/").replace(/^\/+/, "");
const parts = normalized.split("/").filter(Boolean);
if (parts.length === 0 || parts.some((part) => part === "." || part === "..")) {
throw new Error(`Invalid artifact path: ${filePath}`);
}
return parts.join("/");
}
function toSerializableRunResult(result: BenchmarkRunResult): BenchmarkRunResult {
return {
...result,
cases: result.cases.map((caseResult) => ({
...caseResult,
attempts: caseResult.attempts.map(({ artifactFiles, ...attempt }) => attempt),
})),
};
}
function toHistoryRecord(result: BenchmarkRunResult) {
const judgeScores = result.cases.flatMap((caseResult) =>
caseResult.attempts.flatMap((attempt) =>
typeof attempt.judgeScore === "number" ? [attempt.judgeScore] : []
)
);
return {
createdAt: result.createdAt,
gitSha: result.gitSha,
mode: result.mode,
runs: result.runs,
runModel: result.runModel,
judgeModel: result.judgeModel,
caseCount: result.caseCount,
attemptCount: result.attemptCount,
passedAttempts: result.passedAttempts,
passRate: result.passRate,
averageDurationMs: result.averageDurationMs,
averageJudgeScore:
judgeScores.length === 0
? null
: judgeScores.reduce((sum, score) => sum + score, 0) / judgeScores.length,
averageTokenUsagePerAttempt: result.averageTokenUsagePerAttempt ?? null,
failedCaseIds: Array.from(
new Set(
result.cases
.filter((caseResult) => caseResult.attempts.some((attempt) => !attempt.passed))
.map((caseResult) => caseResult.id)
)
),
cases: result.cases.map((caseResult) => {
const attemptCount = caseResult.attempts.length;
const passedAttempts = caseResult.attempts.filter((attempt) => attempt.passed).length;
const totalDurationMs = caseResult.attempts.reduce(
(sum, attempt) => sum + attempt.durationMs,
0
);
const judgeScores = caseResult.attempts.flatMap((attempt) =>
typeof attempt.judgeScore === "number" ? [attempt.judgeScore] : []
);
const totalTokenUsage = caseResult.attempts.reduce<BenchmarkTokenUsage | null>(
(sum, attempt) => {
if (!attempt.tokenUsage) {
return sum;
}
sum ??= { prompt: 0, completion: 0, total: 0 };
sum.prompt += attempt.tokenUsage.prompt;
sum.completion += attempt.tokenUsage.completion;
sum.total += attempt.tokenUsage.total;
return sum;
},
null
);
return {
id: caseResult.id,
attemptCount,
passedAttempts,
passRate: attemptCount === 0 ? 0 : passedAttempts / attemptCount,
averageDurationMs: attemptCount === 0 ? 0 : totalDurationMs / attemptCount,
averageJudgeScore:
judgeScores.length === 0
? null
: judgeScores.reduce((sum, score) => sum + score, 0) / judgeScores.length,
averageTokenUsagePerAttempt:
attemptCount === 0 || !totalTokenUsage
? null
: {
prompt: totalTokenUsage.prompt / attemptCount,
completion: totalTokenUsage.completion / attemptCount,
total: totalTokenUsage.total / attemptCount,
},
};
}),
};
}
function getGitSha(): string | null {
try {
return execFileSync("git", ["rev-parse", "HEAD"], {
cwd: getRepoRoot(),
encoding: "utf8",
stdio: ["ignore", "pipe", "ignore"],
}).trim();
} catch {
return null;
}
}
function formatPercent(value: number): string {
return `${(value * 100).toFixed(1)}%`;
}

View File

@@ -1,264 +0,0 @@
import { judgeOutput, DEFAULT_JUDGE_MODEL } from "./judge";
import type {
BenchmarkAttemptResult,
BenchmarkCaseResult,
BenchmarkCheck,
EvalCase,
FrontendBenchmarkProgressEvent,
ModeRunner,
} from "./types";
export async function runSuite<TInitial, TExpected, TActual>(input: {
modeRunner: ModeRunner<TInitial, TExpected, TActual>;
cases: EvalCase[];
runs: number;
runModel: string | null;
judgeModel?: string | null;
concurrency?: number;
verbose?: boolean;
onProgress?: (event: FrontendBenchmarkProgressEvent) => void;
}): Promise<BenchmarkCaseResult[]> {
const judgeModel = input.judgeModel ?? DEFAULT_JUDGE_MODEL;
const concurrency = Math.max(1, input.concurrency ?? input.modeRunner.concurrency);
const results = new Array<BenchmarkCaseResult>(input.cases.length);
let cursor = 0;
if (input.modeRunner.mode !== "cli") {
input.onProgress?.({
type: "run-start",
surface: input.modeRunner.mode,
totalCases: input.cases.length,
runs: input.runs,
concurrency,
});
}
async function worker(): Promise<void> {
while (true) {
const caseIndex = cursor++;
if (caseIndex >= input.cases.length) {
return;
}
const evalCase = input.cases[caseIndex];
results[caseIndex] = {
id: evalCase.id,
prompt: evalCase.prompt,
initialPath: evalCase.initialPath,
expectedPath: evalCase.expectedPath,
attempts: await runCaseAttempts({
caseIndex,
evalCase,
runs: input.runs,
judgeModel,
judgeThreshold: input.modeRunner.judgeThreshold ?? 80,
modeRunner: input.modeRunner,
totalCases: input.cases.length,
verbose: input.verbose ?? false,
onProgress: input.onProgress,
}),
};
}
}
await Promise.all(
Array.from({ length: Math.min(concurrency, input.cases.length) }, () => worker())
);
return results;
}
async function runCaseAttempts<TInitial, TExpected, TActual>(input: {
caseIndex: number;
evalCase: EvalCase;
runs: number;
judgeModel: string;
judgeThreshold: number;
modeRunner: ModeRunner<TInitial, TExpected, TActual>;
totalCases: number;
verbose: boolean;
onProgress?: (event: FrontendBenchmarkProgressEvent) => void;
}): Promise<BenchmarkAttemptResult[]> {
const attempts: BenchmarkAttemptResult[] = [];
const surface = input.modeRunner.mode === "cli" ? null : input.modeRunner.mode;
for (let attempt = 1; attempt <= input.runs; attempt += 1) {
if (surface) {
input.onProgress?.({
type: "attempt-start",
surface,
caseId: input.evalCase.id,
caseNumber: input.caseIndex + 1,
totalCases: input.totalCases,
attempt,
runs: input.runs,
});
}
const startedAt = Date.now();
try {
const initial = await input.modeRunner.loadInitial(input.evalCase.initialPath);
const expected = await input.modeRunner.loadExpected(input.evalCase.expectedPath);
const run = await input.modeRunner.run(input.evalCase.prompt, initial, {
caseId: input.evalCase.id,
caseNumber: input.caseIndex + 1,
totalCases: input.totalCases,
attempt,
runs: input.runs,
verbose: input.verbose,
onAssistantMessageStart: input.verbose && surface
? () =>
input.onProgress?.({
type: "assistant-message-start",
surface,
caseId: input.evalCase.id,
caseNumber: input.caseIndex + 1,
totalCases: input.totalCases,
attempt,
runs: input.runs,
})
: undefined,
onAssistantChunk: input.verbose && surface
? (chunk: string) =>
input.onProgress?.({
type: "assistant-chunk",
surface,
caseId: input.evalCase.id,
caseNumber: input.caseIndex + 1,
totalCases: input.totalCases,
attempt,
runs: input.runs,
chunk,
})
: undefined,
onAssistantMessageEnd: input.verbose && surface
? () =>
input.onProgress?.({
type: "assistant-message-end",
surface,
caseId: input.evalCase.id,
caseNumber: input.caseIndex + 1,
totalCases: input.totalCases,
attempt,
runs: input.runs,
})
: undefined,
});
const checks: BenchmarkCheck[] = [
buildCheck("run succeeded", run.success, run.error),
...input.modeRunner.validate({
evalCase: input.evalCase,
prompt: input.evalCase.prompt,
initial,
expected,
actual: run.actual,
run,
}),
];
let judgeScore: number | null = null;
let judgeSummary: string | null = null;
if (run.success) {
const judge = await judgeOutput({
mode: input.modeRunner.mode,
prompt: input.evalCase.prompt,
checklist: input.evalCase.judgeChecklist,
initial,
expected: input.modeRunner.mode === "cli" ? undefined : expected,
actual: run.actual,
model: input.judgeModel,
});
judgeScore = judge.success ? judge.score : null;
judgeSummary = judge.summary;
checks.push(buildCheck("judge succeeded", judge.success, judge.error));
checks.push(
buildCheck(
`judge score >= ${input.judgeThreshold}`,
(judgeScore ?? 0) >= input.judgeThreshold,
judge.success ? `score=${judgeScore}` : judge.error
)
);
}
const artifactFiles = input.modeRunner.buildArtifacts?.(run.actual) ?? [];
const attemptResult: BenchmarkAttemptResult = {
attempt,
passed: checks.every((check) => check.passed),
durationMs: Date.now() - startedAt,
assistantMessageCount: run.assistantMessageCount,
toolCallCount: run.toolCallCount,
toolsUsed: uniqueStrings(run.toolsUsed),
skillsInvoked: uniqueStrings(run.skillsInvoked),
checks,
judgeScore,
judgeSummary,
error: run.error ?? null,
tokenUsage: run.tokenUsage ?? null,
artifactsPath: null,
artifactFiles,
};
if (surface) {
input.onProgress?.({
type: "attempt-finish",
surface,
caseId: input.evalCase.id,
caseNumber: input.caseIndex + 1,
totalCases: input.totalCases,
attempt,
runs: input.runs,
passed: attemptResult.passed,
durationMs: attemptResult.durationMs,
judgeScore: attemptResult.judgeScore,
error: attemptResult.error,
});
}
attempts.push(attemptResult);
} catch (error) {
const message = error instanceof Error ? error.message : String(error);
const failedAttempt: BenchmarkAttemptResult = {
attempt,
passed: false,
durationMs: Date.now() - startedAt,
assistantMessageCount: 0,
toolCallCount: 0,
toolsUsed: [],
skillsInvoked: [],
checks: [buildCheck("run crashed", false, message)],
judgeScore: null,
judgeSummary: null,
error: message,
tokenUsage: null,
};
if (surface) {
input.onProgress?.({
type: "attempt-finish",
surface,
caseId: input.evalCase.id,
caseNumber: input.caseIndex + 1,
totalCases: input.totalCases,
attempt,
runs: input.runs,
passed: false,
durationMs: failedAttempt.durationMs,
judgeScore: null,
error: message,
});
}
attempts.push(failedAttempt);
}
}
return attempts;
}
function buildCheck(name: string, passed: boolean, details?: string): BenchmarkCheck {
return details ? { name, passed, details } : { name, passed };
}
function uniqueStrings(values: string[]): string[] {
return [...new Set(values)];
}

View File

@@ -1,198 +0,0 @@
export const EVAL_MODES = ["cli", "flow", "script", "app"] as const;
export type EvalMode = (typeof EVAL_MODES)[number];
export interface FlowValidationSpec {
schemaRequiredPaths?: string[];
schemaAnyOf?: Array<{
requiredPaths: string[];
}>;
resolveResultsRefs?: boolean;
requireSpecialModules?: Array<"preprocessor_module" | "failure_module">;
requireSuspendSteps?: Array<{
id: string;
requiredEvents?: number;
resumeRequiredStringFieldAnyOf?: string[];
}>;
}
export interface EvalCase {
id: string;
prompt: string;
initialPath?: string;
expectedPath?: string;
validate?: FlowValidationSpec;
judgeChecklist?: string[];
}
export interface BenchmarkCheck {
name: string;
passed: boolean;
details?: string;
}
export interface JudgeResult {
success: boolean;
score: number;
summary: string;
error?: string;
}
export interface BenchmarkArtifactFile {
path: string;
content: string;
}
export interface BenchmarkTokenUsage {
prompt: number;
completion: number;
total: number;
}
export interface ModeRunOutput<TActual> {
success: boolean;
actual: TActual;
error?: string;
assistantMessageCount: number;
toolCallCount: number;
toolsUsed: string[];
skillsInvoked: string[];
tokenUsage?: BenchmarkTokenUsage | null;
}
export interface ModeRunContext {
caseId: string;
caseNumber: number;
totalCases: number;
attempt: number;
runs: number;
verbose: boolean;
onAssistantMessageStart?: () => void;
onAssistantChunk?: (chunk: string) => void;
onAssistantMessageEnd?: () => void;
}
export interface ModeRunner<TInitial, TExpected, TActual> {
mode: EvalMode;
concurrency: number;
judgeThreshold?: number;
loadInitial(path?: string): Promise<TInitial | undefined>;
loadExpected(path?: string): Promise<TExpected | undefined>;
run(
prompt: string,
initial: TInitial | undefined,
context: ModeRunContext
): Promise<ModeRunOutput<TActual>>;
validate(input: {
evalCase: EvalCase;
prompt: string;
initial: TInitial | undefined;
expected: TExpected | undefined;
actual: TActual;
run: ModeRunOutput<TActual>;
}): BenchmarkCheck[];
buildArtifacts?(actual: TActual): BenchmarkArtifactFile[];
}
export interface BenchmarkAttemptResult {
attempt: number;
passed: boolean;
durationMs: number;
assistantMessageCount: number;
toolCallCount: number;
toolsUsed: string[];
skillsInvoked: string[];
checks: BenchmarkCheck[];
judgeScore: number | null;
judgeSummary: string | null;
error: string | null;
tokenUsage?: BenchmarkTokenUsage | null;
artifactsPath?: string | null;
artifactFiles?: BenchmarkArtifactFile[];
}
export interface BenchmarkCaseResult {
id: string;
prompt: string;
initialPath?: string;
expectedPath?: string;
attempts: BenchmarkAttemptResult[];
}
export interface BenchmarkRunResult {
version: 1;
mode: EvalMode;
createdAt: string;
gitSha: string | null;
runs: number;
runModel: string | null;
judgeModel: string | null;
caseCount: number;
attemptCount: number;
passedAttempts: number;
passRate: number;
averageDurationMs: number;
totalTokenUsage?: BenchmarkTokenUsage | null;
averageTokenUsagePerAttempt?: BenchmarkTokenUsage | null;
artifactsPath?: string | null;
cases: BenchmarkCaseResult[];
}
export type FrontendBenchmarkProgressEvent =
| {
type: "run-start";
surface: Exclude<EvalMode, "cli">;
totalCases: number;
runs: number;
concurrency: number;
}
| {
type: "attempt-start";
surface: Exclude<EvalMode, "cli">;
caseId: string;
caseNumber: number;
totalCases: number;
attempt: number;
runs: number;
}
| {
type: "attempt-finish";
surface: Exclude<EvalMode, "cli">;
caseId: string;
caseNumber: number;
totalCases: number;
attempt: number;
runs: number;
passed: boolean;
durationMs: number;
judgeScore: number | null;
error: string | null;
}
| {
type: "assistant-message-start";
surface: Exclude<EvalMode, "cli">;
caseId: string;
caseNumber: number;
totalCases: number;
attempt: number;
runs: number;
}
| {
type: "assistant-chunk";
surface: Exclude<EvalMode, "cli">;
caseId: string;
caseNumber: number;
totalCases: number;
attempt: number;
runs: number;
chunk: string;
}
| {
type: "assistant-message-end";
surface: Exclude<EvalMode, "cli">;
caseId: string;
caseNumber: number;
totalCases: number;
attempt: number;
runs: number;
};

View File

@@ -1,36 +0,0 @@
import { describe, expect, it } from "bun:test";
import { validateScriptState } from "./validators";
describe("validateScriptState", () => {
it("accepts semantically equivalent script implementations", () => {
const checks = validateScriptState({
actual: {
path: "f/evals/greet_user.ts",
lang: "bun",
code: "export async function main(name: string): Promise<string> {\n return `Hello, ${name}!`;\n}\n",
},
expected: {
path: "f/evals/greet_user.ts",
lang: "bun",
code: "export async function main(name: string) {\n\treturn `Hello, ${name}!`\n}\n",
},
});
expect(checks.every((check) => check.passed)).toBe(true);
});
it("still requires an exported main entrypoint", () => {
const checks = validateScriptState({
actual: {
path: "f/evals/greet_user.ts",
lang: "bun",
code: "async function main(name: string) {\n return `Hello, ${name}!`;\n}\n",
},
});
expect(checks).toContainEqual({
name: "script exports entrypoint",
passed: false,
});
});
});

View File

@@ -1,997 +0,0 @@
import path from "node:path";
import ts from "typescript";
import type { BenchmarkCheck, FlowValidationSpec } from "./types";
export interface ScriptState {
path: string;
lang: string;
args?: Record<string, unknown>;
code: string;
}
export interface FlowState {
summary?: string;
value?: {
preprocessor_module?: Record<string, unknown>;
failure_module?: Record<string, unknown>;
modules?: Array<Record<string, unknown>>;
[key: string]: unknown;
};
schema?: Record<string, unknown>;
}
export interface AppFilesState {
frontend: Record<string, string>;
backend: Record<string, AppRunnableState>;
}
export interface AppRunnableState {
type?: string;
name?: string;
path?: string;
inlineScript?: {
language?: string;
content?: string;
};
}
const TS_LIKE_LANGUAGES = new Set(["bun", "deno", "nativets", "bunnative", "ts", "typescript"]);
const CONTROL_FLOW_MODULE_TYPES = new Set(["branchone", "branchall", "forloopflow", "whileloopflow"]);
export function validateScriptState(input: {
actual: ScriptState;
initial?: ScriptState;
expected?: ScriptState;
}): BenchmarkCheck[] {
const checks: BenchmarkCheck[] = [
check("script exports entrypoint", hasSupportedEntrypoint(input.actual.code)),
check("script has no syntax errors", getScriptSyntaxErrors(input.actual.code, input.actual.lang).length === 0),
];
if (input.expected) {
checks.push(
check(
"script path matches expected",
input.actual.path === input.expected.path,
`expected ${input.expected.path}, got ${input.actual.path}`
)
);
checks.push(
check(
"script language matches expected",
input.actual.lang === input.expected.lang,
`expected ${input.expected.lang}, got ${input.actual.lang}`
)
);
}
if (input.initial) {
checks.push(
check(
"script differs from initial",
normalizeText(input.actual.code) !== normalizeText(input.initial.code)
)
);
}
return checks;
}
export function validateFlowState(input: {
actual: FlowState;
initial?: FlowState;
expected?: FlowState;
validate?: FlowValidationSpec;
}): BenchmarkCheck[] {
const actualModules = getFlowModules(input.actual);
const placeholderModuleIds = getInlineScriptPlaceholderModuleIds(input.actual);
const checks: BenchmarkCheck[] = [
check("flow has modules", actualModules.length > 0),
check(
"flow has no inline placeholder code",
placeholderModuleIds.length === 0,
placeholderModuleIds.length > 0
? `placeholder content in: ${placeholderModuleIds.join(", ")}`
: undefined
),
];
if (input.initial) {
checks.push(
check(
"flow differs from initial",
normalizeJson(input.actual) !== normalizeJson(input.initial)
)
);
}
if (input.expected) {
checks.push(...validateFlowExpectedStructure(input.actual, input.expected));
}
if (input.validate) {
checks.push(...validateFlowRequirements(input.actual, input.validate));
}
return checks;
}
export function validateAppState(input: {
actual: AppFilesState;
initial?: AppFilesState;
expected?: AppFilesState;
}): BenchmarkCheck[] {
const checks: BenchmarkCheck[] = [];
const frontendEntries = Object.entries(input.actual.frontend ?? {});
const backendEntries = Object.entries(input.actual.backend ?? {});
const frontendSyntaxProblems = getAppFrontendSyntaxProblems(input.actual.frontend);
const backendSyntaxProblems = getAppBackendSyntaxProblems(input.actual.backend);
const unresolvedBackendRefs = getUnresolvedBackendReferences(
input.actual.frontend,
input.actual.backend
);
checks.push(check("app has frontend entrypoint", Boolean(input.actual.frontend["/index.tsx"])));
checks.push(
check(
"app has non-empty frontend files",
frontendEntries.some(([, content]) => content.trim().length > 0)
)
);
checks.push(
check(
"frontend files have no syntax errors",
frontendSyntaxProblems.length === 0,
summarizeProblems(frontendSyntaxProblems)
)
);
checks.push(
check(
"backend inline scripts have entrypoints",
backendEntries.every(([, runnable]) => {
if (runnable.type !== "inline") {
return true;
}
return hasSupportedEntrypoint(runnable.inlineScript?.content ?? "");
})
)
);
checks.push(
check(
"backend inline scripts have no syntax errors",
backendSyntaxProblems.length === 0,
summarizeProblems(backendSyntaxProblems)
)
);
checks.push(
check(
"frontend backend references resolve",
unresolvedBackendRefs.length === 0,
summarizeProblems(unresolvedBackendRefs)
)
);
if (input.initial) {
checks.push(check("app differs from initial", !appStatesEqual(input.actual, input.initial)));
}
if (input.expected) {
for (const [filePath, content] of Object.entries(input.expected.frontend)) {
checks.push(
check(
`frontend includes ${filePath}`,
normalizeText(input.actual.frontend[filePath] ?? "") === normalizeText(content)
)
);
}
for (const [runnableName, runnable] of Object.entries(input.expected.backend)) {
const actualRunnable = input.actual.backend[runnableName];
checks.push(check(`backend includes ${runnableName}`, Boolean(actualRunnable)));
if (actualRunnable && runnable.inlineScript?.content) {
checks.push(
check(
`${runnableName} code matches expected`,
normalizeText(actualRunnable.inlineScript?.content ?? "") ===
normalizeText(runnable.inlineScript.content)
)
);
}
}
}
return checks;
}
export function validateCliWorkspace(input: {
actualFiles: Record<string, string>;
expectedFiles?: Record<string, string>;
initialFiles?: Record<string, string>;
}): BenchmarkCheck[] {
const checks: BenchmarkCheck[] = [];
if (input.expectedFiles) {
for (const [filePath, expectedContent] of Object.entries(input.expectedFiles)) {
const actualContent = input.actualFiles[filePath];
checks.push(check(`creates ${filePath}`, actualContent !== undefined));
if (actualContent !== undefined) {
checks.push(
check(
`${filePath} contains expected content`,
cliFileContainsExpectedContent(actualContent, expectedContent)
)
);
}
}
const expectedPaths = new Set(Object.keys(input.expectedFiles));
const unexpectedPaths = Object.keys(input.actualFiles).filter((filePath) => !expectedPaths.has(filePath));
checks.push(
check(
"workspace contains no unexpected files",
unexpectedPaths.length === 0,
summarizeProblems(unexpectedPaths)
)
);
}
if (input.initialFiles) {
checks.push(check("workspace differs from initial", !fileMapsEqual(input.actualFiles, input.initialFiles)));
}
return checks;
}
function cliFileContainsExpectedContent(actualContent: string, expectedContent: string): boolean {
const expectedSnippets = expectedContent
.replace(/\r\n/g, "\n")
.split("\n")
.map((line) => line.trim())
.filter((line) => line.length > 0);
if (expectedSnippets.length === 0) {
return true;
}
const normalizedActual = actualContent.replace(/\r\n/g, "\n");
return expectedSnippets.every((snippet) => normalizedActual.includes(snippet));
}
function check(name: string, passed: boolean, details?: string): BenchmarkCheck {
return !passed && details ? { name, passed, details } : { name, passed };
}
function normalizeText(value: string): string {
return value.replace(/\r\n/g, "\n").trim();
}
function normalizeJson(value: unknown): string {
return JSON.stringify(value);
}
function summarizeProblems(problems: string[], limit = 5): string | undefined {
if (problems.length === 0) {
return undefined;
}
if (problems.length <= limit) {
return problems.join("; ");
}
return `${problems.slice(0, limit).join("; ")}; ...and ${problems.length - limit} more`;
}
function hasSupportedEntrypoint(code: string): boolean {
return (
/export\s+(async\s+)?function\s+main\s*\(/.test(code) ||
/export\s+default\s+(async\s+)?function\s*\(/.test(code)
);
}
function getScriptSyntaxErrors(code: string, lang: string): string[] {
if (!TS_LIKE_LANGUAGES.has(lang)) {
return [];
}
return getTypeScriptSyntaxErrors(code, "eval.ts");
}
function getTypeScriptSyntaxErrors(code: string, fileName: string): string[] {
const result = ts.transpileModule(code, {
compilerOptions: {
target: ts.ScriptTarget.ES2022,
module: ts.ModuleKind.ESNext,
jsx: ts.JsxEmit.ReactJSX,
},
reportDiagnostics: true,
fileName,
});
return (result.diagnostics ?? []).map((diagnostic) =>
ts.flattenDiagnosticMessageText(diagnostic.messageText, "\n")
);
}
function getAppFrontendSyntaxProblems(frontend: Record<string, string>): string[] {
const problems: string[] = [];
for (const [filePath, content] of Object.entries(frontend)) {
if (!isFrontendCodeFile(filePath)) {
continue;
}
const errors = getTypeScriptSyntaxErrors(content, filePath);
for (const error of errors) {
problems.push(`${filePath}: ${error}`);
}
}
return problems;
}
function getAppBackendSyntaxProblems(backend: Record<string, AppRunnableState>): string[] {
const problems: string[] = [];
for (const [key, runnable] of Object.entries(backend)) {
if (runnable.type !== "inline") {
continue;
}
const language = runnable.inlineScript?.language ?? "";
const content = runnable.inlineScript?.content ?? "";
for (const error of getScriptSyntaxErrors(content, language)) {
problems.push(`${key}: ${error}`);
}
}
return problems;
}
function isFrontendCodeFile(filePath: string): boolean {
const extension = path.extname(filePath).toLowerCase();
return extension === ".ts" || extension === ".tsx" || extension === ".js" || extension === ".jsx";
}
function getUnresolvedBackendReferences(
frontend: Record<string, string>,
backend: Record<string, AppRunnableState>
): string[] {
const backendKeys = new Set(Object.keys(backend));
const unresolved = new Set<string>();
for (const [filePath, content] of Object.entries(frontend)) {
for (const key of extractBackendCallKeys(content)) {
if (!backendKeys.has(key)) {
unresolved.add(`${filePath} references missing backend.${key}()`);
}
}
}
return [...unresolved];
}
function extractBackendCallKeys(content: string): string[] {
const matches = content.matchAll(/\bbackend\.([A-Za-z_][A-Za-z0-9_]*)\s*\(/g);
return [...new Set([...matches].map((match) => match[1]))];
}
function getFlowModules(flow: FlowState): Array<Record<string, unknown>> {
return Array.isArray(flow.value?.modules) ? flow.value.modules : [];
}
function validateFlowExpectedStructure(
actual: FlowState,
expected: FlowState
): BenchmarkCheck[] {
const checks: BenchmarkCheck[] = [];
const expectedTopLevelModules = getFlowModules(expected);
const actualTopLevelModules = getFlowModules(actual);
const expectedSchemaFields = getTopLevelSchemaFields(expected.schema);
if (expectedSchemaFields.length > 0) {
checks.push(
check(
"flow schema includes expected top-level fields",
expectedSchemaFields.every((field) => hasSchemaPath(actual.schema, field)),
`missing one of: ${expectedSchemaFields.join(", ")}`
)
);
}
if (expectedTopLevelModules.length > 0) {
const actualIds = actualTopLevelModules
.map((module) => (typeof module.id === "string" ? module.id : null))
.filter((id): id is string => Boolean(id));
const expectedIds = expectedTopLevelModules
.map((module) => (typeof module.id === "string" ? module.id : null))
.filter((id): id is string => Boolean(id));
checks.push(
check(
"flow includes expected top-level step ids",
expectedIds.every((id) => actualIds.includes(id)),
`expected ids: ${expectedIds.join(", ")}; actual ids: ${actualIds.join(", ")}`
)
);
checks.push(
check(
"flow preserves expected top-level step order",
preservesRelativeOrder(actualIds, expectedIds),
`expected order: ${expectedIds.join(" -> ")}; actual ids: ${actualIds.join(" -> ")}`
)
);
for (const expectedModule of expectedTopLevelModules) {
const moduleId = typeof expectedModule.id === "string" ? expectedModule.id : null;
if (!moduleId) {
continue;
}
const actualModule = actualTopLevelModules.find((module) => module.id === moduleId);
if (!actualModule) {
continue;
}
const expectedType = getModuleType(expectedModule);
if (expectedType && !(hasSuspendConfig(expectedModule) || hasSuspendConfig(actualModule))) {
checks.push(
check(
`${moduleId} type matches expected`,
getModuleType(actualModule) === expectedType,
`expected ${expectedType}, got ${getModuleType(actualModule) ?? "(missing)"}`
)
);
}
const expectedPath = getModulePath(expectedModule);
if (expectedPath) {
checks.push(
check(
`${moduleId} path matches expected`,
getModulePath(actualModule) === expectedPath,
`expected ${expectedPath}, got ${getModulePath(actualModule) ?? "(missing)"}`
)
);
}
}
}
for (const specialModuleKey of ["preprocessor_module", "failure_module"] as const) {
const expectedSpecialModule = getSpecialFlowModule(expected, specialModuleKey);
if (!expectedSpecialModule) {
continue;
}
const actualSpecialModule = getSpecialFlowModule(actual, specialModuleKey);
checks.push(check(`${specialModuleKey} matches expected presence`, Boolean(actualSpecialModule)));
if (!actualSpecialModule) {
continue;
}
const expectedType = getModuleType(expectedSpecialModule);
if (expectedType) {
checks.push(
check(
`${specialModuleKey} type matches expected`,
getModuleType(actualSpecialModule) === expectedType,
`expected ${expectedType}, got ${getModuleType(actualSpecialModule) ?? "(missing)"}`
)
);
}
}
return checks;
}
function validateFlowRequirements(
flow: FlowState,
validate: FlowValidationSpec
): BenchmarkCheck[] {
const checks: BenchmarkCheck[] = [];
for (const requiredPath of validate.schemaRequiredPaths ?? []) {
checks.push(
check(
`schema includes ${requiredPath}`,
hasSchemaPath(flow.schema, requiredPath),
`missing schema path ${requiredPath}`
)
);
}
if (validate.schemaAnyOf && validate.schemaAnyOf.length > 0) {
const matchingVariant = validate.schemaAnyOf.find((variant) =>
variant.requiredPaths.every((requiredPath) => hasSchemaPath(flow.schema, requiredPath))
);
checks.push(
check(
"schema matches one accepted input shape",
Boolean(matchingVariant),
matchingVariant
? undefined
: `expected one of: ${validate.schemaAnyOf
.map((variant) => `[${variant.requiredPaths.join(", ")}]`)
.join(" or ")}`
)
);
}
if (validate.resolveResultsRefs) {
const unresolved = collectUnresolvedResultsRefs(flow);
checks.push(
check(
"results references resolve",
unresolved.length === 0,
unresolved.length > 0 ? unresolved.join("; ") : undefined
)
);
}
for (const specialModule of validate.requireSpecialModules ?? []) {
checks.push(
check(
`${specialModule} exists`,
Boolean(getSpecialFlowModule(flow, specialModule))
)
);
}
for (const suspendStep of validate.requireSuspendSteps ?? []) {
const module = findFlowModuleById(flow, suspendStep.id);
checks.push(check(`${suspendStep.id} step exists`, Boolean(module)));
if (!module) {
continue;
}
checks.push(check(`${suspendStep.id} includes suspend config`, hasSuspendConfig(module)));
if (!hasSuspendConfig(module)) {
continue;
}
if (suspendStep.requiredEvents !== undefined) {
checks.push(
check(
`${suspendStep.id} requires ${suspendStep.requiredEvents} approval event${suspendStep.requiredEvents === 1 ? "" : "s"}`,
getSuspendRequiredEvents(module) === suspendStep.requiredEvents,
`expected ${suspendStep.requiredEvents}, got ${getSuspendRequiredEvents(module) ?? "(missing)"}`
)
);
}
if (
suspendStep.resumeRequiredStringFieldAnyOf &&
suspendStep.resumeRequiredStringFieldAnyOf.length > 0
) {
const stringFields = getSuspendResumeStringFields(module);
checks.push(
check(
`${suspendStep.id} resume form includes one accepted comment field`,
suspendStep.resumeRequiredStringFieldAnyOf.some((field) =>
stringFields.includes(field)
),
`expected one of [${suspendStep.resumeRequiredStringFieldAnyOf.join(", ")}], got [${stringFields.join(", ")}]`
)
);
}
}
return checks;
}
function hasSchemaPath(schema: Record<string, unknown> | undefined, dottedPath: string): boolean {
if (!schema || typeof schema !== "object") {
return false;
}
const segments = dottedPath.split(".").filter(Boolean);
if (segments.length === 0) {
return false;
}
let current: Record<string, unknown> | undefined = schema;
for (const segment of segments) {
const properties = current?.properties;
if (!properties || typeof properties !== "object") {
return false;
}
const next = (properties as Record<string, unknown>)[segment];
if (!next || typeof next !== "object") {
return false;
}
current = next as Record<string, unknown>;
}
return true;
}
function getTopLevelSchemaFields(schema: Record<string, unknown> | undefined): string[] {
if (!schema || typeof schema !== "object") {
return [];
}
const properties = schema.properties;
if (!properties || typeof properties !== "object") {
return [];
}
return Object.keys(properties as Record<string, unknown>).filter((key) => key.length > 0);
}
function preservesRelativeOrder(actualIds: string[], expectedIds: string[]): boolean {
if (expectedIds.length === 0) {
return true;
}
let cursor = 0;
for (const actualId of actualIds) {
if (actualId === expectedIds[cursor]) {
cursor += 1;
if (cursor === expectedIds.length) {
return true;
}
}
}
return false;
}
function collectUnresolvedResultsRefs(flow: FlowState): string[] {
const unresolved = new Set<string>();
validateModuleSequence(getFlowModules(flow), new Map<string, Record<string, unknown>>(), unresolved);
return [...unresolved];
}
function validateModuleSequence(
modules: Array<Record<string, unknown>>,
parentVisibleModules: Map<string, Record<string, unknown>>,
unresolved: Set<string>
): void {
const visibleModules = new Map(parentVisibleModules);
for (const module of modules) {
validateResultsRefsInRecord(module, visibleModules, unresolved);
validateNestedModuleResultsRefs(module, visibleModules, unresolved);
if (typeof module.id === "string" && module.id.length > 0) {
visibleModules.set(module.id, module);
}
}
}
function validateNestedModuleResultsRefs(
module: Record<string, unknown>,
visibleModules: Map<string, Record<string, unknown>>,
unresolved: Set<string>
): void {
const value = isObjectRecord(module.value) ? module.value : null;
if (!value) {
return;
}
const nestedSequences: Array<Array<Record<string, unknown>>> = [];
if (Array.isArray(value.modules)) {
nestedSequences.push(asModuleArray(value.modules));
}
if (Array.isArray(value.default)) {
nestedSequences.push(asModuleArray(value.default));
}
if (Array.isArray(value.branches)) {
for (const branch of value.branches) {
if (!isObjectRecord(branch)) {
continue;
}
if (typeof branch.expr === "string") {
validateResultsRefsInExpression(
branch.expr,
`branch ${module.id ?? "(unnamed)"}`,
visibleModules,
unresolved
);
}
if (Array.isArray(branch.modules)) {
nestedSequences.push(asModuleArray(branch.modules));
}
}
}
for (const sequence of nestedSequences) {
validateModuleSequence(sequence, visibleModules, unresolved);
}
}
function validateResultsRefsInRecord(
value: unknown,
visibleModules: Map<string, Record<string, unknown>>,
unresolved: Set<string>,
context = "expression"
): void {
if (typeof value === "string") {
validateResultsRefsInExpression(value, context, visibleModules, unresolved);
return;
}
if (Array.isArray(value)) {
for (const entry of value) {
validateResultsRefsInRecord(entry, visibleModules, unresolved, context);
}
return;
}
if (!isObjectRecord(value)) {
return;
}
for (const [key, entry] of Object.entries(value)) {
if (key === "content" || key === "modules" || key === "branches" || key === "default") {
continue;
}
validateResultsRefsInRecord(entry, visibleModules, unresolved, key);
}
}
function validateResultsRefsInExpression(
expression: string,
context: string,
visibleModules: Map<string, Record<string, unknown>>,
unresolved: Set<string>
): void {
for (const ref of extractResultsRefs(expression)) {
const module = visibleModules.get(ref.root);
if (!module) {
unresolved.add(`${context} references missing results.${ref.root}`);
continue;
}
validateNestedResultsRefPath(ref.root, ref.path, module, context, unresolved);
}
}
function extractResultsRefs(
expression: string
): Array<{ root: string; path: string[] }> {
const matches = expression.matchAll(/\bresults\.([A-Za-z0-9_-]+)((?:\.[A-Za-z0-9_-]+)*)/g);
const refs = new Map<string, { root: string; path: string[] }>();
for (const match of matches) {
const root = match[1];
const path = match[2]
.split(".")
.filter(Boolean);
const key = `${root}:${path.join(".")}`;
refs.set(key, { root, path });
}
return [...refs.values()];
}
function validateNestedResultsRefPath(
rootId: string,
path: string[],
module: Record<string, unknown>,
context: string,
unresolved: Set<string>
): void {
if (path.length === 0) {
return;
}
const moduleType = getModuleType(module);
if (!moduleType || !CONTROL_FLOW_MODULE_TYPES.has(moduleType)) {
return;
}
const nestedIds = new Set(getImmediateNestedModuleIds(module));
const [firstSegment] = path;
if (nestedIds.has(firstSegment)) {
unresolved.add(
`${context} references nested results.${rootId}.${firstSegment} inside ${moduleType} ${rootId}`
);
}
}
function getAllFlowModules(flow: FlowState): Array<Record<string, unknown>> {
const modules: Array<Record<string, unknown>> = [];
const specialModules = ["preprocessor_module", "failure_module"] as const;
for (const key of specialModules) {
const specialModule = getSpecialFlowModule(flow, key);
if (specialModule) {
modules.push(specialModule);
modules.push(...collectNestedModules(specialModule));
}
}
for (const module of getFlowModules(flow)) {
modules.push(module);
modules.push(...collectNestedModules(module));
}
return modules;
}
function collectNestedModules(module: Record<string, unknown>): Array<Record<string, unknown>> {
const nested: Array<Record<string, unknown>> = [];
const value = isObjectRecord(module.value) ? module.value : null;
if (!value) {
return nested;
}
if (Array.isArray(value.modules)) {
for (const child of asModuleArray(value.modules)) {
nested.push(child, ...collectNestedModules(child));
}
}
if (Array.isArray(value.default)) {
for (const child of asModuleArray(value.default)) {
nested.push(child, ...collectNestedModules(child));
}
}
if (Array.isArray(value.branches)) {
for (const branch of value.branches) {
if (!isObjectRecord(branch) || !Array.isArray(branch.modules)) {
continue;
}
for (const child of asModuleArray(branch.modules)) {
nested.push(child, ...collectNestedModules(child));
}
}
}
return nested;
}
function findFlowModuleById(flow: FlowState, id: string): Record<string, unknown> | null {
for (const module of getAllFlowModules(flow)) {
if (module.id === id) {
return module;
}
}
return null;
}
function getInlineScriptPlaceholderModuleIds(flow: FlowState): string[] {
return getAllFlowModules(flow).flatMap((module) => {
const code = getModuleCode(module)?.trim();
if (!code || !/^inline_script\.[A-Za-z0-9_-]+$/.test(code)) {
return [];
}
if (typeof module.id === "string" && module.id.length > 0) {
return [module.id];
}
return ["(unnamed)"];
});
}
function getImmediateNestedModuleIds(module: Record<string, unknown>): string[] {
const ids: string[] = [];
const value = isObjectRecord(module.value) ? module.value : null;
if (!value) {
return ids;
}
if (Array.isArray(value.modules)) {
ids.push(...asModuleArray(value.modules).flatMap((child) => (typeof child.id === "string" ? [child.id] : [])));
}
if (Array.isArray(value.default)) {
ids.push(...asModuleArray(value.default).flatMap((child) => (typeof child.id === "string" ? [child.id] : [])));
}
if (Array.isArray(value.branches)) {
for (const branch of value.branches) {
if (!isObjectRecord(branch) || !Array.isArray(branch.modules)) {
continue;
}
ids.push(
...asModuleArray(branch.modules).flatMap((child) => (typeof child.id === "string" ? [child.id] : []))
);
}
}
return ids;
}
function getModuleCode(module: Record<string, unknown>): string | null {
const value = isObjectRecord(module.value) ? module.value : null;
return typeof value?.content === "string" ? value.content : null;
}
function asModuleArray(value: unknown[]): Array<Record<string, unknown>> {
return value.filter(isObjectRecord);
}
function isObjectRecord(value: unknown): value is Record<string, any> {
return typeof value === "object" && value !== null && !Array.isArray(value);
}
function getSpecialFlowModule(
flow: FlowState,
key: "preprocessor_module" | "failure_module"
): Record<string, unknown> | null {
if (!flow.value || typeof flow.value !== "object") {
return null;
}
const module = (flow.value as Record<string, unknown>)[key];
return module && typeof module === "object" ? (module as Record<string, unknown>) : null;
}
function getModuleType(module: Record<string, unknown>): string | null {
const value = module.value;
if (!value || typeof value !== "object") {
return null;
}
return typeof (value as Record<string, unknown>).type === "string"
? ((value as Record<string, string>).type)
: null;
}
function getModulePath(module: Record<string, unknown>): string | null {
const value = module.value;
if (!value || typeof value !== "object") {
return null;
}
return typeof (value as Record<string, unknown>).path === "string"
? ((value as Record<string, string>).path)
: null;
}
function hasSuspendConfig(module: Record<string, unknown>): boolean {
return typeof module.suspend === "object" && module.suspend !== null;
}
function getSuspendRequiredEvents(module: Record<string, unknown>): number | null {
const suspend = isObjectRecord(module.suspend) ? module.suspend : null;
return typeof suspend?.required_events === "number" ? suspend.required_events : null;
}
function getSuspendResumeStringFields(module: Record<string, unknown>): string[] {
const suspend = isObjectRecord(module.suspend) ? module.suspend : null;
const resumeForm = isObjectRecord(suspend?.resume_form) ? suspend.resume_form : null;
const schema = isObjectRecord(resumeForm?.schema) ? resumeForm.schema : null;
const properties = isObjectRecord(schema?.properties) ? schema.properties : null;
if (!properties) {
return [];
}
return Object.entries(properties).flatMap(([field, property]) => {
if (!isObjectRecord(property) || property.type !== "string") {
return [];
}
return [field];
});
}
function appStatesEqual(left: AppFilesState, right: AppFilesState): boolean {
return fileMapsEqual(left.frontend, right.frontend) && fileMapsEqual(stringifyBackend(left.backend), stringifyBackend(right.backend));
}
function stringifyBackend(backend: Record<string, AppRunnableState>): Record<string, string> {
const result: Record<string, string> = {};
for (const [key, value] of Object.entries(backend)) {
result[key] = JSON.stringify(value);
}
return result;
}
function fileMapsEqual(left: Record<string, string>, right: Record<string, string>): boolean {
const leftEntries = Object.entries(left).sort(([a], [b]) => a.localeCompare(b));
const rightEntries = Object.entries(right).sort(([a], [b]) => a.localeCompare(b));
if (leftEntries.length !== rightEntries.length) {
return false;
}
return leftEntries.every(([key, value], index) => {
const [otherKey, otherValue] = rightEntries[index];
return key === otherKey && normalizeText(value) === normalizeText(otherValue);
});
}

View File

@@ -1,2 +0,0 @@
main(name: string)
greeting: `Hello, ${name}!`

View File

@@ -1,3 +0,0 @@
export async function main(name: string) {
return { greeting: `Hello, ${name}!` };
}

View File

@@ -1,2 +0,0 @@
type: script
path: f/lib/format_greeting

View File

@@ -1,3 +0,0 @@
export async function main(name: string) {
return { greeting: `Hello, ${name}!` };
}

View File

@@ -1,2 +0,0 @@
def main(
return {"total": a + b}

View File

@@ -1,20 +0,0 @@
summary: Simple greeting flow
schema:
type: object
properties:
name:
type: string
description: Name to greet
required:
- name
value:
modules:
- id: hello_step
value:
type: rawscript
language: bun
content: !inline hello.ts
input_transforms:
name:
type: javascript
expr: flow_input.name

View File

@@ -1,3 +0,0 @@
export async function main(name: string) {
return { greeting: `Hello, ${name}!` };
}

View File

@@ -1,3 +0,0 @@
export async function main(name: string) {
return { greeting: `Hello, ${name}!` };
}

View File

@@ -1,3 +0,0 @@
export async function main(name: string) {
return { greeting: `Hello, ${name}!` };
}

View File

@@ -1,31 +0,0 @@
{
"summary": "",
"value": {
"modules": [
{
"id": "sum_numbers",
"value": {
"type": "rawscript",
"language": "bun",
"content": "export async function main(a: number, b: number) {\n return a + b;\n}",
"input_transforms": {
"a": {
"type": "javascript",
"expr": "flow_input.a"
},
"b": {
"type": "javascript",
"expr": "flow_input.b"
}
}
}
}
]
},
"schema": {
"$schema": "https://json-schema.org/draft/2020-12/schema",
"properties": {},
"required": [],
"type": "object"
}
}

View File

@@ -1,30 +0,0 @@
{
"value": {
"modules": [
{
"id": "count_until_target",
"value": {
"type": "whileloopflow"
}
},
{
"id": "return_final_count",
"value": {
"type": "rawscript"
}
}
]
},
"schema": {
"$schema": "https://json-schema.org/draft/2020-12/schema",
"type": "object",
"properties": {
"target": {
"type": "number"
}
},
"required": [
"target"
]
}
}

View File

@@ -1,36 +0,0 @@
{
"value": {
"preprocessor_module": {
"id": "preprocessor",
"value": {
"type": "rawscript"
}
},
"failure_module": {
"id": "failure",
"value": {
"type": "rawscript"
}
},
"modules": [
{
"id": "process_event",
"value": {
"type": "rawscript"
}
}
]
},
"schema": {
"$schema": "https://json-schema.org/draft/2020-12/schema",
"type": "object",
"properties": {
"payload": {
"type": "string"
}
},
"required": [
"payload"
]
}
}

View File

@@ -1,44 +0,0 @@
{
"value": {
"modules": [
{
"id": "request_approval",
"suspend": {
"required_events": 1,
"resume_form": {
"schema": {
"approver_comment": {
"type": "string"
}
}
}
},
"value": {
"type": "rawscript"
}
},
{
"id": "finalize_purchase",
"value": {
"type": "rawscript"
}
}
]
},
"schema": {
"$schema": "https://json-schema.org/draft/2020-12/schema",
"type": "object",
"properties": {
"requester_email": {
"type": "string"
},
"amount": {
"type": "number"
}
},
"required": [
"requester_email",
"amount"
]
}
}

View File

@@ -1,39 +0,0 @@
{
"value": {
"modules": [
{
"id": "sum_numbers",
"value": {
"type": "script",
"path": "f/evals/add_two_numbers.ts",
"input_transforms": {
"a": {
"type": "javascript",
"expr": "flow_input.a"
},
"b": {
"type": "javascript",
"expr": "flow_input.b"
}
}
}
}
]
},
"schema": {
"$schema": "https://json-schema.org/draft/2020-12/schema",
"type": "object",
"properties": {
"a": {
"type": "number"
},
"b": {
"type": "number"
}
},
"required": [
"a",
"b"
]
}
}

View File

@@ -1,39 +0,0 @@
{
"value": {
"modules": [
{
"id": "call_add_numbers",
"value": {
"type": "flow",
"path": "f/evals/add_numbers_flow",
"input_transforms": {
"a": {
"type": "javascript",
"expr": "flow_input.a"
},
"b": {
"type": "javascript",
"expr": "flow_input.b"
}
}
}
}
]
},
"schema": {
"$schema": "https://json-schema.org/draft/2020-12/schema",
"type": "object",
"properties": {
"a": {
"type": "number"
},
"b": {
"type": "number"
}
},
"required": [
"a",
"b"
]
}
}

View File

@@ -1,24 +0,0 @@
{
"value": {
"modules": [
{
"id": "route_by_tier",
"value": {
"type": "branchone"
}
}
]
},
"schema": {
"$schema": "https://json-schema.org/draft/2020-12/schema",
"type": "object",
"properties": {
"tier": {
"type": "string"
}
},
"required": [
"tier"
]
}
}

View File

@@ -1,29 +0,0 @@
{
"workspace": {
"scripts": [
{
"path": "f/evals/add_two_numbers.ts",
"summary": "Add two numbers",
"description": "Returns the sum of two numeric inputs.",
"language": "bun",
"schema": {
"$schema": "https://json-schema.org/draft/2020-12/schema",
"type": "object",
"properties": {
"a": {
"type": "number"
},
"b": {
"type": "number"
}
},
"required": [
"a",
"b"
]
},
"content": "export async function main(a: number, b: number) {\n return a + b;\n}\n"
}
]
}
}

View File

@@ -1,49 +0,0 @@
{
"workspace": {
"flows": [
{
"path": "f/evals/add_numbers_flow",
"summary": "Add two numbers in a subflow",
"description": "Takes two numeric inputs and returns their sum.",
"schema": {
"$schema": "https://json-schema.org/draft/2020-12/schema",
"type": "object",
"properties": {
"a": {
"type": "number"
},
"b": {
"type": "number"
}
},
"required": [
"a",
"b"
]
},
"value": {
"modules": [
{
"id": "sum_numbers",
"value": {
"type": "rawscript",
"language": "bun",
"content": "export async function main(a: number, b: number) {\n return a + b;\n}",
"input_transforms": {
"a": {
"type": "javascript",
"expr": "flow_input.a"
},
"b": {
"type": "javascript",
"expr": "flow_input.b"
}
}
}
}
]
}
}
]
}
}

View File

@@ -1,8 +0,0 @@
{
"path": "f/evals/greet_user.ts",
"lang": "bun",
"args": {
"name": "Alice"
},
"code": "export async function main(name: string) {\n\treturn `Hello, ${name}!`\n}\n"
}

View File

@@ -1,8 +0,0 @@
{
"path": "f/evals/greet_user.ts",
"lang": "bun",
"args": {
"name": "Alice"
},
"code": "export async function main(name: string) {\n\treturn ''\n}\n"
}

View File

@@ -1,3 +0,0 @@
{"createdAt":"2026-04-10T14:24:42.248Z","gitSha":"8f8b487be517a0bdd318c36857c1d46d5ab0723a","mode":"app","runs":1,"runModel":"anthropic:claude-haiku-4-5-20251001","judgeModel":"claude-sonnet-4-6","caseCount":9,"attemptCount":9,"passedAttempts":7,"passRate":0.7777777777777778,"averageDurationMs":25680.777777777777,"averageJudgeScore":76.55555555555556,"averageTokenUsagePerAttempt":{"prompt":53989.22222222222,"completion":2629.222222222222,"total":56618.444444444445},"failedCaseIds":["app-test8-inventory-tracker-create","app-test9-recipe-book-create"],"cases":[{"id":"app-test1-counter-create","attemptCount":1,"passedAttempts":1,"passRate":1,"averageDurationMs":11071,"averageJudgeScore":100,"averageTokenUsagePerAttempt":{"prompt":17912,"completion":1079,"total":18991}},{"id":"app-test2-counter-reset","attemptCount":1,"passedAttempts":1,"passRate":1,"averageDurationMs":12121,"averageJudgeScore":97,"averageTokenUsagePerAttempt":{"prompt":19088,"completion":833,"total":19921}},{"id":"app-test3-shopping-cart-quantity","attemptCount":1,"passedAttempts":1,"passRate":1,"averageDurationMs":25852,"averageJudgeScore":98,"averageTokenUsagePerAttempt":{"prompt":58834,"completion":2446,"total":61280}},{"id":"app-test4-shopping-cart-discount","attemptCount":1,"passedAttempts":1,"passRate":1,"averageDurationMs":42350,"averageJudgeScore":97,"averageTokenUsagePerAttempt":{"prompt":90882,"completion":4984,"total":95866}},{"id":"app-test5-file-manager-search","attemptCount":1,"passedAttempts":1,"passRate":1,"averageDurationMs":29129,"averageJudgeScore":100,"averageTokenUsagePerAttempt":{"prompt":81980,"completion":2817,"total":84797}},{"id":"app-test6-file-manager-inline-rename","attemptCount":1,"passedAttempts":1,"passRate":1,"averageDurationMs":51576,"averageJudgeScore":100,"averageTokenUsagePerAttempt":{"prompt":110023,"completion":6328,"total":116351}},{"id":"app-test7-file-manager-select-all","attemptCount":1,"passedAttempts":1,"passRate":1,"averageDurationMs":39256,"averageJudgeScore":97,"averageTokenUsagePerAttempt":{"prompt":72006,"completion":4188,"total":76194}},{"id":"app-test8-inventory-tracker-create","attemptCount":1,"passedAttempts":0,"passRate":0,"averageDurationMs":10514,"averageJudgeScore":0,"averageTokenUsagePerAttempt":{"prompt":17600,"completion":511,"total":18111}},{"id":"app-test9-recipe-book-create","attemptCount":1,"passedAttempts":0,"passRate":0,"averageDurationMs":9258,"averageJudgeScore":0,"averageTokenUsagePerAttempt":{"prompt":17578,"completion":477,"total":18055}}]}
{"createdAt":"2026-04-10T14:27:49.271Z","gitSha":"8f8b487be517a0bdd318c36857c1d46d5ab0723a","mode":"app","runs":1,"runModel":"anthropic:claude-opus-4-6","judgeModel":"claude-sonnet-4-6","caseCount":9,"attemptCount":9,"passedAttempts":6,"passRate":0.6666666666666666,"averageDurationMs":57285.666666666664,"averageJudgeScore":82.55555555555556,"averageTokenUsagePerAttempt":{"prompt":54435.77777777778,"completion":3668.6666666666665,"total":58104.444444444445},"failedCaseIds":["app-test7-file-manager-select-all","app-test8-inventory-tracker-create","app-test9-recipe-book-create"],"cases":[{"id":"app-test1-counter-create","attemptCount":1,"passedAttempts":1,"passRate":1,"averageDurationMs":17930,"averageJudgeScore":100,"averageTokenUsagePerAttempt":{"prompt":17620,"completion":743,"total":18363}},{"id":"app-test2-counter-reset","attemptCount":1,"passedAttempts":1,"passRate":1,"averageDurationMs":17852,"averageJudgeScore":100,"averageTokenUsagePerAttempt":{"prompt":18887,"completion":701,"total":19588}},{"id":"app-test3-shopping-cart-quantity","attemptCount":1,"passedAttempts":1,"passRate":1,"averageDurationMs":43501,"averageJudgeScore":97,"averageTokenUsagePerAttempt":{"prompt":38855,"completion":2692,"total":41547}},{"id":"app-test4-shopping-cart-discount","attemptCount":1,"passedAttempts":1,"passRate":1,"averageDurationMs":60820,"averageJudgeScore":98,"averageTokenUsagePerAttempt":{"prompt":61707,"completion":3420,"total":65127}},{"id":"app-test5-file-manager-search","attemptCount":1,"passedAttempts":1,"passRate":1,"averageDurationMs":45253,"averageJudgeScore":100,"averageTokenUsagePerAttempt":{"prompt":67244,"completion":3031,"total":70275}},{"id":"app-test6-file-manager-inline-rename","attemptCount":1,"passedAttempts":1,"passRate":1,"averageDurationMs":104837,"averageJudgeScore":98,"averageTokenUsagePerAttempt":{"prompt":116979,"completion":6834,"total":123813}},{"id":"app-test7-file-manager-select-all","attemptCount":1,"passedAttempts":0,"passRate":0,"averageDurationMs":73325,"averageJudgeScore":78,"averageTokenUsagePerAttempt":{"prompt":76351,"completion":5239,"total":81590}},{"id":"app-test8-inventory-tracker-create","attemptCount":1,"passedAttempts":0,"passRate":0,"averageDurationMs":133705,"averageJudgeScore":72,"averageTokenUsagePerAttempt":{"prompt":68546,"completion":9881,"total":78427}},{"id":"app-test9-recipe-book-create","attemptCount":1,"passedAttempts":0,"passRate":0,"averageDurationMs":18348,"averageJudgeScore":0,"averageTokenUsagePerAttempt":{"prompt":23733,"completion":477,"total":24210}}]}
{"createdAt":"2026-04-10T14:29:28.396Z","gitSha":"8f8b487be517a0bdd318c36857c1d46d5ab0723a","mode":"app","runs":1,"runModel":"openai:gpt-4o","judgeModel":"claude-sonnet-4-6","caseCount":9,"attemptCount":9,"passedAttempts":5,"passRate":0.5555555555555556,"averageDurationMs":31682.555555555555,"averageJudgeScore":73.11111111111111,"averageTokenUsagePerAttempt":{"prompt":27221.222222222223,"completion":1564.6666666666667,"total":28785.88888888889},"failedCaseIds":["app-test6-file-manager-inline-rename","app-test7-file-manager-select-all","app-test8-inventory-tracker-create","app-test9-recipe-book-create"],"cases":[{"id":"app-test1-counter-create","attemptCount":1,"passedAttempts":1,"passRate":1,"averageDurationMs":9911,"averageJudgeScore":100,"averageTokenUsagePerAttempt":{"prompt":8116,"completion":525,"total":8641}},{"id":"app-test2-counter-reset","attemptCount":1,"passedAttempts":1,"passRate":1,"averageDurationMs":15146,"averageJudgeScore":97,"averageTokenUsagePerAttempt":{"prompt":13096,"completion":576,"total":13672}},{"id":"app-test3-shopping-cart-quantity","attemptCount":1,"passedAttempts":1,"passRate":1,"averageDurationMs":31146,"averageJudgeScore":97,"averageTokenUsagePerAttempt":{"prompt":42424,"completion":1691,"total":44115}},{"id":"app-test4-shopping-cart-discount","attemptCount":1,"passedAttempts":1,"passRate":1,"averageDurationMs":49382,"averageJudgeScore":92,"averageTokenUsagePerAttempt":{"prompt":35785,"completion":3345,"total":39130}},{"id":"app-test5-file-manager-search","attemptCount":1,"passedAttempts":1,"passRate":1,"averageDurationMs":62963,"averageJudgeScore":100,"averageTokenUsagePerAttempt":{"prompt":46902,"completion":3590,"total":50492}},{"id":"app-test6-file-manager-inline-rename","attemptCount":1,"passedAttempts":0,"passRate":0,"averageDurationMs":24203,"averageJudgeScore":100,"averageTokenUsagePerAttempt":{"prompt":33121,"completion":498,"total":33619}},{"id":"app-test7-file-manager-select-all","attemptCount":1,"passedAttempts":0,"passRate":0,"averageDurationMs":74058,"averageJudgeScore":72,"averageTokenUsagePerAttempt":{"prompt":46026,"completion":3591,"total":49617}},{"id":"app-test8-inventory-tracker-create","attemptCount":1,"passedAttempts":0,"passRate":0,"averageDurationMs":6757,"averageJudgeScore":0,"averageTokenUsagePerAttempt":{"prompt":7770,"completion":165,"total":7935}},{"id":"app-test9-recipe-book-create","attemptCount":1,"passedAttempts":0,"passRate":0,"averageDurationMs":11577,"averageJudgeScore":0,"averageTokenUsagePerAttempt":{"prompt":11751,"completion":101,"total":11852}}]}

View File

@@ -1,2 +0,0 @@
{"createdAt":"2026-04-10T14:25:39.106Z","gitSha":"8f8b487be517a0bdd318c36857c1d46d5ab0723a","mode":"cli","runs":1,"runModel":"anthropic:haiku","judgeModel":"claude-sonnet-4-6","caseCount":6,"attemptCount":6,"passedAttempts":6,"passRate":1,"averageDurationMs":21746,"averageJudgeScore":99.16666666666667,"averageTokenUsagePerAttempt":null,"failedCaseIds":[],"cases":[{"id":"bun-hello-script","attemptCount":1,"passedAttempts":1,"passRate":1,"averageDurationMs":16588,"averageJudgeScore":100,"averageTokenUsagePerAttempt":null},{"id":"bun-hello-flow","attemptCount":1,"passedAttempts":1,"passRate":1,"averageDurationMs":27642,"averageJudgeScore":100,"averageTokenUsagePerAttempt":null},{"id":"python-add-numbers-script","attemptCount":1,"passedAttempts":1,"passRate":1,"averageDurationMs":23640,"averageJudgeScore":100,"averageTokenUsagePerAttempt":null},{"id":"bun-hello-script-uppercase","attemptCount":1,"passedAttempts":1,"passRate":1,"averageDurationMs":19379,"averageJudgeScore":100,"averageTokenUsagePerAttempt":null},{"id":"bun-hello-flow-punctuation","attemptCount":1,"passedAttempts":1,"passRate":1,"averageDurationMs":21993,"averageJudgeScore":100,"averageTokenUsagePerAttempt":null},{"id":"flow-reuse-existing-script","attemptCount":1,"passedAttempts":1,"passRate":1,"averageDurationMs":21234,"averageJudgeScore":95,"averageTokenUsagePerAttempt":null}]}
{"createdAt":"2026-04-10T14:28:09.045Z","gitSha":"8f8b487be517a0bdd318c36857c1d46d5ab0723a","mode":"cli","runs":1,"runModel":"anthropic:opus","judgeModel":"claude-sonnet-4-6","caseCount":6,"attemptCount":6,"passedAttempts":6,"passRate":1,"averageDurationMs":24988.833333333332,"averageJudgeScore":99.66666666666667,"averageTokenUsagePerAttempt":null,"failedCaseIds":[],"cases":[{"id":"bun-hello-script","attemptCount":1,"passedAttempts":1,"passRate":1,"averageDurationMs":22034,"averageJudgeScore":100,"averageTokenUsagePerAttempt":null},{"id":"bun-hello-flow","attemptCount":1,"passedAttempts":1,"passRate":1,"averageDurationMs":28030,"averageJudgeScore":100,"averageTokenUsagePerAttempt":null},{"id":"python-add-numbers-script","attemptCount":1,"passedAttempts":1,"passRate":1,"averageDurationMs":16668,"averageJudgeScore":100,"averageTokenUsagePerAttempt":null},{"id":"bun-hello-script-uppercase","attemptCount":1,"passedAttempts":1,"passRate":1,"averageDurationMs":21269,"averageJudgeScore":100,"averageTokenUsagePerAttempt":null},{"id":"bun-hello-flow-punctuation","attemptCount":1,"passedAttempts":1,"passRate":1,"averageDurationMs":30126,"averageJudgeScore":100,"averageTokenUsagePerAttempt":null},{"id":"flow-reuse-existing-script","attemptCount":1,"passedAttempts":1,"passRate":1,"averageDurationMs":31806,"averageJudgeScore":98,"averageTokenUsagePerAttempt":null}]}

View File

@@ -1,3 +0,0 @@
{"createdAt":"2026-04-10T14:25:16.664Z","gitSha":"8f8b487be517a0bdd318c36857c1d46d5ab0723a","mode":"flow","runs":1,"runModel":"anthropic:claude-haiku-4-5-20251001","judgeModel":"claude-sonnet-4-6","caseCount":13,"attemptCount":13,"passedAttempts":8,"passRate":0.6153846153846154,"averageDurationMs":33424.692307692305,"averageJudgeScore":82.61538461538461,"averageTokenUsagePerAttempt":{"prompt":131901,"completion":3121.230769230769,"total":135022.23076923078},"failedCaseIds":["flow-test6-ai-agent-tools","flow-test7-simple-modification","flow-test9-parallel-refactor","flow-test10-while-loop-counter","flow-test11-preprocessor-and-failure-handler"],"cases":[{"id":"flow-test0-sum-two-numbers","attemptCount":1,"passedAttempts":1,"passRate":1,"averageDurationMs":16943,"averageJudgeScore":98,"averageTokenUsagePerAttempt":{"prompt":126615,"completion":839,"total":127454}},{"id":"flow-test1-reuse-existing-script","attemptCount":1,"passedAttempts":1,"passRate":1,"averageDurationMs":15220,"averageJudgeScore":100,"averageTokenUsagePerAttempt":{"prompt":75614,"completion":805,"total":76419}},{"id":"flow-test2-call-existing-subflow","attemptCount":1,"passedAttempts":1,"passRate":1,"averageDurationMs":15699,"averageJudgeScore":97,"averageTokenUsagePerAttempt":{"prompt":76182,"completion":887,"total":77069}},{"id":"flow-test3-branchone-routing","attemptCount":1,"passedAttempts":1,"passRate":1,"averageDurationMs":21605,"averageJudgeScore":100,"averageTokenUsagePerAttempt":{"prompt":62230,"completion":1509,"total":63739}},{"id":"flow-test4-order-processing-loop","attemptCount":1,"passedAttempts":1,"passRate":1,"averageDurationMs":47228,"averageJudgeScore":97,"averageTokenUsagePerAttempt":{"prompt":143511,"completion":5443,"total":148954}},{"id":"flow-test5-parallel-data-pipeline","attemptCount":1,"passedAttempts":1,"passRate":1,"averageDurationMs":81870,"averageJudgeScore":92,"averageTokenUsagePerAttempt":{"prompt":194542,"completion":12409,"total":206951}},{"id":"flow-test6-ai-agent-tools","attemptCount":1,"passedAttempts":0,"passRate":0,"averageDurationMs":51878,"averageJudgeScore":72,"averageTokenUsagePerAttempt":{"prompt":142071,"completion":5720,"total":147791}},{"id":"flow-test7-simple-modification","attemptCount":1,"passedAttempts":0,"passRate":0,"averageDurationMs":49113,"averageJudgeScore":42,"averageTokenUsagePerAttempt":{"prompt":318525,"completion":2702,"total":321227}},{"id":"flow-test8-branching-in-loop","attemptCount":1,"passedAttempts":1,"passRate":1,"averageDurationMs":18244,"averageJudgeScore":82,"averageTokenUsagePerAttempt":{"prompt":78441,"completion":979,"total":79420}},{"id":"flow-test9-parallel-refactor","attemptCount":1,"passedAttempts":0,"passRate":0,"averageDurationMs":49485,"averageJudgeScore":72,"averageTokenUsagePerAttempt":{"prompt":135237,"completion":5467,"total":140704}},{"id":"flow-test10-while-loop-counter","attemptCount":1,"passedAttempts":0,"passRate":0,"averageDurationMs":21210,"averageJudgeScore":90,"averageTokenUsagePerAttempt":{"prompt":127844,"completion":1179,"total":129023}},{"id":"flow-test11-preprocessor-and-failure-handler","attemptCount":1,"passedAttempts":0,"passRate":0,"averageDurationMs":25142,"averageJudgeScore":42,"averageTokenUsagePerAttempt":{"prompt":128648,"completion":1337,"total":129985}},{"id":"flow-test12-approval-step","attemptCount":1,"passedAttempts":1,"passRate":1,"averageDurationMs":20884,"averageJudgeScore":90,"averageTokenUsagePerAttempt":{"prompt":105253,"completion":1300,"total":106553}}]}
{"createdAt":"2026-04-10T14:57:17.513Z","gitSha":"2a58402cfc5c320748839e92b51a1291b937bf26","mode":"flow","runs":1,"runModel":"anthropic:claude-opus-4-6","judgeModel":"claude-sonnet-4-6","caseCount":13,"attemptCount":13,"passedAttempts":8,"passRate":0.6153846153846154,"averageDurationMs":58074.53846153846,"averageJudgeScore":87.53846153846153,"averageTokenUsagePerAttempt":{"prompt":125452.76923076923,"completion":2957.769230769231,"total":128410.53846153847},"failedCaseIds":["flow-test4-order-processing-loop","flow-test6-ai-agent-tools","flow-test7-simple-modification","flow-test10-while-loop-counter","flow-test11-preprocessor-and-failure-handler"],"cases":[{"id":"flow-test0-sum-two-numbers","attemptCount":1,"passedAttempts":1,"passRate":1,"averageDurationMs":26967,"averageJudgeScore":97,"averageTokenUsagePerAttempt":{"prompt":103796,"completion":634,"total":104430}},{"id":"flow-test1-reuse-existing-script","attemptCount":1,"passedAttempts":1,"passRate":1,"averageDurationMs":29009,"averageJudgeScore":100,"averageTokenUsagePerAttempt":{"prompt":75507,"completion":743,"total":76250}},{"id":"flow-test2-call-existing-subflow","attemptCount":1,"passedAttempts":1,"passRate":1,"averageDurationMs":26828,"averageJudgeScore":100,"averageTokenUsagePerAttempt":{"prompt":76172,"completion":807,"total":76979}},{"id":"flow-test3-branchone-routing","attemptCount":1,"passedAttempts":1,"passRate":1,"averageDurationMs":44418,"averageJudgeScore":100,"averageTokenUsagePerAttempt":{"prompt":130440,"completion":1787,"total":132227}},{"id":"flow-test4-order-processing-loop","attemptCount":1,"passedAttempts":0,"passRate":0,"averageDurationMs":82185,"averageJudgeScore":97,"averageTokenUsagePerAttempt":{"prompt":116133,"completion":4905,"total":121038}},{"id":"flow-test5-parallel-data-pipeline","attemptCount":1,"passedAttempts":1,"passRate":1,"averageDurationMs":110344,"averageJudgeScore":97,"averageTokenUsagePerAttempt":{"prompt":122092,"completion":6980,"total":129072}},{"id":"flow-test6-ai-agent-tools","attemptCount":1,"passedAttempts":0,"passRate":0,"averageDurationMs":119901,"averageJudgeScore":72,"averageTokenUsagePerAttempt":{"prompt":154916,"completion":8908,"total":163824}},{"id":"flow-test7-simple-modification","attemptCount":1,"passedAttempts":0,"passRate":0,"averageDurationMs":44333,"averageJudgeScore":72,"averageTokenUsagePerAttempt":{"prompt":109935,"completion":1536,"total":111471}},{"id":"flow-test8-branching-in-loop","attemptCount":1,"passedAttempts":1,"passRate":1,"averageDurationMs":54247,"averageJudgeScore":97,"averageTokenUsagePerAttempt":{"prompt":136872,"completion":2638,"total":139510}},{"id":"flow-test9-parallel-refactor","attemptCount":1,"passedAttempts":1,"passRate":1,"averageDurationMs":63274,"averageJudgeScore":82,"averageTokenUsagePerAttempt":{"prompt":137794,"completion":3686,"total":141480}},{"id":"flow-test10-while-loop-counter","attemptCount":1,"passedAttempts":0,"passRate":0,"averageDurationMs":38813,"averageJudgeScore":82,"averageTokenUsagePerAttempt":{"prompt":105075,"completion":1157,"total":106232}},{"id":"flow-test11-preprocessor-and-failure-handler","attemptCount":1,"passedAttempts":0,"passRate":0,"averageDurationMs":77267,"averageJudgeScore":52,"averageTokenUsagePerAttempt":{"prompt":256547,"completion":3398,"total":259945}},{"id":"flow-test12-approval-step","attemptCount":1,"passedAttempts":1,"passRate":1,"averageDurationMs":37383,"averageJudgeScore":90,"averageTokenUsagePerAttempt":{"prompt":105607,"completion":1272,"total":106879}}]}
{"createdAt":"2026-04-10T14:29:52.249Z","gitSha":"8f8b487be517a0bdd318c36857c1d46d5ab0723a","mode":"flow","runs":1,"runModel":"openai:gpt-4o","judgeModel":"claude-sonnet-4-6","caseCount":13,"attemptCount":13,"passedAttempts":6,"passRate":0.46153846153846156,"averageDurationMs":29841.53846153846,"averageJudgeScore":68.46153846153847,"averageTokenUsagePerAttempt":{"prompt":72815.92307692308,"completion":770.7692307692307,"total":73586.69230769231},"failedCaseIds":["flow-test5-parallel-data-pipeline","flow-test6-ai-agent-tools","flow-test7-simple-modification","flow-test9-parallel-refactor","flow-test10-while-loop-counter","flow-test11-preprocessor-and-failure-handler","flow-test12-approval-step"],"cases":[{"id":"flow-test0-sum-two-numbers","attemptCount":1,"passedAttempts":1,"passRate":1,"averageDurationMs":20059,"averageJudgeScore":97,"averageTokenUsagePerAttempt":{"prompt":64091,"completion":265,"total":64356}},{"id":"flow-test1-reuse-existing-script","attemptCount":1,"passedAttempts":1,"passRate":1,"averageDurationMs":20728,"averageJudgeScore":100,"averageTokenUsagePerAttempt":{"prompt":46594,"completion":270,"total":46864}},{"id":"flow-test2-call-existing-subflow","attemptCount":1,"passedAttempts":1,"passRate":1,"averageDurationMs":21533,"averageJudgeScore":98,"averageTokenUsagePerAttempt":{"prompt":46859,"completion":232,"total":47091}},{"id":"flow-test3-branchone-routing","attemptCount":1,"passedAttempts":1,"passRate":1,"averageDurationMs":29004,"averageJudgeScore":100,"averageTokenUsagePerAttempt":{"prompt":64593,"completion":568,"total":65161}},{"id":"flow-test4-order-processing-loop","attemptCount":1,"passedAttempts":1,"passRate":1,"averageDurationMs":36250,"averageJudgeScore":95,"averageTokenUsagePerAttempt":{"prompt":66346,"completion":1259,"total":67605}},{"id":"flow-test5-parallel-data-pipeline","attemptCount":1,"passedAttempts":0,"passRate":0,"averageDurationMs":46151,"averageJudgeScore":72,"averageTokenUsagePerAttempt":{"prompt":104676,"completion":1698,"total":106374}},{"id":"flow-test6-ai-agent-tools","attemptCount":1,"passedAttempts":0,"passRate":0,"averageDurationMs":72403,"averageJudgeScore":62,"averageTokenUsagePerAttempt":{"prompt":105280,"completion":2216,"total":107496}},{"id":"flow-test7-simple-modification","attemptCount":1,"passedAttempts":0,"passRate":0,"averageDurationMs":41599,"averageJudgeScore":20,"averageTokenUsagePerAttempt":{"prompt":103053,"completion":707,"total":103760}},{"id":"flow-test8-branching-in-loop","attemptCount":1,"passedAttempts":1,"passRate":1,"averageDurationMs":23352,"averageJudgeScore":82,"averageTokenUsagePerAttempt":{"prompt":97955,"completion":468,"total":98423}},{"id":"flow-test9-parallel-refactor","attemptCount":1,"passedAttempts":0,"passRate":0,"averageDurationMs":19341,"averageJudgeScore":0,"averageTokenUsagePerAttempt":{"prompt":12254,"completion":1057,"total":13311}},{"id":"flow-test10-while-loop-counter","attemptCount":1,"passedAttempts":0,"passRate":0,"averageDurationMs":16143,"averageJudgeScore":82,"averageTokenUsagePerAttempt":{"prompt":64480,"completion":445,"total":64925}},{"id":"flow-test11-preprocessor-and-failure-handler","attemptCount":1,"passedAttempts":0,"passRate":0,"averageDurationMs":24231,"averageJudgeScore":52,"averageTokenUsagePerAttempt":{"prompt":106068,"completion":472,"total":106540}},{"id":"flow-test12-approval-step","attemptCount":1,"passedAttempts":0,"passRate":0,"averageDurationMs":17146,"averageJudgeScore":30,"averageTokenUsagePerAttempt":{"prompt":64358,"completion":363,"total":64721}}]}

View File

@@ -1,3 +0,0 @@
{"createdAt":"2026-04-10T14:23:51.580Z","gitSha":"8f8b487be517a0bdd318c36857c1d46d5ab0723a","mode":"script","runs":1,"runModel":"anthropic:claude-haiku-4-5-20251001","judgeModel":"claude-sonnet-4-6","caseCount":1,"attemptCount":1,"passedAttempts":1,"passRate":1,"averageDurationMs":12112,"averageJudgeScore":100,"averageTokenUsagePerAttempt":{"prompt":48134,"completion":452,"total":48586},"failedCaseIds":[],"cases":[{"id":"script-test1-greet-user","attemptCount":1,"passedAttempts":1,"passRate":1,"averageDurationMs":12112,"averageJudgeScore":100,"averageTokenUsagePerAttempt":{"prompt":48134,"completion":452,"total":48586}}]}
{"createdAt":"2026-04-10T14:24:18.129Z","gitSha":"8f8b487be517a0bdd318c36857c1d46d5ab0723a","mode":"script","runs":1,"runModel":"anthropic:claude-opus-4-6","judgeModel":"claude-sonnet-4-6","caseCount":1,"attemptCount":1,"passedAttempts":1,"passRate":1,"averageDurationMs":16595,"averageJudgeScore":100,"averageTokenUsagePerAttempt":{"prompt":38264,"completion":254,"total":38518},"failedCaseIds":[],"cases":[{"id":"script-test1-greet-user","attemptCount":1,"passedAttempts":1,"passRate":1,"averageDurationMs":16595,"averageJudgeScore":100,"averageTokenUsagePerAttempt":{"prompt":38264,"completion":254,"total":38518}}]}
{"createdAt":"2026-04-10T14:24:41.534Z","gitSha":"8f8b487be517a0bdd318c36857c1d46d5ab0723a","mode":"script","runs":1,"runModel":"openai:gpt-4o","judgeModel":"claude-sonnet-4-6","caseCount":1,"attemptCount":1,"passedAttempts":1,"passRate":1,"averageDurationMs":13643,"averageJudgeScore":97,"averageTokenUsagePerAttempt":{"prompt":28961,"completion":137,"total":29098},"failedCaseIds":[],"cases":[{"id":"script-test1-greet-user","attemptCount":1,"passedAttempts":1,"passRate":1,"averageDurationMs":13643,"averageJudgeScore":97,"averageTokenUsagePerAttempt":{"prompt":28961,"completion":137,"total":29098}}]}

View File

@@ -1,79 +0,0 @@
import { loadAppFixture } from "../adapters/frontend/core/app/appFixtureLoader";
import type { AppFiles } from "../../frontend/src/lib/components/copilot/chat/app/core";
import type { FrontendEvalModelConfig } from "../core/models";
import { validateAppState, type AppFilesState } from "../core/validators";
import type { BenchmarkArtifactFile, ModeRunner } from "../core/types";
import { runAppEval } from "../adapters/frontend/core/app/appEvalRunner";
import { DEFAULT_FRONTEND_EVAL_MODEL, getFrontendApiKey } from "./frontendCommon";
export function createAppModeRunner(
modelConfig: FrontendEvalModelConfig = DEFAULT_FRONTEND_EVAL_MODEL
): ModeRunner<AppFilesState, AppFilesState, AppFilesState> {
return {
mode: "app",
concurrency: 5,
judgeThreshold: 80,
async loadInitial(path) {
return path ? (await loadAppFixture(path)) : undefined;
},
async loadExpected(path) {
return path ? (await loadAppFixture(path)) : undefined;
},
async run(prompt, initial, context) {
const result = await runAppEval(prompt, getFrontendApiKey(modelConfig.provider), {
initialFrontend: initial?.frontend,
initialBackend: initial?.backend as AppFiles["backend"] | undefined,
provider: modelConfig.provider,
model: modelConfig.model,
runContext: context,
});
return {
success: result.success,
actual: result.files as AppFilesState,
error: result.error,
assistantMessageCount: result.assistantMessageCount,
toolCallCount: result.toolCallCount,
toolsUsed: result.toolsUsed,
skillsInvoked: [],
tokenUsage: result.tokenUsage,
};
},
validate({ actual, initial, expected }) {
return validateAppState({ actual, initial, expected });
},
buildArtifacts(actual): BenchmarkArtifactFile[] {
const artifacts: BenchmarkArtifactFile[] = [
{
path: "app.json",
content: JSON.stringify(actual, null, 2) + "\n",
},
];
for (const [filePath, content] of Object.entries(actual.frontend)) {
artifacts.push({
path: `frontend${filePath.startsWith("/") ? filePath : `/${filePath}`}`,
content,
});
}
for (const [key, runnable] of Object.entries(actual.backend)) {
artifacts.push({
path: `backend/${key}/meta.json`,
content: JSON.stringify(runnable, null, 2) + "\n",
});
const inlineContent = runnable.inlineScript?.content;
if (inlineContent) {
const extension = runnable.inlineScript?.language === "python3" ? "py" : "ts";
artifacts.push({
path: `backend/${key}/main.${extension}`,
content: inlineContent,
});
}
}
return artifacts;
},
};
}

View File

@@ -1,162 +0,0 @@
import { mkdtemp, rm, writeFile } from "node:fs/promises";
import { tmpdir } from "node:os";
import path from "node:path";
import { join } from "node:path";
import { readFile } from "node:fs/promises";
import { writeAiGuidanceFiles } from "../../cli/src/guidance/writer.ts";
import type { CliEvalModelConfig } from "../core/models";
import {
DEFAULT_CLI_EVAL_MODEL,
formatCliRunModelLabel,
getGeneratedSkillsSource,
runPromptAndCapture,
} from "../adapters/cli/runtime";
import { copyDirectory, readDirectoryFiles } from "../core/files";
import { validateCliWorkspace } from "../core/validators";
import type { BenchmarkArtifactFile, ModeRunner } from "../core/types";
const IGNORE_WORKSPACE_FILES = new Set([".claude", "AGENTS.md", "CLAUDE.md", "rt.d.ts"]);
interface CliWorkspaceFixture {
sourceDir: string;
files: Record<string, string>;
}
interface CliRunActual {
assistantOutput: string;
workspaceFiles: Record<string, string>;
}
const CLAUDE_PROJECT_PREAMBLE = [
"Follow the project instructions from AGENTS.md exactly.",
"Before creating or modifying any Windmill entity, you MUST invoke the relevant Skill tool and follow it.",
"Use the skill guidance for file layout, implementation details, and the exact next commands to tell the user.",
"Do not skip the Skill step.",
"You are running inside an automated benchmark harness, not an interactive user session.",
"Act autonomously and complete the requested file changes directly in the workspace.",
"Do not ask for confirmation, do not ask the user to save or create files manually, and do not wait for approval.",
"Do not respond with a plan when you can make the change directly.",
"Only describe what was done after you have written the files.",
].join(" ");
export function createCliModeRunner(
modelConfig: CliEvalModelConfig = DEFAULT_CLI_EVAL_MODEL
): ModeRunner<CliWorkspaceFixture, CliWorkspaceFixture, CliRunActual> {
return {
mode: "cli",
concurrency: 1,
judgeThreshold: 80,
async loadInitial(path) {
return path
? {
sourceDir: path,
files: await readDirectoryFiles(path),
}
: undefined;
},
async loadExpected(path) {
return path
? {
sourceDir: path,
files: await readDirectoryFiles(path),
}
: undefined;
},
async run(prompt, initial, _context) {
const workspaceDir = await mkdtemp(join(tmpdir(), "wmill-cli-benchmark-"));
try {
if (initial) {
await copyDirectory(initial.sourceDir, workspaceDir);
}
await writeAiGuidanceFiles({
targetDir: workspaceDir,
nonDottedPaths: true,
overwriteProjectGuidance: true,
skillsSourcePath: getGeneratedSkillsSource(),
});
await writeFile(join(workspaceDir, "rt.d.ts"), "export namespace RT {}\n", "utf8");
const renderedPrompt = await renderPrompt(prompt, workspaceDir);
const run = await runPromptAndCapture(renderedPrompt, workspaceDir, 6, modelConfig);
const workspaceFiles = await readDirectoryFiles(workspaceDir, { ignore: IGNORE_WORKSPACE_FILES });
return {
success: true,
actual: {
assistantOutput: run.output,
workspaceFiles,
},
assistantMessageCount: run.assistantMessageCount,
toolCallCount: run.toolsUsed.length,
toolsUsed: run.toolsUsed.map((entry) => entry.tool),
skillsInvoked: run.skillsInvoked,
tokenUsage: run.tokenUsage ?? null,
};
} catch (error) {
const message = error instanceof Error ? error.message : String(error);
return {
success: false,
actual: {
assistantOutput: "",
workspaceFiles: {},
},
error: message,
assistantMessageCount: 0,
toolCallCount: 0,
toolsUsed: [],
skillsInvoked: [],
tokenUsage: null,
};
} finally {
await rm(workspaceDir, { recursive: true, force: true });
}
},
validate({ actual, initial, expected }) {
return validateCliWorkspace({
actualFiles: actual.workspaceFiles,
expectedFiles: expected?.files,
initialFiles: initial?.files,
});
},
buildArtifacts(actual): BenchmarkArtifactFile[] {
const artifacts: BenchmarkArtifactFile[] = [
{
path: "assistant-output.txt",
content: `${actual.assistantOutput}\n`,
},
];
for (const [filePath, content] of Object.entries(actual.workspaceFiles)) {
artifacts.push({
path: filePath,
content,
});
}
return artifacts;
},
};
}
export function getCliRunModelLabel(
modelConfig: CliEvalModelConfig = DEFAULT_CLI_EVAL_MODEL
): string {
return formatCliRunModelLabel(modelConfig);
}
async function renderPrompt(prompt: string, workspaceDir: string): Promise<string> {
const renderedUserPrompt = prompt.replaceAll("{{workspace_root}}", workspaceDir);
const agentsInstructions = await readFile(path.join(workspaceDir, "AGENTS.md"), "utf8");
return [
"# Project Instructions",
agentsInstructions.trim(),
"",
"# Benchmark Harness",
CLAUDE_PROJECT_PREAMBLE,
"",
"# User Request",
renderedUserPrompt,
].join("\n");
}

Some files were not shown because too many files have changed in this diff Show More