Files
windmill/ai_evals/cases/flow.yaml
2026-04-13 14:05:46 +02:00

247 lines
12 KiB
YAML

- id: flow-test0-sum-two-numbers
prompt: |-
Create a flow that takes two numbers, `a` and `b`, and returns their sum.
Keep it simple and use a single step named `sum_numbers`.
expected: ai_evals/fixtures/frontend/flow/expected/test0_sum_two_numbers.json
judgeChecklist:
- "the flow takes `a` and `b` as inputs"
- "the main step is named `sum_numbers`"
- the flow returns the sum of the two numbers
- id: flow-test1-reuse-existing-script
prompt: |-
I need a flow that adds two numbers.
If there is already a script in the workspace that does that, reuse it instead of rewriting the logic.
The flow should take `a` and `b` as inputs and use a single step named `sum_numbers`.
initial: ai_evals/fixtures/frontend/flow/initial/test1_reuse_existing_script_initial.json
expected: ai_evals/fixtures/frontend/flow/expected/test1_reuse_existing_script.json
judgeChecklist:
- "the flow takes `a` and `b` as inputs"
- "the main step is named `sum_numbers`"
- the flow reuses the existing workspace script instead of rewriting the addition logic
- id: flow-test2-call-existing-subflow
prompt: |-
Create a parent flow that adds two numbers by reusing an existing flow in the workspace if one already exists.
The parent flow should take `a` and `b` as inputs and delegate the calculation instead of inlining it.
Use a single step named `call_add_numbers`.
initial: ai_evals/fixtures/frontend/flow/initial/test2_call_existing_subflow_initial.json
expected: ai_evals/fixtures/frontend/flow/expected/test2_call_existing_subflow.json
judgeChecklist:
- "the parent flow takes `a` and `b` as inputs"
- "the main step is named `call_add_numbers`"
- the parent flow delegates to an existing workspace subflow instead of inlining the addition logic
- id: flow-test3-branchone-routing
prompt: |-
Create a flow that routes incoming support requests based on the customer's tier.
The input should contain a string field named `tier`.
Free, pro, and enterprise requests should go to different queues, and unknown tiers should fall back to a default queue.
Name the main routing step `route_by_tier`.
expected: ai_evals/fixtures/frontend/flow/expected/test3_branchone_routing.json
judgeChecklist:
- "the input schema includes a string field named `tier`"
- "the main routing step is named `route_by_tier`"
- free requests go to a free queue
- pro requests go to a pro queue
- enterprise requests go to an enterprise queue
- unknown tiers fall back to a default queue
- id: flow-test4-order-processing-loop
prompt: |-
Build an order-processing flow.
The input should include an order with:
- an `items` array containing `name`, `price`, and `quantity`
- `customer_email`
- `shipping_address`
The flow should:
- validate that every item has a positive price and quantity
- calculate the order total with 8% tax
- check inventory for each item using placeholder availability data
- create a shipment if everything is in stock, otherwise create a backorder
- send a confirmation using placeholder email logic
- return a final order summary with the status
validate:
schemaAnyOf:
- requiredPaths:
- order
- order.items
- order.customer_email
- order.shipping_address
- requiredPaths:
- items
- customer_email
- shipping_address
resolveResultsRefs: true
judgeChecklist:
- the flow validates that every item has a positive price and quantity
- the flow calculates the order total with 8% tax
- the flow checks inventory for each item using placeholder availability data
- the flow creates a shipment if everything is in stock, otherwise a backorder
- the flow sends a confirmation using placeholder email logic
- the flow returns a final order summary with the resulting status
- id: flow-test5-parallel-data-pipeline
prompt: |-
Create a data-processing flow for three external data sources.
It should:
- load a small placeholder configuration listing the three sources
- fetch placeholder records from each source
- clean and validate each source's records
- combine everything into one dataset
- compute an overall quality score
- store the result differently depending on the score:
- 90 or above goes to the primary database
- 70 to 89 goes to a secondary database with a warning
- below 70 goes to quarantine and triggers an alert
- return a processing report with total records, quality score, and destination
judgeChecklist:
- the flow loads a placeholder configuration listing three external sources
- the flow fetches placeholder records from each source
- the flow cleans and validates each source's records
- the flow combines everything into one dataset
- the flow computes an overall quality score
- scores of 90 or above go to the primary database
- scores from 70 to 89 go to a secondary database with a warning
- scores below 70 go to quarantine and trigger an alert
- the final report includes total records, quality score, and destination
- id: flow-test6-ai-agent-tools
prompt: |-
Create a customer support flow.
The input should include `customer_id` and `query_text`.
The flow should load the customer's profile and order history, then use an AI assistant to help with the request.
The assistant should be able to:
- look up orders
- check refund eligibility
- search FAQs
- open a support ticket when needed
After that, log the interaction and return the assistant's response along with any actions it took.
judgeChecklist:
- "the input schema includes `customer_id` and `query_text`"
- the flow loads the customer's profile and order history
- the flow uses an AI assistant step
- the assistant can look up orders
- the assistant can check refund eligibility
- the assistant can search FAQs
- the assistant can open a support ticket
- the flow logs the interaction
- the final output returns the assistant response along with any actions taken or resulting support action details
- id: flow-test7-simple-modification
prompt: |-
Update this flow so it validates processed data before saving it.
After `process_data`, add a `validate_data` step that checks the data array is not empty.
If the array is empty, it should return an error object with the message `No data to save`.
If validation passes, let the save continue normally.
Update `save_results` so it handles the validation result correctly.
initial: ai_evals/fixtures/frontend/flow/initial/test5_initial.json
expected: ai_evals/fixtures/frontend/flow/expected/test5_modify_simple.json
judgeChecklist:
- the updated flow keeps the original fetch and process steps intact
- "a `validate_data` step is added after `process_data`"
- "`validate_data` checks that the processed data array is not empty"
- "empty data returns an error object with the message `No data to save`"
- "`save_results` handles the validation result correctly"
- id: flow-test8-branching-in-loop
prompt: |-
Update the order-processing logic inside `loop_orders` so different order types are handled differently.
For `express`, mark the order as priority and use a shipping cost of $15.99.
For `standard`, use a shipping cost of $5.99.
For `pickup`, mark it as no shipping required with a cost of $0.
Keep the existing processing as a fallback for unknown order types.
Each path should return the orderId, shipping cost, and shipping type.
initial: ai_evals/fixtures/frontend/flow/initial/test6_initial.json
judgeChecklist:
- "the existing `loop_orders` flow still handles per-order processing"
- exact branching topology is not required as long as `loop_orders` handles the order types correctly
- express orders are marked as priority and use a shipping cost of 15.99
- standard orders use a shipping cost of 5.99
- pickup orders use a shipping cost of 0 and are treated as no shipping required
- unknown order types still follow a fallback path
- "each processed order returns `orderId`, `shippingCost`, and `shippingType`"
- id: flow-test9-parallel-refactor
prompt: |-
Refactor this flow so the enrichment work no longer runs one step at a time.
`enrich_price`, `enrich_inventory`, and `enrich_reviews` should run independently.
Each one should return a fallback value if it fails.
Update `combine_data` so it merges the enrichment results and sets a `hasFallbacks` flag when any fallback was used.
Keep `get_item` as the first step and `return_result` as the last step.
initial: ai_evals/fixtures/frontend/flow/initial/test7_initial.json
expected: ai_evals/fixtures/frontend/flow/expected/test7_modify_complex.json
judgeChecklist:
- "the updated flow keeps `get_item` as the first step"
- "the updated flow keeps `return_result` as the last step"
- "`enrich_price`, `enrich_inventory`, and `enrich_reviews` run independently rather than sequentially"
- each enrichment path returns a fallback value if it fails
- "`combine_data` merges the enrichment results"
- "`combine_data` sets `hasFallbacks` when any fallback was used"
- id: flow-test10-while-loop-counter
prompt: |-
Create a flow that keeps incrementing a counter until it reaches a target value.
The input should include a number field named `target`.
Name the looping step `count_until_target`.
Once the target is reached, return the final counter value.
expected: ai_evals/fixtures/frontend/flow/expected/test10_while_loop_counter.json
judgeChecklist:
- "the input schema includes a number field named `target`"
- "the looping step is named `count_until_target`"
- the flow keeps incrementing a counter until the target is reached
- the final output returns the final counter value
- id: flow-test11-preprocessor-and-failure-handler
prompt: |-
Create an event-processing flow for a string payload.
Before the main processing runs, trim the payload and reject empty strings.
The main step should be named `process_event` and return a simple success object.
If anything fails, return a compact error object with the error message and the failing step id.
expected: ai_evals/fixtures/frontend/flow/expected/test11_preprocessor_failure.json
validate:
requireSpecialModules:
- preprocessor_module
- failure_module
judgeChecklist:
- the flow trims the payload before the main processing runs
- the flow rejects empty payload strings
- "the main step is named `process_event`"
- "`process_event` returns a simple success object"
- failures return a compact error object with the error message and failing step id
- id: flow-test12-approval-step
prompt: |-
Create a purchase approval flow.
The input should include `requester_email` and `amount`.
Add an approval step named `request_approval` that pauses the flow and asks the approver for a comment.
One approval should be enough to continue.
After approval, add a final step named `finalize_purchase` that returns an approved status object.
expected: ai_evals/fixtures/frontend/flow/expected/test12_approval_step.json
validate:
schemaRequiredPaths:
- requester_email
- amount
requireSuspendSteps:
- id: request_approval
requiredEvents: 1
resumeRequiredStringFieldAnyOf:
- comment
- approver_comment
judgeChecklist:
- "the flow includes an approval step named `request_approval`"
- "`request_approval` pauses the flow and asks the approver for a comment"
- one approval is enough to continue
- "the flow includes a final step named `finalize_purchase`"
- "`finalize_purchase` returns an approved status object after approval"