diff --git a/app/en/guides/create-tools/evaluate-tools/capture-mode/page.mdx b/app/en/guides/create-tools/evaluate-tools/capture-mode/page.mdx index 7f48a4912..6f886e9e4 100644 --- a/app/en/guides/create-tools/evaluate-tools/capture-mode/page.mdx +++ b/app/en/guides/create-tools/evaluate-tools/capture-mode/page.mdx @@ -7,11 +7,13 @@ import { Callout, Steps } from "nextra/components"; # Capture mode -Capture mode records tool calls without evaluating them. Use it to bootstrap test expectations or debug model behavior. +Capture mode records tool calls without evaluating them to bootstrap test expectations or debug model behavior. + +Capture mode helps you understand how models interact with your tools when you don't know what to expect. You can use it to create initial test cases, debug unexpected evaluation failures, or explore how models interpret new tools. This guide covers how to set up capture mode, review the output, and convert captured calls into proper test expectations. **Backward compatibility**: Capture mode works with existing evaluation - suites. Simply add the `--capture` flag to any `arcade evals` command. No code + suites. Add the `--capture` flag to any `arcade evals` command. No code changes needed. @@ -63,7 +65,7 @@ async def capture_weather_suite(): # Add cases without expected tool calls suite.add_case( - name="Simple weather query", + name="Basic weather query", user_message="What's the weather in Seattle?", expected_tool_calls=[], # Empty for capture ) @@ -98,7 +100,7 @@ Open the JSON file to see what the model called: "provider": "openai", "captured_cases": [ { - "case_name": "Simple weather query", + "case_name": "Basic weather query", "user_message": "What's the weather in Seattle?", "tool_calls": [ { @@ -118,7 +120,7 @@ If you set `--num-runs` > 1, each case also includes `runs`: ```json { - "case_name": "Simple weather query", + "case_name": "Basic weather query", "tool_calls": [{"name": "Weather_GetCurrent", "args": {"location": "Seattle"}}], "runs": [ {"tool_calls": [{"name": "Weather_GetCurrent", "args": {"location": "Seattle"}}]}, @@ -137,7 +139,7 @@ Copy the captured calls into your evaluation suite: from arcade_evals import ExpectedMCPToolCall, BinaryCritic suite.add_case( - name="Simple weather query", + name="Basic weather query", user_message="What's the weather in Seattle?", expected_tool_calls=[ ExpectedMCPToolCall( @@ -214,11 +216,11 @@ Markdown format is more readable for quick review: ### Model: gpt-4o -#### Case: Simple weather query +#### Case: Basic weather query **Input:** What's the weather in Seattle? -**Tool Calls:** +**tool calls:** - `Weather_GetCurrent` - location: Seattle @@ -354,7 +356,7 @@ arcade evals . --capture \ -o captures/models.json -o captures/models.md ``` -## Converting captures to tests +## Convert captures to tests ### Step 1: Identify patterns @@ -374,7 +376,7 @@ Review captured tool calls to find patterns: Create expected tool calls based on patterns: ```python -# Default to fahrenheit for US cities +# Default to fahrenheit for domestic cities ExpectedMCPToolCall("GetWeather", {"location": "Seattle", "units": "fahrenheit"}) # Use celsius for international cities @@ -411,7 +413,7 @@ Use failures to refine: **Possible causes:** 1. Model didn't call any tools -2. Tools not properly registered +2. tools not properly registered 3. System message doesn't encourage tool use **Solution:** @@ -442,4 +444,4 @@ suite = EvalSuite( ## Next steps - Learn about [comparative evaluations](/guides/create-tools/evaluate-tools/comparative-evaluations) to compare tool sources -- [Create evaluation suites](/guides/create-tools/evaluate-tools/create-evaluation-suite) with expectations +- [Create evaluation suites](/guides/create-tools/evaluate-tools/create-evaluation-suite) with expectations \ No newline at end of file diff --git a/app/en/guides/create-tools/evaluate-tools/comparative-evaluations/page.mdx b/app/en/guides/create-tools/evaluate-tools/comparative-evaluations/page.mdx index 43df3d504..61fe2b756 100644 --- a/app/en/guides/create-tools/evaluate-tools/comparative-evaluations/page.mdx +++ b/app/en/guides/create-tools/evaluate-tools/comparative-evaluations/page.mdx @@ -5,13 +5,17 @@ description: "Compare different tool implementations with the same test cases" # Comparative evaluations +Compare different tool implementations with the same test cases using isolated tool sources. + Comparative evaluations let you test how well AI models select and use tools from different, isolated tool sources. Each "track" represents a separate tool registry, allowing you to compare implementations side-by-side. +This page explains how to create comparative evaluations, when to use them versus regular evaluations, and how to structure track-specific test cases. You'll use comparative evaluations when testing multiple implementations of the same feature, comparing different tool providers, or running A/B tests on alternative tool designs. + import { Callout, Steps } from "nextra/components"; ## What are tracks? -**Tracks are isolated tool registries** within a single evaluation suite. Each track has its own set of tools that are **not shared** with other tracks. This isolation lets you test how models perform when given different tool options for the same task. +**Tracks isolate tool registries** within a single evaluation suite. Each track has its own set of tools that are **not shared** with other tracks. This isolation lets you test how models perform when given different tool options for the same task. **Key concept**: Comparative evaluations test tool **selection** across different tool sets. Each track provides a different context (set of tools) to the model. @@ -187,7 +191,7 @@ suite.add_tool_definitions( ``` - Tools must be registered before creating comparative cases that reference + You must register tools before creating comparative cases that reference their tracks. @@ -318,7 +322,7 @@ async def search_comparison(): track="DuckDuckGo", ) - # Simple query + # Basic query suite.add_comparative_case( name="basic_search", user_message="Search for Python tutorials", @@ -410,7 +414,7 @@ Case: search_with_filters ## Result structure -Comparative results are organized by track: +Comparative results use tracks to organize data: ```python { @@ -555,7 +559,7 @@ Model: claude-sonnet-4-5-20250929 ### Use descriptive track names -Choose clear names that indicate what's being compared: +Choose clear names that indicate what you're comparing: ```python # ✅ Good @@ -689,7 +693,7 @@ Use the exact tool names from the output. **Symptom:** Same user message produces different scores across tracks -**Explanation:** This is expected. Different tool implementations may work differently. +**Explanation:** This design works as expected. Different tool implementations may work differently. **Solution:** Adjust expectations and critics per track to account for implementation differences. @@ -754,4 +758,4 @@ suite.add_tool_catalog(catalog_v2, track="Python v2") - [Create an evaluation suite](/guides/create-tools/evaluate-tools/create-evaluation-suite) with tracks - Use [capture mode](/guides/create-tools/evaluate-tools/capture-mode) to discover track-specific tool calls -- [Run evaluations](/guides/create-tools/evaluate-tools/run-evaluations) with multiple models and tracks +- [Run evaluations](/guides/create-tools/evaluate-tools/run-evaluations) with multiple models and tracks \ No newline at end of file diff --git a/app/en/guides/create-tools/evaluate-tools/create-evaluation-suite/page.mdx b/app/en/guides/create-tools/evaluate-tools/create-evaluation-suite/page.mdx index 2894029b2..a720d82b0 100644 --- a/app/en/guides/create-tools/evaluate-tools/create-evaluation-suite/page.mdx +++ b/app/en/guides/create-tools/evaluate-tools/create-evaluation-suite/page.mdx @@ -5,7 +5,9 @@ description: "Learn how to evaluate your tools using Arcade" # Create an evaluation suite -Evaluation suites help you test whether AI models use your tools correctly. This guide shows you how to create test cases that measure tool selection and parameter accuracy. +Learn how to create test cases that measure tool selection and parameter accuracy for your tools. + +Evaluation suites help you test whether AI models use your tools correctly. This guide covers creating an evaluation file, defining test cases with expected tool calls, and running evaluations across different providers. You'll need this when validating tool behavior before deployment or when comparing model performance. You'll work in your MCP server directory and use the Arcade CLI to execute tests. By the end, you'll have a working evaluation suite that automatically tests your tools. import { Steps, Tabs, Callout } from "nextra/components"; @@ -164,7 +166,7 @@ PASSED Get weather for city -- Score: 100.00% ## Loading tools -You can load tools from different sources. All methods are async and must be awaited in your `@tool_eval()` decorated function. +You can load tools from different sources. All methods are async and you must await them in your `@tool_eval()` decorated function. ### From MCP HTTP server @@ -245,7 +247,7 @@ await suite.add_mcp_server("http://server2.example") suite.add_tool_definitions([{"name": "CustomTool", ...}]) ``` -All tools are accumulated in the suite's registry and available to the model. +The suite accumulates all tools in its registry and makes them available to the model. ## Expected tool calls @@ -259,14 +261,14 @@ ExpectedMCPToolCall( ``` - Tool names are normalized for compatibility with model tool calling. Dots + The system normalizes tool names for compatibility with model tool calling. Dots (`.`) become underscores (`_`). For example, `Weather.GetCurrent` becomes `Weather_GetCurrent`. ## Critics -Critics validate tool call parameters. Each critic type handles different validation needs: +Validation experts evaluate tool call parameters. Each critic type handles different validation needs: | Critic | Use case | Example | | ------------------ | --------------- | ------------------------------------------------------------------ | @@ -284,7 +286,7 @@ critics=[ ] ``` -All weights are normalized proportionally to sum to 1.0. Use numeric values or `FuzzyWeight` (`CRITICAL`, `HIGH`, `MEDIUM`, `LOW`). +The system normalizes all weights proportionally to sum to 1.0. Use numeric values or `FuzzyWeight` (`CRITICAL`, `HIGH`, `MEDIUM`, `LOW`). ## Multiple tool calls @@ -314,7 +316,7 @@ suite.add_case( ], additional_messages=[ {"role": "user", "content": "I'm planning to visit Tokyo next week."}, - {"role": "assistant", "content": "That sounds exciting! What would you like to know about Tokyo?"}, + {"role": "assistant", "content": "That would you like to know about Tokyo?"}, ], ) ``` @@ -341,4 +343,4 @@ If you want stricter suites, increase thresholds (for example `fail_threshold=0. - Learn how to [run evaluations with different providers](/guides/create-tools/evaluate-tools/run-evaluations) - Explore [capture mode](/guides/create-tools/evaluate-tools/capture-mode) to record tool calls -- Compare tool sources with [comparative evaluations](/guides/create-tools/evaluate-tools/comparative-evaluations) +- Compare tool sources with [comparative evaluations](/guides/create-tools/evaluate-tools/comparative-evaluations) \ No newline at end of file diff --git a/app/en/guides/create-tools/evaluate-tools/run-evaluations/page.mdx b/app/en/guides/create-tools/evaluate-tools/run-evaluations/page.mdx index f5df4a97e..92552fb75 100644 --- a/app/en/guides/create-tools/evaluate-tools/run-evaluations/page.mdx +++ b/app/en/guides/create-tools/evaluate-tools/run-evaluations/page.mdx @@ -5,7 +5,9 @@ description: "Learn how to run evaluations using Arcade" # Run evaluations -The `arcade evals` command discovers and executes evaluation suites with support for multiple providers, models, and output formats. +Run evaluations using the `arcade evals` command to discover and execute evaluation suites with support for multiple providers, models, and output formats. + +The `arcade evals` command searches for files starting with `eval_` and ending with `.py`, then executes them against your specified models and providers. You can compare performance across different models, capture tool calls for debugging, and generate reports in multiple formats. This is essential when you need to validate tool performance, compare model capabilities, or establish baseline expectations for your tools. import { Callout } from "nextra/components"; @@ -34,7 +36,7 @@ arcade evals . --details Filter to show only failures: ```bash -arcade evals . --only-failed +arcade evals . --only failed ``` ## Multi-provider support @@ -79,7 +81,7 @@ When you specify multiple models, results show side-by-side comparisons. ## API keys -API keys are resolved in the following order: +The system resolves API keys in the following order: | Priority | Format | |----------|--------| @@ -230,7 +232,7 @@ Arcade uses `--multi-run-pass-rule` to set the overall `status`, `passed`, and ` | `--api-key` | `-k` | Provider API key | `-k openai:sk-...` | | `--capture` | - | Record without scoring | `--capture` | | `--details` | `-d` | Show critic feedback | `--details` | -| `--only-failed` | `-f` | Filter failures | `--only-failed` | +| `--only failed` | `-f` | Filter failures | `--only failed` | | `--output` | `-o` | Output file (repeatable) | `-o results.md` | | `--include-context` | - | Add messages to output | `--include-context` | | `--max-concurrent` | `-c` | Parallel limit | `-c 10` | @@ -320,12 +322,12 @@ Show detailed results including critic feedback: arcade evals . --details ``` -### `--only-failed`, `-f` +### `--only failed`, `-f` Show only failed test cases: ```bash -arcade evals . --only-failed +arcade evals . --only failed ``` ### `--max-concurrent`, `-c` @@ -350,7 +352,7 @@ Displays detailed error traces and connection information. ## Understanding results -Results are formatted based on evaluation type (regular, multi-model, or comparative) and selected flags. +The system formats results based on evaluation type (regular, multi-model, or comparative) and selected flags. ### Summary format @@ -363,7 +365,7 @@ Summary -- Total: 5 -- Passed: 4 -- Failed: 1 **How flags affect output:** - `--details`: Adds per-critic breakdown for each case -- `--only-failed`: Filters to show only failed cases (summary shows original totals) +- `--only failed`: Filters to show only failed cases (summary shows original totals) - `--include-context`: Includes system messages and conversation history - `--num-runs`: Adds per-run statistics and aggregate scores - Multiple models: Switches to comparison table format @@ -435,7 +437,7 @@ This creates: ### Missing dependencies -If you see `ImportError: MCP SDK is required`, install the full package: +If you see `ImportError: MCP SDK requires installation`, install the full package: ```bash pip install 'arcade-mcp[evals]' @@ -449,7 +451,7 @@ pip install anthropic ### Tool name mismatches -Tool names are normalized (dots become underscores). Check your tool definitions if you see unexpected names. +The system normalizes tool names (dots become underscores). Check your tool definitions if you see unexpected names. ### API rate limits diff --git a/app/en/guides/create-tools/evaluate-tools/why-evaluate/page.mdx b/app/en/guides/create-tools/evaluate-tools/why-evaluate/page.mdx index b1a2fc506..8d045fe48 100644 --- a/app/en/guides/create-tools/evaluate-tools/why-evaluate/page.mdx +++ b/app/en/guides/create-tools/evaluate-tools/why-evaluate/page.mdx @@ -7,6 +7,8 @@ import { Callout } from "nextra/components"; # Why evaluate tools? +This page explains why tool evaluation is critical for reliable AI applications. +
Tool evaluations ensure AI models use your tools correctly in production. Unlike traditional testing, evaluations measure two key aspects: @@ -72,7 +74,7 @@ FAILED Wrong tool selected -- Score: 50.00% ## Advanced features -Once you're comfortable with basic evaluations, explore these advanced capabilities: +Once you're comfortable with evaluations, explore these advanced capabilities: ### Capture mode @@ -84,4 +86,4 @@ Test the same cases against different tool sources (tracks) with isolated regist ### Output formats -Save results in multiple formats (txt, md, html, json) for reporting and analysis. Specify output files with extensions or use no extension for all formats. [Learn more →](/guides/create-tools/evaluate-tools/run-evaluations#output-formats) +Save results in multiple formats (txt, md, html, json) for reporting and analysis. Specify output files with extensions or use no extension for all formats. [Learn more →](/guides/create-tools/evaluate-tools/run-evaluations#output-formats) \ No newline at end of file