Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 14 additions & 12 deletions app/en/guides/create-tools/evaluate-tools/capture-mode/page.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -7,11 +7,13 @@ import { Callout, Steps } from "nextra/components";

# Capture mode

Capture mode records tool calls without evaluating them. Use it to bootstrap test expectations or debug model behavior.
Capture mode records tool calls without evaluating them to bootstrap test expectations or debug model behavior.

Capture mode helps you understand how models interact with your tools when you don't know what to expect. You can use it to create initial test cases, debug unexpected evaluation failures, or explore how models interpret new tools. This guide covers how to set up capture mode, review the output, and convert captured calls into proper test expectations.

<Callout type="info">
**Backward compatibility**: Capture mode works with existing evaluation
suites. Simply add the `--capture` flag to any `arcade evals` command. No code
suites. Add the `--capture` flag to any `arcade evals` command. No code
changes needed.
</Callout>

Expand Down Expand Up @@ -63,7 +65,7 @@ async def capture_weather_suite():

# Add cases without expected tool calls
suite.add_case(
name="Simple weather query",
name="Basic weather query",
user_message="What's the weather in Seattle?",
expected_tool_calls=[], # Empty for capture
)
Expand Down Expand Up @@ -98,7 +100,7 @@ Open the JSON file to see what the model called:
"provider": "openai",
"captured_cases": [
{
"case_name": "Simple weather query",
"case_name": "Basic weather query",
"user_message": "What's the weather in Seattle?",
"tool_calls": [
{
Expand All @@ -118,7 +120,7 @@ If you set `--num-runs` > 1, each case also includes `runs`:

```json
{
"case_name": "Simple weather query",
"case_name": "Basic weather query",
"tool_calls": [{"name": "Weather_GetCurrent", "args": {"location": "Seattle"}}],
"runs": [
{"tool_calls": [{"name": "Weather_GetCurrent", "args": {"location": "Seattle"}}]},
Expand All @@ -137,7 +139,7 @@ Copy the captured calls into your evaluation suite:
from arcade_evals import ExpectedMCPToolCall, BinaryCritic

suite.add_case(
name="Simple weather query",
name="Basic weather query",
user_message="What's the weather in Seattle?",
expected_tool_calls=[
ExpectedMCPToolCall(
Expand Down Expand Up @@ -214,11 +216,11 @@ Markdown format is more readable for quick review:

### Model: gpt-4o

#### Case: Simple weather query
#### Case: Basic weather query

**Input:** What's the weather in Seattle?

**Tool Calls:**
**tool calls:**

- `Weather_GetCurrent`
- location: Seattle
Expand Down Expand Up @@ -354,7 +356,7 @@ arcade evals . --capture \
-o captures/models.json -o captures/models.md
```

## Converting captures to tests
## Convert captures to tests

### Step 1: Identify patterns

Expand All @@ -374,7 +376,7 @@ Review captured tool calls to find patterns:
Create expected tool calls based on patterns:

```python
# Default to fahrenheit for US cities
# Default to fahrenheit for domestic cities
ExpectedMCPToolCall("GetWeather", {"location": "Seattle", "units": "fahrenheit"})

# Use celsius for international cities
Expand Down Expand Up @@ -411,7 +413,7 @@ Use failures to refine:
**Possible causes:**

1. Model didn't call any tools
2. Tools not properly registered
2. tools not properly registered
3. System message doesn't encourage tool use

**Solution:**
Expand Down Expand Up @@ -442,4 +444,4 @@ suite = EvalSuite(
## Next steps

- Learn about [comparative evaluations](/guides/create-tools/evaluate-tools/comparative-evaluations) to compare tool sources
- [Create evaluation suites](/guides/create-tools/evaluate-tools/create-evaluation-suite) with expectations
- [Create evaluation suites](/guides/create-tools/evaluate-tools/create-evaluation-suite) with expectations
Original file line number Diff line number Diff line change
Expand Up @@ -5,13 +5,17 @@ description: "Compare different tool implementations with the same test cases"

# Comparative evaluations

Compare different tool implementations with the same test cases using isolated tool sources.

Comparative evaluations let you test how well AI models select and use tools from different, isolated tool sources. Each "track" represents a separate tool registry, allowing you to compare implementations side-by-side.

This page explains how to create comparative evaluations, when to use them versus regular evaluations, and how to structure track-specific test cases. You'll use comparative evaluations when testing multiple implementations of the same feature, comparing different tool providers, or running A/B tests on alternative tool designs.

import { Callout, Steps } from "nextra/components";

## What are tracks?

**Tracks are isolated tool registries** within a single evaluation suite. Each track has its own set of tools that are **not shared** with other tracks. This isolation lets you test how models perform when given different tool options for the same task.
**Tracks isolate tool registries** within a single evaluation suite. Each track has its own set of tools that are **not shared** with other tracks. This isolation lets you test how models perform when given different tool options for the same task.

**Key concept**: Comparative evaluations test tool **selection** across different tool sets. Each track provides a different context (set of tools) to the model.

Expand Down Expand Up @@ -187,7 +191,7 @@ suite.add_tool_definitions(
```

<Callout type="info">
Tools must be registered before creating comparative cases that reference
You must register tools before creating comparative cases that reference
their tracks.
</Callout>

Expand Down Expand Up @@ -318,7 +322,7 @@ async def search_comparison():
track="DuckDuckGo",
)

# Simple query
# Basic query
suite.add_comparative_case(
name="basic_search",
user_message="Search for Python tutorials",
Expand Down Expand Up @@ -410,7 +414,7 @@ Case: search_with_filters

## Result structure

Comparative results are organized by track:
Comparative results use tracks to organize data:

```python
{
Expand Down Expand Up @@ -555,7 +559,7 @@ Model: claude-sonnet-4-5-20250929

### Use descriptive track names

Choose clear names that indicate what's being compared:
Choose clear names that indicate what you're comparing:

```python
# ✅ Good
Expand Down Expand Up @@ -689,7 +693,7 @@ Use the exact tool names from the output.

**Symptom:** Same user message produces different scores across tracks

**Explanation:** This is expected. Different tool implementations may work differently.
**Explanation:** This design works as expected. Different tool implementations may work differently.

**Solution:** Adjust expectations and critics per track to account for implementation differences.

Expand Down Expand Up @@ -754,4 +758,4 @@ suite.add_tool_catalog(catalog_v2, track="Python v2")

- [Create an evaluation suite](/guides/create-tools/evaluate-tools/create-evaluation-suite) with tracks
- Use [capture mode](/guides/create-tools/evaluate-tools/capture-mode) to discover track-specific tool calls
- [Run evaluations](/guides/create-tools/evaluate-tools/run-evaluations) with multiple models and tracks
- [Run evaluations](/guides/create-tools/evaluate-tools/run-evaluations) with multiple models and tracks
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,9 @@ description: "Learn how to evaluate your tools using Arcade"

# Create an evaluation suite

Evaluation suites help you test whether AI models use your tools correctly. This guide shows you how to create test cases that measure tool selection and parameter accuracy.
Learn how to create test cases that measure tool selection and parameter accuracy for your tools.

Evaluation suites help you test whether AI models use your tools correctly. This guide covers creating an evaluation file, defining test cases with expected tool calls, and running evaluations across different providers. You'll need this when validating tool behavior before deployment or when comparing model performance. You'll work in your MCP server directory and use the Arcade CLI to execute tests. By the end, you'll have a working evaluation suite that automatically tests your tools.

import { Steps, Tabs, Callout } from "nextra/components";

Expand Down Expand Up @@ -164,7 +166,7 @@ PASSED Get weather for city -- Score: 100.00%

## Loading tools

You can load tools from different sources. All methods are async and must be awaited in your `@tool_eval()` decorated function.
You can load tools from different sources. All methods are async and you must await them in your `@tool_eval()` decorated function.

### From MCP HTTP server

Expand Down Expand Up @@ -245,7 +247,7 @@ await suite.add_mcp_server("http://server2.example")
suite.add_tool_definitions([{"name": "CustomTool", ...}])
```

All tools are accumulated in the suite's registry and available to the model.
The suite accumulates all tools in its registry and makes them available to the model.

## Expected tool calls

Expand All @@ -259,14 +261,14 @@ ExpectedMCPToolCall(
```

<Callout type="warning">
Tool names are normalized for compatibility with model tool calling. Dots
The system normalizes tool names for compatibility with model tool calling. Dots
(`.`) become underscores (`_`). For example, `Weather.GetCurrent` becomes
`Weather_GetCurrent`.
</Callout>

## Critics

Critics validate tool call parameters. Each critic type handles different validation needs:
Validation experts evaluate tool call parameters. Each critic type handles different validation needs:

| Critic | Use case | Example |
| ------------------ | --------------- | ------------------------------------------------------------------ |
Expand All @@ -284,7 +286,7 @@ critics=[
]
```

All weights are normalized proportionally to sum to 1.0. Use numeric values or `FuzzyWeight` (`CRITICAL`, `HIGH`, `MEDIUM`, `LOW`).
The system normalizes all weights proportionally to sum to 1.0. Use numeric values or `FuzzyWeight` (`CRITICAL`, `HIGH`, `MEDIUM`, `LOW`).

## Multiple tool calls

Expand Down Expand Up @@ -314,7 +316,7 @@ suite.add_case(
],
additional_messages=[
{"role": "user", "content": "I'm planning to visit Tokyo next week."},
{"role": "assistant", "content": "That sounds exciting! What would you like to know about Tokyo?"},
{"role": "assistant", "content": "That would you like to know about Tokyo?"},
],
)
```
Expand All @@ -341,4 +343,4 @@ If you want stricter suites, increase thresholds (for example `fail_threshold=0.

- Learn how to [run evaluations with different providers](/guides/create-tools/evaluate-tools/run-evaluations)
- Explore [capture mode](/guides/create-tools/evaluate-tools/capture-mode) to record tool calls
- Compare tool sources with [comparative evaluations](/guides/create-tools/evaluate-tools/comparative-evaluations)
- Compare tool sources with [comparative evaluations](/guides/create-tools/evaluate-tools/comparative-evaluations)
22 changes: 12 additions & 10 deletions app/en/guides/create-tools/evaluate-tools/run-evaluations/page.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,9 @@ description: "Learn how to run evaluations using Arcade"

# Run evaluations

The `arcade evals` command discovers and executes evaluation suites with support for multiple providers, models, and output formats.
Run evaluations using the `arcade evals` command to discover and execute evaluation suites with support for multiple providers, models, and output formats.

The `arcade evals` command searches for files starting with `eval_` and ending with `.py`, then executes them against your specified models and providers. You can compare performance across different models, capture tool calls for debugging, and generate reports in multiple formats. This is essential when you need to validate tool performance, compare model capabilities, or establish baseline expectations for your tools.

import { Callout } from "nextra/components";

Expand Down Expand Up @@ -34,7 +36,7 @@ arcade evals . --details
Filter to show only failures:

```bash
arcade evals . --only-failed
arcade evals . --only failed
```

## Multi-provider support
Expand Down Expand Up @@ -79,7 +81,7 @@ When you specify multiple models, results show side-by-side comparisons.

## API keys

API keys are resolved in the following order:
The system resolves API keys in the following order:

| Priority | Format |
|----------|--------|
Expand Down Expand Up @@ -230,7 +232,7 @@ Arcade uses `--multi-run-pass-rule` to set the overall `status`, `passed`, and `
| `--api-key` | `-k` | Provider API key | `-k openai:sk-...` |
| `--capture` | - | Record without scoring | `--capture` |
| `--details` | `-d` | Show critic feedback | `--details` |
| `--only-failed` | `-f` | Filter failures | `--only-failed` |
| `--only failed` | `-f` | Filter failures | `--only failed` |
| `--output` | `-o` | Output file (repeatable) | `-o results.md` |
| `--include-context` | - | Add messages to output | `--include-context` |
| `--max-concurrent` | `-c` | Parallel limit | `-c 10` |
Expand Down Expand Up @@ -320,12 +322,12 @@ Show detailed results including critic feedback:
arcade evals . --details
```

### `--only-failed`, `-f`
### `--only failed`, `-f`

Show only failed test cases:

```bash
arcade evals . --only-failed
arcade evals . --only failed
```

### `--max-concurrent`, `-c`
Expand All @@ -350,7 +352,7 @@ Displays detailed error traces and connection information.

## Understanding results

Results are formatted based on evaluation type (regular, multi-model, or comparative) and selected flags.
The system formats results based on evaluation type (regular, multi-model, or comparative) and selected flags.

### Summary format

Expand All @@ -363,7 +365,7 @@ Summary -- Total: 5 -- Passed: 4 -- Failed: 1
**How flags affect output:**

- `--details`: Adds per-critic breakdown for each case
- `--only-failed`: Filters to show only failed cases (summary shows original totals)
- `--only failed`: Filters to show only failed cases (summary shows original totals)
- `--include-context`: Includes system messages and conversation history
- `--num-runs`: Adds per-run statistics and aggregate scores
- Multiple models: Switches to comparison table format
Expand Down Expand Up @@ -435,7 +437,7 @@ This creates:

### Missing dependencies

If you see `ImportError: MCP SDK is required`, install the full package:
If you see `ImportError: MCP SDK requires installation`, install the full package:

```bash
pip install 'arcade-mcp[evals]'
Expand All @@ -449,7 +451,7 @@ pip install anthropic

### Tool name mismatches

Tool names are normalized (dots become underscores). Check your tool definitions if you see unexpected names.
The system normalizes tool names (dots become underscores). Check your tool definitions if you see unexpected names.

### API rate limits

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,8 @@ import { Callout } from "nextra/components";

# Why evaluate tools?

This page explains why tool evaluation is critical for reliable AI applications.

<div className="grid grid-cols-1 md:grid-cols-[4fr_3fr] gap-8">
<div>
Tool evaluations ensure AI models use your tools correctly in production. Unlike traditional testing, evaluations measure two key aspects:
Expand Down Expand Up @@ -72,7 +74,7 @@ FAILED Wrong tool selected -- Score: 50.00%

## Advanced features

Once you're comfortable with basic evaluations, explore these advanced capabilities:
Once you're comfortable with evaluations, explore these advanced capabilities:

### Capture mode

Expand All @@ -84,4 +86,4 @@ Test the same cases against different tool sources (tracks) with isolated regist

### Output formats

Save results in multiple formats (txt, md, html, json) for reporting and analysis. Specify output files with extensions or use no extension for all formats. [Learn more →](/guides/create-tools/evaluate-tools/run-evaluations#output-formats)
Save results in multiple formats (txt, md, html, json) for reporting and analysis. Specify output files with extensions or use no extension for all formats. [Learn more →](/guides/create-tools/evaluate-tools/run-evaluations#output-formats)