Lightweight framework for generating, running, and reviewing MCP evals.
MCP evals are lightweight, reproducible tests that measure how well LLMs use MCP servers/tools.
Evals are scored via rule checks and LLM-as-judge, with metrics like task accuracy, tool-use precision, latency, and token cost.
They test the ability for LLMs to:
- Select the right tools at the right time
- Pass appropriate arguments to those tools
- Produce correct final outcomes
Arbiter is a lightweight framework for running eval suites on your MCP servers across different models and providers.
- Define your evals in a JSON config file
my_evals.json(see config section) - Run the CLI
arbiter execute my_evals.json
# make new project
mkdir arbiter-demo-project
cd arbiter-demo-project
# install arbiter with uv
uv venv
uv pip install arbiter-mcp-evals
# configure claude api key
export ANTHROPIC_API_KEY=...
# run demo (will incur a small amount of api cost)
uv run arbiter genesis
uv run arbiter execute arbiter_example_evals.json# install arbiter globally using pipx (or use uv, as demonstrated above)
pipx install arbiter
# configure claude api key
export ANTHROPIC_API_KEY=...
# generate and run custom eval suite
uv run arbiter forge --forge-model "anthropic:claude-sonnet-4-20250514" \
--num-tool-evals 15 \
--num-abstention-evals 4 \
--repeats 2
arbiter execute arbiter_forged_evals.jsonInstall globally using pipx:
pipx install arbiter-mcp-evals
arbiter --versionOr install inside your project:
uv init # This will create a new virtual environment for your project
uv add arbiter-mcp-evals
uv run arbiter --versionArbiter is open-source and free to use.
Credentials are required based on the providers referenced in your config. Set env vars:
# Anthropic
export ANTHROPIC_API_KEY=...
# OpenAI
export OPENAI_API_KEY=...
# Google
export GOOGLE_API_KEY=...- Generate an example config you can edit:
arbiter genesis- Run an evaluation from a config file:
arbiter execute my_evals.jsonThe results will be saved to a timestamped JSON file in the same directory as your config file.
By default, arbiter execute shows a short confirmation preview before running:
- Suite name, models, judge model, repeats
- MCP server command and args
- Total eval items (tool-use vs abstention counts)
- Per-1K token rates for each configured model (from LiteLLM). If pricing cannot be resolved, the rate shows as "unknown" and cost is treated as 0.
To run non-interactively, pass the -y/--yes flag:
arbiter execute -y my_evals.jsonCombine with verbose mode for detailed traces:
arbiter execute -y -v my_evals.jsonConfig files are JSON with this structure:
Arbiter is currently limited to testing one MCP server at a time.
{
"name": "Unit Converter MCP Evals Suite",
"models": [
"anthropic:claude-sonnet-4-0",
"anthropic:claude-3-5-haiku-latest",
"openai:gpt-4o-mini",
"google:gemini-2.5-pro"
],
"judge": {
"model": "google:gemini-2.5-pro",
"max_tokens": 128,
},
"repeats": 3,
"mcp_servers": {
"unit-converter": {
"command": "uvx",
"args": ["unit-converter-mcp"],
"transport": "stdio"
}
},
"tool_use_evals": [
{
"query": "convert 0 celsius to fahrenheit",
"answer": "32 Fahrenheit",
"judge_mode": "llm"
},
{
"query": "convert 100 fahrenheit to celsius",
"answer": "37.7778",
"judge_mode": "contains"
}
],
"abstention_evals": [
{
"query": "who are the temperature units named after?"
}
]
}- Python 3.12+
- Provider API keys set based on the providers used in
modelsandjudge.model
- Configurable LLM models and MCP servers
- Tool usage tracking and validation
- LLM-as-a-judge evaluation with ground truth comparison or case-insensitive contains matching
- Detailed metrics including pass rates, precision, recall
- Timestamped output files with comprehensive results
- Rich console output with progress tracking
- Cost tracking (tokens and USD) for model runs and cumulative judge usage
- Note: Cost estimation only counts tokens used during evaluation turns and judge responses. It does not attempt to estimate long system/context prompts or hidden preambles.
- Costs are estimated using LiteLLM's pricing metadata. We pass models without providers (e.g.,
gpt-5-mini,gemini-2.5-pro,claude-3-haiku-20240307). If pricing cannot be resolved for a model, it will be set to 0. - Anthropic models: If you use non-dated aliases like
claude-3-5-haiku-latest, LiteLLM cannot provide pricing. Use dated model IDs such asclaude-3-haiku-20240307. See the Anthropic model overview and for the latest model IDs.
- Unit tests (no LLM calls, no MCP servers):
uv run pytest- Live integration test (will incur costs by issuing calls to LLMs):
- Equivalent to running:
arbiter genesis arbiter execute arbiter_example_evals.json
- This pytest integration is intended for CI/CD testing.
- Prefer running the command above, if testing manually.
export ARB_TEST_LIVE=1
export ANTHROPIC_API_KEY=...
export OPENAI_API_KEY=...
export GOOGLE_API_KEY=...
uv run pytest -m integrationRunning arbiter execute my_evals.json writes two files to the same directory as your config:
eval_YYYYMMDD_HHMMSS.json— structured results (config, per-model runs, summaries, costs)eval_YYYYMMDD_HHMMSS.log— human-readable run log with progress lines
{
"created_at": "2025-09-15T14:47:36.086492",
"config": {
"name": "Unit Converter MCP Evals Suite",
"models": ["anthropic:claude-3-5-haiku-latest", "openai:gpt-5-mini", "google:gemini-2.5-flash"],
"judge_model": "openai:gpt-5-mini",
"repeats": 1,
"mcp_servers": {
"unit-converter-mcp": { "command": "uvx", "args": ["unit-converter-mcp"], "transport": "stdio" }
}
},
"tool_use_evals": [
{ "query": "convert 0 celsius to fahrenheit", "answer": "32 Fahrenheit", "judge_mode": "llm" },
{ "query": "convert 8 radians to degrees", "answer": "458.366236", "judge_mode": "contains" },
...
],
"abstention_evals": [
{ "query": "who is the Pascal unit named after?" },
...
],
"results": {
"openai:gpt-5-mini": {
"model": "openai:gpt-5-mini",
"runs": [
{
"iteration": 1,
"query": "convert 0 celsius to fahrenheit",
"ground_truth": "32 Fahrenheit",
"model_raw_response": "0 °C = 32 °F ...",
"grade": "pass",
"judge_mode": "llm",
"judge_raw_response": "<thinking>...</thinking>\n<result>correct</result>",
"tool_expected": true,
"tool_used": true,
"tool_calls": ["convert_temperature"],
"latency_s": 11.913,
"tokens": { "input": 21756, "output": 138, "total": 21894 },
"cost_usd": 0.005715
},
...
],
"summary": {
"total_runs": 3,
"judged_runs": 2,
"pass_count": 2,
"pass_rate": 1.0,
"tool_use": {
"expected_total": 2,
"used_when_expected": 2,
"recall": 1.0,
"total_used": 2,
"used_when_not_expected": 0,
"precision": 1.0,
"false_positive_rate": 0.0
},
"avg_latency_s": 6.877,
"tokens": { "input": 54276, "output": 1020, "total": 55296 },
"cost_usd": 0.015609
}
},
"anthropic:claude-3-5-haiku-latest": { ... },
"google:gemini-2.5-flash": { ... }
},
"summary_table_markdown": "| metric | ... |",
"judge_cost_summary": {
"model": "openai:gpt-5-mini",
"tokens": { "input": 562, "output": 1816, "total": 2378 },
"cost_usd": 0.003773
},
"summary": {
"table_markdown": "| metric | ... |",
"judge_cost": { ... },
"overall": {
"total_runs": 9,
"judged_runs": 6,
"pass_count": 4,
"pass_rate": 0.6667,
"tool_use": {
"expected_total": 6,
"used_when_expected": 6,
"recall": 1.0,
"total_used": 6,
"used_when_not_expected": 0,
"precision": 1.0,
"false_positive_rate": 0.0
},
"avg_latency_s": 7.314,
"tokens": { "input": 142241, "output": 3627, "total": 145868 },
"cost_usd": 0.102978
},
"per_model": {
"openai:gpt-5-mini": { "pass_rate": 1.0, ... },
"anthropic:claude-3-5-haiku-latest": { ... },
"google:gemini-2.5-flash": { ... }
}
}
}A compact example of the run log:
2025-09-15 14:47:05,986 INFO Starting MCP server 'unit-converter-mcp' and loading tools...
2025-09-15 14:47:06,281 INFO Loaded 16 tool(s) from MCP server.
2025-09-15 14:47:14,104 INFO ✅ [google:gemini-2.5-flash] convert 0 celsius to fahrenheit #1/1 | tools=True (convert_temperature) | tokens=7003 | 2.83s | $0.0024
2025-09-15 14:47:28,547 INFO ✅ [openai:gpt-5-mini] convert 8 radians to degrees #1/1 | tools=True (convert_angle) | tokens=21897 | 3.90s | $0.0057
2025-09-15 14:47:36,083 INFO === Overall Summary (All Models) ===
- Python 3.12+
- uv package manager
# Clone the repository
git clone https://github.com/zazencodes/arbiter-mcp-evals
cd arbiter-mcp-evals
# Install dependencies
uv sync --extra dev
# Run tests
uv run pytest
# Run linting and formatting
uv run ruff format
uv run ruff check --fix
uv run isort --profile black .
# Type checking
uv run mypy arbiter/# Build package
uv build
# Test installation
uv run --with dist/*.whl arbiter --help-
Update Version:
- Increment the
versionnumber inpyproject.tomlandarbiter/__init__.py.
- Increment the
-
Update Changelog:
- Add a new entry in
CHANGELOG.mdfor the release.- Draft notes from recent changes (e.g., via
git log --onelineor a diff).
- Draft notes from recent changes (e.g., via
- Add a new entry in
-
Create GitHub Release:
- Draft a new release on the GitHub UI and publish it.
- The GitHub workflow will automatically build and publish the package to PyPI.