Lightweight framework for generating, running, and reviewing MCP evals.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

zazencodes

These details have not been verified by PyPI

Project description

Arbiter

Lightweight framework for generating, running, and reviewing MCP evals.

What are MCP evals?

MCP evals are lightweight, reproducible tests that measure how well LLMs use MCP servers/tools.

Scoring evals

Evals are scored via rule checks and LLM-as-judge, with metrics like task accuracy, tool-use precision, latency, and token cost.

Why MCP evals?

They test the ability for LLMs to:

Select the right tools at the right time
Pass appropriate arguments to those tools
Produce correct final outcomes

How does Arbiter do MCP evals?

Arbiter is a lightweight framework for running eval suites on your MCP servers across different models and providers.

Define your evals in a JSON config file my_evals.json (see config section)
Run the CLI arbiter execute my_evals.json

Quickstart Demo

Run the example evals

# make new project
mkdir arbiter-demo-project
cd arbiter-demo-project 

# install arbiter with uv
uv venv
uv pip install arbiter-mcp-evals

# configure claude api key
export ANTHROPIC_API_KEY=...

# run demo (will incur a small amount of api cost)
uv run arbiter genesis
uv run arbiter execute arbiter_example_evals.json

Generate evals for your own MCP server

# install arbiter globally using pipx (or use uv, as demonstrated above)
pipx install arbiter

# configure claude api key
export ANTHROPIC_API_KEY=...

# generate and run custom eval suite
uv run arbiter forge --forge-model "anthropic:claude-sonnet-4-20250514" \
  --num-tool-evals 15 \
  --num-abstention-evals 4 \
  --repeats 2
arbiter execute arbiter_forged_evals.json

Installation

Global

Install globally using pipx:

pipx install arbiter-mcp-evals
arbiter --version

Project

Or install inside your project:

uv init # This will create a new virtual environment for your project
uv add arbiter-mcp-evals
uv run arbiter --version

Credentials

Arbiter is open-source and free to use.

Credentials are required based on the providers referenced in your config. Set env vars:

# Anthropic
export ANTHROPIC_API_KEY=...

# OpenAI
export OPENAI_API_KEY=...

# Google
export GOOGLE_API_KEY=...

Usage

Generate an example config you can edit:

arbiter genesis

Run an evaluation from a config file:

arbiter execute my_evals.json

The results will be saved to a timestamped JSON file in the same directory as your config file.

Execution confirmation

By default, arbiter execute shows a short confirmation preview before running:

Suite name, models, judge model, repeats
MCP server command and args
Total eval items (tool-use vs abstention counts)
Per-1K token rates for each configured model (from LiteLLM). If pricing cannot be resolved, the rate shows as "unknown" and cost is treated as 0.

To run non-interactively, pass the -y/--yes flag:

arbiter execute -y my_evals.json

Combine with verbose mode for detailed traces:

arbiter execute -y -v my_evals.json

Configuration

Config files are JSON with this structure:

Arbiter is currently limited to testing one MCP server at a time.

{
    "name": "Unit Converter MCP Evals Suite",
    "models": [
        "anthropic:claude-sonnet-4-0",
        "anthropic:claude-3-5-haiku-latest",
        "openai:gpt-4o-mini",
        "google:gemini-2.5-pro"
    ],
    "judge": {
        "model": "google:gemini-2.5-pro",
        "max_tokens": 128,
    },
    "repeats": 3,
    "mcp_servers": {
        "unit-converter": {
            "command": "uvx",
            "args": ["unit-converter-mcp"],
            "transport": "stdio"
        }
    },
    "tool_use_evals": [
        {
            "query": "convert 0 celsius to fahrenheit",
            "answer": "32 Fahrenheit",
            "judge_mode": "llm"
        },
        {
            "query": "convert 100 fahrenheit to celsius",
            "answer": "37.7778",
            "judge_mode": "contains"
        }
    ],
    "abstention_evals": [
        {
            "query": "who are the temperature units named after?"
        }
    ]
}

Requirements

Python 3.12+
Provider API keys set based on the providers used in models and judge.model

Features

Configurable LLM models and MCP servers
Tool usage tracking and validation
LLM-as-a-judge evaluation with ground truth comparison or case-insensitive contains matching
Detailed metrics including pass rates, precision, recall
Timestamped output files with comprehensive results
Rich console output with progress tracking
Cost tracking (tokens and USD) for model runs and cumulative judge usage
- Note: Cost estimation only counts tokens used during evaluation turns and judge responses. It does not attempt to estimate long system/context prompts or hidden preambles.

Cost configuration

Costs are estimated using LiteLLM's pricing metadata. We pass models without providers (e.g., gpt-5-mini, gemini-2.5-pro, claude-3-haiku-20240307). If pricing cannot be resolved for a model, it will be set to 0.
Anthropic models: If you use non-dated aliases like claude-3-5-haiku-latest, LiteLLM cannot provide pricing. Use dated model IDs such as claude-3-haiku-20240307. See the Anthropic model overview and for the latest model IDs.

Testing

Unit tests (no LLM calls, no MCP servers):

uv run pytest

Live integration test (will incur costs by issuing calls to LLMs):
- Equivalent to running:
```
arbiter genesis
arbiter execute arbiter_example_evals.json
```
- This pytest integration is intended for CI/CD testing.
- Prefer running the command above, if testing manually.

export ARB_TEST_LIVE=1
export ANTHROPIC_API_KEY=...
export OPENAI_API_KEY=...
export GOOGLE_API_KEY=...
uv run pytest -m integration

Output files

Running arbiter execute my_evals.json writes two files to the same directory as your config:

eval_YYYYMMDD_HHMMSS.json — structured results (config, per-model runs, summaries, costs)
eval_YYYYMMDD_HHMMSS.log — human-readable run log with progress lines

Results JSON example

{
  "created_at": "2025-09-15T14:47:36.086492",
  "config": {
    "name": "Unit Converter MCP Evals Suite",
    "models": ["anthropic:claude-3-5-haiku-latest", "openai:gpt-5-mini", "google:gemini-2.5-flash"],
    "judge_model": "openai:gpt-5-mini",
    "repeats": 1,
    "mcp_servers": {
      "unit-converter-mcp": { "command": "uvx", "args": ["unit-converter-mcp"], "transport": "stdio" }
    }
  },
  "tool_use_evals": [
    { "query": "convert 0 celsius to fahrenheit", "answer": "32 Fahrenheit", "judge_mode": "llm" },
    { "query": "convert 8 radians to degrees", "answer": "458.366236", "judge_mode": "contains" },
    ...
  ],
  "abstention_evals": [
    { "query": "who is the Pascal unit named after?" },
    ...
  ],
  "results": {
    "openai:gpt-5-mini": {
      "model": "openai:gpt-5-mini",
      "runs": [
        {
          "iteration": 1,
          "query": "convert 0 celsius to fahrenheit",
          "ground_truth": "32 Fahrenheit",
          "model_raw_response": "0 °C = 32 °F ...",
          "grade": "pass",
          "judge_mode": "llm",
          "judge_raw_response": "<thinking>...</thinking>\n<result>correct</result>",
          "tool_expected": true,
          "tool_used": true,
          "tool_calls": ["convert_temperature"],
          "latency_s": 11.913,
          "tokens": { "input": 21756, "output": 138, "total": 21894 },
          "cost_usd": 0.005715
        },
        ...
      ],
      "summary": {
        "total_runs": 3,
        "judged_runs": 2,
        "pass_count": 2,
        "pass_rate": 1.0,
        "tool_use": {
          "expected_total": 2,
          "used_when_expected": 2,
          "recall": 1.0,
          "total_used": 2,
          "used_when_not_expected": 0,
          "precision": 1.0,
          "false_positive_rate": 0.0
        },
        "avg_latency_s": 6.877,
        "tokens": { "input": 54276, "output": 1020, "total": 55296 },
        "cost_usd": 0.015609
      }
    },
    "anthropic:claude-3-5-haiku-latest": { ... },
    "google:gemini-2.5-flash": { ... }
  },
  "summary_table_markdown": "| metric | ... |",
  "judge_cost_summary": {
    "model": "openai:gpt-5-mini",
    "tokens": { "input": 562, "output": 1816, "total": 2378 },
    "cost_usd": 0.003773
  },
  "summary": {
    "table_markdown": "| metric | ... |",
    "judge_cost": { ... },
    "overall": {
      "total_runs": 9,
      "judged_runs": 6,
      "pass_count": 4,
      "pass_rate": 0.6667,
      "tool_use": {
        "expected_total": 6,
        "used_when_expected": 6,
        "recall": 1.0,
        "total_used": 6,
        "used_when_not_expected": 0,
        "precision": 1.0,
        "false_positive_rate": 0.0
      },
      "avg_latency_s": 7.314,
      "tokens": { "input": 142241, "output": 3627, "total": 145868 },
      "cost_usd": 0.102978
    },
    "per_model": {
      "openai:gpt-5-mini": { "pass_rate": 1.0, ... },
      "anthropic:claude-3-5-haiku-latest": { ... },
      "google:gemini-2.5-flash": { ... }
    }
  }
}

Log file example

A compact example of the run log:

2025-09-15 14:47:05,986 INFO Starting MCP server 'unit-converter-mcp' and loading tools...
2025-09-15 14:47:06,281 INFO Loaded 16 tool(s) from MCP server.
2025-09-15 14:47:14,104 INFO ✅ [google:gemini-2.5-flash] convert 0 celsius to fahrenheit #1/1 | tools=True (convert_temperature) | tokens=7003 | 2.83s | $0.0024
2025-09-15 14:47:28,547 INFO ✅ [openai:gpt-5-mini] convert 8 radians to degrees #1/1 | tools=True (convert_angle) | tokens=21897 | 3.90s | $0.0057
2025-09-15 14:47:36,083 INFO === Overall Summary (All Models) ===

🛠️ Development

Prerequisites

Python 3.12+
uv package manager

Setup

# Clone the repository
git clone https://github.com/zazencodes/arbiter-mcp-evals
cd arbiter-mcp-evals

# Install dependencies
uv sync --extra dev

# Run tests
uv run pytest

# Run linting and formatting
uv run ruff format
uv run ruff check --fix
uv run isort --profile black .

# Type checking
uv run mypy arbiter/

Building

# Build package
uv build

# Test installation
uv run --with dist/*.whl arbiter --help

Release Checklist

Update Version:
- Increment the version number in pyproject.toml and arbiter/__init__.py.
Update Changelog:
- Add a new entry in CHANGELOG.md for the release.
  - Draft notes from recent changes (e.g., via git log --oneline or a diff).
Create GitHub Release:
- Draft a new release on the GitHub UI and publish it.
- The GitHub workflow will automatically build and publish the package to PyPI.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

zazencodes

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.0

Sep 21, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

arbiter_mcp_evals-0.1.0.tar.gz (50.8 kB view details)

Uploaded Sep 21, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

arbiter_mcp_evals-0.1.0-py3-none-any.whl (48.4 kB view details)

Uploaded Sep 21, 2025 Python 3

File details

Details for the file arbiter_mcp_evals-0.1.0.tar.gz.

File metadata

Download URL: arbiter_mcp_evals-0.1.0.tar.gz
Upload date: Sep 21, 2025
Size: 50.8 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for arbiter_mcp_evals-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`257bf2f31142b6745adfa5b326c8b2115ee97534161cfc508213afaae92b56b0`
MD5	`495eb7fb0e6f6d19a9ecf37af4511aef`
BLAKE2b-256	`1ea7f21fbc501e4e4ee7ec77061491b562119ab4d5b93fb9f4666400bb20148c`

See more details on using hashes here.

Provenance

The following attestation bundles were made for arbiter_mcp_evals-0.1.0.tar.gz:

Publisher: publish.yml on zazencodes/arbiter-mcp-evals

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: arbiter_mcp_evals-0.1.0.tar.gz
- Subject digest: 257bf2f31142b6745adfa5b326c8b2115ee97534161cfc508213afaae92b56b0
- Sigstore transparency entry: 543734103
- Sigstore integration time: Sep 21, 2025
Source repository:
- Permalink: zazencodes/arbiter-mcp-evals@0cbce879940582c5874bc226472dfcdc8edc1ebf
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/zazencodes
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@0cbce879940582c5874bc226472dfcdc8edc1ebf
- Trigger Event: release

File details

Details for the file arbiter_mcp_evals-0.1.0-py3-none-any.whl.

File metadata

Download URL: arbiter_mcp_evals-0.1.0-py3-none-any.whl
Upload date: Sep 21, 2025
Size: 48.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for arbiter_mcp_evals-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`9760e4bf922a33e92a1c566b95d5abd308e2c4eba1140888ca8e854d46167f70`
MD5	`7dc7e58fca737337f45639c3c84080a8`
BLAKE2b-256	`c16269274c4b76f735660232b186ba6c4e6f8af8b05b34e80567f818b9a71ca9`

See more details on using hashes here.

Provenance

The following attestation bundles were made for arbiter_mcp_evals-0.1.0-py3-none-any.whl:

Publisher: publish.yml on zazencodes/arbiter-mcp-evals

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: arbiter_mcp_evals-0.1.0-py3-none-any.whl
- Subject digest: 9760e4bf922a33e92a1c566b95d5abd308e2c4eba1140888ca8e854d46167f70
- Sigstore transparency entry: 543734107
- Sigstore integration time: Sep 21, 2025
Source repository:
- Permalink: zazencodes/arbiter-mcp-evals@0cbce879940582c5874bc226472dfcdc8edc1ebf
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/zazencodes
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@0cbce879940582c5874bc226472dfcdc8edc1ebf
- Trigger Event: release

arbiter-mcp-evals 0.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

Arbiter

What are MCP evals?

Scoring evals

Why MCP evals?

How does Arbiter do MCP evals?

Quickstart Demo

Run the example evals

Generate evals for your own MCP server

Installation

Global

Project

Credentials

Usage

Execution confirmation

Configuration

Requirements

Features

Cost configuration

Testing

Output files

Results JSON example

Log file example

🛠️ Development

Prerequisites

Setup

Building

Release Checklist

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance