Skip to main content

Adversarial multi-agent eval harness for Claude API/OpenAI or any LLM agent pipelines

Project description

Gauntlet โš”๏ธ

Adversarial eval harness for any LLM agent pipeline โ€” Claude, OpenAI, or your own

Python 3.10+ CI License: MIT PyPI

๐Ÿ“ฆ gauntlet-eval on PyPI โ€” pip install gauntlet-eval

Gauntlet solves a problem every AI engineer hits in production: how do you know your agent pipeline actually works before it breaks in front of a real user?

Point it at Claude, OpenAI, or any LLM agent, describe what it should do in plain English, and get back a pass rate, per-agent breakdown, adversarial findings, and concrete recommendations automatically.


Install

pip install gauntlet-eval

Add your Anthropic API key to your MCP config (recommended):

{
  "mcpServers": {
    "gauntlet": {
      "command": "python",
      "args": ["-m", "gauntlet.mcp_server"],
      "env": {
        "ANTHROPIC_API_KEY": "your-key-here"
      }
    }
  }
}

Or if using the CLI/API directly, add it to a .env file:

ANTHROPIC_API_KEY=sk-ant-...

โ†’ Full IDE setup: docs/MCP_SETUP.md


Three ways to use it

1. IDE โ€” least manual work

Connect Gauntlet as an MCP server in Cursor or Antigravity. Type find in the chat โ€” Gauntlet scans your workspace, detects agent files automatically, and runs the eval. No JSON, no terminal.

โ†’ docs/MCP_SETUP.md

2. REST API

gauntlet serve
# Interactive docs at http://localhost:8000/docs

3. CLI

gauntlet run \
  --goal "Classify a support ticket as billing, technical, or general" \
  --agent-description "Single Claude classifier" \
  --agent-api-key "sk-ant-..." \
  --system-prompt "You are a classifier. Reply with one word." \
  --mode full \
  --runs 5

How it works

Your agent
    โ”‚
    โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚          Gauntlet Runner             โ”‚
โ”‚                                      โ”‚
โ”‚  1. ScenarioAgent   โ†’ test inputs    โ”‚
โ”‚  2. AdversarialAgentโ†’ hostile inputs โ”‚
โ”‚  3. JudgeAgent      โ†’ pass/fail      โ”‚
โ”‚  4. ReportAgent     โ†’ recommendationsโ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                   โ–ผ
           gauntlet.db (SQLite)
Agent What it does
ScenarioAgent Generates realistic test inputs from your plain-English goal
AdversarialAgent Prompt injection, contradictory requirements, hallucination traps
JudgeAgent Scores each response pass/fail โ€” supports custom success criteria
ReportAgent Turns failures into prioritised, code-level recommendations

Average cost per full eval run(Approximate): ~$0.002


Single-agent eval

Point Gauntlet at any Claude or OpenAI model with a system prompt via the REST API, CLI, or Python SDK. It generates test scenarios, runs them through your agent, and returns a full report.

โ†’ See docs/ARCHITECTURE.md for SDK usage.


Multi-agent eval โ€” automatic flow tracing

Add @trace("AgentName") above each agent function. Gauntlet automatically records every call, judges each step individually, and pinpoints exactly which agent is the bottleneck.

โ†’ See docs/ARCHITECTURE.md for the full @trace example.

The report shows the complete execution flow:

Traced flow: Router โ†’ Writer โ†’ Validator

โš ๏ธ Bottleneck: Writer (43% pass rate)

| Agent     | Pass Rate | Status        |
|-----------|-----------|---------------|
| Router    | 86%       | โœ…            |
| Writer    | 43%       | โŒ bottleneck |
| Validator | 100%      | โœ…            |

Scenario s2 โ€” FAIL
  โœ… Router    โ†’ returned "billing" (120ms)
  โŒ Writer    โ†’ returned "" โ€” output was empty
  โš ๏ธ Validator โ†’ never reached โ€” upstream failure

Docs

Document What's in it
docs/MCP_SETUP.md Cursor & Antigravity setup, find command walkthrough
docs/CURSOR_PROMPT.md Ready-made prompt to paste in Cursor chat
docs/ARCHITECTURE.md System design, agent flow, data models

Contributing

pip install -e ".[dev]"
pytest tests/ -v
ruff check gauntlet/

PRs welcome.


License

MIT โ€” see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gauntlet_eval-0.1.3.tar.gz (28.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

gauntlet_eval-0.1.3-py3-none-any.whl (30.4 kB view details)

Uploaded Python 3

File details

Details for the file gauntlet_eval-0.1.3.tar.gz.

File metadata

  • Download URL: gauntlet_eval-0.1.3.tar.gz
  • Upload date:
  • Size: 28.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for gauntlet_eval-0.1.3.tar.gz
Algorithm Hash digest
SHA256 13b6d7886ce7e9dcc8229e2c4e722c0ec0fef9e6625e5932d2211a89c5d8761c
MD5 62bcbfed113bcb3abc96bb93725528b1
BLAKE2b-256 7d48e36bb5a17ee9f07558ed8854bcee5e74db2539cf808883f53524b6091edd

See more details on using hashes here.

File details

Details for the file gauntlet_eval-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: gauntlet_eval-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 30.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for gauntlet_eval-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 404e939876729e4d33a0e09c77f745be96e05b8174de78ee68de5ff4f6112093
MD5 147d6f2623d459f350e9a5005c9ea706
BLAKE2b-256 646d54c54ac50ac3c2237f73a4f64ae62ab4b9a6556eeb2c5b18f95e3af3a753

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page