Skip to main content

Adversarial multi-agent eval harness for Claude API pipelines

Project description

Gauntlet ⚔️

Adversarial eval harness for multi-agent Claude API pipelines.

Python 3.10+ CI License: MIT PyPI

Gauntlet solves a problem every AI engineer hits in production: how do you know your agent pipeline actually works before it breaks in front of a real user?

Point it at any Claude or OpenAI agent, describe what it should do in plain English, and get back a pass rate, adversarial findings, and concrete recommendations — automatically.


Install

pip install gauntlet-eval

Add your Anthropic API key to your MCP config (recommended):

{
  "mcpServers": {
    "gauntlet": {
      "command": "python",
      "args": ["-m", "gauntlet.mcp_server"],
      "env": {
        "ANTHROPIC_API_KEY": "your-key-here"
      }
    }
  }
}

Or if using the CLI/API directly, add it to a .env file instead:

ANTHROPIC_API_KEY=sk-ant-...

→ Full IDE setup: docs/MCP_SETUP.md


Three ways to use it

1. IDE (MCP Servers) — least manual work

Connect Gauntlet as an MCP server in Cursor or Antigravity, then type find in the chat. Gauntlet scans your workspace, detects agent files, and walks you through the eval.

docs/MCP_SETUP.md

2. REST API

gauntlet serve
# Interactive docs at http://localhost:8000/docs

3. CLI

gauntlet run \
  --goal "Classify a support ticket as billing, technical, or general" \
  --agent-description "Single Claude classifier" \
  --agent-api-key "sk-ant-..." \
  --system-prompt "You are a classifier. Reply with one word." \
  --mode full \
  --runs 5

How it works

Your agent
    │
    ▼
┌─────────────────────────────────────┐
│          Gauntlet Runner             │
│                                      │
│  1. ScenarioAgent   → test inputs    │
│  2. AdversarialAgent→ hostile inputs │
│  3. JudgeAgent      → pass/fail      │
│  4. ReportAgent     → recommendations│
└──────────────────┬──────────────────┘
                   ▼
           gauntlet.db (SQLite)
Agent What it does
ScenarioAgent Generates realistic test inputs from your plain-English goal
AdversarialAgent Prompt injection, contradictory requirements, hallucination traps
JudgeAgent Scores each response pass/fail — supports custom criteria
ReportAgent Turns failures into prioritised, code-level recommendations

Python SDK

from gauntlet.core.runner import run_eval
from gauntlet.core.models import EvalRequest, EvalMode
import asyncio

request = EvalRequest(
    goal="Handle a customer refund request",
    agent_description="Claude agent with order lookup tool",
    agent_api_key="sk-ant-...",
    agent_system_prompt="You are a refund handler...",
    mode=EvalMode.full,
    runs=5,
)
report = asyncio.run(run_eval(request))
print(f"Pass rate: {report.pass_rate:.0%}")

Docs

Document What's in it
docs/MCP_SETUP.md Cursor & Antigravity setup, find command walkthrough
docs/CURSOR_PROMPT.md Ready-made prompt to paste in Cursor chat
docs/ARCHITECTURE.md System design, agent flow, data models

Contributing

pip install -e ".[dev]"
pytest tests/ -v
ruff check gauntlet/

PRs welcome.


License

MIT — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gauntlet_eval-0.1.2.tar.gz (28.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

gauntlet_eval-0.1.2-py3-none-any.whl (30.1 kB view details)

Uploaded Python 3

File details

Details for the file gauntlet_eval-0.1.2.tar.gz.

File metadata

  • Download URL: gauntlet_eval-0.1.2.tar.gz
  • Upload date:
  • Size: 28.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for gauntlet_eval-0.1.2.tar.gz
Algorithm Hash digest
SHA256 f3367fa6ee09e6ed50960405a7e8f63be4d40433b30b0663fd207c05a708a53a
MD5 40ade411f6d06161d55a73e1efed593c
BLAKE2b-256 0f201233dcbe7ee21c4e78d492c083fbc392a03b4017e44945942af0695eb95d

See more details on using hashes here.

File details

Details for the file gauntlet_eval-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: gauntlet_eval-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 30.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for gauntlet_eval-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 499decd7797f33e389ce9134be9f33e8164e899feb8a02162e9c2d5def1221af
MD5 6e76e4ce4416160f4d1f513642aeb79b
BLAKE2b-256 37339320cfe567fd52128443f64191ccb6ddd8d7ed167b7ee7b85fad54aff442

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page