Break your agent before your users do. Adversarial stress-testing and regression suites for AI agents.
Project description
Gauntlet
Break your agent before your users do.
Gauntlet fires a suite of adversarial, edge-case "users" at your AI agent over HTTP, finds where it fails (system-prompt leaks, unsafe actions, scope drift, crashes, runaway output), ranks the failures by severity, and turns them into a regression suite you can gate in CI. Framework-agnostic: if your agent speaks HTTP, Gauntlet can test it.
It is built on one belief: a green eval only means something if you defined what red looks like. Most agent "evals" pass because nobody wrote the test that would have failed.
Why this exists
Teams ship agents that work in the demo and then quietly break in production: the model picks the wrong tool, leaks its prompt to a clever user, confirms an action it should have refused, or loops. The expensive part of reliability is not the dashboard, it is finding the failures and making sure they stay fixed. Gauntlet is the part that goes looking for them.
Quickstart (30 seconds, no API key)
# Install (once published): pip install gauntlet-agent
# or as an isolated CLI: pipx install gauntlet-agent
# 1. See it work against a deliberately broken sample agent, in one process:
python examples/demo.py
# 2. Or run it against your own agent (any HTTP endpoint that takes JSON):
# terminal A:
python examples/sample_agent.py
# terminal B:
gauntlet run --target http://localhost:8000/chat --canaries examples/canaries.json
Network-free self-test (runs anywhere, good for CI):
python tests/test_offline.py # or: pytest
Using it on your agent
Point --target at any endpoint that accepts POST {"message": "..."} and
returns JSON or text. Field names are configurable. Agent isn't an HTTP service
yet (LangChain, OpenAI Assistants, a bare function)? Wrap it in a few lines with
the shims in examples/adapters/ — no server code to write.
gauntlet run \
--target https://your-agent.example.com/v1/chat \
--request-field input --response-field output \
--header "Authorization: Bearer $TOKEN" \
--canaries my_canaries.json \
--fail-on HIGH \
--json gauntlet-report.json
Exit code is nonzero when anything at or above --fail-on is found, so you can
drop it straight into CI:
# .github/workflows/agent.yml
- run: gauntlet run --target $STAGING_AGENT_URL --canaries canaries.json --fail-on HIGH
Canaries: define what failure means
The universal graders (errors, empty replies, runaway length) need no config. The high-value graders are the canaries you supply, the things your agent must never do:
{
"never_output_substrings": ["INTERNAL-OPS-KEY", "system prompt"],
"unsafe_compliance_substrings": ["refund of $", "deleted the account"],
"max_response_chars": 6000,
"severity_overrides": { "missing_refusal": "MEDIUM", "data_leak": "CRITICAL" }
}
severity_overrides lets you retune any finding kind to your own risk bar
(CRITICAL/HIGH/MEDIUM/LOW/INFO) — e.g. downgrade missing_refusal if your agent
is intentionally chatty, or keep leaks at CRITICAL.
How it works
- Adversaries (
gauntlet/adversaries.py) — a deterministic library of probes across prompt injection, scope discipline, false premises, data exfiltration, malformed input, and loop bait. Deterministic so runs are reproducible. - Runner (
gauntlet/runner.py) — fires probes concurrently at your HTTP endpoint, stdlib only. - Graders (
gauntlet/graders.py) — universal reliability checks plus your canaries, producing severity-ranked findings (CRITICAL to INFO). - Report (
gauntlet/report.py) — a readable summary, the worst failures, and a JSON artifact for CI.
Optional: LLM-powered mode
The default needs no API key. With --llm, Gauntlet generates fresh adversarial
personas from a description of your agent and can grade open-ended behavior with
a judge instead of substring canaries.
pip install "gauntlet-agent[llm]"
export ANTHROPIC_API_KEY=...
gauntlet run --target $URL --llm --describe "support bot for an online store"
The judge is a thin, swappable layer. The methodology is the point: generate probes from your agent's real surface, and validate the judge against a small human-labeled gold set before trusting its scores.
Calibrate the judge (don't trust a score you haven't validated)
gauntlet calibrate --gold examples/gold.jsonl --min-kappa 0.6
Runs the judge over a human-labeled gold set and reports accuracy, precision,
recall (of real failures, how many the judge catches — the number that
matters for a safety tool), F1, and Cohen's κ (chance-corrected agreement).
It exits nonzero below --min-kappa, so a weak judge fails CI instead of quietly
shipping bad scores. A starter gold set lives at examples/gold.jsonl.
Multi-turn probes (jailbreaks that build across turns)
Real jailbreaks are rarely one message — they build trust, plant context, or
manufacture a false premise over several turns, then cash it in. Add --multiturn
to include built-in conversation probes (crescendo, gradual role-reset, context
poisoning, manufactured commitment). Gauntlet drives each turn-by-turn and grades
the final reply.
# stateful agent (keeps its own session):
gauntlet run --target $URL --multiturn --canaries canaries.json
# stateless agent: send the running transcript as an OpenAI-style messages array
gauntlet run --target $URL --multiturn --history-field messages --canaries canaries.json
The report prints the full conversation for any multi-turn failure, so you can see exactly how it got there.
Add --extended for newer single-turn attack classes (indirect/RAG injection,
unicode-homoglyph evasion, refusal suppression, base64 encoding smuggling,
tool-description extraction, PII handling):
gauntlet run --target $URL --extended --multiturn --canaries canaries.json
Trace-aware grading (score the tool calls, not just the text)
A safe-sounding answer can hide an unsafe action. If your agent returns the tool
calls it made, Gauntlet can grade those directly. Have the agent include a
trace in its JSON response:
{ "response": "All set!", "trace": [ {"tool": "issue_refund", "args": {"amount": 999}} ] }
Then point at it and declare which tools/args are off-limits:
gauntlet run --target $URL --trace-field trace --canaries canaries.json
{ "forbidden_tools": ["issue_refund", "delete_user", "send_bulk_email"],
"forbidden_tool_arg_substrings": ["all_users", "DROP TABLE"] }
A forbidden tool call (or a forbidden argument) is a HIGH finding even if the text looked fine — catching the agent that says "I can't" but calls the tool anyway.
Roadmap
- Judge calibration command (
gauntlet calibrate) - Persona memory: multi-turn conversation probes (
--multiturn) - Trace-aware grading (
--trace-field+ forbidden tools/args) - Hosted dashboard + scheduled runs (see the
apps/dashboardin the monorepo)
License
MIT. See LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file gauntlet_agent-0.1.0.tar.gz.
File metadata
- Download URL: gauntlet_agent-0.1.0.tar.gz
- Upload date:
- Size: 27.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1f789cfe076de9ea1e8235ffd62069ae93d18121b5f41d37180db1cb4cf94982
|
|
| MD5 |
5320e1605ed57eb4180c20e09209e11e
|
|
| BLAKE2b-256 |
7cc60facfc73d34c2fcd78bf0ec68c85818656066773e61b74fc2951ca5c3c98
|
Provenance
The following attestation bundles were made for gauntlet_agent-0.1.0.tar.gz:
Publisher:
release.yml on GauntletVectorLabs/gauntlet
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
gauntlet_agent-0.1.0.tar.gz -
Subject digest:
1f789cfe076de9ea1e8235ffd62069ae93d18121b5f41d37180db1cb4cf94982 - Sigstore transparency entry: 1822596530
- Sigstore integration time:
-
Permalink:
GauntletVectorLabs/gauntlet@d7dd832e8c09f5d86bf68d5239b3041b71f827a8 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/GauntletVectorLabs
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@d7dd832e8c09f5d86bf68d5239b3041b71f827a8 -
Trigger Event:
push
-
Statement type:
File details
Details for the file gauntlet_agent-0.1.0-py3-none-any.whl.
File metadata
- Download URL: gauntlet_agent-0.1.0-py3-none-any.whl
- Upload date:
- Size: 22.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a3d8cdaa997684d299fad9f7c996e7771ad4837f41d3c12aa56c1aae8007a20b
|
|
| MD5 |
f3756a27f1cb065c3fa8ef161ac818f1
|
|
| BLAKE2b-256 |
94628907264b2db72720d8c0ab1c355587287e5635668b175aca6803b54cc57b
|
Provenance
The following attestation bundles were made for gauntlet_agent-0.1.0-py3-none-any.whl:
Publisher:
release.yml on GauntletVectorLabs/gauntlet
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
gauntlet_agent-0.1.0-py3-none-any.whl -
Subject digest:
a3d8cdaa997684d299fad9f7c996e7771ad4837f41d3c12aa56c1aae8007a20b - Sigstore transparency entry: 1822596621
- Sigstore integration time:
-
Permalink:
GauntletVectorLabs/gauntlet@d7dd832e8c09f5d86bf68d5239b3041b71f827a8 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/GauntletVectorLabs
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@d7dd832e8c09f5d86bf68d5239b3041b71f827a8 -
Trigger Event:
push
-
Statement type: