Chaos Engineering & Failure Diagnosis for AI Agents
Project description
AgentChaos
Chaos testing and failure diagnosis for AI agents.
AgentChaos is currently 0.2.0: a Python toolkit for repeatedly running agent
tests, injecting realistic failures, collecting trace-like spans, detecting common
failure modes, ingesting framework-shaped traces, and exporting reliability reports.
The v0.1 track is intentionally small: pytest integration, two injectors, two detectors, three metrics, JSON reports, and a CLI summary command. v0.2 is focused on framework-neutral adapter boundaries, runtime ingestion prototypes, trace-based semantic detectors, and release hygiene.
Status
Implemented today:
ChaosTracerfor agent, tool, and chat spansChaosRunnerfor repeated callable executionexecute_chaos_test()as the framework-neutral execution service- Injectors:
ToolTimeout,ArgSchemaMutation - Detectors:
LoopDetector,ArgSchemaViolationDetector,ToolInvocationMismatchDetector - Metrics:
step_success_rate_at_k,run_variance,recovery_rate - JSON report exporter
- Pytest plugin:
@chaos,--chaos,--chaos-report,chaos_tracer - CLI:
agentchaos summarize <report.json> - No-API-key local demo
- v0.2 development: minimal LangGraph-like adapter prototype
- v0.2 development: LangGraph runtime
stream/astream_eventsingestion adapter with parent reconstruction - v0.2 development: OpenAI Agents-like event model skeleton, without OpenAI SDK imports
- v0.2 development: CrewAI-like event model skeleton, without CrewAI SDK imports
- v0.2 development: MCP-like event model skeleton, without MCP SDK imports
Not implemented yet:
- HTML report
- CrewAI, OpenAI Agents, or MCP production runtime adapters
- Production-ready framework adapters beyond the current LangGraph runtime ingestion and skeleton prototypes
- Benchmark integrations such as tau-bench
- Production sampling or hosted dashboard
v0.2 Roadmap
- Keep adapter prototypes framework-neutral by mapping runtime-like events into
TraceSpanwithout importing framework SDKs. - Expand trace-based semantic detectors around tool-use reliability while keeping detectors dependent only on internal spans.
- Harden release hygiene with public preflight checks, package verification commands, and stable JSON/CLI behavior.
- Defer production runtime adapters until the adapter boundary and skeleton tests are stable.
Quickstart
Clone the repo and install it locally:
git clone https://github.com/jeffery0929/agentchaos.git
cd agentchaos
python -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"
Run the no-API-key demo:
pytest examples/basic --chaos --chaos-report chaos_reports/basic.json -q
Summarize the report:
agentchaos summarize chaos_reports/basic.json
Expected result:
AgentChaos report: pytest_suite
tests: 1
total runs: 3
successful runs: 3
failed runs: 0
matched detections: 3
First Test
from agent_chaos import chaos
from agent_chaos.injectors import ToolTimeout
@chaos(injectors=[ToolTimeout(p=0.2)], runs=10)
def test_agent_handles_tool_timeouts(chaos_tracer):
with chaos_tracer.invoke_agent("flight-agent"):
result = my_agent.run("Book a flight from SFO to NRT")
assert result.status == "success"
Run it with:
pytest --chaos --chaos-report agentchaos-report.json
agentchaos summarize agentchaos-report.json
@chaos is lazy-loaded from the package root, so ordinary import agent_chaos does not
pull in pytest. Pytest is only needed when using the pytest plugin.
Report Contents
The JSON report includes:
- total, successful, and failed run counts
- pass rate
step_success_rate_at_krun_variancerecovery_rate- per-run detector results
- optional span payloads with
--chaos-include-spans
Why This Exists
Production agent failures are often not clean assertion failures. They show up as loops, bad tool arguments, fabricated observations, retry storms, premature stops, and task drift. AgentChaos focuses on a narrow v0.1 gap:
fault injection + trace-backed failure classification + CI-friendly reports
See:
Current v0.1 Scope
In scope:
- pytest-first local workflow
- deterministic local demo
- JSON report as the stable output
- CLI summary for report inspection
- small, testable core modules
Out of scope for v0.1:
- hosted UI
- SaaS dashboard
- HTML report unless the core stabilizes first
- framework-specific adapters
- public leaderboard
Development Checks
pytest -q
ruff check agent_chaos tests examples
ruff format --check agent_chaos tests examples
mypy agent_chaos tests examples
pytest examples/basic --chaos --chaos-report chaos_reports/basic.json -q
agentchaos summarize chaos_reports/basic.json
Optional Paid OpenAI Dogfood
After configuring a small API budget and adding OPENAI_API_KEY to ignored local env
files, run the manual paid smoke test:
python examples/openai_paid_dogfood/run_demo.py
agentchaos summarize chaos_reports/openai-paid-dogfood.json
The default model is gpt-5.4-nano to keep the first paid run cheap.
License
Apache 2.0. See LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file agentchaos_core-0.2.0.tar.gz.
File metadata
- Download URL: agentchaos_core-0.2.0.tar.gz
- Upload date:
- Size: 145.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cfb8b718038f8c592b4be3b12f88c5018472d76603d3595233cb7b77460ac9d3
|
|
| MD5 |
ce6920727ed98858c31821307e64c47e
|
|
| BLAKE2b-256 |
a01d18431622f4c439c601cd7672200fed32e0f6737967b0b02611ef8384324b
|
File details
Details for the file agentchaos_core-0.2.0-py3-none-any.whl.
File metadata
- Download URL: agentchaos_core-0.2.0-py3-none-any.whl
- Upload date:
- Size: 67.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0e0a0e8a5c19646db997df093f19e4d5963c7bf1da9ec5fcc886cebca0c5cc32
|
|
| MD5 |
86917f2c4f9abe78cddf94359eb11229
|
|
| BLAKE2b-256 |
2ee3cec1d59b0d512d594a04e2a1039b757b8a8308dfb9188b2c00004de7cb41
|