Skip to main content

Chaos Engineering & Failure Diagnosis for AI Agents

Project description

AgentChaos

Chaos testing and failure diagnosis for AI agents.

AgentChaos is currently 0.2.0: a Python toolkit for repeatedly running agent tests, injecting realistic failures, collecting trace-like spans, detecting common failure modes, ingesting framework-shaped traces, and exporting reliability reports.

The v0.1 track is intentionally small: pytest integration, two injectors, two detectors, three metrics, JSON reports, and a CLI summary command. v0.2 is focused on framework-neutral adapter boundaries, runtime ingestion prototypes, trace-based semantic detectors, and release hygiene.

License: Apache 2.0

Status

Implemented today:

  • ChaosTracer for agent, tool, and chat spans
  • ChaosRunner for repeated callable execution
  • execute_chaos_test() as the framework-neutral execution service
  • Injectors: ToolTimeout, ArgSchemaMutation
  • Detectors: LoopDetector, ArgSchemaViolationDetector, ToolInvocationMismatchDetector
  • Metrics: step_success_rate_at_k, run_variance, recovery_rate
  • JSON report exporter
  • Pytest plugin: @chaos, --chaos, --chaos-report, chaos_tracer
  • CLI: agentchaos summarize <report.json>
  • No-API-key local demo
  • v0.2 development: minimal LangGraph-like adapter prototype
  • v0.2 development: LangGraph runtime stream/astream_events ingestion adapter with parent reconstruction
  • v0.2 development: OpenAI Agents-like event model skeleton, without OpenAI SDK imports
  • v0.2 development: CrewAI-like event model skeleton, without CrewAI SDK imports
  • v0.2 development: MCP-like event model skeleton, without MCP SDK imports

Not implemented yet:

  • HTML report
  • CrewAI, OpenAI Agents, or MCP production runtime adapters
  • Production-ready framework adapters beyond the current LangGraph runtime ingestion and skeleton prototypes
  • Benchmark integrations such as tau-bench
  • Production sampling or hosted dashboard

v0.2 Roadmap

  • Keep adapter prototypes framework-neutral by mapping runtime-like events into TraceSpan without importing framework SDKs.
  • Expand trace-based semantic detectors around tool-use reliability while keeping detectors dependent only on internal spans.
  • Harden release hygiene with public preflight checks, package verification commands, and stable JSON/CLI behavior.
  • Defer production runtime adapters until the adapter boundary and skeleton tests are stable.

Quickstart

Clone the repo and install it locally:

git clone https://github.com/jeffery0929/agentchaos.git
cd agentchaos
python -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"

Run the no-API-key demo:

pytest examples/basic --chaos --chaos-report chaos_reports/basic.json -q

Summarize the report:

agentchaos summarize chaos_reports/basic.json

Expected result:

AgentChaos report: pytest_suite
tests: 1
total runs: 3
successful runs: 3
failed runs: 0
matched detections: 3

First Test

from agent_chaos import chaos
from agent_chaos.injectors import ToolTimeout


@chaos(injectors=[ToolTimeout(p=0.2)], runs=10)
def test_agent_handles_tool_timeouts(chaos_tracer):
    with chaos_tracer.invoke_agent("flight-agent"):
        result = my_agent.run("Book a flight from SFO to NRT")

    assert result.status == "success"

Run it with:

pytest --chaos --chaos-report agentchaos-report.json
agentchaos summarize agentchaos-report.json

@chaos is lazy-loaded from the package root, so ordinary import agent_chaos does not pull in pytest. Pytest is only needed when using the pytest plugin.

Report Contents

The JSON report includes:

  • total, successful, and failed run counts
  • pass rate
  • step_success_rate_at_k
  • run_variance
  • recovery_rate
  • per-run detector results
  • optional span payloads with --chaos-include-spans

Why This Exists

Production agent failures are often not clean assertion failures. They show up as loops, bad tool arguments, fabricated observations, retry storms, premature stops, and task drift. AgentChaos focuses on a narrow v0.1 gap:

fault injection + trace-backed failure classification + CI-friendly reports

See:

Current v0.1 Scope

In scope:

  • pytest-first local workflow
  • deterministic local demo
  • JSON report as the stable output
  • CLI summary for report inspection
  • small, testable core modules

Out of scope for v0.1:

  • hosted UI
  • SaaS dashboard
  • HTML report unless the core stabilizes first
  • framework-specific adapters
  • public leaderboard

Development Checks

pytest -q
ruff check agent_chaos tests examples
ruff format --check agent_chaos tests examples
mypy agent_chaos tests examples
pytest examples/basic --chaos --chaos-report chaos_reports/basic.json -q
agentchaos summarize chaos_reports/basic.json

Optional Paid OpenAI Dogfood

After configuring a small API budget and adding OPENAI_API_KEY to ignored local env files, run the manual paid smoke test:

python examples/openai_paid_dogfood/run_demo.py
agentchaos summarize chaos_reports/openai-paid-dogfood.json

The default model is gpt-5.4-nano to keep the first paid run cheap.

License

Apache 2.0. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agentchaos_core-0.2.0.tar.gz (145.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

agentchaos_core-0.2.0-py3-none-any.whl (67.6 kB view details)

Uploaded Python 3

File details

Details for the file agentchaos_core-0.2.0.tar.gz.

File metadata

  • Download URL: agentchaos_core-0.2.0.tar.gz
  • Upload date:
  • Size: 145.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for agentchaos_core-0.2.0.tar.gz
Algorithm Hash digest
SHA256 cfb8b718038f8c592b4be3b12f88c5018472d76603d3595233cb7b77460ac9d3
MD5 ce6920727ed98858c31821307e64c47e
BLAKE2b-256 a01d18431622f4c439c601cd7672200fed32e0f6737967b0b02611ef8384324b

See more details on using hashes here.

File details

Details for the file agentchaos_core-0.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for agentchaos_core-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0e0a0e8a5c19646db997df093f19e4d5963c7bf1da9ec5fcc886cebca0c5cc32
MD5 86917f2c4f9abe78cddf94359eb11229
BLAKE2b-256 2ee3cec1d59b0d512d594a04e2a1039b757b8a8308dfb9188b2c00004de7cb41

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page