Open-source reliability testing for tool-using AI agents: record, profile cost, detect regressions before production.

These details have not been verified by PyPI

Project links

Project description

AgentChaos

Open-source reliability testing for tool-using AI agents. Record agent runs, profile cost, and catch operational regressions in CI — before they hit production.

0.1.0 — v0 wedge: detect a cost/latency/tool-call regression in CI. See PROGRESS.md and the roadmap.

The problem

Tool-using agents fail in production in ways quality-focused eval tools don't catch:

Cost silently doubles after a prompt or model change.
A flaky tool returns 503 and the agent retries 12 times, racking up $4 in model calls.
Retrieval returns 12 chunks instead of 5, blowing the planner's context budget.
The "fallback when a tool fails" path was never tested, because tools never fail in dev.

Eval tools (LangSmith, Braintrust, DeepEval) score answer quality. Load tools (k6, Locust) don't understand the agent loop. AgentChaos covers the operational gap: deliberate, reproducible regression detection for tool-using agents, runnable in CI.

Quickstart

pip install agentchaos-reliability
agentchaos init my-agent-tests
cd my-agent-tests
agentchaos doctor scenarios/example.yaml      # validate + ping your agent
agentchaos run scenarios/example.yaml          # record a baseline trace
# ... change your agent (prompt, model, retrieval) ...
agentchaos run scenarios/example.yaml --baseline runs/baseline.jsonl

A scenario is plain YAML: a conversation, the tools you expect, and operational budgets.

id: refund-agent
agent:
  type: http
  endpoint: http://127.0.0.1:8080/chat
conversation:
  - user: "I want to return my order."
  - user: "My order number is 12345."
expect:
  must_call_tools: [get_order, create_return_label]
  final_response_contains: [return label]
budgets:
  max_cost_usd: 0.05
  max_cost_regression_pct: 20          # fail if cost grows >20% vs baseline
  max_input_token_regression_pct: 30

What you get

Re-run after a change and AgentChaos diffs against the baseline, gates on your budgets, and tells you why cost moved — then exits non-zero so CI fails:

AgentChaos — refund-agent

Verdict: FAIL

Why:
  - [regression_budget] max_cost_regression_pct: cost regressed +68.2% (limit +20.0%)
  - [regression_budget] max_input_token_regression_pct: input_tokens regressed +90.8% (limit +30.0%)

Metrics:
  Metric                      Baseline       Current         Δ  Status
  --------------------------------------------------------------------
  Cost                         $0.0004       $0.0006    +68.2%  FAIL
  Input tokens                    1850          3530    +90.8%  FAIL
  Output tokens                    150           150     +0.0%  PASS
  LLM calls                          3             3     +0.0%  PASS
  Tool calls                         2             2     +0.0%  PASS

Tool sequence:
  baseline: get_order -> create_return_label
  current:  get_order -> create_return_label

Possible contributors:
  - input_tokens grew +90.8% (+1680 tokens)
  - per-model cost grew on gpt-4o-mini: $0.000368 → $0.000619 (+68.2%)
  - metadata.rag_chunks changed on model_call planner: 5 → 12  [correlates]

Exit code: 2

The cause detection is observation-first: it leads with deltas and lists possible contributors with confidence (observed / correlates / computed). Only a model swap claims a hard dollar contribution.

Try it now — the refund-agent demo reproduces the report above in under a minute.

How it works

scenario.yaml → run → trace.jsonl → metrics → compare(baseline) → terminal verdict + exit code

Your agent is any HTTP endpoint that takes a message and returns a response. AgentChaos auto-detects one of three fidelity tiers from your response shape — full per-call cost attribution, aggregate usage, or message-only — and tells you which on the first run. No SDK, no instrumentation, no framework lock-in.

See the HTTP integration contract for the exact request/response shapes and fidelity tiers.

Command	Purpose
`agentchaos init`	Scaffold a scenarios/ + runs/ folder
`agentchaos doctor`	Validate a scenario and probe the endpoint
`agentchaos run`	Execute a scenario, record a trace, optionally diff a baseline
`agentchaos compare`	Diff two existing traces — pure analysis, no agent calls

Exit codes: 0 pass · 1 config error · 2 budget/expectation violation · 4 endpoint unreachable.

What it is not

Not a generic LLM eval platform. Use DeepEval, Braintrust, or LangSmith for answer-quality scoring.
Not a prompt management tool. Use Langfuse or git.
Not an observability dashboard. Use Phoenix or Datadog.
Not a load testing framework. Use k6 or Locust.

AgentChaos does one thing: catch operational failures and cost regressions in tool-using agents, in CI.

Roadmap

v0 (now) — cost/regression profiling with baseline diff. ✅
v1 — chaos injection (tool failure/latency), loop & retry-storm detection, record/replay, OTel emission, GitHub Action.
v2 — framework adapters (LangGraph, OpenAI Agents SDK), MCP proxy + contract tests.

License

Apache 2.0.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Jun 2, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agentchaos_reliability-0.1.0.tar.gz (109.5 kB view details)

Uploaded Jun 2, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

agentchaos_reliability-0.1.0-py3-none-any.whl (33.0 kB view details)

Uploaded Jun 2, 2026 Python 3

File details

Details for the file agentchaos_reliability-0.1.0.tar.gz.

File metadata

Download URL: agentchaos_reliability-0.1.0.tar.gz
Upload date: Jun 2, 2026
Size: 109.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.11.2 {"installer":{"name":"uv","version":"0.11.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for agentchaos_reliability-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`de529113b11a80c6aa470e96631ccd6b218e9a0a2efe5463b70edb9e94833b41`
MD5	`dab36e47b37e8db80d07214ea9c87633`
BLAKE2b-256	`c1125c64eca2a4519a196c25c3225cf63398f052eb4fdd58ba92a1a368c337d6`

See more details on using hashes here.

File details

Details for the file agentchaos_reliability-0.1.0-py3-none-any.whl.

File metadata

Download URL: agentchaos_reliability-0.1.0-py3-none-any.whl
Upload date: Jun 2, 2026
Size: 33.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.11.2 {"installer":{"name":"uv","version":"0.11.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for agentchaos_reliability-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2fdc3680a7e83e56ae8eb39690905b0d2bfced3a52bd8bd88f310b4a76e2d0bd`
MD5	`14371883e84a1b0acc62b4046cb7941b`
BLAKE2b-256	`8361c3e1ebbee9cb6c3a7986afebca702dc42e79de59b02fd280118a2e7e21d8`

See more details on using hashes here.

agentchaos-reliability 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

AgentChaos

The problem

Quickstart

What you get

How it works

What it is not

Roadmap

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes