Skip to main content

Chaos Engineering for AI Agents

Project description

agent-chaos

Chaos engineering for AI agents.

The Joker: "Introduce a little anarchy. Upset the established order, and everything becomes chaos. I'm an agent of chaos. Oh, and you know the thing about chaos? It's fair!"


Your agent works in demos. It passes evals. Then it hits production: the LLM rate-limits, the tool API returns garbage, the stream cuts mid-response. The agent fails silently, confidently returns wrong answers, or loops forever.

agent-chaos breaks your agent on purpose—before production does.


Why This Exists

AI agents have failure boundaries that didn't exist before:

Boundary What Can Break
LLM provider Rate limits, timeouts, server errors, stream interruptions
Tool execution API failures, malformed responses, lies
Context/memory Corrupted retrieval, poisoned history, token overflow

Traditional chaos engineering tools (Chaos Monkey, Gremlin, Litmus) operate at the infrastructure layer—network partitions, pod failures, CPU stress. They don't understand agent-specific failure modes. They can't corrupt a tool result or cut an LLM stream after 10 chunks.

Evaluation tools (Galileo, DeepEval, LangSmith) tell you if your agent worked correctly. They judge past runs. They can't answer: "What happens when the weather API lies?"

agent-chaos injects failures. Eval tools judge outcomes. Use both.

                     ┌─────────────────┐
                     │  agent-chaos    │
                     │  (inject chaos) │
                     └────────┬────────┘
                              │
                              ▼
┌──────────────┐       ┌─────────────┐       ┌──────────────┐
│   CI / Test  │──────▶│  Your Agent │──────▶│  Eval Tools  │
│   Pipeline   │       │             │       │  (judge it)  │
└──────────────┘       └─────────────┘       └──────────────┘

What It Does

Inject chaos at every agent boundary:

from agent_chaos import (
    chaos_context,
    llm_rate_limit,
    llm_stream_cut,
    tool_error,
    tool_mutate,
)

def corrupt_weather(tool_name: str, result: str) -> str:
    # Return plausible lies
    return result.replace("22°C", "-50°C")

with chaos_context(
    name="resilience-test",
    chaos=[
        llm_rate_limit().after_calls(2),
        llm_stream_cut(after_chunks=10).with_probability(0.3),
        tool_error("Service unavailable").for_tool("get_weather").on_call(1),
        tool_mutate(corrupt_weather),
    ],
) as ctx:
    response = my_agent.run("What's the weather in Tokyo?")

    # Did the agent handle the chaos?
    assert ctx.metrics.chaos_injected > 0

Then gate your CI:

agent-chaos run scenarios/ --artifacts-dir ./runs
# Exit code 0 = all scenarios passed
# Exit code 1 = failures detected

The Two Questions

Question Tool
"Did my agent give the right answer?" Eval tools (Galileo, DeepEval, LangSmith)
"Did my agent survive when dependencies failed?" agent-chaos

Evals test correctness. Chaos tests resilience. Production needs both.


What About Edge-Case Inputs?

"What if the user asks something unexpected?"

This is a fair question, but it's not chaos engineering—it's evaluation. A weird user query isn't a fault. It's just input. The user isn't failing; they're being a user.

For testing agent behavior on edge-case inputs: use eval tools with golden datasets. That's what they're built for.

agent-chaos is for failures at external dependencies—the LLM provider, tool APIs, memory systems. Things that break independently of what the user asked.

The exception: multi-agent systems. When Agent A's output becomes Agent B's input, corrupted handoffs are faults at a boundary. That's chaos territory.


Who This Is For

Teams shipping agents to production. Not demos. Not prototypes.

If you've been burned by:

  • Silent failures from flaky tool APIs
  • Agents that loop forever on rate limits
  • Confident wrong answers from corrupted context
  • "Works on my machine" syndrome in agent behavior

This is for you.


Status

Under active development. Anthropic provider supported. OpenAI and Gemini planned.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agent_chaos-0.1.0.tar.gz (344.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

agent_chaos-0.1.0-py3-none-any.whl (99.5 kB view details)

Uploaded Python 3

File details

Details for the file agent_chaos-0.1.0.tar.gz.

File metadata

  • Download URL: agent_chaos-0.1.0.tar.gz
  • Upload date:
  • Size: 344.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.20 {"installer":{"name":"uv","version":"0.9.20","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for agent_chaos-0.1.0.tar.gz
Algorithm Hash digest
SHA256 2224b9eaaa02834a26ef48d73a8229b1ad8e7f0bcd5539fc96d3f83cafbb63cb
MD5 d9537531f28d676cd63f7c6216be5f99
BLAKE2b-256 d706bf2d7ed71f0fcc0adfa8157689d989ea10459e8d00b52e7a6d92e90c7f30

See more details on using hashes here.

File details

Details for the file agent_chaos-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: agent_chaos-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 99.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.20 {"installer":{"name":"uv","version":"0.9.20","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for agent_chaos-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 80fc360b7cee948223718fb09497956c77f872168cfa26d53ae2dee9041710dc
MD5 e119bb9a57110a5c1b02fd61b99d1613
BLAKE2b-256 29c1ab4f9588ca5be80a2b16c436a55e247d2bc17b424f7ef17a71106e880965

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page