Skip to main content

Chaos Engineering for AI Agents

Project description

agent-chaos

Chaos engineering for AI agents.

The Joker: "Introduce a little anarchy. Upset the established order, and everything becomes chaos. I'm an agent of chaos. Oh, and you know the thing about chaos? It's fair!"


Your agent works in demos. It passes evals. Then it hits production: the LLM rate-limits, the tool API returns garbage, the stream cuts mid-response. The agent fails silently, confidently returns wrong answers, or loops forever.

agent-chaos breaks your agent on purpose—before production does.


Why This Exists

AI agents have failure boundaries that didn't exist before:

Boundary What Can Break
LLM provider Rate limits, timeouts, server errors, stream interruptions
Tool execution API failures, malformed responses, lies
Context/memory Corrupted retrieval, poisoned history, token overflow

Traditional chaos engineering tools (Chaos Monkey, Gremlin, Litmus) operate at the infrastructure layer—network partitions, pod failures, CPU stress. They don't understand agent-specific failure modes. They can't corrupt a tool result or cut an LLM stream after 10 chunks.

Evaluation tools (Galileo, DeepEval, LangSmith) tell you if your agent worked correctly. They judge past runs. They can't answer: "What happens when the weather API lies?"

agent-chaos injects failures. Eval tools judge outcomes. Use both.

                     ┌─────────────────┐
                     │  agent-chaos    │
                     │  (inject chaos) │
                     └────────┬────────┘
                              │
                              ▼
┌──────────────┐       ┌─────────────┐       ┌──────────────┐
│   CI / Test  │──────▶│  Your Agent │──────▶│  Eval Tools  │
│   Pipeline   │       │             │       │  (judge it)  │
└──────────────┘       └─────────────┘       └──────────────┘

What It Does

Inject chaos at every agent boundary:

from agent_chaos import (
    chaos_context,
    llm_rate_limit,
    llm_stream_cut,
    tool_error,
    tool_mutate,
)

def corrupt_weather(tool_name: str, result: str) -> str:
    # Return plausible lies
    return result.replace("22°C", "-50°C")

with chaos_context(
    name="resilience-test",
    chaos=[
        llm_rate_limit().after_calls(2),
        llm_stream_cut(after_chunks=10).with_probability(0.3),
        tool_error("Service unavailable").for_tool("get_weather").on_call(1),
        tool_mutate(corrupt_weather),
    ],
) as ctx:
    response = my_agent.run("What's the weather in Tokyo?")

    # Did the agent handle the chaos?
    assert ctx.metrics.chaos_injected > 0

Then gate your CI:

agent-chaos run scenarios/ --artifacts-dir ./runs
# Exit code 0 = all scenarios passed
# Exit code 1 = failures detected

The Two Questions

Question Tool
"Did my agent give the right answer?" Eval tools (Galileo, DeepEval, LangSmith)
"Did my agent survive when dependencies failed?" agent-chaos

Evals test correctness. Chaos tests resilience. Production needs both.


What About Edge-Case Inputs?

"What if the user asks something unexpected?"

This is a fair question, but it's not chaos engineering—it's evaluation. A weird user query isn't a fault. It's just input. The user isn't failing; they're being a user.

For testing agent behavior on edge-case inputs: use eval tools with golden datasets. That's what they're built for.

agent-chaos is for failures at external dependencies—the LLM provider, tool APIs, memory systems. Things that break independently of what the user asked.

The exception: multi-agent systems. When Agent A's output becomes Agent B's input, corrupted handoffs are faults at a boundary. That's chaos territory.


Who This Is For

Teams shipping agents to production. Not demos. Not prototypes.

If you've been burned by:

  • Silent failures from flaky tool APIs
  • Agents that loop forever on rate limits
  • Confident wrong answers from corrupted context
  • "Works on my machine" syndrome in agent behavior

This is for you.


Status

Under active development. Anthropic provider supported. OpenAI and Gemini planned.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agent_chaos-0.1.1.tar.gz (344.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

agent_chaos-0.1.1-py3-none-any.whl (99.5 kB view details)

Uploaded Python 3

File details

Details for the file agent_chaos-0.1.1.tar.gz.

File metadata

  • Download URL: agent_chaos-0.1.1.tar.gz
  • Upload date:
  • Size: 344.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.20 {"installer":{"name":"uv","version":"0.9.20","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for agent_chaos-0.1.1.tar.gz
Algorithm Hash digest
SHA256 c068b87500d9064e4aefe34e0d821af5038802f9fadbd07599e593d90aa783ba
MD5 67165cd60756fd431106f4d6fc35fcac
BLAKE2b-256 4473d48f21a3066e50674c16f86388cdcc6f6e0c07068a70e1bc177b5796b04b

See more details on using hashes here.

File details

Details for the file agent_chaos-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: agent_chaos-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 99.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.20 {"installer":{"name":"uv","version":"0.9.20","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for agent_chaos-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 a683268caec165e369556c338d29e5baf6b48916ff33e03b80055dfe0614f87e
MD5 50526ceebbf96f185b0c4afd4aaf1792
BLAKE2b-256 c1a24c0f0ed1ff1f72f70803855574964ee581a89edac2a883299f711bdb7e54

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page