Skip to main content

Agent reliability simulator — chaos engineering for AI agents

Project description

cascade

CI Python 3.11+ License: MIT Code style: ruff

Agent reliability simulator -- chaos engineering for AI agents.

cascade models what actually happens when multi-step agent systems fail: retries that help, retries that waste money, fallback paths that degrade quality, and corrupt intermediate outputs that poison downstream steps.


At a Glance

  • Monte Carlo simulation for multi-step agent pipelines
  • Failure injection for hallucination, refusal, tool failure, latency, and context loss
  • Strategy comparison across retry, fallback, checkpoint, parallel, human review, and adaptive control
  • Cost, latency, and reliability tradeoff analysis in one run
  • Report generation for engineering and decision-making, not just toy metrics

The Problem

Accuracy compounds catastrophically in multi-step agent pipelines:

Steps Per-Step Accuracy End-to-End Success
5 95% 77%
10 95% 60%
10 85% 20%
20 90% 12%
50 95% 8%

A 95%-accurate agent on a 50-step task succeeds 8% of the time. Netflix built Chaos Monkey to test distributed systems resilience. cascade is the equivalent for AI agents.

The Solution

cascade is a Monte Carlo simulation framework that models multi-step AI agent pipelines, injects realistic failure modes, and measures end-to-end reliability under different resilience strategies.

What you get:

  • Quantified reliability for any agent pipeline architecture
  • Strategy comparison with cost modeling (retry, fallback, parallel, checkpoint, adaptive)
  • Pareto frontier visualization: cost vs. reliability tradeoffs
  • Cascading corruption modeling -- the hardest failure mode, where bad output propagates

Quick Start

pip install cascade-agent-sim

Minimal simulation:

from cascade import Pipeline, Step, Simulator, FailureConfig
from cascade import strategies

# Define your agent pipeline
pipeline = Pipeline(steps=[
    Step(name="research", model="sonnet", tools=["web_search", "read_file"]),
    Step(name="analyze", model="sonnet", tools=["python_exec"], depends_on=["research"]),
    Step(name="draft", model="sonnet", tools=["write_file"], depends_on=["analyze"]),
    Step(name="review", model="opus", tools=["read_file"], depends_on=["draft"]),
    Step(name="revise", model="sonnet", tools=["write_file"], depends_on=["review"]),
    Step(name="publish", model="haiku", tools=["api_call"], depends_on=["revise"]),
])

# Configure failure injection
failures = FailureConfig(
    hallucination_rate=0.05,
    refusal_rate=0.02,
    tool_failure_rate=0.03,
    context_overflow_at=100_000,
    cascade_propagation=0.8,
)

# Run 10,000 simulations
sim = Simulator(pipeline, failures, n_simulations=10_000, seed=42)
results = sim.run()

# Compare resilience strategies
from cascade import Comparator
comp = Comparator(pipeline, failures, n_simulations=10_000, seed=42)
comparison = comp.compare([
    strategies.naive(),
    strategies.retry(max_attempts=3),
    strategies.parallel(n=3, vote="majority"),
    strategies.checkpoint(interval=2),
    strategies.adaptive(escalation_threshold=2),
])

comparison.print_table()
comparison.recommend()

Output:

Strategy Comparison (10,000 simulations each):
+-----------------------+----------+-----------+----------+------------+
| Strategy              | Success  | Avg Cost  | Avg Time | Failures   |
+-----------------------+----------+-----------+----------+------------+
| Naive                 |  54.0%   |  $0.0318  |   6.1s   |      4,599 |
| Retry(3)              |  99.3%   |  $0.0451  |   8.5s   |         73 |
| Parallel(3)           |  84.8%   |  $0.1146  |   7.3s   |      1,525 |
| Checkpoint(2)         |  99.9%   |  $0.0453  |   8.6s   |          8 |
| Adaptive              |  99.3%   |  $0.0451  |   8.5s   |         73 |
+-----------------------+----------+-----------+----------+------------+

Recommendation: Retry(3) (99.3% success at 1.4x baseline cost)

Architecture

graph TD
    A[Pipeline Definition] --> C[Simulation Engine]
    B[Failure Injector] --> C
    C --> D[Resilience Strategy Comparator]
    D --> E[Report Generator]

    subgraph "Failure Modes"
        B1[Hallucination]
        B2[Refusal]
        B3[Tool Failure]
        B4[Context Overflow]
        B5[Cascading Corruption]
        B6[Latency Spike]
    end

    subgraph "Strategies"
        S1[Naive]
        S2[Retry]
        S3[Fallback]
        S4[Parallel Redundancy]
        S5[Checkpoint + Rollback]
        S6[Human-in-the-Loop]
        S7[Adaptive]
    end

    B1 & B2 & B3 & B4 & B5 & B6 --> B
    S1 & S2 & S3 & S4 & S5 & S6 & S7 --> D

Failure Models

Failure Mode Description Default Rate
Hallucination Agent produces plausible but incorrect output (wrong tool args, fabricated data, incorrect reasoning, format errors) 5%
Refusal Safety filter blocks a legitimate action (false positive) 2%
Tool Failure External API returns an error, timeout, or rate limit 3%
Context Overflow Context window fills up, losing earlier information At 128K tokens
Cascading Corruption Hallucinated output propagates to downstream steps 80% propagation
Latency Spike Individual step takes 10x longer than expected 1%

Resilience Strategies

from cascade import strategies

strategies.naive()                    # No retry, fail fast
strategies.retry(max_attempts=3)      # Simple retry
strategies.fallback(models=["sonnet", "haiku"])  # Try models in order
strategies.parallel(n=3, vote="majority")  # Run N agents, majority vote
strategies.checkpoint(interval=5)     # Checkpoint every N steps, rollback on failure
strategies.human_in_loop(at_steps=[5, 10])  # Human verification at key steps
strategies.adaptive(                  # Escalate after repeated failures
    escalation_threshold=2,
    escalation_strategy="parallel",
)

What It Helps You Answer

  • How fast does reliability collapse as workflows get longer?
  • Which strategy buys the most reliability per unit cost?
  • Where do checkpoint intervals actually matter?
  • How much damage does one bad intermediate result cause downstream?
  • When is human review worth the latency?

CLI

# Run a single simulation
cascade simulate pipeline.json --strategy retry --simulations 10000

# Compare strategies
cascade compare pipeline.json --strategies naive,retry,parallel,checkpoint,adaptive

# Export results
cascade compare pipeline.json -o results.json --pareto pareto.png --heatmap heatmap.png

API Reference

Core Classes

  • Pipeline -- DAG of Steps defining the agent workflow
  • Step -- Single agent action with model, tools, and dependencies
  • FailureConfig -- Failure injection probabilities and parameters
  • Simulator -- Monte Carlo simulation engine
  • Comparator -- Multi-strategy comparison orchestrator
  • StrategyComparison -- Results container with table, plot, and recommend methods

Key Functions

  • strategies.naive() / retry() / fallback() / parallel() / checkpoint() / human_in_loop() / adaptive() -- Strategy factories
  • build_report(result) -- Build structured report from SimulationResult
  • format_report(report) -- Format report as human-readable text
  • export_json(report, path) -- Export report to JSON

Statistical Utilities

  • proportion_ci(successes, total) -- Wilson score CI for success rates
  • mean_ci(values) -- t-distribution CI for means
  • summarize(values) -- Distribution summary (mean, median, percentiles)
  • pareto_frontier(costs, rates) -- Compute Pareto-optimal strategies

Examples

See the examples/ directory:

  • research_pipeline.py -- 6-step research agent with full strategy comparison
  • coding_pipeline.py -- 10-step coding agent demonstrating the compounding problem
  • customer_support.py -- Diamond-shaped pipeline with parallel research paths

Demo

Run the offline walkthrough with:

uv run python examples/demo.py

For larger reliability studies and strategy comparisons, see examples/.

Development

git clone https://github.com/sushaan-k/cascade.git
cd cascade
pip install -e ".[dev]"
pytest -v
ruff check src/ tests/
mypy src/cascade/

Contributing

Contributions are welcome. Please:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Write tests for your changes
  4. Ensure all tests pass (pytest -v)
  5. Ensure code passes linting (ruff check .)
  6. Submit a pull request

License

MIT License. See LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

floww-0.1.0.tar.gz (135.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

floww-0.1.0-py3-none-any.whl (29.6 kB view details)

Uploaded Python 3

File details

Details for the file floww-0.1.0.tar.gz.

File metadata

  • Download URL: floww-0.1.0.tar.gz
  • Upload date:
  • Size: 135.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for floww-0.1.0.tar.gz
Algorithm Hash digest
SHA256 63d16e1404725650d72b3b7104603aa552bbbb75160a3c3bfb86960722adde51
MD5 e62a7578a8f9c6896609cd616620aa35
BLAKE2b-256 a26fcb812453869825b340609bd06770edfeb142d5e19acd2c8dca9348adfc06

See more details on using hashes here.

File details

Details for the file floww-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: floww-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 29.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for floww-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 cbd1f4989eb22812595eee5c7f3de6653436ccb7d22808519932e6d152b45407
MD5 cc4823f83a89cf4a72c8513cc4799172
BLAKE2b-256 7fade70eacb1cd79883895465e4f8babda3a836559eaf8f9d6c4760f39033bb5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page