Agent reliability simulator — chaos engineering for AI agents
Project description
cascade
Agent reliability simulator -- chaos engineering for AI agents.
cascade models what actually happens when multi-step agent systems fail:
retries that help, retries that waste money, fallback paths that degrade
quality, and corrupt intermediate outputs that poison downstream steps.
At a Glance
- Monte Carlo simulation for multi-step agent pipelines
- Failure injection for hallucination, refusal, tool failure, latency, and context loss
- Strategy comparison across retry, fallback, checkpoint, parallel, human review, and adaptive control
- Cost, latency, and reliability tradeoff analysis in one run
- Report generation for engineering and decision-making, not just toy metrics
The Problem
Accuracy compounds catastrophically in multi-step agent pipelines:
| Steps | Per-Step Accuracy | End-to-End Success |
|---|---|---|
| 5 | 95% | 77% |
| 10 | 95% | 60% |
| 10 | 85% | 20% |
| 20 | 90% | 12% |
| 50 | 95% | 8% |
A 95%-accurate agent on a 50-step task succeeds 8% of the time. Netflix built Chaos Monkey to test distributed systems resilience. cascade is the equivalent for AI agents.
The Solution
cascade is a Monte Carlo simulation framework that models multi-step AI agent pipelines, injects realistic failure modes, and measures end-to-end reliability under different resilience strategies.
What you get:
- Quantified reliability for any agent pipeline architecture
- Strategy comparison with cost modeling (retry, fallback, parallel, checkpoint, adaptive)
- Pareto frontier visualization: cost vs. reliability tradeoffs
- Cascading corruption modeling -- the hardest failure mode, where bad output propagates
Quick Start
pip install cascade-agent-sim
Minimal simulation:
from cascade import Pipeline, Step, Simulator, FailureConfig
from cascade import strategies
# Define your agent pipeline
pipeline = Pipeline(steps=[
Step(name="research", model="sonnet", tools=["web_search", "read_file"]),
Step(name="analyze", model="sonnet", tools=["python_exec"], depends_on=["research"]),
Step(name="draft", model="sonnet", tools=["write_file"], depends_on=["analyze"]),
Step(name="review", model="opus", tools=["read_file"], depends_on=["draft"]),
Step(name="revise", model="sonnet", tools=["write_file"], depends_on=["review"]),
Step(name="publish", model="haiku", tools=["api_call"], depends_on=["revise"]),
])
# Configure failure injection
failures = FailureConfig(
hallucination_rate=0.05,
refusal_rate=0.02,
tool_failure_rate=0.03,
context_overflow_at=100_000,
cascade_propagation=0.8,
)
# Run 10,000 simulations
sim = Simulator(pipeline, failures, n_simulations=10_000, seed=42)
results = sim.run()
# Compare resilience strategies
from cascade import Comparator
comp = Comparator(pipeline, failures, n_simulations=10_000, seed=42)
comparison = comp.compare([
strategies.naive(),
strategies.retry(max_attempts=3),
strategies.parallel(n=3, vote="majority"),
strategies.checkpoint(interval=2),
strategies.adaptive(escalation_threshold=2),
])
comparison.print_table()
comparison.recommend()
Output:
Strategy Comparison (10,000 simulations each):
+-----------------------+----------+-----------+----------+------------+
| Strategy | Success | Avg Cost | Avg Time | Failures |
+-----------------------+----------+-----------+----------+------------+
| Naive | 54.0% | $0.0318 | 6.1s | 4,599 |
| Retry(3) | 99.3% | $0.0451 | 8.5s | 73 |
| Parallel(3) | 84.8% | $0.1146 | 7.3s | 1,525 |
| Checkpoint(2) | 99.9% | $0.0453 | 8.6s | 8 |
| Adaptive | 99.3% | $0.0451 | 8.5s | 73 |
+-----------------------+----------+-----------+----------+------------+
Recommendation: Retry(3) (99.3% success at 1.4x baseline cost)
Architecture
graph TD
A[Pipeline Definition] --> C[Simulation Engine]
B[Failure Injector] --> C
C --> D[Resilience Strategy Comparator]
D --> E[Report Generator]
subgraph "Failure Modes"
B1[Hallucination]
B2[Refusal]
B3[Tool Failure]
B4[Context Overflow]
B5[Cascading Corruption]
B6[Latency Spike]
end
subgraph "Strategies"
S1[Naive]
S2[Retry]
S3[Fallback]
S4[Parallel Redundancy]
S5[Checkpoint + Rollback]
S6[Human-in-the-Loop]
S7[Adaptive]
end
B1 & B2 & B3 & B4 & B5 & B6 --> B
S1 & S2 & S3 & S4 & S5 & S6 & S7 --> D
Failure Models
| Failure Mode | Description | Default Rate |
|---|---|---|
| Hallucination | Agent produces plausible but incorrect output (wrong tool args, fabricated data, incorrect reasoning, format errors) | 5% |
| Refusal | Safety filter blocks a legitimate action (false positive) | 2% |
| Tool Failure | External API returns an error, timeout, or rate limit | 3% |
| Context Overflow | Context window fills up, losing earlier information | At 128K tokens |
| Cascading Corruption | Hallucinated output propagates to downstream steps | 80% propagation |
| Latency Spike | Individual step takes 10x longer than expected | 1% |
Resilience Strategies
from cascade import strategies
strategies.naive() # No retry, fail fast
strategies.retry(max_attempts=3) # Simple retry
strategies.fallback(models=["sonnet", "haiku"]) # Try models in order
strategies.parallel(n=3, vote="majority") # Run N agents, majority vote
strategies.checkpoint(interval=5) # Checkpoint every N steps, rollback on failure
strategies.human_in_loop(at_steps=[5, 10]) # Human verification at key steps
strategies.adaptive( # Escalate after repeated failures
escalation_threshold=2,
escalation_strategy="parallel",
)
What It Helps You Answer
- How fast does reliability collapse as workflows get longer?
- Which strategy buys the most reliability per unit cost?
- Where do checkpoint intervals actually matter?
- How much damage does one bad intermediate result cause downstream?
- When is human review worth the latency?
CLI
# Run a single simulation
cascade simulate pipeline.json --strategy retry --simulations 10000
# Compare strategies
cascade compare pipeline.json --strategies naive,retry,parallel,checkpoint,adaptive
# Export results
cascade compare pipeline.json -o results.json --pareto pareto.png --heatmap heatmap.png
API Reference
Core Classes
Pipeline-- DAG of Steps defining the agent workflowStep-- Single agent action with model, tools, and dependenciesFailureConfig-- Failure injection probabilities and parametersSimulator-- Monte Carlo simulation engineComparator-- Multi-strategy comparison orchestratorStrategyComparison-- Results container with table, plot, and recommend methods
Key Functions
strategies.naive()/retry()/fallback()/parallel()/checkpoint()/human_in_loop()/adaptive()-- Strategy factoriesbuild_report(result)-- Build structured report from SimulationResultformat_report(report)-- Format report as human-readable textexport_json(report, path)-- Export report to JSON
Statistical Utilities
proportion_ci(successes, total)-- Wilson score CI for success ratesmean_ci(values)-- t-distribution CI for meanssummarize(values)-- Distribution summary (mean, median, percentiles)pareto_frontier(costs, rates)-- Compute Pareto-optimal strategies
Examples
See the examples/ directory:
research_pipeline.py-- 6-step research agent with full strategy comparisoncoding_pipeline.py-- 10-step coding agent demonstrating the compounding problemcustomer_support.py-- Diamond-shaped pipeline with parallel research paths
Demo
Run the offline walkthrough with:
uv run python examples/demo.py
For larger reliability studies and strategy comparisons, see examples/.
Development
git clone https://github.com/sushaan-k/cascade.git
cd cascade
pip install -e ".[dev]"
pytest -v
ruff check src/ tests/
mypy src/cascade/
Contributing
Contributions are welcome. Please:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Write tests for your changes
- Ensure all tests pass (
pytest -v) - Ensure code passes linting (
ruff check .) - Submit a pull request
License
MIT License. See LICENSE for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file floww-0.1.0.tar.gz.
File metadata
- Download URL: floww-0.1.0.tar.gz
- Upload date:
- Size: 135.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
63d16e1404725650d72b3b7104603aa552bbbb75160a3c3bfb86960722adde51
|
|
| MD5 |
e62a7578a8f9c6896609cd616620aa35
|
|
| BLAKE2b-256 |
a26fcb812453869825b340609bd06770edfeb142d5e19acd2c8dca9348adfc06
|
File details
Details for the file floww-0.1.0-py3-none-any.whl.
File metadata
- Download URL: floww-0.1.0-py3-none-any.whl
- Upload date:
- Size: 29.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cbd1f4989eb22812595eee5c7f3de6653436ccb7d22808519932e6d152b45407
|
|
| MD5 |
cc4823f83a89cf4a72c8513cc4799172
|
|
| BLAKE2b-256 |
7fade70eacb1cd79883895465e4f8babda3a836559eaf8f9d6c4760f39033bb5
|