Skip to main content

Chaos/fault injection framework for MCP (Model Context Protocol) projects

Project description

mcp-chaos-monkey (Python)

Chaos/fault injection framework for MCP (Model Context Protocol) projects. Inject faults at the transport level so resilience wrappers (circuit breakers, retries, timeouts) exercise naturally.

Zero runtime dependencies. Optional deps for httpx, Redis, and Starlette interceptors.

Installation

pip install mcp-chaos-monkey

# With optional interceptors
pip install mcp-chaos-monkey[httpx]       # HTTP interceptor
pip install mcp-chaos-monkey[redis]       # Redis interceptor
pip install mcp-chaos-monkey[starlette]   # Admin endpoints
pip install mcp-chaos-monkey[all]         # Everything

Requires Python >= 3.11.

Quick Start

import os
os.environ["CHAOS_ENABLED"] = "true"

from mcp_chaos_monkey import ChaosController, ErrorFault

# Inject a fault
controller = ChaosController.get_instance()
fault_id = controller.inject("weather-api", ErrorFault(status_code=503, message="Service unavailable"))

# Query the fault
fault = controller.get_fault("weather-api")
# → ErrorFault(status_code=503, message='Service unavailable')

# Clean up
controller.clear(fault_id)

With MCP Python SDK

import os
os.environ["CHAOS_ENABLED"] = "true"

import httpx
from mcp_chaos_monkey import ChaosController, ErrorFault, TimeoutFault
from mcp_chaos_monkey.interceptors import create_chaos_aware_client

# Create a chaos-aware httpx client for your MCP tool
client = create_chaos_aware_client("weather-api", httpx.AsyncClient())

# Inject a fault — all requests through this client now fail with 503
controller = ChaosController.get_instance()
controller.inject("weather-api", ErrorFault(status_code=503))

response = await client.get("https://api.weather.com/forecast")
# → Response [503]

# Inject a timeout — requests hang then raise ReadTimeout
controller.clear_all()
controller.inject("weather-api", TimeoutFault(hang_ms=10_000))

response = await client.get("https://api.weather.com/forecast")
# → raises httpx.ReadTimeout after 10s

Architecture

Faults are injected at the transport level, sitting between your application code and the real service. This means resilience wrappers (circuit breakers, retries, timeouts) exercise naturally without mocking:

┌─────────────────────────────────────────────────────┐
│  Application Code                                   │
│  (MCP client, agent, tool handler)                  │
├─────────────────────────────────────────────────────┤
│  Resilience Layer                                   │
│  (circuit breaker, retry, timeout)                  │
├─────────────────────────────────────────────────────┤
│  ★ mcp-chaos-monkey ★                              │
│  (fault injection at transport level)               │
├─────────────────────────────────────────────────────┤
│  Transport                                          │
│  (httpx, redis-py, ASGI middleware)                 │
├─────────────────────────────────────────────────────┤
│  External Service                                   │
│  (API, database, auth provider)                     │
└─────────────────────────────────────────────────────┘

The key insight: faults are injected below the resilience layer so that circuit breakers, retries, and timeouts trigger organically — exactly as they would in a real outage.

Production Safety

The framework refuses to run in production. Two hard guards must both pass:

  1. ENVIRONMENT / NODE_ENV must not be "production" — throws immediately
  2. CHAOS_ENABLED must be "true" — prevents accidental activation
from mcp_chaos_monkey import assert_chaos_allowed
assert_chaos_allowed()  # raises ChaosNotAllowedError if unsafe

Fault Types

Type What It Simulates Dataclass Key Fields
latency Slow upstream LatencyFault delay_ms
error HTTP 500/502/503 ErrorFault status_code, message
timeout Upstream hangs TimeoutFault hang_ms
malformed Corrupted JSON MalformedFault corrupt_response
connection-refused Service unreachable ConnectionRefusedFault
connection-drop Mid-response failure ConnectionDropFault after_bytes
rate-limit 429 Too Many Requests RateLimitFault retry_after_seconds
schema-mismatch Missing JSON fields SchemaMismatchFault missing_fields

All fault types support an optional probability field (0.0–1.0) for probabilistic injection.

API Reference

ChaosController

from mcp_chaos_monkey import ChaosController, ErrorFault, LatencyFault

controller = ChaosController.get_instance()  # singleton

# Inject a fault — returns a unique fault ID
fault_id = controller.inject(
    "weather-api",
    LatencyFault(delay_ms=2000),
    duration_ms=30_000,  # optional: auto-expire after 30s
)

# Query faults
fault = controller.get_fault("weather-api")    # FaultConfig | None
active = controller.get_active_faults()         # list[ActiveFaultInfo]

# Clean up
controller.clear(fault_id)  # remove one
controller.clear_all()       # remove all

# Test isolation
ChaosController.reset()  # destroy singleton + clear all faults

HTTP Interceptor (httpx)

Wraps an httpx.AsyncClient or httpx.Client with a chaos-aware transport:

from mcp_chaos_monkey.interceptors import create_chaos_aware_client

# Wrap an existing async client
client = create_chaos_aware_client("weather-api", httpx.AsyncClient(base_url="..."))

# Or create a new one
client = create_chaos_aware_client("weather-api", base_url="https://api.weather.com")

# Sync variant
from mcp_chaos_monkey.interceptors.http_interceptor import create_chaos_aware_client_sync
sync_client = create_chaos_aware_client_sync("weather-api", httpx.Client())

Redis Interceptor

Monkey-patches Redis client commands (get, set, delete, hget, hset, expire, ttl, keys, mget) to inject faults. Works with both sync redis.Redis and async redis.asyncio.Redis:

import redis
from mcp_chaos_monkey import ChaosController, ErrorFault, LatencyFault, ConnectionRefusedFault
from mcp_chaos_monkey.interceptors import wrap_redis_with_chaos

# Your existing Redis client
redis_client = redis.Redis(host="localhost", port=6379)

# Wrap it with chaos — returns an unwrap function
unwrap = wrap_redis_with_chaos(redis_client, "redis")

# No fault active — commands work normally
redis_client.set("key", "value")
redis_client.get("key")  # → b"value"

# Inject a connection error — all Redis commands now raise ConnectionError
controller = ChaosController.get_instance()
controller.inject("redis", ErrorFault(status_code=500, message="connection lost"))

try:
    redis_client.get("key")
except ConnectionError as e:
    print(e)  # → "Chaos Redis error: connection lost"

# Simulate Redis being completely unreachable
controller.clear_all()
controller.inject("redis", ConnectionRefusedFault())

try:
    redis_client.set("key", "value")
except ConnectionError as e:
    print(e)  # → "Redis connection refused (chaos)"

# Add latency to simulate a slow Redis
controller.clear_all()
controller.inject("redis", LatencyFault(delay_ms=500))

redis_client.get("key")  # → takes 500ms, then returns b"value"

# Restore original Redis behavior
unwrap()
redis_client.get("key")  # → works instantly, no chaos

Async Redis works identically:

import redis.asyncio as aioredis
from mcp_chaos_monkey.interceptors import wrap_redis_with_chaos

async_redis = aioredis.Redis(host="localhost", port=6379)
unwrap = wrap_redis_with_chaos(async_redis, "redis")

# Faults apply to async commands too
await async_redis.get("key")  # → chaos fault if active

unwrap()

Auth Interceptor (ASGI Middleware)

ASGI middleware that intercepts authentication faults. Use it with Starlette, FastAPI, or any ASGI framework to simulate auth provider failures:

from starlette.applications import Starlette
from starlette.responses import JSONResponse
from starlette.routing import Route
from mcp_chaos_monkey import ChaosController, ErrorFault, LatencyFault, TimeoutFault
from mcp_chaos_monkey.interceptors import ChaosAuthMiddleware, create_chaos_auth_middleware

# Your ASGI app
async def homepage(request):
    return JSONResponse({"status": "ok"})

app = Starlette(routes=[Route("/", homepage)])

# Add chaos auth middleware — targets "oauth-token" by default
app = ChaosAuthMiddleware(app, target="oauth-token")

# No fault active — requests pass through to your app
# GET / → 200, {"status": "ok"}

# Simulate auth provider returning 401
controller = ChaosController.get_instance()
controller.inject("oauth-token", ErrorFault(status_code=401, message="Token expired"))
# GET / → 401, {"error": "token_invalid", "message": "Token expired"}

# Simulate auth provider being slow (adds 2s delay before forwarding)
controller.clear_all()
controller.inject("oauth-token", LatencyFault(delay_ms=2000))
# GET / → 200 after 2s delay, {"status": "ok"}

# Simulate auth provider hanging (request never completes)
controller.clear_all()
controller.inject("oauth-token", TimeoutFault(hang_ms=30_000))
# GET / → hangs for 30s, client times out

Custom auth target — use the factory to target different auth mechanisms:

# Protect API key validation
app = create_chaos_auth_middleware("api-key")(app)

# Now faults on "api-key" target affect this middleware
controller.inject("api-key", ErrorFault(status_code=403, message="Invalid API key"))

Multiple auth targets with FastAPI:

from fastapi import FastAPI
from mcp_chaos_monkey.interceptors import ChaosAuthMiddleware

app = FastAPI()

# Stack middleware for different auth targets
app.add_middleware(ChaosAuthMiddleware, target="oauth-token")
app.add_middleware(ChaosAuthMiddleware, target="api-key")

@app.get("/")
async def root():
    return {"status": "ok"}

Admin Endpoints

Framework-agnostic handler functions:

from mcp_chaos_monkey import handle_status, handle_inject, handle_clear, handle_clear_all

# Use with any web framework
result = handle_status()             # {"faults": [...]}
result = handle_inject(body)         # {"fault_id": "..."}
result = handle_clear(body)          # {"cleared": "fault-id"}
result = handle_clear_all()          # {"cleared": "all"}

Optional Starlette integration:

from mcp_chaos_monkey.admin_endpoint import create_starlette_routes
from starlette.applications import Starlette
from starlette.routing import Route

routes = create_starlette_routes()
app = Starlette(routes=routes)

This registers 4 routes:

Method Path Description
GET /chaos/status List all active faults
POST /chaos/inject Inject a new fault
POST /chaos/clear Clear a specific fault
POST /chaos/clear-all Clear all faults

CLI Usage

# Set required env vars
export CHAOS_ENABLED=true

# Inject faults
mcp-chaos inject <target> <type> [--status N] [--delay N] [--duration N]
mcp-chaos clear <fault_id>
mcp-chaos clear-all
mcp-chaos status

Examples:

# Inject a 503 error on weather-api
mcp-chaos inject weather-api error --status 503

# Inject latency with auto-expiry after 30s
mcp-chaos inject redis latency --delay 2000 --duration 30

# Check what's active
mcp-chaos status

# Clean up
mcp-chaos clear-all

Scenario Builder

Define reproducible chaos scenarios for structured testing:

from mcp_chaos_monkey import define_scenario, ChaosScenario, ScenarioFault
from mcp_chaos_monkey import TimeoutFault, ErrorFault, ConnectionRefusedFault

api_timeout: ChaosScenario = define_scenario(
    name="weather-api-timeout",
    description="Weather API hangs for 10s — timeout fires, retries exhaust",
    faults=[
        ScenarioFault(target="weather-api", config=TimeoutFault(hang_ms=10_000)),
    ],
    expected_behavior="Each attempt times out at 5s. Retries fire and all fail.",
    assertions=[
        "Each retry attempt hits timeout at 5s",
        "Total latency bounded by retry * timeout",
    ],
)

# Cascading failure — multiple faults
cascading = define_scenario(
    name="cascading-redis-then-api",
    description="Redis drops, then weather API starts failing",
    faults=[
        ScenarioFault(target="redis", config=ConnectionRefusedFault()),
        ScenarioFault(target="weather-api", config=ErrorFault(status_code=503)),
    ],
    expected_behavior="System degrades to flight-only without cache.",
    assertions=["System remains responsive", "Flight queries still work end-to-end"],
)

# Temporary fault with auto-expiry
recovery = define_scenario(
    name="api-recovery",
    description="API fails then recovers — circuit should close",
    faults=[
        ScenarioFault(
            target="weather-api",
            config=ErrorFault(status_code=503),
            duration_ms=35_000,
        ),
    ],
    expected_behavior="Circuit opens, fault expires, circuit closes.",
    assertions=["Circuit transitions: CLOSED -> OPEN -> HALF_OPEN -> CLOSED"],
)

Pluggable Logger

By default, the library logs via stdlib logging. To use your own logger (e.g., structlog), call configure_chaos_logger once at startup:

import structlog
from mcp_chaos_monkey import configure_chaos_logger

# structlog loggers satisfy ChaosLogger protocol directly
configure_chaos_logger(lambda name: structlog.get_logger(name))

The ChaosLogger protocol requires debug, info, warning, error methods matching Python's logging.Logger.

Cross-Language Compatibility

The Python and TypeScript implementations share the same fault types, controller API, admin endpoints, CLI commands, and scenario interface. Fault configs can be exchanged as JSON between the two — parse_fault_config() accepts both camelCase (TypeScript) and snake_case (Python) keys:

from mcp_chaos_monkey import parse_fault_config

# Both work:
fault = parse_fault_config({"type": "latency", "delayMs": 2000})     # camelCase
fault = parse_fault_config({"type": "latency", "delay_ms": 2000})    # snake_case

Original Contribution

mcp-chaos-monkey is an original open-source contribution to the MCP ecosystem. No existing tool addressed systematic chaos testing for MCP-based systems, so this framework was created as a standalone, reusable package that any MCP project can adopt to validate its resilience stack under realistic failure conditions.

For a production-grade example of using mcp-chaos-monkey, see reliable-mcp — an MCP Reliability Playbook that uses this framework for chaos testing with 21 automated fault injection scenarios.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mcp_chaos_monkey-0.1.0.tar.gz (24.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mcp_chaos_monkey-0.1.0-py3-none-any.whl (20.9 kB view details)

Uploaded Python 3

File details

Details for the file mcp_chaos_monkey-0.1.0.tar.gz.

File metadata

  • Download URL: mcp_chaos_monkey-0.1.0.tar.gz
  • Upload date:
  • Size: 24.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for mcp_chaos_monkey-0.1.0.tar.gz
Algorithm Hash digest
SHA256 ac144d3fdb55ef85683a20045da1757ef3357fb1526f5f95bb4254413a435bf7
MD5 423db44cd0ad9dba06d5a0fb75b8741a
BLAKE2b-256 22716d08f4303107ed008322a0d75f08cff6bc4a7d052f4cd22253339edc374e

See more details on using hashes here.

File details

Details for the file mcp_chaos_monkey-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for mcp_chaos_monkey-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e3ea2f981215bb53e2fbd4535929bda28ef732b84d078d9405305400a211dca4
MD5 78ef9f33941b53fdf2ce051aebfbc5e8
BLAKE2b-256 2da0671443721e3271d11fea4e6dc58abe681bc9e7e97679ca83e71f499b0e5a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page