Invoke mock AZ, DB, and MSK Failure. Internally use AWS FIS, AWS SSM.
Project description
Failure Invoker MCP Server
A comprehensive chaos engineering tool that enables failure injection experiments across multiple AWS services using AWS Fault Injection Simulator (FIS) and AWS Systems Manager (SSM).
Features
- Multi-Service Support: Target EC2, RDS, ECS, Lambda, ASG, ELB, EKS, and MSK
- Tag-Based Targeting: Flexible resource selection using AWS tags
- Configurable Duration: Control experiment duration with human-readable formats
- Auto-Recovery: Built-in recovery mechanisms for most services
- Comprehensive Logging: Detailed experiment tracking and status monitoring
Supported AWS Services
| Service | Action | Recovery |
|---|---|---|
| EC2 | Stop instances | Auto-restart after duration |
| RDS | Reboot/Failover | Automatic |
| ECS | Stop tasks | Service auto-recovery |
| Lambda | Error injection | Duration-based |
| ASG | Capacity errors | Duration-based |
| ELB | Unavailable state | Duration-based |
| EKS | Terminate nodes | Auto Scaling recovery |
| MSK | Restart brokers | Automatic |
Installation
MCP Configuration
{
"mcpServers": {
"failure-invoker": {
"command": "uvx",
"args": ["failure-invoker-mcp@latest"],
"env": {
"AWS_REGION": "us-west-2",
"AWS_ACCESS_KEY_ID": "your-access-key",
"AWS_SECRET_ACCESS_KEY": "your-secret-key"
}
}
}
}
Strands Agent SDK
from mcp import ClientSession, StdioServerParameters
from mcp.client.stdio import stdio_client
failure_invoker_client = MCPClient(
lambda: stdio_client(
StdioServerParameters(
command="uvx",
args=["failure-invoker-mcp@latest"],
env={
"AWS_REGION": "us-west-2",
"AWS_ACCESS_KEY_ID": "your-access-key",
"AWS_SECRET_ACCESS_KEY": "your-secret-key"
}
)
)
)
failure_invoker_client.start()
agent = Agent(
model,
system_prompt,
tools=[failure_invoker_client.list_tools_sync()],
)
Available Tools
1. db_failure
Execute database failure experiments on RDS instances or Aurora clusters.
Parameters:
db_identifier(required): RDS instance or cluster identifierfailure_type(optional): "reboot" or "failover" (default: "reboot")region(optional): AWS region (uses AWS_REGION env var if not specified)
Examples:
# Reboot RDS instance
db_failure(db_identifier="my-database", failure_type="reboot")
# Failover Aurora cluster
db_failure(db_identifier="my-cluster", failure_type="failover", region="us-east-1")
2. az_failure
Execute availability zone failure experiments affecting all resources in a specific AZ.
Parameters:
availability_zone(required): Target availability zone (e.g., "us-west-2a")region(optional): AWS region (uses AWS_REGION env var if not specified)
Examples:
# Simulate AZ failure
az_failure(availability_zone="us-west-2a")
# Target specific region
az_failure(availability_zone="eu-west-1b", region="eu-west-1")
3. msk_failure
Execute MSK (Managed Streaming for Kafka) cluster failure experiments.
Parameters:
cluster_name(required): MSK cluster nameregion(optional): AWS region (uses AWS_REGION env var if not specified)
Examples:
# Restart MSK brokers
msk_failure(cluster_name="my-kafka-cluster")
# Target specific region
msk_failure(cluster_name="prod-kafka", region="us-east-1")
4. tag_based_failure
Execute failure experiments on all resources matching specified tags across multiple AWS services.
Parameters:
tag_key(required): Tag key to search fortag_value(required): Tag value to matchduration(optional): Duration of the failure (e.g., "60s", "10m", "2h", default: "10m")region(optional): AWS region (uses AWS_REGION env var if not specified)
Examples:
# Target all resources with Environment=test tag
tag_based_failure(tag_key="Environment", tag_value="test", duration="5m")
# Target specific team's resources
tag_based_failure(tag_key="Team", tag_value="backend", duration="30s")
# Target EKS cluster nodes
tag_based_failure(
tag_key="eks:cluster-name",
tag_value="my-cluster",
duration="2m"
)
# Target auto-scaling enabled resources
tag_based_failure(
tag_key="k8s.io/cluster-autoscaler/enabled",
tag_value="true",
duration="1h"
)
5. get_experiment_status
Check the status of running or completed FIS experiments.
Parameters:
experiment_id(optional): Specific experiment ID to checkregion(optional): AWS region (uses AWS_REGION env var if not specified)
Examples:
# Get all recent experiments
get_experiment_status()
# Check specific experiment
get_experiment_status(experiment_id="EXP123456789")
# Check experiments in specific region
get_experiment_status(region="eu-west-1")
Duration Format
The duration parameter accepts human-readable formats:
"30s"- 30 seconds"5m"- 5 minutes"2h"- 2 hours"1h30m"- 1 hour 30 minutes
Resource Targeting Logic
Tag-Based Targeting
The tag_based_failure tool searches across all supported AWS services:
- EC2 Instances: Uses describe-instances with tag filters
- RDS: Queries all instances/clusters, then checks tags individually
- ECS: Searches services across all clusters for matching tags
- Lambda: Iterates through functions checking tags
- ASG: Examines Auto Scaling Group tags
- ELB: Checks Load Balancer tags
- EKS: Searches Node Groups across all clusters
- MSK: Not included in tag-based targeting (use
msk_failureinstead)
Failure Actions by Service
- EC2: Stop instances → Auto-restart after duration
- RDS Instances: Reboot → Automatic recovery
- RDS Clusters: Failover → Automatic recovery
- ECS: Stop tasks → Service maintains desired count
- Lambda: Inject errors → Duration-based
- ASG: Insufficient capacity errors → Duration-based
- ELB: Mark unavailable → Duration-based
- EKS: Terminate 100% of nodes → Auto Scaling recovery
Prerequisites
- AWS Credentials: Configure via environment variables or AWS profiles
- IAM Permissions: Ensure the following permissions:
fis:*- For Fault Injection Simulatorssm:*- For Systems Manager (MSK experiments)ec2:*,rds:*,ecs:*,lambda:*,autoscaling:*,elasticloadbalancing:*,eks:*,kafka:*- For resource discovery and targeting
- FIS Service Role: Create an IAM role for FIS experiments with appropriate permissions
Error Handling
- Resource Not Found: Experiments skip missing resources
- Permission Denied: Clear error messages with required permissions
- Invalid Duration: Automatic conversion to AWS FIS PT format
- Network Issues: Configurable timeouts and retries (300s read, 60s connect, 3 retries)
Safety Features
- Dry Run Mode: Preview targets before execution
- Auto Recovery: Most experiments include automatic recovery
- Resource Validation: Verify resources exist before targeting
- Region Isolation: Experiments are region-specific
- Tag Validation: Ensure exact tag matches to prevent accidental targeting
Examples
Chaos Engineering Scenarios
# Test EKS cluster resilience
tag_based_failure(
tag_key="eks:cluster-name",
tag_value="production-cluster",
duration="5m"
)
# Simulate database failover
db_failure(
db_identifier="prod-aurora-cluster",
failure_type="failover"
)
# Test multi-AZ application resilience
az_failure(availability_zone="us-west-2a")
# Validate auto-scaling behavior
tag_based_failure(
tag_key="Environment",
tag_value="staging",
duration="10m"
)
# Test Kafka cluster resilience
msk_failure(cluster_name="event-streaming-cluster")
Monitoring
Use get_experiment_status() to monitor experiment progress:
# Start experiment
result = tag_based_failure(tag_key="Team", tag_value="platform")
experiment_id = result.content[0].text # Extract experiment ID
# Monitor progress
status = get_experiment_status(experiment_id=experiment_id)
Contributing
- Fork the repository
- Create a feature branch
- Add tests for new functionality
- Submit a pull request
License
MIT License - see LICENSE file for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file failure_invoker_mcp-1.1.0.tar.gz.
File metadata
- Download URL: failure_invoker_mcp-1.1.0.tar.gz
- Upload date:
- Size: 14.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.8.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
07f89129221d30ca3e892eb840fed9530b710432d60453534c539c63997711f6
|
|
| MD5 |
4212a9d8b7658c01940f003c2ec879dc
|
|
| BLAKE2b-256 |
201b7707fb88479ff811391dd152ed212d03fe5dbe10673d3856d38051ae10f8
|
File details
Details for the file failure_invoker_mcp-1.1.0-py3-none-any.whl.
File metadata
- Download URL: failure_invoker_mcp-1.1.0-py3-none-any.whl
- Upload date:
- Size: 15.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.8.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d145afdf0092b153b0048aba1150a2024eb6a3d14b2fcdb955dda97e1355303e
|
|
| MD5 |
854eb8302a2a78e1efe7b173416584eb
|
|
| BLAKE2b-256 |
6b1e3a8a3e65b05f832c445348d0848245066c5e9644d6cd226aa411aeb42e96
|