Skip to main content

Invoke mock AZ, DB, and MSK Failure. Internally use AWS FIS, AWS SSM.

Project description

Failure Invoker MCP Server

A comprehensive chaos engineering tool that enables failure injection experiments across multiple AWS services using AWS Fault Injection Simulator (FIS) and AWS Systems Manager (SSM).

Features

  • Multi-Service Support: Target EC2, RDS, ECS, Lambda, ASG, ELB, EKS, and MSK
  • Tag-Based Targeting: Flexible resource selection using AWS tags
  • Configurable Duration: Control experiment duration with human-readable formats
  • Auto-Recovery: Built-in recovery mechanisms for most services
  • Comprehensive Logging: Detailed experiment tracking and status monitoring

Supported AWS Services

Service Action Recovery
EC2 Stop instances Auto-restart after duration
RDS Reboot/Failover Automatic
ECS Stop tasks Service auto-recovery
Lambda Error injection Duration-based
ASG Capacity errors Duration-based
ELB Unavailable state Duration-based
EKS Terminate nodes Auto Scaling recovery
MSK Restart brokers Automatic

Installation

MCP Configuration

{
  "mcpServers": {
    "failure-invoker": {
      "command": "uvx",
      "args": ["failure-invoker-mcp@latest"],
      "env": {
        "AWS_REGION": "us-west-2",
        "AWS_ACCESS_KEY_ID": "your-access-key",
        "AWS_SECRET_ACCESS_KEY": "your-secret-key"
      }
    }
  }
}

Strands Agent SDK

from mcp import ClientSession, StdioServerParameters
from mcp.client.stdio import stdio_client

failure_invoker_client = MCPClient(
    lambda: stdio_client(
        StdioServerParameters(
            command="uvx",
            args=["failure-invoker-mcp@latest"],
            env={
                "AWS_REGION": "us-west-2",
                "AWS_ACCESS_KEY_ID": "your-access-key",
                "AWS_SECRET_ACCESS_KEY": "your-secret-key"
            }
        )
    )
)

failure_invoker_client.start()

agent = Agent(
    model,
    system_prompt,
    tools=[failure_invoker_client.list_tools_sync()],
)

Available Tools

1. db_failure

Execute database failure experiments on RDS instances or Aurora clusters.

Parameters:

  • db_identifier (required): RDS instance or cluster identifier
  • failure_type (optional): "reboot" or "failover" (default: "reboot")
  • region (optional): AWS region (uses AWS_REGION env var if not specified)

Examples:

# Reboot RDS instance
db_failure(db_identifier="my-database", failure_type="reboot")

# Failover Aurora cluster
db_failure(db_identifier="my-cluster", failure_type="failover", region="us-east-1")

2. az_failure

Execute availability zone failure experiments affecting all resources in a specific AZ.

Parameters:

  • availability_zone (required): Target availability zone (e.g., "us-west-2a")
  • region (optional): AWS region (uses AWS_REGION env var if not specified)

Examples:

# Simulate AZ failure
az_failure(availability_zone="us-west-2a")

# Target specific region
az_failure(availability_zone="eu-west-1b", region="eu-west-1")

3. msk_failure

Execute MSK (Managed Streaming for Kafka) cluster failure experiments.

Parameters:

  • cluster_name (required): MSK cluster name
  • region (optional): AWS region (uses AWS_REGION env var if not specified)

Examples:

# Restart MSK brokers
msk_failure(cluster_name="my-kafka-cluster")

# Target specific region
msk_failure(cluster_name="prod-kafka", region="us-east-1")

4. tag_based_failure

Execute failure experiments on all resources matching specified tags across multiple AWS services.

Parameters:

  • tag_key (required): Tag key to search for
  • tag_value (required): Tag value to match
  • duration (optional): Duration of the failure (e.g., "60s", "10m", "2h", default: "10m")
  • region (optional): AWS region (uses AWS_REGION env var if not specified)

Examples:

# Target all resources with Environment=test tag
tag_based_failure(tag_key="Environment", tag_value="test", duration="5m")

# Target specific team's resources
tag_based_failure(tag_key="Team", tag_value="backend", duration="30s")

# Target EKS cluster nodes
tag_based_failure(
    tag_key="eks:cluster-name", 
    tag_value="my-cluster", 
    duration="2m"
)

# Target auto-scaling enabled resources
tag_based_failure(
    tag_key="k8s.io/cluster-autoscaler/enabled", 
    tag_value="true", 
    duration="1h"
)

5. get_experiment_status

Check the status of running or completed FIS experiments.

Parameters:

  • experiment_id (optional): Specific experiment ID to check
  • region (optional): AWS region (uses AWS_REGION env var if not specified)

Examples:

# Get all recent experiments
get_experiment_status()

# Check specific experiment
get_experiment_status(experiment_id="EXP123456789")

# Check experiments in specific region
get_experiment_status(region="eu-west-1")

Duration Format

The duration parameter accepts human-readable formats:

  • "30s" - 30 seconds
  • "5m" - 5 minutes
  • "2h" - 2 hours
  • "1h30m" - 1 hour 30 minutes

Resource Targeting Logic

Tag-Based Targeting

The tag_based_failure tool searches across all supported AWS services:

  1. EC2 Instances: Uses describe-instances with tag filters
  2. RDS: Queries all instances/clusters, then checks tags individually
  3. ECS: Searches services across all clusters for matching tags
  4. Lambda: Iterates through functions checking tags
  5. ASG: Examines Auto Scaling Group tags
  6. ELB: Checks Load Balancer tags
  7. EKS: Searches Node Groups across all clusters
  8. MSK: Not included in tag-based targeting (use msk_failure instead)

Failure Actions by Service

  • EC2: Stop instances → Auto-restart after duration
  • RDS Instances: Reboot → Automatic recovery
  • RDS Clusters: Failover → Automatic recovery
  • ECS: Stop tasks → Service maintains desired count
  • Lambda: Inject errors → Duration-based
  • ASG: Insufficient capacity errors → Duration-based
  • ELB: Mark unavailable → Duration-based
  • EKS: Terminate 100% of nodes → Auto Scaling recovery

Prerequisites

  1. AWS Credentials: Configure via environment variables or AWS profiles
  2. IAM Permissions: Ensure the following permissions:
    • fis:* - For Fault Injection Simulator
    • ssm:* - For Systems Manager (MSK experiments)
    • ec2:*, rds:*, ecs:*, lambda:*, autoscaling:*, elasticloadbalancing:*, eks:*, kafka:* - For resource discovery and targeting
  3. FIS Service Role: Create an IAM role for FIS experiments with appropriate permissions

Error Handling

  • Resource Not Found: Experiments skip missing resources
  • Permission Denied: Clear error messages with required permissions
  • Invalid Duration: Automatic conversion to AWS FIS PT format
  • Network Issues: Configurable timeouts and retries (300s read, 60s connect, 3 retries)

Safety Features

  • Dry Run Mode: Preview targets before execution
  • Auto Recovery: Most experiments include automatic recovery
  • Resource Validation: Verify resources exist before targeting
  • Region Isolation: Experiments are region-specific
  • Tag Validation: Ensure exact tag matches to prevent accidental targeting

Examples

Chaos Engineering Scenarios

# Test EKS cluster resilience
tag_based_failure(
    tag_key="eks:cluster-name",
    tag_value="production-cluster",
    duration="5m"
)

# Simulate database failover
db_failure(
    db_identifier="prod-aurora-cluster",
    failure_type="failover"
)

# Test multi-AZ application resilience  
az_failure(availability_zone="us-west-2a")

# Validate auto-scaling behavior
tag_based_failure(
    tag_key="Environment",
    tag_value="staging", 
    duration="10m"
)

# Test Kafka cluster resilience
msk_failure(cluster_name="event-streaming-cluster")

Monitoring

Use get_experiment_status() to monitor experiment progress:

# Start experiment
result = tag_based_failure(tag_key="Team", tag_value="platform")
experiment_id = result.content[0].text  # Extract experiment ID

# Monitor progress
status = get_experiment_status(experiment_id=experiment_id)

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Add tests for new functionality
  4. Submit a pull request

License

MIT License - see LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

failure_invoker_mcp-1.1.0.tar.gz (14.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

failure_invoker_mcp-1.1.0-py3-none-any.whl (15.0 kB view details)

Uploaded Python 3

File details

Details for the file failure_invoker_mcp-1.1.0.tar.gz.

File metadata

  • Download URL: failure_invoker_mcp-1.1.0.tar.gz
  • Upload date:
  • Size: 14.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.8.4

File hashes

Hashes for failure_invoker_mcp-1.1.0.tar.gz
Algorithm Hash digest
SHA256 07f89129221d30ca3e892eb840fed9530b710432d60453534c539c63997711f6
MD5 4212a9d8b7658c01940f003c2ec879dc
BLAKE2b-256 201b7707fb88479ff811391dd152ed212d03fe5dbe10673d3856d38051ae10f8

See more details on using hashes here.

File details

Details for the file failure_invoker_mcp-1.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for failure_invoker_mcp-1.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d145afdf0092b153b0048aba1150a2024eb6a3d14b2fcdb955dda97e1355303e
MD5 854eb8302a2a78e1efe7b173416584eb
BLAKE2b-256 6b1e3a8a3e65b05f832c445348d0848245066c5e9644d6cd226aa411aeb42e96

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page