Invoke mock AZ, DB, and MSK Failure. Internally use AWS FIS, AWS SSM.

These details have not been verified by PyPI

Project links

Project description

Failure Invoker MCP Server

A comprehensive chaos engineering tool that enables failure injection experiments across multiple AWS services using AWS Fault Injection Simulator (FIS) and AWS Systems Manager (SSM).

Features

Multi-Service Support: Target EC2, RDS, ECS, Lambda, ASG, ELB, EKS, and MSK
Tag-Based Targeting: Flexible resource selection using AWS tags
Configurable Duration: Control experiment duration with human-readable formats
Auto-Recovery: Built-in recovery mechanisms for most services
Comprehensive Logging: Detailed experiment tracking and status monitoring

Supported AWS Services

Service	Action	Recovery
EC2	Stop instances	Auto-restart after duration
RDS	Reboot/Failover	Automatic
ECS	Stop tasks	Service auto-recovery
Lambda	Error injection	Duration-based
ASG	Capacity errors	Duration-based
ELB	Unavailable state	Duration-based
EKS	Terminate nodes	Auto Scaling recovery
MSK	Restart brokers	Automatic

Installation

MCP Configuration

{
  "mcpServers": {
    "failure-invoker": {
      "command": "uvx",
      "args": ["failure-invoker-mcp@latest"],
      "env": {
        "AWS_REGION": "us-west-2",
        "AWS_ACCESS_KEY_ID": "your-access-key",
        "AWS_SECRET_ACCESS_KEY": "your-secret-key"
      }
    }
  }
}

Strands Agent SDK

from mcp import ClientSession, StdioServerParameters
from mcp.client.stdio import stdio_client

failure_invoker_client = MCPClient(
    lambda: stdio_client(
        StdioServerParameters(
            command="uvx",
            args=["failure-invoker-mcp@latest"],
            env={
                "AWS_REGION": "us-west-2",
                "AWS_ACCESS_KEY_ID": "your-access-key",
                "AWS_SECRET_ACCESS_KEY": "your-secret-key"
            }
        )
    )
)

failure_invoker_client.start()

agent = Agent(
    model,
    system_prompt,
    tools=[failure_invoker_client.list_tools_sync()],
)

Available Tools

1. `db_failure`

Execute database failure experiments on RDS instances or Aurora clusters.

Parameters:

db_identifier (required): RDS instance or cluster identifier
failure_type (optional): "reboot" or "failover" (default: "reboot")
region (optional): AWS region (uses AWS_REGION env var if not specified)

Examples:

# Reboot RDS instance
db_failure(db_identifier="my-database", failure_type="reboot")

# Failover Aurora cluster
db_failure(db_identifier="my-cluster", failure_type="failover", region="us-east-1")

2. `az_failure`

Execute availability zone failure experiments affecting all resources in a specific AZ.

Parameters:

availability_zone (required): Target availability zone (e.g., "us-west-2a")
region (optional): AWS region (uses AWS_REGION env var if not specified)

Examples:

# Simulate AZ failure
az_failure(availability_zone="us-west-2a")

# Target specific region
az_failure(availability_zone="eu-west-1b", region="eu-west-1")

3. `msk_failure`

Execute MSK (Managed Streaming for Kafka) cluster failure experiments.

Parameters:

cluster_name (required): MSK cluster name
region (optional): AWS region (uses AWS_REGION env var if not specified)

Examples:

# Restart MSK brokers
msk_failure(cluster_name="my-kafka-cluster")

# Target specific region
msk_failure(cluster_name="prod-kafka", region="us-east-1")

4. `tag_based_failure`

Execute failure experiments on all resources matching specified tags across multiple AWS services.

Parameters:

tag_key (required): Tag key to search for
tag_value (required): Tag value to match
duration (optional): Duration of the failure (e.g., "60s", "10m", "2h", default: "10m")
region (optional): AWS region (uses AWS_REGION env var if not specified)

Examples:

# Target all resources with Environment=test tag
tag_based_failure(tag_key="Environment", tag_value="test", duration="5m")

# Target specific team's resources
tag_based_failure(tag_key="Team", tag_value="backend", duration="30s")

# Target EKS cluster nodes
tag_based_failure(
    tag_key="eks:cluster-name", 
    tag_value="my-cluster", 
    duration="2m"
)

# Target auto-scaling enabled resources
tag_based_failure(
    tag_key="k8s.io/cluster-autoscaler/enabled", 
    tag_value="true", 
    duration="1h"
)

5. `get_experiment_status`

Check the status of running or completed FIS experiments.

Parameters:

experiment_id (optional): Specific experiment ID to check
region (optional): AWS region (uses AWS_REGION env var if not specified)

Examples:

# Get all recent experiments
get_experiment_status()

# Check specific experiment
get_experiment_status(experiment_id="EXP123456789")

# Check experiments in specific region
get_experiment_status(region="eu-west-1")

Duration Format

The duration parameter accepts human-readable formats:

"30s" - 30 seconds
"5m" - 5 minutes
"2h" - 2 hours
"1h30m" - 1 hour 30 minutes

Resource Targeting Logic

Tag-Based Targeting

The tag_based_failure tool searches across all supported AWS services:

EC2 Instances: Uses describe-instances with tag filters
RDS: Queries all instances/clusters, then checks tags individually
ECS: Searches services across all clusters for matching tags
Lambda: Iterates through functions checking tags
ASG: Examines Auto Scaling Group tags
ELB: Checks Load Balancer tags
EKS: Searches Node Groups across all clusters
MSK: Not included in tag-based targeting (use msk_failure instead)

Failure Actions by Service

EC2: Stop instances → Auto-restart after duration
RDS Instances: Reboot → Automatic recovery
RDS Clusters: Failover → Automatic recovery
ECS: Stop tasks → Service maintains desired count
Lambda: Inject errors → Duration-based
ASG: Insufficient capacity errors → Duration-based
ELB: Mark unavailable → Duration-based
EKS: Terminate 100% of nodes → Auto Scaling recovery

Prerequisites

AWS Credentials: Configure via environment variables or AWS profiles
IAM Permissions: Ensure the following permissions:
- fis:* - For Fault Injection Simulator
- ssm:* - For Systems Manager (MSK experiments)
- ec2:*, rds:*, ecs:*, lambda:*, autoscaling:*, elasticloadbalancing:*, eks:*, kafka:* - For resource discovery and targeting
FIS Service Role: Create an IAM role for FIS experiments with appropriate permissions

Error Handling

Resource Not Found: Experiments skip missing resources
Permission Denied: Clear error messages with required permissions
Invalid Duration: Automatic conversion to AWS FIS PT format
Network Issues: Configurable timeouts and retries (300s read, 60s connect, 3 retries)

Safety Features

Dry Run Mode: Preview targets before execution
Auto Recovery: Most experiments include automatic recovery
Resource Validation: Verify resources exist before targeting
Region Isolation: Experiments are region-specific
Tag Validation: Ensure exact tag matches to prevent accidental targeting

Examples

Chaos Engineering Scenarios

# Test EKS cluster resilience
tag_based_failure(
    tag_key="eks:cluster-name",
    tag_value="production-cluster",
    duration="5m"
)

# Simulate database failover
db_failure(
    db_identifier="prod-aurora-cluster",
    failure_type="failover"
)

# Test multi-AZ application resilience  
az_failure(availability_zone="us-west-2a")

# Validate auto-scaling behavior
tag_based_failure(
    tag_key="Environment",
    tag_value="staging", 
    duration="10m"
)

# Test Kafka cluster resilience
msk_failure(cluster_name="event-streaming-cluster")

Monitoring

Use get_experiment_status() to monitor experiment progress:

# Start experiment
result = tag_based_failure(tag_key="Team", tag_value="platform")
experiment_id = result.content[0].text  # Extract experiment ID

# Monitor progress
status = get_experiment_status(experiment_id=experiment_id)

Contributing

Fork the repository
Create a feature branch
Add tests for new functionality
Submit a pull request

License

MIT License - see LICENSE file for details.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

1.1.0

Sep 7, 2025

1.0.0

Sep 3, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

failure_invoker_mcp-1.1.0.tar.gz (14.8 kB view details)

Uploaded Sep 7, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

failure_invoker_mcp-1.1.0-py3-none-any.whl (15.0 kB view details)

Uploaded Sep 7, 2025 Python 3

File details

Details for the file failure_invoker_mcp-1.1.0.tar.gz.

File metadata

Download URL: failure_invoker_mcp-1.1.0.tar.gz
Upload date: Sep 7, 2025
Size: 14.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.8.4

File hashes

Hashes for failure_invoker_mcp-1.1.0.tar.gz
Algorithm	Hash digest
SHA256	`07f89129221d30ca3e892eb840fed9530b710432d60453534c539c63997711f6`
MD5	`4212a9d8b7658c01940f003c2ec879dc`
BLAKE2b-256	`201b7707fb88479ff811391dd152ed212d03fe5dbe10673d3856d38051ae10f8`

See more details on using hashes here.

File details

Details for the file failure_invoker_mcp-1.1.0-py3-none-any.whl.

File metadata

Download URL: failure_invoker_mcp-1.1.0-py3-none-any.whl
Upload date: Sep 7, 2025
Size: 15.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.8.4

File hashes

Hashes for failure_invoker_mcp-1.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d145afdf0092b153b0048aba1150a2024eb6a3d14b2fcdb955dda97e1355303e`
MD5	`854eb8302a2a78e1efe7b173416584eb`
BLAKE2b-256	`6b1e3a8a3e65b05f832c445348d0848245066c5e9644d6cd226aa411aeb42e96`

See more details on using hashes here.

failure-invoker-mcp 1.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Failure Invoker MCP Server

Features

Supported AWS Services

Installation

MCP Configuration

Strands Agent SDK

Available Tools

1. db_failure

2. az_failure

3. msk_failure

4. tag_based_failure

5. get_experiment_status

Duration Format

Resource Targeting Logic

Tag-Based Targeting

Failure Actions by Service

Prerequisites

Error Handling

Safety Features

Examples

Chaos Engineering Scenarios

Monitoring

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

1. `db_failure`

2. `az_failure`

3. `msk_failure`

4. `tag_based_failure`

5. `get_experiment_status`