Skip to main content

Diagnose why your ECS tasks fail to start or keep crashing

Project description

ECS Task Doctor

CI PyPI version Python 3.9+ License: MIT

Diagnose why your ECS tasks fail to start or keep crashing — in one command.

ECS Task Doctor aggregates information from ECS, CloudWatch, ECR, IAM, and EC2 into a single, human-readable diagnosis report. No more jumping between 7 AWS console tabs.

Installation

pip install ecs-task-doctor

Quick Start

# Diagnose a specific service
ecs-doctor diagnose --cluster my-cluster --service my-service

# Diagnose a specific task
ecs-doctor diagnose --cluster my-cluster --task arn:aws:ecs:us-east-1:123:task/my-cluster/abc123

# Scan all services in a cluster for issues
ecs-doctor scan --cluster my-cluster

# Quick health check
ecs-doctor health --cluster my-cluster

What It Checks

Check What it does
Task Status Parses stopped reasons and container exit codes (OOM, segfault, etc.)
Service Events Detects crash loops, placement failures, and capacity issues
CloudWatch Logs Scans recent logs for error patterns (OOM, connection refused, etc.)
Image Verifies ECR images exist and are pullable
IAM Validates task execution and task roles exist
Resources Checks CPU/memory constraints and cluster capacity
Networking Verifies subnets have IPs, security groups allow egress

Example Output

╭─────────────────────────────────────────────────╮
│  ECS Task Doctor — Diagnosis Report             │
│  Cluster: production  Service: api-server       │
╰─────────────────────────────────────────────────╯

🔴 CRITICAL: Container keeps crashing (3 restarts in 10 min)

📋 Checks:
  ✅ Image: 123456789.dkr.ecr.us-east-1.amazonaws.com/api:v2.1.0 — exists and pullable
  ✅ IAM: Task execution role has required permissions
  ✅ Network: Subnets have available IPs, security groups allow egress
  ❌ Task Status: Essential container exited with code 137 (OOM Kill)
  ⚠️  Resources: Container memory limit (512MB) is close to task memory (512MB)
  ❌ Logs: Last error — "JavaScript heap out of memory"

💡 Recommendation:
  1. Increase container memory limit from 512MB to 1024MB
  2. Update task definition memory from 512 to 1024
  3. Consider adding --max-old-space-size=768 to Node.js startup

📝 Full logs: aws logs tail /ecs/api-server --since 1h

Output Formats

# Rich terminal output (default)
ecs-doctor diagnose --cluster my-cluster --service my-service

# JSON (for scripting/automation)
ecs-doctor diagnose --cluster my-cluster --service my-service --format json

# Markdown (for reports/PRs)
ecs-doctor diagnose --cluster my-cluster --service my-service --format markdown

Commands

ecs-doctor diagnose

Run a full diagnosis on a service or task.

ecs-doctor diagnose --cluster CLUSTER --service SERVICE [--region REGION] [--format FORMAT]
ecs-doctor diagnose --cluster CLUSTER --task TASK_ARN [--region REGION] [--format FORMAT]

ecs-doctor scan

Scan all services in a cluster and diagnose any unhealthy ones.

ecs-doctor scan --cluster CLUSTER [--region REGION] [--format FORMAT]

ecs-doctor health

Quick health overview of all services in a cluster.

ecs-doctor health --cluster CLUSTER [--region REGION] [--format FORMAT]

Required AWS Permissions

ECS Task Doctor needs read-only access to several AWS services:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "ecs:DescribeClusters",
        "ecs:DescribeServices",
        "ecs:DescribeTasks",
        "ecs:DescribeTaskDefinition",
        "ecs:ListServices",
        "ecs:ListTasks",
        "ecs:ListContainerInstances",
        "ecs:DescribeContainerInstances",
        "logs:DescribeLogStreams",
        "logs:GetLogEvents",
        "ecr:DescribeRepositories",
        "ecr:DescribeImages",
        "iam:GetRole",
        "ec2:DescribeSubnets",
        "ec2:DescribeSecurityGroups"
      ],
      "Resource": "*"
    }
  ]
}

Development

# Install in dev mode
pip install -e '.[dev]'

# Run tests
pytest -v

# Lint
ruff check src/ tests/

See CONTRIBUTING.md for more details.

License

MIT — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ecs_task_doctor-0.1.1.tar.gz (27.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ecs_task_doctor-0.1.1-py3-none-any.whl (26.4 kB view details)

Uploaded Python 3

File details

Details for the file ecs_task_doctor-0.1.1.tar.gz.

File metadata

  • Download URL: ecs_task_doctor-0.1.1.tar.gz
  • Upload date:
  • Size: 27.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for ecs_task_doctor-0.1.1.tar.gz
Algorithm Hash digest
SHA256 6fc2e9d9f79a1a8b1cca948058f6241648bccb9b0ae65ae912955979f23ec8ca
MD5 e1e0850d4d92b450f2c8c1ed21674502
BLAKE2b-256 453a87691803e5523870d0e356a9da1964b6942f285cd9d5abde3567d3d1bfab

See more details on using hashes here.

Provenance

The following attestation bundles were made for ecs_task_doctor-0.1.1.tar.gz:

Publisher: publish.yml on rishi1508/ecs-task-doctor

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file ecs_task_doctor-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for ecs_task_doctor-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 2bb4d0ac1323f9b5d1e97b9cd51fb5835ebbc19710ef74a39b7540a8ca9bab65
MD5 a97536f60bc8073e1a4142ccd896bdf9
BLAKE2b-256 e3ded6a80960f53f3a65548a86a5c7ed566b7f29619ce6589e867edf1d7131eb

See more details on using hashes here.

Provenance

The following attestation bundles were made for ecs_task_doctor-0.1.1-py3-none-any.whl:

Publisher: publish.yml on rishi1508/ecs-task-doctor

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page