Skip to main content

Diagnose why your ECS tasks fail to start or keep crashing

Project description

ECS Task Doctor

CI PyPI version Python 3.9+ License: MIT

Diagnose why your ECS tasks fail to start or keep crashing — in one command.

ECS Task Doctor aggregates information from ECS, CloudWatch, ECR, IAM, and EC2 into a single, human-readable diagnosis report. No more jumping between 7 AWS console tabs.

Installation

pip install ecs-task-doctor

Quick Start

# Diagnose a specific service
ecs-doctor diagnose --cluster my-cluster --service my-service

# Diagnose a specific task
ecs-doctor diagnose --cluster my-cluster --task arn:aws:ecs:us-east-1:123:task/my-cluster/abc123

# Scan all services in a cluster for issues
ecs-doctor scan --cluster my-cluster

# Quick health check
ecs-doctor health --cluster my-cluster

What It Checks

Check What it does
Task Status Parses stopped reasons and container exit codes (OOM, segfault, etc.)
Service Events Detects crash loops, placement failures, and capacity issues
CloudWatch Logs Scans recent logs for error patterns (OOM, connection refused, etc.)
Image Verifies ECR images exist and are pullable
IAM Validates task execution and task roles exist
Resources Checks CPU/memory constraints and cluster capacity
Networking Verifies subnets have IPs, security groups allow egress

Example Output

╭─────────────────────────────────────────────────╮
│  ECS Task Doctor — Diagnosis Report             │
│  Cluster: production  Service: api-server       │
╰─────────────────────────────────────────────────╯

🔴 CRITICAL: Container keeps crashing (3 restarts in 10 min)

📋 Checks:
  ✅ Image: 123456789.dkr.ecr.us-east-1.amazonaws.com/api:v2.1.0 — exists and pullable
  ✅ IAM: Task execution role has required permissions
  ✅ Network: Subnets have available IPs, security groups allow egress
  ❌ Task Status: Essential container exited with code 137 (OOM Kill)
  ⚠️  Resources: Container memory limit (512MB) is close to task memory (512MB)
  ❌ Logs: Last error — "JavaScript heap out of memory"

💡 Recommendation:
  1. Increase container memory limit from 512MB to 1024MB
  2. Update task definition memory from 512 to 1024
  3. Consider adding --max-old-space-size=768 to Node.js startup

📝 Full logs: aws logs tail /ecs/api-server --since 1h

Output Formats

# Rich terminal output (default)
ecs-doctor diagnose --cluster my-cluster --service my-service

# JSON (for scripting/automation)
ecs-doctor diagnose --cluster my-cluster --service my-service --format json

# Markdown (for reports/PRs)
ecs-doctor diagnose --cluster my-cluster --service my-service --format markdown

Commands

ecs-doctor diagnose

Run a full diagnosis on a service or task.

ecs-doctor diagnose --cluster CLUSTER --service SERVICE [--region REGION] [--format FORMAT]
ecs-doctor diagnose --cluster CLUSTER --task TASK_ARN [--region REGION] [--format FORMAT]

ecs-doctor scan

Scan all services in a cluster and diagnose any unhealthy ones.

ecs-doctor scan --cluster CLUSTER [--region REGION] [--format FORMAT]

ecs-doctor health

Quick health overview of all services in a cluster.

ecs-doctor health --cluster CLUSTER [--region REGION] [--format FORMAT]

Required AWS Permissions

ECS Task Doctor needs read-only access to several AWS services:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "ecs:DescribeClusters",
        "ecs:DescribeServices",
        "ecs:DescribeTasks",
        "ecs:DescribeTaskDefinition",
        "ecs:ListServices",
        "ecs:ListTasks",
        "ecs:ListContainerInstances",
        "ecs:DescribeContainerInstances",
        "logs:DescribeLogStreams",
        "logs:GetLogEvents",
        "ecr:DescribeRepositories",
        "ecr:DescribeImages",
        "iam:GetRole",
        "ec2:DescribeSubnets",
        "ec2:DescribeSecurityGroups"
      ],
      "Resource": "*"
    }
  ]
}

Development

# Install in dev mode
pip install -e '.[dev]'

# Run tests
pytest -v

# Lint
ruff check src/ tests/

See CONTRIBUTING.md for more details.

License

MIT — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ecs_task_doctor-0.1.0.tar.gz (27.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ecs_task_doctor-0.1.0-py3-none-any.whl (26.4 kB view details)

Uploaded Python 3

File details

Details for the file ecs_task_doctor-0.1.0.tar.gz.

File metadata

  • Download URL: ecs_task_doctor-0.1.0.tar.gz
  • Upload date:
  • Size: 27.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for ecs_task_doctor-0.1.0.tar.gz
Algorithm Hash digest
SHA256 a123c3964bec428909491ad7850062dc5d3cfcc6dc312207525fa49a1d6a1602
MD5 93cafee1b97c39e4a3ff59795d60bbcf
BLAKE2b-256 8383d217348f0ba6857a38eff14ca509a9f42d9971b0767cd084b2a10e4617ff

See more details on using hashes here.

File details

Details for the file ecs_task_doctor-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for ecs_task_doctor-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 4de19a3999138697da041706dad89100d81c97c83f34f58269ef33f8a57e95ec
MD5 462864f68798db85ec86467a2ac6ba6d
BLAKE2b-256 536175fabcf1d6a350444ee101b2c856d18a8411dc193e3f44f2694486d946ba

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page