Skip to main content

CLI tool to diagnose why ECS tasks and services are failing

Project description

ecs-doctor

PyPI version License: MIT Python 3.12+

Diagnose why your ECS service is failing — in one command.

Designed and built by Praveen Rajkoilraj.


The Problem

ECS troubleshooting today means manually correlating four separate AWS data sources every single incident, by hand:

  1. ECS DescribeServices events — was there a placement failure? a deployment rollback?
  2. DescribeTasks stoppedReason + container exit codes — OOM? image pull failure? missing secret?
  3. CloudWatch Logs — what was the application printing before it crashed?
  4. ALB target health — is the load balancer even reaching the container?

You're tabbing between four AWS console screens at 2am, each one showing raw data with no correlation, trying to figure out whether it's OOM, a bad image tag, a broken health check path, or a VPC security group blocking the ALB. Every time.

There is currently no open-source tool that aggregates these four signals into a single root-cause report. The AWS CLI, boto3 scripts, and the ECS console only expose raw data per service — they do not correlate findings across signals or tell you what to fix.

ecs-doctor does that.


Why This Exists

"It's 2am. PagerDuty woke you up. DesiredCount: 3, RunningCount: 0. You open the ECS console, see 'essential container in task exited', switch to CloudWatch Logs to find the crash, switch to the target group to check health, go back to the service events to see if it's been flapping for 20 minutes or 20 seconds. Thirty minutes later you realize it was a DockerHub rate limit. You've done this exact sequence fifteen times this year."

ecs-doctor runs all four checks in parallel and tells you the most likely root cause with a confidence score and a suggested fix.


Installation

# Recommended: install with pipx for an isolated environment
pipx install ecs-doctor

# Or with pip
pip install ecs-doctor

# Development install (includes test dependencies)
git clone https://github.com/PraveenLuke/ecs-task-doctor
cd ecs-task-doctor
pip install -e ".[dev]"

Usage

ecs-doctor diagnose --cluster my-cluster --service my-service

# Specify region explicitly
ecs-doctor diagnose --cluster my-cluster --service my-service --region us-west-2

# Machine-readable JSON output (for CI, Slack webhooks, etc.)
ecs-doctor diagnose --cluster my-cluster --service my-service --json

AWS Credentials

ecs-doctor uses the standard boto3 credential chain — no custom auth required:

  1. Environment variables (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_SESSION_TOKEN)
  2. AWS named profiles (~/.aws/credentials)
  3. ECS task role / EC2 instance role (when running on AWS infrastructure)

Example Output

────────────────── ECS Task Doctor — prod-cluster / payments-service ──────────────────

╭─ Root Cause ────────────────────────────────────────────────────────────────────────╮
│                                                                                      │
│  Container is being OOM-killed (out of memory)                                       │
│                                                                                      │
│  Confidence: 97%                                                                     │
│                                                                                      │
│  Suggested fix:                                                                      │
│  Increase the container's memory reservation in the task definition.                  │
│  Enable CloudWatch Container Insights to track memory utilization trends.             │
│  Profile the application for memory leaks — common causes include unbounded caches,  │
│  unclosed DB connections, and JVM heap misconfiguration.                              │
│                                                                                      │
╰──────────────────────────────────────────────────────────────────────────────────────╯

╭─ Supporting Evidence ─────────────────────────────────────────────────────────────────╮
│ Source        │ Type          │ Severity │ Message                                    │
│ stop_reasons  │ oom_killed    │ CRITICAL │ Container 'app' OOM-killed (exit 137).      │
│               │               │          │ stoppedReason: Essential container in task  │
│               │               │          │ exited (3 tasks affected)                  │
│ logs          │ log_crash_sig │ CRITICAL │ [app] OOM in logs detected in logs         │
│               │               │          │ (task abc123)                              │
│ events        │ task_thrash   │ CRITICAL │ Crash loop detected: 4 start(s) and        │
│               │               │          │ 4 stop(s) in the last 20 events.           │
╰───────────────────────────────────────────────────────────────────────────────────────╯

(1 additional finding(s) not shown above — run with --json to see all.)

JSON output (--json)

{
  "cluster": "prod-cluster",
  "service": "payments-service",
  "region": "us-east-1",
  "root_cause": {
    "cause": "Container is being OOM-killed (out of memory)",
    "confidence": 0.97,
    "suggested_fix": "Increase the container's memory reservation...",
    "evidence": [...]
  },
  "all_findings": [...]
}

Diagnostic Checks

ecs-doctor runs four diagnosers and feeds their findings into a root-cause aggregator:

Diagnoser AWS API What it catches
events ecs:DescribeServices Placement failures, health check failures, deployment rollbacks, crash loops
stop_reasons ecs:ListTasks, ecs:DescribeTasks OOM (exit 137/139), image pull failures, missing secrets (ResourceInitializationError), non-zero exits, premature exits (exit 0), SIGTERM not handled (exit 143)
logs logs:GetLogEvents Python/Java/Go/Node tracebacks, connection refused, DNS failures, TLS errors, wrong CPU arch (exec format error), missing files/binaries, DB fatal errors
alb_health elasticloadbalancing:DescribeTargetHealth Unhealthy targets — timeout, connection refused, non-2xx health check response

Root Cause Categories

The aggregator maps findings to these root causes, ranked by confidence:

  • Container is being OOM-killed
  • ECS cannot pull the container image (registry auth, rate limit, wrong tag)
  • Task cannot initialize — secret or config resource missing or inaccessible
  • Insufficient cluster capacity (placement failure)
  • ALB targets unhealthy
  • Container/ALB health checks failing
  • Deployment failed — circuit breaker triggered
  • Application crash-looping
  • Application exiting with non-zero code
  • Container not handling SIGTERM (graceful shutdown failure)
  • Application crash signature in logs

Required IAM Permissions

Grant these permissions to the IAM role or user running ecs-doctor:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "ecs:DescribeServices",
        "ecs:DescribeTasks",
        "ecs:ListTasks",
        "ecs:DescribeTaskDefinition"
      ],
      "Resource": "*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "logs:GetLogEvents"
      ],
      "Resource": "arn:aws:logs:*:*:log-group:/ecs/*:*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "elasticloadbalancing:DescribeTargetHealth"
      ],
      "Resource": "*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "sts:GetCallerIdentity"
      ],
      "Resource": "*"
    }
  ]
}

Permission handling: If any permission is missing, ecs-doctor catches the AccessDenied error, tells you exactly which IAM action and resource ARN to add, and continues running the remaining diagnosers — it never crashes on a missing permission.


Roadmap

  • IAM policy auto-generator — output a ready-to-apply IAM policy statement for the exact resources diagnosed
  • Slack / webhook output--webhook <url> to post findings to a Slack channel or incident management system
  • Multi-service batch scanecs-doctor scan --cluster my-cluster to check all services in a cluster
  • --watch mode — poll and re-diagnose every N seconds until the service is healthy
  • CloudWatch Container Insights integration — pull memory and CPU utilization metrics to support OOM diagnosis
  • ECS Exec integration — optionally open a shell into a failing container for live debugging
  • Cost impact report — estimate how much a crash-looping service has cost during the incident window
  • GitHub Actions output format — emit findings as GitHub annotations

Development

Requires Python 3.12+.

# Install dev dependencies
pip install -e ".[dev]"

# Run tests
pytest tests/ -v

Project Structure

ecs_doctor/
├── cli.py              # Click CLI entrypoint + rich renderer
├── models.py           # Finding, RootCause dataclasses
├── aggregator.py       # Root-cause scoring and ranking
└── diagnosers/
    ├── events.py       # ECS service events parser
    ├── stop_reasons.py # Task stop reason classifier
    ├── logs.py         # CloudWatch log crash pattern matcher
    └── alb_health.py   # ALB target health checker

License

MIT — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ecs_doctor-0.1.1.tar.gz (30.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ecs_doctor-0.1.1-py3-none-any.whl (22.3 kB view details)

Uploaded Python 3

File details

Details for the file ecs_doctor-0.1.1.tar.gz.

File metadata

  • Download URL: ecs_doctor-0.1.1.tar.gz
  • Upload date:
  • Size: 30.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for ecs_doctor-0.1.1.tar.gz
Algorithm Hash digest
SHA256 2ec7c0f87bb4d7eadbb1607f6446753432623a31252266889594806d58efc31f
MD5 da1fb64fc9b7fefa436a957cca3d7326
BLAKE2b-256 da8323de9484198387890baf3776cfa86ed6a79d002d4cd0aee5afeda08cd098

See more details on using hashes here.

Provenance

The following attestation bundles were made for ecs_doctor-0.1.1.tar.gz:

Publisher: release.yml on PraveenLuke/ecs-doctor

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file ecs_doctor-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: ecs_doctor-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 22.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for ecs_doctor-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 0bb06313a53a9267184526c5775f7f6f7e6c1f71eedb14dd6002694c467f6b5f
MD5 8bd5dac84a3bb864fb5c85e69f7238aa
BLAKE2b-256 25803ef6e630a13b7c447e95e70b45b6e6ecb941afcd1c99e63ba8d46497c6c0

See more details on using hashes here.

Provenance

The following attestation bundles were made for ecs_doctor-0.1.1-py3-none-any.whl:

Publisher: release.yml on PraveenLuke/ecs-doctor

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page