CLI tool to diagnose why ECS tasks and services are failing

These details have not been verified by PyPI

Project description

ecs-doctor

Diagnose why your ECS service is failing — in one command.

Designed and built by Praveen Rajkoilraj.

The Problem

ECS troubleshooting today means manually correlating four separate AWS data sources every single incident, by hand:

ECS DescribeServices events — was there a placement failure? a deployment rollback?
DescribeTasks stoppedReason + container exit codes — OOM? image pull failure? missing secret?
CloudWatch Logs — what was the application printing before it crashed?
ALB target health — is the load balancer even reaching the container?

You're tabbing between four AWS console screens at 2am, each one showing raw data with no correlation, trying to figure out whether it's OOM, a bad image tag, a broken health check path, or a VPC security group blocking the ALB. Every time.

There is currently no open-source tool that aggregates these four signals into a single root-cause report. The AWS CLI, boto3 scripts, and the ECS console only expose raw data per service — they do not correlate findings across signals or tell you what to fix.

ecs-doctor does that.

Why This Exists

"It's 2am. PagerDuty woke you up. DesiredCount: 3, RunningCount: 0. You open the ECS console, see 'essential container in task exited', switch to CloudWatch Logs to find the crash, switch to the target group to check health, go back to the service events to see if it's been flapping for 20 minutes or 20 seconds. Thirty minutes later you realize it was a DockerHub rate limit. You've done this exact sequence fifteen times this year."

ecs-doctor runs all four checks in parallel and tells you the most likely root cause with a confidence score and a suggested fix.

Installation

# Recommended: install with pipx for an isolated environment
pipx install ecs-doctor

# Or with pip
pip install ecs-doctor

# Development install (includes test dependencies)
git clone https://github.com/PraveenLuke/ecs-task-doctor
cd ecs-task-doctor
pip install -e ".[dev]"

Usage

ecs-doctor diagnose --cluster my-cluster --service my-service

# Specify region explicitly
ecs-doctor diagnose --cluster my-cluster --service my-service --region us-west-2

# Machine-readable JSON output (for CI, Slack webhooks, etc.)
ecs-doctor diagnose --cluster my-cluster --service my-service --json

AWS Credentials

ecs-doctor uses the standard boto3 credential chain — no custom auth required:

Environment variables (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_SESSION_TOKEN)
AWS named profiles (~/.aws/credentials)
ECS task role / EC2 instance role (when running on AWS infrastructure)

Example Output

────────────────── ECS Task Doctor — prod-cluster / payments-service ──────────────────

╭─ Root Cause ────────────────────────────────────────────────────────────────────────╮
│                                                                                      │
│  Container is being OOM-killed (out of memory)                                       │
│                                                                                      │
│  Confidence: 97%                                                                     │
│                                                                                      │
│  Suggested fix:                                                                      │
│  Increase the container's memory reservation in the task definition.                  │
│  Enable CloudWatch Container Insights to track memory utilization trends.             │
│  Profile the application for memory leaks — common causes include unbounded caches,  │
│  unclosed DB connections, and JVM heap misconfiguration.                              │
│                                                                                      │
╰──────────────────────────────────────────────────────────────────────────────────────╯

╭─ Supporting Evidence ─────────────────────────────────────────────────────────────────╮
│ Source        │ Type          │ Severity │ Message                                    │
│ stop_reasons  │ oom_killed    │ CRITICAL │ Container 'app' OOM-killed (exit 137).      │
│               │               │          │ stoppedReason: Essential container in task  │
│               │               │          │ exited (3 tasks affected)                  │
│ logs          │ log_crash_sig │ CRITICAL │ [app] OOM in logs detected in logs         │
│               │               │          │ (task abc123)                              │
│ events        │ task_thrash   │ CRITICAL │ Crash loop detected: 4 start(s) and        │
│               │               │          │ 4 stop(s) in the last 20 events.           │
╰───────────────────────────────────────────────────────────────────────────────────────╯

(1 additional finding(s) not shown above — run with --json to see all.)

JSON output (`--json`)

{
  "cluster": "prod-cluster",
  "service": "payments-service",
  "region": "us-east-1",
  "root_cause": {
    "cause": "Container is being OOM-killed (out of memory)",
    "confidence": 0.97,
    "suggested_fix": "Increase the container's memory reservation...",
    "evidence": [...]
  },
  "all_findings": [...]
}

Diagnostic Checks

ecs-doctor runs four diagnosers and feeds their findings into a root-cause aggregator:

Diagnoser	AWS API	What it catches
events	`ecs:DescribeServices`	Placement failures, health check failures, deployment rollbacks, crash loops
stop_reasons	`ecs:ListTasks`, `ecs:DescribeTasks`	OOM (exit 137/139), image pull failures, missing secrets (`ResourceInitializationError`), non-zero exits, premature exits (exit 0), SIGTERM not handled (exit 143)
logs	`logs:GetLogEvents`	Python/Java/Go/Node tracebacks, connection refused, DNS failures, TLS errors, wrong CPU arch (`exec format error`), missing files/binaries, DB fatal errors
alb_health	`elasticloadbalancing:DescribeTargetHealth`	Unhealthy targets — timeout, connection refused, non-2xx health check response

Root Cause Categories

The aggregator maps findings to these root causes, ranked by confidence:

Container is being OOM-killed
ECS cannot pull the container image (registry auth, rate limit, wrong tag)
Task cannot initialize — secret or config resource missing or inaccessible
Insufficient cluster capacity (placement failure)
ALB targets unhealthy
Container/ALB health checks failing
Deployment failed — circuit breaker triggered
Application crash-looping
Application exiting with non-zero code
Container not handling SIGTERM (graceful shutdown failure)
Application crash signature in logs

Required IAM Permissions

Grant these permissions to the IAM role or user running ecs-doctor:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "ecs:DescribeServices",
        "ecs:DescribeTasks",
        "ecs:ListTasks",
        "ecs:DescribeTaskDefinition"
      ],
      "Resource": "*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "logs:GetLogEvents"
      ],
      "Resource": "arn:aws:logs:*:*:log-group:/ecs/*:*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "elasticloadbalancing:DescribeTargetHealth"
      ],
      "Resource": "*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "sts:GetCallerIdentity"
      ],
      "Resource": "*"
    }
  ]
}

Permission handling: If any permission is missing, ecs-doctor catches the AccessDenied error, tells you exactly which IAM action and resource ARN to add, and continues running the remaining diagnosers — it never crashes on a missing permission.

Roadmap

IAM policy auto-generator — output a ready-to-apply IAM policy statement for the exact resources diagnosed
Slack / webhook output — --webhook <url> to post findings to a Slack channel or incident management system
Multi-service batch scan — ecs-doctor scan --cluster my-cluster to check all services in a cluster
--watch mode — poll and re-diagnose every N seconds until the service is healthy
CloudWatch Container Insights integration — pull memory and CPU utilization metrics to support OOM diagnosis
ECS Exec integration — optionally open a shell into a failing container for live debugging
Cost impact report — estimate how much a crash-looping service has cost during the incident window
GitHub Actions output format — emit findings as GitHub annotations

Development

Requires Python 3.12+.

# Install dev dependencies
pip install -e ".[dev]"

# Run tests
pytest tests/ -v

Project Structure

ecs_doctor/
├── cli.py              # Click CLI entrypoint + rich renderer
├── models.py           # Finding, RootCause dataclasses
├── aggregator.py       # Root-cause scoring and ranking
└── diagnosers/
    ├── events.py       # ECS service events parser
    ├── stop_reasons.py # Task stop reason classifier
    ├── logs.py         # CloudWatch log crash pattern matcher
    └── alb_health.py   # ALB target health checker

License

MIT — see LICENSE.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.2.0

Jun 23, 2026

This version

0.1.1

Jun 23, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ecs_doctor-0.1.1.tar.gz (30.1 kB view details)

Uploaded Jun 23, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

ecs_doctor-0.1.1-py3-none-any.whl (22.3 kB view details)

Uploaded Jun 23, 2026 Python 3

File details

Details for the file ecs_doctor-0.1.1.tar.gz.

File metadata

Download URL: ecs_doctor-0.1.1.tar.gz
Upload date: Jun 23, 2026
Size: 30.1 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for ecs_doctor-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`2ec7c0f87bb4d7eadbb1607f6446753432623a31252266889594806d58efc31f`
MD5	`da1fb64fc9b7fefa436a957cca3d7326`
BLAKE2b-256	`da8323de9484198387890baf3776cfa86ed6a79d002d4cd0aee5afeda08cd098`

See more details on using hashes here.

Provenance

The following attestation bundles were made for ecs_doctor-0.1.1.tar.gz:

Publisher: release.yml on PraveenLuke/ecs-doctor

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: ecs_doctor-0.1.1.tar.gz
- Subject digest: 2ec7c0f87bb4d7eadbb1607f6446753432623a31252266889594806d58efc31f
- Sigstore transparency entry: 1927287625
- Sigstore integration time: Jun 23, 2026
Source repository:
- Permalink: PraveenLuke/ecs-doctor@b907d28a7e9d2d47527a787c24dfe2464b1b0d01
- Branch / Tag: refs/tags/v0.1.1
- Owner: https://github.com/PraveenLuke
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@b907d28a7e9d2d47527a787c24dfe2464b1b0d01
- Trigger Event: push

File details

Details for the file ecs_doctor-0.1.1-py3-none-any.whl.

File metadata

Download URL: ecs_doctor-0.1.1-py3-none-any.whl
Upload date: Jun 23, 2026
Size: 22.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for ecs_doctor-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0bb06313a53a9267184526c5775f7f6f7e6c1f71eedb14dd6002694c467f6b5f`
MD5	`8bd5dac84a3bb864fb5c85e69f7238aa`
BLAKE2b-256	`25803ef6e630a13b7c447e95e70b45b6e6ecb941afcd1c99e63ba8d46497c6c0`

See more details on using hashes here.

Provenance

The following attestation bundles were made for ecs_doctor-0.1.1-py3-none-any.whl:

Publisher: release.yml on PraveenLuke/ecs-doctor

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: ecs_doctor-0.1.1-py3-none-any.whl
- Subject digest: 0bb06313a53a9267184526c5775f7f6f7e6c1f71eedb14dd6002694c467f6b5f
- Sigstore transparency entry: 1927287855
- Sigstore integration time: Jun 23, 2026
Source repository:
- Permalink: PraveenLuke/ecs-doctor@b907d28a7e9d2d47527a787c24dfe2464b1b0d01
- Branch / Tag: refs/tags/v0.1.1
- Owner: https://github.com/PraveenLuke
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@b907d28a7e9d2d47527a787c24dfe2464b1b0d01
- Trigger Event: push

ecs-doctor 0.1.1

Navigation

Verified details

Maintainers

Meta

Unverified details

Meta

Classifiers

Project description

ecs-doctor

The Problem

Why This Exists

Installation

Usage

AWS Credentials

Example Output

JSON output (--json)

Diagnostic Checks

Root Cause Categories

Required IAM Permissions

Roadmap

Development

Project Structure

License

Project details

Verified details

Maintainers

Meta

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

JSON output (`--json`)