CLI tool to diagnose why ECS tasks and services are failing
Project description
ecs-doctor
Diagnose why your ECS service is failing — in one command.
Designed and built by Praveen Rajkoilraj.
The Problem
ECS troubleshooting today means manually correlating four separate AWS data sources every single incident, by hand:
- ECS DescribeServices events — was there a placement failure? a deployment rollback?
- DescribeTasks stoppedReason + container exit codes — OOM? image pull failure? missing secret?
- CloudWatch Logs — what was the application printing before it crashed?
- ALB target health — is the load balancer even reaching the container?
You're tabbing between four AWS console screens at 2am, each one showing raw data with no correlation, trying to figure out whether it's OOM, a bad image tag, a broken health check path, or a VPC security group blocking the ALB. Every time.
There is currently no open-source tool that aggregates these four signals into a single root-cause report. The AWS CLI, boto3 scripts, and the ECS console only expose raw data per service — they do not correlate findings across signals or tell you what to fix.
ecs-doctor does that.
Why This Exists
"It's 2am. PagerDuty woke you up.
DesiredCount: 3, RunningCount: 0. You open the ECS console, see 'essential container in task exited', switch to CloudWatch Logs to find the crash, switch to the target group to check health, go back to the service events to see if it's been flapping for 20 minutes or 20 seconds. Thirty minutes later you realize it was a DockerHub rate limit. You've done this exact sequence fifteen times this year."
ecs-doctor runs all four checks in parallel and tells you the most likely root cause with a confidence score and a suggested fix.
Installation
# Recommended: install with pipx for an isolated environment
pipx install ecs-doctor
# Or with pip
pip install ecs-doctor
# Development install (includes test dependencies)
git clone https://github.com/PraveenLuke/ecs-task-doctor
cd ecs-task-doctor
pip install -e ".[dev]"
Usage
ecs-doctor diagnose --cluster my-cluster --service my-service
# Specify region explicitly
ecs-doctor diagnose --cluster my-cluster --service my-service --region us-west-2
# Machine-readable JSON output (for CI, Slack webhooks, etc.)
ecs-doctor diagnose --cluster my-cluster --service my-service --json
AWS Credentials
ecs-doctor uses the standard boto3 credential chain — no custom auth required:
- Environment variables (
AWS_ACCESS_KEY_ID,AWS_SECRET_ACCESS_KEY,AWS_SESSION_TOKEN) - AWS named profiles (
~/.aws/credentials) - ECS task role / EC2 instance role (when running on AWS infrastructure)
Example Output
────────────────── ECS Task Doctor — prod-cluster / payments-service ──────────────────
╭─ Root Cause ────────────────────────────────────────────────────────────────────────╮
│ │
│ Container is being OOM-killed (out of memory) │
│ │
│ Confidence: 97% │
│ │
│ Suggested fix: │
│ Increase the container's memory reservation in the task definition. │
│ Enable CloudWatch Container Insights to track memory utilization trends. │
│ Profile the application for memory leaks — common causes include unbounded caches, │
│ unclosed DB connections, and JVM heap misconfiguration. │
│ │
╰──────────────────────────────────────────────────────────────────────────────────────╯
╭─ Supporting Evidence ─────────────────────────────────────────────────────────────────╮
│ Source │ Type │ Severity │ Message │
│ stop_reasons │ oom_killed │ CRITICAL │ Container 'app' OOM-killed (exit 137). │
│ │ │ │ stoppedReason: Essential container in task │
│ │ │ │ exited (3 tasks affected) │
│ logs │ log_crash_sig │ CRITICAL │ [app] OOM in logs detected in logs │
│ │ │ │ (task abc123) │
│ events │ task_thrash │ CRITICAL │ Crash loop detected: 4 start(s) and │
│ │ │ │ 4 stop(s) in the last 20 events. │
╰───────────────────────────────────────────────────────────────────────────────────────╯
(1 additional finding(s) not shown above — run with --json to see all.)
JSON output (--json)
{
"cluster": "prod-cluster",
"service": "payments-service",
"region": "us-east-1",
"root_cause": {
"cause": "Container is being OOM-killed (out of memory)",
"confidence": 0.97,
"suggested_fix": "Increase the container's memory reservation...",
"evidence": [...]
},
"all_findings": [...]
}
Diagnostic Checks
ecs-doctor runs four diagnosers and feeds their findings into a root-cause aggregator:
| Diagnoser | AWS API | What it catches |
|---|---|---|
| events | ecs:DescribeServices |
Placement failures, health check failures, deployment rollbacks, crash loops |
| stop_reasons | ecs:ListTasks, ecs:DescribeTasks |
OOM (exit 137/139), image pull failures, missing secrets (ResourceInitializationError), non-zero exits, premature exits (exit 0), SIGTERM not handled (exit 143) |
| logs | logs:GetLogEvents |
Python/Java/Go/Node tracebacks, connection refused, DNS failures, TLS errors, wrong CPU arch (exec format error), missing files/binaries, DB fatal errors |
| alb_health | elasticloadbalancing:DescribeTargetHealth |
Unhealthy targets — timeout, connection refused, non-2xx health check response |
Root Cause Categories
The aggregator maps findings to these root causes, ranked by confidence:
- Container is being OOM-killed
- ECS cannot pull the container image (registry auth, rate limit, wrong tag)
- Task cannot initialize — secret or config resource missing or inaccessible
- Insufficient cluster capacity (placement failure)
- ALB targets unhealthy
- Container/ALB health checks failing
- Deployment failed — circuit breaker triggered
- Application crash-looping
- Application exiting with non-zero code
- Container not handling SIGTERM (graceful shutdown failure)
- Application crash signature in logs
Required IAM Permissions
Grant these permissions to the IAM role or user running ecs-doctor:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"ecs:DescribeServices",
"ecs:DescribeTasks",
"ecs:ListTasks",
"ecs:DescribeTaskDefinition"
],
"Resource": "*"
},
{
"Effect": "Allow",
"Action": [
"logs:GetLogEvents"
],
"Resource": "arn:aws:logs:*:*:log-group:/ecs/*:*"
},
{
"Effect": "Allow",
"Action": [
"elasticloadbalancing:DescribeTargetHealth"
],
"Resource": "*"
},
{
"Effect": "Allow",
"Action": [
"sts:GetCallerIdentity"
],
"Resource": "*"
}
]
}
Permission handling: If any permission is missing, ecs-doctor catches the AccessDenied error, tells you exactly which IAM action and resource ARN to add, and continues running the remaining diagnosers — it never crashes on a missing permission.
Roadmap
- IAM policy auto-generator — output a ready-to-apply IAM policy statement for the exact resources diagnosed
- Slack / webhook output —
--webhook <url>to post findings to a Slack channel or incident management system - Multi-service batch scan —
ecs-doctor scan --cluster my-clusterto check all services in a cluster -
--watchmode — poll and re-diagnose every N seconds until the service is healthy - CloudWatch Container Insights integration — pull memory and CPU utilization metrics to support OOM diagnosis
- ECS Exec integration — optionally open a shell into a failing container for live debugging
- Cost impact report — estimate how much a crash-looping service has cost during the incident window
- GitHub Actions output format — emit findings as GitHub annotations
Development
Requires Python 3.12+.
# Install dev dependencies
pip install -e ".[dev]"
# Run tests
pytest tests/ -v
Project Structure
ecs_doctor/
├── cli.py # Click CLI entrypoint + rich renderer
├── models.py # Finding, RootCause dataclasses
├── aggregator.py # Root-cause scoring and ranking
└── diagnosers/
├── events.py # ECS service events parser
├── stop_reasons.py # Task stop reason classifier
├── logs.py # CloudWatch log crash pattern matcher
└── alb_health.py # ALB target health checker
License
MIT — see LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ecs_doctor-0.1.1.tar.gz.
File metadata
- Download URL: ecs_doctor-0.1.1.tar.gz
- Upload date:
- Size: 30.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2ec7c0f87bb4d7eadbb1607f6446753432623a31252266889594806d58efc31f
|
|
| MD5 |
da1fb64fc9b7fefa436a957cca3d7326
|
|
| BLAKE2b-256 |
da8323de9484198387890baf3776cfa86ed6a79d002d4cd0aee5afeda08cd098
|
Provenance
The following attestation bundles were made for ecs_doctor-0.1.1.tar.gz:
Publisher:
release.yml on PraveenLuke/ecs-doctor
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
ecs_doctor-0.1.1.tar.gz -
Subject digest:
2ec7c0f87bb4d7eadbb1607f6446753432623a31252266889594806d58efc31f - Sigstore transparency entry: 1927287625
- Sigstore integration time:
-
Permalink:
PraveenLuke/ecs-doctor@b907d28a7e9d2d47527a787c24dfe2464b1b0d01 -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/PraveenLuke
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@b907d28a7e9d2d47527a787c24dfe2464b1b0d01 -
Trigger Event:
push
-
Statement type:
File details
Details for the file ecs_doctor-0.1.1-py3-none-any.whl.
File metadata
- Download URL: ecs_doctor-0.1.1-py3-none-any.whl
- Upload date:
- Size: 22.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0bb06313a53a9267184526c5775f7f6f7e6c1f71eedb14dd6002694c467f6b5f
|
|
| MD5 |
8bd5dac84a3bb864fb5c85e69f7238aa
|
|
| BLAKE2b-256 |
25803ef6e630a13b7c447e95e70b45b6e6ecb941afcd1c99e63ba8d46497c6c0
|
Provenance
The following attestation bundles were made for ecs_doctor-0.1.1-py3-none-any.whl:
Publisher:
release.yml on PraveenLuke/ecs-doctor
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
ecs_doctor-0.1.1-py3-none-any.whl -
Subject digest:
0bb06313a53a9267184526c5775f7f6f7e6c1f71eedb14dd6002694c467f6b5f - Sigstore transparency entry: 1927287855
- Sigstore integration time:
-
Permalink:
PraveenLuke/ecs-doctor@b907d28a7e9d2d47527a787c24dfe2464b1b0d01 -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/PraveenLuke
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@b907d28a7e9d2d47527a787c24dfe2464b1b0d01 -
Trigger Event:
push
-
Statement type: