CLI tool to diagnose why ECS tasks and services are failing
Project description
ecs-doctor
Diagnose why your ECS service is failing — in one command.
Designed and built by Praveen Rajkoilraj.
The Problem
ECS troubleshooting today means manually correlating multiple AWS data sources every incident, by hand:
- ECS service events — was there a placement failure? a deployment rollback? a deadlock?
- Stopped task reasons + container exit codes — OOM? image pull failure? missing secret?
- CloudWatch Logs — what was the application printing before it crashed?
- ALB target health — is the load balancer even reaching the container?
- CloudWatch Metrics — is CPU or memory trending toward exhaustion?
- Task definition + network config — is the Fargate CPU/memory combo invalid? are security groups blocking egress?
ecs-doctor aggregates all of these into a single confidence-scored root-cause report with a suggested fix.
Installation
# Recommended: isolated install via pipx
pipx install ecs-doctor
# Or with pip
pip install ecs-doctor
# With web UI support
pip install "ecs-doctor[web]"
# With interactive browser (arrow-key cluster/service selection)
pip install "ecs-doctor[interactive]"
# Everything
pip install "ecs-doctor[web,interactive]"
Development install
git clone https://github.com/PraveenLuke/ecs-doctor
cd ecs-doctor
pip install -e ".[dev]"
pytest tests/ -v
Quick Start
# Run a full diagnosis
ecs-doctor diagnose --cluster prod-cluster --service payments-service
# Specify region
ecs-doctor diagnose --cluster prod-cluster --service payments-service --region us-west-2
# Use a named AWS profile
ecs-doctor diagnose --cluster prod-cluster --service payments-service --profile staging
# Machine-readable JSON (for CI, Slack bots, incident tooling)
ecs-doctor diagnose --cluster prod-cluster --service payments-service --json
# Faster run — skip CloudWatch metrics (no cloudwatch:GetMetricData needed)
ecs-doctor diagnose --cluster prod-cluster --service payments-service --no-metrics
# Skip task definition config panel
ecs-doctor diagnose --cluster prod-cluster --service payments-service --no-config
# Stream live logs from running tasks (Ctrl+C to stop)
ecs-doctor diagnose --cluster prod-cluster --service payments-service --stream-logs
Commands
diagnose — Run all diagnostic checks
ecs-doctor diagnose [OPTIONS]
Options:
--cluster TEXT ECS cluster name or ARN [required]
--service TEXT ECS service name [required]
--region TEXT AWS region (overrides profile/env default)
--profile TEXT AWS named profile from ~/.aws/credentials
--json Emit machine-readable JSON instead of the rich report
--stream-logs Stream live logs from running tasks (cannot combine with --json)
--no-metrics Skip CloudWatch metrics (faster, fewer permissions needed)
--no-config Skip task definition config display
browse — Interactive wizard (requires [interactive] extra)
ecs-doctor browse
Launches an arrow-key wizard that guides you through:
- Choosing an authentication method — AWS Profile, Access Keys, or Default Chain
- Selecting an AWS region
- Listing all clusters in the account and selecting one
- Listing all services in that cluster and selecting one
- Choosing output format — rich terminal report or JSON
Useful when you don't know the cluster/service name, or when exploring an unfamiliar account.
serve — Web UI (requires [web] extra)
# Start the web server (default: http://0.0.0.0:8080)
ecs-doctor serve
# Custom host/port
ecs-doctor serve --host 127.0.0.1 --port 9090
# Auto-reload on code changes (dev mode)
ecs-doctor serve --reload
Opens a browser UI at http://localhost:8080 where you can enter cluster/service/region, run a diagnosis, and stream live logs — all without the CLI.
Authentication
ecs-doctor uses the standard boto3 credential chain — the same one used by the AWS CLI:
| Method | How to configure |
|---|---|
| Environment variables | AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_SESSION_TOKEN |
| AWS named profile | ecs-doctor diagnose --profile my-profile |
| ECS task role | Automatic when running inside Fargate/ECS |
| EC2 instance role | Automatic when running on EC2 |
| OIDC / Web Identity | Automatic via AWS_WEB_IDENTITY_TOKEN_FILE (GitHub Actions, EKS) |
If the tool cannot resolve credentials at all, it exits with a clear error message listing all supported methods.
Example Output
────────────── ECS Doctor — prod-cluster / payments-service ──────────────
╭─ Root Cause ─────────────────────────────────────────────────────────╮
│ │
│ Container is being OOM-killed (out of memory) │
│ │
│ Confidence: 97% │
│ │
│ Suggested fix: │
│ Increase the container memory reservation in the task definition. │
│ Enable CloudWatch Container Insights to track memory utilization. │
│ Profile the application for memory leaks — common causes include │
│ unbounded caches, unclosed DB connections, JVM heap misconfiguration.│
│ │
╰───────────────────────────────────────────────────────────────────────╯
Source Type Severity Message
stop_reasons oom_killed CRITICAL Container 'app' OOM-killed (exit 137). (3 tasks)
logs log_crash_sig CRITICAL [app] OOM in logs detected (task abc123)
events task_thrashing CRITICAL Crash loop: 4 starts and 4 stops in last 20 events
(1 additional finding not shown — run with --json to see all.)
Metric Average Maximum
CPU Utilization 12.4% 18.1%
Memory Utilization 94.2% 99.8% ← anomaly flagged
Desired / Running / Pending: 3 / 0 / 0
Launch type: FARGATE Platform: LATEST
Deployment: min 100% / max 200% Circuit breaker: on
Container Image CPU Memory Log Group
app payments:v1.2.3 256 512 /ecs/payments
Diagnosis completed in 843ms.
JSON output (--json)
{
"request": {
"cluster": "prod-cluster",
"service": "payments-service",
"region": "us-east-1",
"account_id": "123456789012"
},
"root_cause": {
"cause": "Container is being OOM-killed (out of memory)",
"confidence": 0.97,
"suggested_fix": "Increase the container memory reservation...",
"evidence": [...]
},
"all_findings": [...],
"metrics": {
"cpu_avg_percent": 12.4,
"cpu_max_percent": 18.1,
"memory_avg_percent": 94.2,
"memory_max_percent": 99.8
},
"service_config": { ... },
"task_config": { ... },
"duration_ms": 843
}
What Gets Diagnosed
Diagnosers
| Diagnoser | AWS APIs used | What it catches |
|---|---|---|
| events | ecs:DescribeServices |
Placement failures, health check failures, deployment rollbacks, crash loops (thrashing), deployment config deadlock |
| stop_reasons | ecs:ListTasks, ecs:DescribeTasks |
OOM (exit 137/139), image pull failures, missing secrets, non-zero exits, premature exit 0, SIGTERM not handled (exit 143), Spot interruption, TaskFailedToStart |
| logs | logs:GetLogEvents |
Python/Java/Go/Node/Rust/.NET/PHP/Ruby crashes, connection refused, DNS failures, TLS errors, wrong CPU architecture, missing files, DB deadlocks, OOM in logs, disk full, read-only filesystem, EFS/NFS mount failures |
| alb_health | elasticloadbalancing:DescribeTargetHealth |
Unhealthy targets — health check timeout, connection refused, non-2xx response |
| metrics | cloudwatch:GetMetricData |
CPU or memory utilization above 85% (last 3 hours) |
| config | ecs:DescribeTaskDefinition |
Invalid Fargate CPU/memory combination, service deployment configuration |
| network | ec2:Describe* |
Security groups blocking egress, no NAT Gateway in route table, ENI not attached |
Root Cause Categories (scored by confidence)
- Container OOM-killed (memory exhaustion)
- Cannot pull container image (registry auth, rate limit, bad tag)
- Task initialization failure (missing secret, SSM parameter, config resource)
- Insufficient cluster capacity (placement failure)
- ALB targets unhealthy (timeout, connection refused, non-2xx)
- Health check failing (container or ALB level)
- Deployment failed — circuit breaker triggered rollback
- Deployment config deadlock (min=100%, max=100%, running=0)
- Application crash-looping (task thrashing)
- Application exiting with non-zero code
- Container not handling SIGTERM (graceful shutdown failure)
- Fargate Spot task interrupted by AWS
- Task failed to start before startTimeout
- Application crash signature in logs
- High CPU or memory utilization (anomaly threshold 85%)
- Network connectivity blocked (security group, no NAT, ENI)
- Disk error or EFS mount failure
Crash Patterns in Logs
The log diagnoser matches against 25+ patterns:
| Pattern | Severity |
|---|---|
Python Traceback (most recent call last) |
HIGH |
Java Exception in thread |
HIGH |
Go panic: |
HIGH |
Node.js UnhandledPromiseRejection |
HIGH |
Rust thread '...' panicked at |
HIGH |
.NET System.Exception: / Unhandled exception |
HIGH |
PHP Fatal error: |
HIGH |
exec format error (wrong CPU architecture) |
CRITICAL |
out of memory / cannot allocate memory |
CRITICAL |
connection refused |
MEDIUM |
no such host (DNS failure) |
MEDIUM |
certificate expired / SSL error |
MEDIUM |
FATAL: (DB fatal) |
HIGH |
deadlock detected |
HIGH |
no space left on device (disk full) |
CRITICAL |
read-only file system |
HIGH |
disk quota exceeded |
HIGH |
mount.nfs / nfs: server not responding (EFS failure) |
CRITICAL |
exec: ... permission denied (entrypoint not executable) |
HIGH |
no such file or directory |
HIGH |
SecretNotFound / secret not found |
HIGH |
Running as a Container / on Fargate
The deploy/ directory contains production-ready deployment files.
Docker
# Build
docker build -f deploy/Dockerfile -t ecs-doctor .
# Run locally
docker run -p 8080:8080 \
-e AWS_DEFAULT_REGION=us-east-1 \
-v ~/.aws:/home/ecsdoctor/.aws:ro \
ecs-doctor
# Open http://localhost:8080
Deploy to Fargate
-
Push the image to ECR:
aws ecr create-repository --repository-name ecs-doctor docker tag ecs-doctor:latest <ACCOUNT_ID>.dkr.ecr.<REGION>.amazonaws.com/ecs-doctor:latest docker push <ACCOUNT_ID>.dkr.ecr.<REGION>.amazonaws.com/ecs-doctor:latest
-
Edit
deploy/task-definition.json— replaceACCOUNT_IDandREGIONplaceholders. -
Create the IAM roles referenced in the task definition — use
deploy/iam-policy.jsonas the task role policy. -
Register and run the task definition:
aws ecs register-task-definition --cli-input-json file://deploy/task-definition.json aws ecs create-service \ --cluster your-cluster \ --service-name ecs-doctor \ --task-definition ecs-doctor \ --desired-count 1 \ --launch-type FARGATE \ --network-configuration "awsvpcConfiguration={subnets=[subnet-xxx],securityGroups=[sg-xxx],assignPublicIp=ENABLED}"
The web UI will be available on port 8080 of the task's public IP or via an ALB.
Required IAM Permissions
Full minimum policy (save as deploy/iam-policy.json):
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"ecs:DescribeServices",
"ecs:DescribeTasks",
"ecs:DescribeTaskDefinition",
"ecs:DescribeClusters",
"ecs:ListTasks",
"ecs:ListClusters",
"ecs:ListServices"
],
"Resource": "*"
},
{
"Effect": "Allow",
"Action": [
"logs:GetLogEvents",
"logs:FilterLogEvents",
"logs:DescribeLogStreams"
],
"Resource": "arn:aws:logs:*:*:log-group:/ecs/*:*"
},
{
"Effect": "Allow",
"Action": [
"cloudwatch:GetMetricData",
"cloudwatch:GetMetricStatistics"
],
"Resource": "*"
},
{
"Effect": "Allow",
"Action": ["elasticloadbalancing:DescribeTargetHealth"],
"Resource": "*"
},
{
"Effect": "Allow",
"Action": [
"ec2:DescribeSecurityGroups",
"ec2:DescribeSubnets",
"ec2:DescribeRouteTables",
"ec2:DescribeNatGateways",
"ec2:DescribeNetworkInterfaces"
],
"Resource": "*"
},
{
"Effect": "Allow",
"Action": ["sts:GetCallerIdentity"],
"Resource": "*"
}
]
}
Minimum permissions for a fast scan (skip --no-metrics not needed, just omit CloudWatch + EC2):
If you only have ECS + Logs + ELB permissions, run with --no-metrics. ecs-doctor will gracefully skip any check it lacks permissions for and report exactly which IAM action and resource ARN you'd need to add.
Project Structure
ecs_doctor/
├── cli.py # Click CLI — diagnose, browse, serve subcommands
├── engine.py # Shared orchestration layer (DiagnosisRequest/Result)
├── models.py # Finding, RootCause, MetricSnapshot, TaskConfig dataclasses
├── aggregator.py # Confidence scoring and root-cause ranking
├── _aws.py # ServiceDataCache, IAM error helpers
├── streaming.py # Live log streaming generator (CLI + web SSE)
├── wizard.py # Interactive questionary-based cluster/service browser
└── diagnosers/
├── events.py # ECS service events + deployment deadlock detection
├── stop_reasons.py # Task stop reason and container exit code classifier
├── logs.py # CloudWatch log crash pattern matcher (25+ patterns)
├── alb_health.py # ALB target health checker
├── metrics.py # CloudWatch CPU/memory utilization (last 3h)
├── config.py # Task definition + service config extractor + Fargate validation
└── network.py # Security group, NAT gateway, ENI attachment checks
web/
├── app.py # FastAPI application factory
├── routes/
│ ├── diagnose.py # GET / (form), POST /diagnose (HTMX), GET /api/diagnose (JSON)
│ ├── stream.py # GET /api/stream-logs (Server-Sent Events)
│ └── health.py # GET /healthz → {"status":"ok"}
├── templates/
│ ├── base.html # HTMX CDN, layout shell
│ ├── index.html # Diagnosis form
│ └── report.html # Results: root cause, metrics, config, evidence, log stream
└── static/
├── style.css # Dark theme
└── app.js # SSE EventSource for live log streaming
deploy/
├── Dockerfile # python:3.12-slim, non-root user, port 8080
├── task-definition.json # Fargate 256 CPU / 512 MB, awsvpc, /healthz health check
└── iam-policy.json # Minimum task role policy
Development
Requires Python 3.12+.
# Install with dev + all optional extras
pip install -e ".[dev,web,interactive]"
# Run all tests
pytest tests/ -v
# Run with coverage
pytest tests/ --cov=ecs_doctor --cov-report=term-missing
# Run with uv (if uv is installed)
uv run pytest tests/ -q
Running the Web UI locally
pip install -e ".[web]"
ecs-doctor serve --reload
# Open http://localhost:8080
Adding a new diagnoser
- Create
ecs_doctor/diagnosers/my_check.pywith adiagnose_my_check(...)function that returnslist[Finding] - Add new
FindingTypevalues toecs_doctor/models.py - Add hypothesis entries to
ecs_doctor/aggregator.py - Call the new function from
ecs_doctor/engine.pyrun_diagnosis() - Add tests in
tests/test_my_check.py
License
MIT — see LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ecs_doctor-0.2.0.tar.gz.
File metadata
- Download URL: ecs_doctor-0.2.0.tar.gz
- Upload date:
- Size: 59.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5e608917f3d2fe1728b1ed9201f03e8b0c095edaae7b7e1ece8b97cf0a796983
|
|
| MD5 |
c87af86fa392a9308434d1c9d594a3c1
|
|
| BLAKE2b-256 |
ade2ca0d9add7eeb32b59b65d8f8bd19217f8ba33b709a47400bd44f4d8c04b8
|
Provenance
The following attestation bundles were made for ecs_doctor-0.2.0.tar.gz:
Publisher:
release.yml on PraveenLuke/ecs-doctor
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
ecs_doctor-0.2.0.tar.gz -
Subject digest:
5e608917f3d2fe1728b1ed9201f03e8b0c095edaae7b7e1ece8b97cf0a796983 - Sigstore transparency entry: 1930561490
- Sigstore integration time:
-
Permalink:
PraveenLuke/ecs-doctor@c4275f0952ae8a4cff6d12f345e11796f9ee7b95 -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/PraveenLuke
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@c4275f0952ae8a4cff6d12f345e11796f9ee7b95 -
Trigger Event:
push
-
Statement type:
File details
Details for the file ecs_doctor-0.2.0-py3-none-any.whl.
File metadata
- Download URL: ecs_doctor-0.2.0-py3-none-any.whl
- Upload date:
- Size: 47.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d73c466e8ceffded18f6f8853be4faf9a7a49025aae00108ea0bc758ba8f2ad2
|
|
| MD5 |
2a08296472a75f2c00a97d7506c57f0f
|
|
| BLAKE2b-256 |
c96028c506c71d510f54546ff9500ac9f4ff99b45f034bdfc8f55e440d89bbc2
|
Provenance
The following attestation bundles were made for ecs_doctor-0.2.0-py3-none-any.whl:
Publisher:
release.yml on PraveenLuke/ecs-doctor
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
ecs_doctor-0.2.0-py3-none-any.whl -
Subject digest:
d73c466e8ceffded18f6f8853be4faf9a7a49025aae00108ea0bc758ba8f2ad2 - Sigstore transparency entry: 1930561662
- Sigstore integration time:
-
Permalink:
PraveenLuke/ecs-doctor@c4275f0952ae8a4cff6d12f345e11796f9ee7b95 -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/PraveenLuke
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@c4275f0952ae8a4cff6d12f345e11796f9ee7b95 -
Trigger Event:
push
-
Statement type: