CLI tool to diagnose why ECS tasks and services are failing

These details have not been verified by PyPI

Project description

ecs-doctor

Diagnose why your ECS service is failing — in one command.

Designed and built by Praveen Rajkoilraj.

The Problem

ECS troubleshooting today means manually correlating multiple AWS data sources every incident, by hand:

ECS service events — was there a placement failure? a deployment rollback? a deadlock?
Stopped task reasons + container exit codes — OOM? image pull failure? missing secret?
CloudWatch Logs — what was the application printing before it crashed?
ALB target health — is the load balancer even reaching the container?
CloudWatch Metrics — is CPU or memory trending toward exhaustion?
Task definition + network config — is the Fargate CPU/memory combo invalid? are security groups blocking egress?

ecs-doctor aggregates all of these into a single confidence-scored root-cause report with a suggested fix.

Installation

# Recommended: isolated install via pipx
pipx install ecs-doctor

# Or with pip
pip install ecs-doctor

# With web UI support
pip install "ecs-doctor[web]"

# With interactive browser (arrow-key cluster/service selection)
pip install "ecs-doctor[interactive]"

# Everything
pip install "ecs-doctor[web,interactive]"

Development install

git clone https://github.com/PraveenLuke/ecs-doctor
cd ecs-doctor
pip install -e ".[dev]"
pytest tests/ -v

Quick Start

# Run a full diagnosis
ecs-doctor diagnose --cluster prod-cluster --service payments-service

# Specify region
ecs-doctor diagnose --cluster prod-cluster --service payments-service --region us-west-2

# Use a named AWS profile
ecs-doctor diagnose --cluster prod-cluster --service payments-service --profile staging

# Machine-readable JSON (for CI, Slack bots, incident tooling)
ecs-doctor diagnose --cluster prod-cluster --service payments-service --json

# Faster run — skip CloudWatch metrics (no cloudwatch:GetMetricData needed)
ecs-doctor diagnose --cluster prod-cluster --service payments-service --no-metrics

# Skip task definition config panel
ecs-doctor diagnose --cluster prod-cluster --service payments-service --no-config

# Stream live logs from running tasks (Ctrl+C to stop)
ecs-doctor diagnose --cluster prod-cluster --service payments-service --stream-logs

Commands

`diagnose` — Run all diagnostic checks

ecs-doctor diagnose [OPTIONS]

Options:
  --cluster TEXT    ECS cluster name or ARN  [required]
  --service TEXT    ECS service name  [required]
  --region TEXT     AWS region (overrides profile/env default)
  --profile TEXT    AWS named profile from ~/.aws/credentials
  --json            Emit machine-readable JSON instead of the rich report
  --stream-logs     Stream live logs from running tasks  (cannot combine with --json)
  --no-metrics      Skip CloudWatch metrics (faster, fewer permissions needed)
  --no-config       Skip task definition config display

`browse` — Interactive wizard (requires `[interactive]` extra)

ecs-doctor browse

Launches an arrow-key wizard that guides you through:

Choosing an authentication method — AWS Profile, Access Keys, or Default Chain
Selecting an AWS region
Listing all clusters in the account and selecting one
Listing all services in that cluster and selecting one
Choosing output format — rich terminal report or JSON

Useful when you don't know the cluster/service name, or when exploring an unfamiliar account.

`serve` — Web UI (requires `[web]` extra)

# Start the web server (default: http://0.0.0.0:8080)
ecs-doctor serve

# Custom host/port
ecs-doctor serve --host 127.0.0.1 --port 9090

# Auto-reload on code changes (dev mode)
ecs-doctor serve --reload

Opens a browser UI at http://localhost:8080 where you can enter cluster/service/region, run a diagnosis, and stream live logs — all without the CLI.

Authentication

ecs-doctor uses the standard boto3 credential chain — the same one used by the AWS CLI:

Method	How to configure
Environment variables	`AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, `AWS_SESSION_TOKEN`
AWS named profile	`ecs-doctor diagnose --profile my-profile`
ECS task role	Automatic when running inside Fargate/ECS
EC2 instance role	Automatic when running on EC2
OIDC / Web Identity	Automatic via `AWS_WEB_IDENTITY_TOKEN_FILE` (GitHub Actions, EKS)

If the tool cannot resolve credentials at all, it exits with a clear error message listing all supported methods.

Example Output

────────────── ECS Doctor — prod-cluster / payments-service ──────────────

╭─ Root Cause ─────────────────────────────────────────────────────────╮
│                                                                       │
│  Container is being OOM-killed (out of memory)                        │
│                                                                       │
│  Confidence: 97%                                                      │
│                                                                       │
│  Suggested fix:                                                       │
│  Increase the container memory reservation in the task definition.    │
│  Enable CloudWatch Container Insights to track memory utilization.    │
│  Profile the application for memory leaks — common causes include     │
│  unbounded caches, unclosed DB connections, JVM heap misconfiguration.│
│                                                                       │
╰───────────────────────────────────────────────────────────────────────╯

  Source        Type            Severity   Message
  stop_reasons  oom_killed      CRITICAL   Container 'app' OOM-killed (exit 137). (3 tasks)
  logs          log_crash_sig   CRITICAL   [app] OOM in logs detected (task abc123)
  events        task_thrashing  CRITICAL   Crash loop: 4 starts and 4 stops in last 20 events

(1 additional finding not shown — run with --json to see all.)

  Metric              Average    Maximum
  CPU Utilization     12.4%      18.1%
  Memory Utilization  94.2%      99.8%     ← anomaly flagged

  Desired / Running / Pending: 3 / 0 / 0
  Launch type: FARGATE  Platform: LATEST
  Deployment: min 100% / max 200%  Circuit breaker: on

  Container   Image              CPU   Memory   Log Group
  app         payments:v1.2.3    256   512      /ecs/payments

Diagnosis completed in 843ms.

JSON output (`--json`)

{
  "request": {
    "cluster": "prod-cluster",
    "service": "payments-service",
    "region": "us-east-1",
    "account_id": "123456789012"
  },
  "root_cause": {
    "cause": "Container is being OOM-killed (out of memory)",
    "confidence": 0.97,
    "suggested_fix": "Increase the container memory reservation...",
    "evidence": [...]
  },
  "all_findings": [...],
  "metrics": {
    "cpu_avg_percent": 12.4,
    "cpu_max_percent": 18.1,
    "memory_avg_percent": 94.2,
    "memory_max_percent": 99.8
  },
  "service_config": { ... },
  "task_config": { ... },
  "duration_ms": 843
}

What Gets Diagnosed

Diagnosers

Diagnoser	AWS APIs used	What it catches
events	`ecs:DescribeServices`	Placement failures, health check failures, deployment rollbacks, crash loops (thrashing), deployment config deadlock
stop_reasons	`ecs:ListTasks`, `ecs:DescribeTasks`	OOM (exit 137/139), image pull failures, missing secrets, non-zero exits, premature exit 0, SIGTERM not handled (exit 143), Spot interruption, TaskFailedToStart
logs	`logs:GetLogEvents`	Python/Java/Go/Node/Rust/.NET/PHP/Ruby crashes, connection refused, DNS failures, TLS errors, wrong CPU architecture, missing files, DB deadlocks, OOM in logs, disk full, read-only filesystem, EFS/NFS mount failures
alb_health	`elasticloadbalancing:DescribeTargetHealth`	Unhealthy targets — health check timeout, connection refused, non-2xx response
metrics	`cloudwatch:GetMetricData`	CPU or memory utilization above 85% (last 3 hours)
config	`ecs:DescribeTaskDefinition`	Invalid Fargate CPU/memory combination, service deployment configuration
network	`ec2:Describe*`	Security groups blocking egress, no NAT Gateway in route table, ENI not attached

Root Cause Categories (scored by confidence)

Container OOM-killed (memory exhaustion)
Cannot pull container image (registry auth, rate limit, bad tag)
Task initialization failure (missing secret, SSM parameter, config resource)
Insufficient cluster capacity (placement failure)
ALB targets unhealthy (timeout, connection refused, non-2xx)
Health check failing (container or ALB level)
Deployment failed — circuit breaker triggered rollback
Deployment config deadlock (min=100%, max=100%, running=0)
Application crash-looping (task thrashing)
Application exiting with non-zero code
Container not handling SIGTERM (graceful shutdown failure)
Fargate Spot task interrupted by AWS
Task failed to start before startTimeout
Application crash signature in logs
High CPU or memory utilization (anomaly threshold 85%)
Network connectivity blocked (security group, no NAT, ENI)
Disk error or EFS mount failure

Crash Patterns in Logs

The log diagnoser matches against 25+ patterns:

Pattern	Severity
Python `Traceback (most recent call last)`	HIGH
Java `Exception in thread`	HIGH
Go `panic:`	HIGH
Node.js `UnhandledPromiseRejection`	HIGH
Rust `thread '...' panicked at`	HIGH
.NET `System.Exception:` / `Unhandled exception`	HIGH
PHP `Fatal error:`	HIGH
`exec format error` (wrong CPU architecture)	CRITICAL
`out of memory` / `cannot allocate memory`	CRITICAL
`connection refused`	MEDIUM
`no such host` (DNS failure)	MEDIUM
`certificate expired` / `SSL error`	MEDIUM
`FATAL:` (DB fatal)	HIGH
`deadlock detected`	HIGH
`no space left on device` (disk full)	CRITICAL
`read-only file system`	HIGH
`disk quota exceeded`	HIGH
`mount.nfs` / `nfs: server not responding` (EFS failure)	CRITICAL
`exec: ... permission denied` (entrypoint not executable)	HIGH
`no such file or directory`	HIGH
`SecretNotFound` / `secret not found`	HIGH

Running as a Container / on Fargate

The deploy/ directory contains production-ready deployment files.

Docker

# Build
docker build -f deploy/Dockerfile -t ecs-doctor .

# Run locally
docker run -p 8080:8080 \
  -e AWS_DEFAULT_REGION=us-east-1 \
  -v ~/.aws:/home/ecsdoctor/.aws:ro \
  ecs-doctor

# Open http://localhost:8080

Deploy to Fargate

Push the image to ECR:

aws ecr create-repository --repository-name ecs-doctor
docker tag ecs-doctor:latest <ACCOUNT_ID>.dkr.ecr.<REGION>.amazonaws.com/ecs-doctor:latest
docker push <ACCOUNT_ID>.dkr.ecr.<REGION>.amazonaws.com/ecs-doctor:latest

Edit deploy/task-definition.json — replace ACCOUNT_ID and REGION placeholders.
Create the IAM roles referenced in the task definition — use deploy/iam-policy.json as the task role policy.

aws ecs register-task-definition --cli-input-json file://deploy/task-definition.json
aws ecs create-service \
  --cluster your-cluster \
  --service-name ecs-doctor \
  --task-definition ecs-doctor \
  --desired-count 1 \
  --launch-type FARGATE \
  --network-configuration "awsvpcConfiguration={subnets=[subnet-xxx],securityGroups=[sg-xxx],assignPublicIp=ENABLED}"

The web UI will be available on port 8080 of the task's public IP or via an ALB.

Required IAM Permissions

Full minimum policy (save as deploy/iam-policy.json):

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "ecs:DescribeServices",
        "ecs:DescribeTasks",
        "ecs:DescribeTaskDefinition",
        "ecs:DescribeClusters",
        "ecs:ListTasks",
        "ecs:ListClusters",
        "ecs:ListServices"
      ],
      "Resource": "*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "logs:GetLogEvents",
        "logs:FilterLogEvents",
        "logs:DescribeLogStreams"
      ],
      "Resource": "arn:aws:logs:*:*:log-group:/ecs/*:*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "cloudwatch:GetMetricData",
        "cloudwatch:GetMetricStatistics"
      ],
      "Resource": "*"
    },
    {
      "Effect": "Allow",
      "Action": ["elasticloadbalancing:DescribeTargetHealth"],
      "Resource": "*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "ec2:DescribeSecurityGroups",
        "ec2:DescribeSubnets",
        "ec2:DescribeRouteTables",
        "ec2:DescribeNatGateways",
        "ec2:DescribeNetworkInterfaces"
      ],
      "Resource": "*"
    },
    {
      "Effect": "Allow",
      "Action": ["sts:GetCallerIdentity"],
      "Resource": "*"
    }
  ]
}

Minimum permissions for a fast scan (skip --no-metrics not needed, just omit CloudWatch + EC2):

If you only have ECS + Logs + ELB permissions, run with --no-metrics. ecs-doctor will gracefully skip any check it lacks permissions for and report exactly which IAM action and resource ARN you'd need to add.

Project Structure

ecs_doctor/
├── cli.py                  # Click CLI — diagnose, browse, serve subcommands
├── engine.py               # Shared orchestration layer (DiagnosisRequest/Result)
├── models.py               # Finding, RootCause, MetricSnapshot, TaskConfig dataclasses
├── aggregator.py           # Confidence scoring and root-cause ranking
├── _aws.py                 # ServiceDataCache, IAM error helpers
├── streaming.py            # Live log streaming generator (CLI + web SSE)
├── wizard.py               # Interactive questionary-based cluster/service browser
└── diagnosers/
    ├── events.py           # ECS service events + deployment deadlock detection
    ├── stop_reasons.py     # Task stop reason and container exit code classifier
    ├── logs.py             # CloudWatch log crash pattern matcher (25+ patterns)
    ├── alb_health.py       # ALB target health checker
    ├── metrics.py          # CloudWatch CPU/memory utilization (last 3h)
    ├── config.py           # Task definition + service config extractor + Fargate validation
    └── network.py          # Security group, NAT gateway, ENI attachment checks

web/
├── app.py                  # FastAPI application factory
├── routes/
│   ├── diagnose.py         # GET / (form), POST /diagnose (HTMX), GET /api/diagnose (JSON)
│   ├── stream.py           # GET /api/stream-logs (Server-Sent Events)
│   └── health.py           # GET /healthz → {"status":"ok"}
├── templates/
│   ├── base.html           # HTMX CDN, layout shell
│   ├── index.html          # Diagnosis form
│   └── report.html         # Results: root cause, metrics, config, evidence, log stream
└── static/
    ├── style.css           # Dark theme
    └── app.js              # SSE EventSource for live log streaming

deploy/
├── Dockerfile              # python:3.12-slim, non-root user, port 8080
├── task-definition.json    # Fargate 256 CPU / 512 MB, awsvpc, /healthz health check
└── iam-policy.json         # Minimum task role policy

Development

Requires Python 3.12+.

# Install with dev + all optional extras
pip install -e ".[dev,web,interactive]"

# Run all tests
pytest tests/ -v

# Run with coverage
pytest tests/ --cov=ecs_doctor --cov-report=term-missing

# Run with uv (if uv is installed)
uv run pytest tests/ -q

Running the Web UI locally

pip install -e ".[web]"
ecs-doctor serve --reload
# Open http://localhost:8080

Adding a new diagnoser

Create ecs_doctor/diagnosers/my_check.py with a diagnose_my_check(...) function that returns list[Finding]
Add new FindingType values to ecs_doctor/models.py
Add hypothesis entries to ecs_doctor/aggregator.py
Call the new function from ecs_doctor/engine.py run_diagnosis()
Add tests in tests/test_my_check.py

License

MIT — see LICENSE.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.2.0

Jun 23, 2026

0.1.1

Jun 23, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ecs_doctor-0.2.0.tar.gz (59.5 kB view details)

Uploaded Jun 23, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

ecs_doctor-0.2.0-py3-none-any.whl (47.9 kB view details)

Uploaded Jun 23, 2026 Python 3

File details

Details for the file ecs_doctor-0.2.0.tar.gz.

File metadata

Download URL: ecs_doctor-0.2.0.tar.gz
Upload date: Jun 23, 2026
Size: 59.5 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for ecs_doctor-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`5e608917f3d2fe1728b1ed9201f03e8b0c095edaae7b7e1ece8b97cf0a796983`
MD5	`c87af86fa392a9308434d1c9d594a3c1`
BLAKE2b-256	`ade2ca0d9add7eeb32b59b65d8f8bd19217f8ba33b709a47400bd44f4d8c04b8`

See more details on using hashes here.

Provenance

The following attestation bundles were made for ecs_doctor-0.2.0.tar.gz:

Publisher: release.yml on PraveenLuke/ecs-doctor

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: ecs_doctor-0.2.0.tar.gz
- Subject digest: 5e608917f3d2fe1728b1ed9201f03e8b0c095edaae7b7e1ece8b97cf0a796983
- Sigstore transparency entry: 1930561490
- Sigstore integration time: Jun 23, 2026
Source repository:
- Permalink: PraveenLuke/ecs-doctor@c4275f0952ae8a4cff6d12f345e11796f9ee7b95
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/PraveenLuke
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@c4275f0952ae8a4cff6d12f345e11796f9ee7b95
- Trigger Event: push

File details

Details for the file ecs_doctor-0.2.0-py3-none-any.whl.

File metadata

Download URL: ecs_doctor-0.2.0-py3-none-any.whl
Upload date: Jun 23, 2026
Size: 47.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for ecs_doctor-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d73c466e8ceffded18f6f8853be4faf9a7a49025aae00108ea0bc758ba8f2ad2`
MD5	`2a08296472a75f2c00a97d7506c57f0f`
BLAKE2b-256	`c96028c506c71d510f54546ff9500ac9f4ff99b45f034bdfc8f55e440d89bbc2`

See more details on using hashes here.

Provenance

The following attestation bundles were made for ecs_doctor-0.2.0-py3-none-any.whl:

Publisher: release.yml on PraveenLuke/ecs-doctor

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: ecs_doctor-0.2.0-py3-none-any.whl
- Subject digest: d73c466e8ceffded18f6f8853be4faf9a7a49025aae00108ea0bc758ba8f2ad2
- Sigstore transparency entry: 1930561662
- Sigstore integration time: Jun 23, 2026
Source repository:
- Permalink: PraveenLuke/ecs-doctor@c4275f0952ae8a4cff6d12f345e11796f9ee7b95
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/PraveenLuke
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@c4275f0952ae8a4cff6d12f345e11796f9ee7b95
- Trigger Event: push

ecs-doctor 0.2.0

Navigation

Verified details

Maintainers

Meta

Unverified details

Meta

Classifiers

Project description

ecs-doctor

The Problem

Installation

Development install

Quick Start

Commands

diagnose — Run all diagnostic checks

browse — Interactive wizard (requires [interactive] extra)

serve — Web UI (requires [web] extra)

Authentication

Example Output

JSON output (--json)

What Gets Diagnosed

Diagnosers

Root Cause Categories (scored by confidence)

Crash Patterns in Logs

Running as a Container / on Fargate

Docker

Deploy to Fargate

Required IAM Permissions

Project Structure

Development

Running the Web UI locally

Adding a new diagnoser

License

Project details

Verified details

Maintainers

Meta

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

`diagnose` — Run all diagnostic checks

`browse` — Interactive wizard (requires `[interactive]` extra)

`serve` — Web UI (requires `[web]` extra)

JSON output (`--json`)