vllm-doctor

Diagnostic tool for vLLM inference servers

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

aminalaee

These details have not been verified by PyPI

Project description

Diagnose vLLM server bottlenecks from live metrics.

vllm-doctor demo

vLLM Doctor reads vLLM server metrics and turns them into diagnostic findings: what looks unhealthy, why it may be happening, and which vLLM settings are worth checking first.

vllm-doctor diagnose http://localhost:8000/metrics

vLLM Doctor is not a dashboard replacement or benchmark runner. It is a fast server-side diagnostic snapshot for a single vLLM server or Prometheus target.

Features

Built-in diagnosis rules — queue pressure, TTFT/TPOT bottlenecks, KV cache pressure, low throughput, error rate, replica imbalance, prefix cache efficiency, preemption pressure, queue latency
Local history — persist runs with --save, review with history list and history show
Watch change-log — --save --watch only persists when state changes (health or firing rules)
Dual input — Prometheus or direct /metrics scrape
Structured output — rich text tables or JSON for automation
Configurable thresholds — per-rule tuning via TOML

Why not just a dashboard?

Dashboards show metrics. vLLM Doctor explains server-side inference behavior.

	Dashboards	vLLM Doctor
Shows raw metrics	✓	✓
Explains what's wrong	✗	✓
Recommends vLLM configs	✗	✓
Requires setup	✓	✗
Works on a single server	✗	✓

How does this relate to GuideLLM?

GuideLLM is a good fit for generating workloads and measuring endpoint behavior. vLLM Doctor is a good fit for explaining server-side symptoms from vLLM metrics.

Used together, GuideLLM can create or replay load while vLLM Doctor helps explain bottlenecks such as queue pressure, KV cache pressure, high TTFT, or high TPOT.

Installation

With pip:

pip install vllm-doctor

With uv:

uv tool install vllm-doctor

Quickstart

Direct scrape:

vllm-doctor diagnose http://localhost:8000/metrics

Prometheus:

vllm-doctor diagnose http://localhost:9090

Run with Docker

A prebuilt image is published to GitHub Container Registry:

docker run --rm ghcr.io/aminalaee/vllm-doctor diagnose <url>

<url> is your vLLM /metrics or Prometheus endpoint — the same argument the CLI takes — reachable from inside the container.

Example verbose output

─────────────────────────────────── vLLM Doctor  ·  Health: CRITICAL  ·  Since: now ────────────────────────────────────

╭─ ✖ KV cache pressure  [high confidence] ─────────────────────────────────────────────────────────────────────────────╮
│   GPU KV cache usage: 94% (threshold: 90%)  ·  Waiting requests: 7 (blocked by full cache)                           │
│                                                                                                                      │
│   → Reduce max_num_seqs to limit concurrent sequences                                                                │
│   → Reduce max_num_batched_tokens to cap memory per step                                                             │
│   → Increase gpu_memory_utilization if GPU memory headroom exists                                                    │
│   → Route long-context requests to a dedicated replica                                                               │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─ ⚠ High time to first token (TTFT)  [high confidence] ───────────────────────────────────────────────────────────────╮
│   TTFT p95: 3.200s  ·  TPOT p95: 0.050s  ·  Waiting requests: 7                                                      │
│                                                                                                                      │
│   → Enable or tune chunked prefill (--enable-chunked-prefill)                                                        │
│   → Reduce max prompt length or filter long requests                                                                 │
│   → Inspect queue depth — consider adding replicas                                                                   │
│   → Separate long-context traffic to dedicated instances                                                             │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─ ⚠ Replica imbalance  [high confidence] ─────────────────────────────────────────────────────────────────────────────╮
│   meta-llama/Llama-3.1-8B: running vllm-1=10 vs vllm-0=2; cache 94% vs 41%; waiting vllm-1=7 vs vllm-0=0             │
│                                                                                                                      │
│   → Check the load balancer / service routing and session affinity settings                                          │
│   → Verify readiness probes — an unready replica receives no traffic                                                 │
│   → Compare per-replica latency and restart any unhealthy replica                                                    │
│   → Confirm newly added replicas are registered with the load balancer                                               │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─ ⚠ Queue pressure  [low confidence] ─────────────────────────────────────────────────────────────────────────────────╮
│   Waiting requests: 7 (threshold: 5)                                                                                 │
│                                                                                                                      │
│   → Add replicas or increase concurrency limits                                                                      │
│   → Inspect autoscaling thresholds                                                                                   │
│   → Separate long-context traffic to a dedicated replica                                                             │
│   → Reduce incoming request rate                                                                                     │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

  KV Cache Pressure          ✖ critical    [high]
  High TTFT                  ⚠ warning     [high]
  Replica Imbalance          ⚠ warning     [high]
  Queue Pressure             ⚠ warning     [low]
  Queue Latency              ✓ ok
  Preemption Pressure        ✓ ok
  Low Throughput             ✓ ok
  Error Rate                 ✓ ok
  High TPOT                  ✓ ok
  Prefix Cache Efficiency    ✓ ok

─────────────────────────────────────────────────── Observed Metrics ───────────────────────────────────────────────────

  Summary
  Requests Running                               12
  Requests Waiting                                7
  GPU Cache Usage          ███████████████████░ 94%
  Prefill Tokens/s                            390.0
  Decode Tokens/s                             252.0
  Requests Success                              114
  Requests Error                                  0
  Requests Aborted                                0
  TTFT p95 (s)                                3.200
  TPOT p95 (s)                                0.050
  Queue Time p95 (s)                          0.800
  Preemptions Total                               0
  Prefix Cache Hit Rate                         50%

─────────────────────────────────────────────── Observed Metrics per pod ───────────────────────────────────────────────

                       vllm-1    vllm-0
  Requests Running         10         2
  Requests Waiting          7         0
  GPU Cache Usage         94%       41%
  Prefill Tokens/s       80.0     310.0
  Decode Tokens/s        42.0     210.0
  Requests Success         30        84
  Requests Error            0         0
  Requests Aborted          0         0
  Preemptions Total         0         0

Documentation

Read the full documentation: https://aminalaee.github.io/vllm-doctor

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

aminalaee

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.6.0

Jun 12, 2026

0.5.0

Jun 9, 2026

0.4.0

Jun 8, 2026

0.3.0

Jun 1, 2026

0.2.0

May 28, 2026

0.1.0

May 18, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vllm_doctor-0.6.0.tar.gz (28.1 kB view details)

Uploaded Jun 12, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

vllm_doctor-0.6.0-py3-none-any.whl (46.4 kB view details)

Uploaded Jun 12, 2026 Python 3

File details

Details for the file vllm_doctor-0.6.0.tar.gz.

File metadata

Download URL: vllm_doctor-0.6.0.tar.gz
Upload date: Jun 12, 2026
Size: 28.1 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: uv/0.11.21 {"installer":{"name":"uv","version":"0.11.21","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for vllm_doctor-0.6.0.tar.gz
Algorithm	Hash digest
SHA256	`76b7d60eaeeaed064c1f369a5db0579a8b687645f4ad8152fdd9274109882963`
MD5	`69ca92bb5847f2d8d471f2776771c1e2`
BLAKE2b-256	`0d2adee3793eae9269b7464cc2df58c4f2a0836e2be803803fbb45490a99754f`

See more details on using hashes here.

File details

Details for the file vllm_doctor-0.6.0-py3-none-any.whl.

File metadata

Download URL: vllm_doctor-0.6.0-py3-none-any.whl
Upload date: Jun 12, 2026
Size: 46.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: uv/0.11.21 {"installer":{"name":"uv","version":"0.11.21","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for vllm_doctor-0.6.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`820e1bda76ce99de685249c3967430c7c1fbe3be10bc60f97aec7b47775f4d10`
MD5	`e06ecd2185f50673870bc33909028958`
BLAKE2b-256	`c617eee3364db5f09e2b3bd7bfe07513cd2d541f6bbdf157d6c9be2e5125509c`

See more details on using hashes here.

vllm-doctor 0.6.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Meta

Unverified details

Meta

Classifiers

Project description

Features

Why not just a dashboard?

How does this relate to GuideLLM?

Installation

Quickstart

Run with Docker

Example verbose output

Documentation

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Meta

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes