htop-style terminal monitor for vLLM inference servers

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

These details have not been verified by PyPI

Project description

vllm-htop

htop for vLLM inference servers — point it at one or more /metrics endpoints and get the right numbers, the right way, right now.

Zero dependencies. Single file. Python 3.8+.

vLLM DP Monitor  │  4/4 up  │  2026-05-18 14:23:01  (interval=2.0s)
──────────────────────────────────────────────────────────────────────────────────
 DP  Status   Run  Wait  Swap   KV%      in tok/s  out tok/s   TTFT-P95  TPOT-P95
──────────────────────────────────────────────────────────────────────────────────
 0   OK       12     0     0    55.0%       49793      16597       410ms     37.0ms
 1   OK       11     0     0    58.0%       47841      15947       415ms     38.0ms
 2   OK       18     6     0    91.0%       69738      23246       820ms     52.0ms
 3   OK       12     0     0    57.0%       48100      16100       420ms     38.0ms
──────────────────────────────────────────────────────────────────────────────────
 ALL          53     6     0   max91.0%      215472      71890       512ms     41.0ms

▸ Imbalance check  (across 4 replicas)
  Running req     :    11  →  18    (Δ=7)             ⚠ load-balancer skew?
  KV cache        :  55.0% → 91.0%  (Δ=36.0pp)        ⚠ uneven KV pressure
  TTFT P95        :   410ms → 820ms (2.00×)           ⚠ slow replica
  TPOT P95        :  37.0ms → 52.0ms (1.41×)

▸ Cumulative  (life = vLLM counters · sess = peaks observed since monitor uptime 12m34s)
──────────────────────────────────────────────────────────────────────────────────
 DP   life-Prompt  life-Output  life-Reqs   peak-Run  peak-Wait  peak-KV%   peak in/out tok/s
──────────────────────────────────────────────────────────────────────────────────
 0        12.3M         3.4M        10.2K       19         4     71.3%   52.1K/17.4K
 1        11.9M         3.3M         9.9K       17         2     67.8%   50.3K/16.8K
 2        13.1M         3.7M        11.2K       28        12     91.0%   72.4K/24.1K
 3        12.1M         3.4M        10.1K       18         3     68.5%   51.2K/17.1K
──────────────────────────────────────────────────────────────────────────────────
 ALL      49.4M        13.8M       41.4K

Why?

vLLM exports a rich Prometheus /metrics endpoint with everything you need to understand serving performance — TTFT/TPOT/E2E histograms, KV cache usage, queue depth, swap counts. But...

...running production Prometheus + Grafana is overkill when you just SSH'd in and want to know if a server is healthy right now
...curl /metrics | grep can't compute windowed percentiles or rates
...when you run Data Parallel replicas, you really want side-by-side comparison and imbalance detection, which the default vLLM Grafana dashboard doesn't surface at all

vllm-htop is the thing you reach for between Grafana (always-on, persistent) and curl (one-off, raw). It complements both — not a replacement.

Install

The fastest way — no install needed (recommended):

uvx vllm-htop --url http://localhost:8000

With pip:

pip install vllm-htop
vllm-htop --url http://localhost:8000

Or just grab the single file and run it (no dependencies needed beyond Python 3.8+):

curl -O https://raw.githubusercontent.com/eyuansu62/vllm-htop/main/vllm_htop.py
python vllm_htop.py --url http://localhost:8000

Usage

Single instance — detail view

vllm-htop --url http://localhost:8000

Shows P50/P95/P99 across TTFT/TPOT/E2E/Queue, current saturation gauges, and lifetime cumulative.

DP / multiple replicas — comparison table

# Space-separated
vllm-htop --url http://h1:8000 http://h2:8000 http://h3:8000 http://h4:8000

# Comma-separated
vllm-htop --url http://h1:8000,http://h2:8000,http://h3:8000,http://h4:8000

# Shell brace expansion (most concise)
vllm-htop --url http://localhost:{8000,8001,8002,8003}

Automatically switches to compact per-replica rows + aggregate + imbalance check.

Auto-discovery — one machine, many DP replicas

If you don't pass --url, vllm-htop scans localhost:8000-8015 for vLLM-shaped /metrics endpoints and attaches to whatever it finds. So when you have multiple vllm serve processes on the same host (one per port), monitoring all of them is just:

vllm-htop

It narrates the discovery only when interesting (≥2 endpoints found, or --auto was explicit); the single-instance case stays quiet.

# Force discovery (fails loudly if nothing's found — useful in scripts)
vllm-htop --auto

# Wider range, different host
vllm-htop --auto --host 10.0.0.7 --port-range 9000-9031

Discovery does a parallel TCP probe over the range, then HTTP-probes only the open ports for the vllm: metric-name prefix, so it's fast (typically <100ms on a localhost scan) even on wide ranges. Non-vLLM services on the same ports are filtered out, not confused for replicas.

If discovery turns up nothing and you didn't pass --auto, the tool falls back to http://<host>:8000 and surfaces the real fetch error there — more useful than a generic "no endpoints found".

Cost estimation (optional)

Two independent pricing models, either or both can be on:

Token-based — explicit prices in $/1M tokens (OpenAI-style convention):

vllm-htop --cost-in 0.50 --cost-out 1.50

Compute-based — auto-detected from nvidia-smi, with a built-in GPU price-hint table:

# Just run it. If `nvidia-smi` is on PATH, vllm-htop reads the GPU model and
# count, looks up a community-market reference rate, and shows compute burn.
vllm-htop

# Or override the rate / count explicitly:
vllm-htop --gpu-cost-hour 2.99 --num-gpus 8

# Skip the auto-detect entirely:
vllm-htop --no-gpu-detect

The built-in hints cover:

Blackwell datacenter: B200, B100, GB200
Blackwell workstation/consumer: RTX PRO 6000, RTX 5090, RTX 5080
Hopper: H100, H100 NVL, H200
Ampere: A100 (40/80GB), A40, A30, A10, A10G, RTX A6000/A5000/A4000, RTX 3090
Ada Lovelace: L40S, L40, L4, RTX 6000 Ada, RTX 4090, RTX 4080
Older datacenter: V100, T4

Prices are anchored to RunPod Secure tier published rates as of 2026-05 — this is what OpenRouter-class token-API providers (Lambda, Hyperbolic, DeepInfra, …) typically pay for their compute, so it's the most representative "GPU rental cost" for someone running their own vLLM stack. Cross-provider variance:

AWS / GCP on-demand: 3-5× higher
Lambda Labs: within ±10%
RunPod Community: 20-40% lower
vast.ai community: 30-50% lower (high variance)

Treat the numbers as a ballpark (±30%) and override via --gpu-cost-hour for anything serious.

Both at once — also surfaces a Margin row (token revenue ÷ compute cost):

vllm-htop --cost-in 0.50 --cost-out 1.50 --gpu-cost-hour 2.99 --num-gpus 8

Example output:

▸ Cost  (estimated · sum across 3 replicas)
  Token-based  ($0.5/M in, $1.5/M out)
    Lifetime     :      $165.17  ($75.08 in + $90.09 out)
    This session :        $0.13  (over 2m11s)
    Current rate :        $3.86/min  ($231.55/hour at current throughput)
  Compute-based  (NVIDIA H100 80GB HBM3 × 8 @ $2.99/h — auto-detected, estimate)
    Burn rate    :       $23.92/hour  (paid whether busy or idle)
    This session :         $0.87  (over 2m11s)
  Margin (token revenue ÷ compute cost)
    At current load :       9.68×  ($231.55/h revenue vs $23.92/h compute)

The Cost section is hidden when no pricing is configured (no --cost-* flags and GPU auto-detect found nothing).

Flags

Flag	Default	What it does
`--url URL [URL ...]`	(auto-discovery)	Explicit vLLM base URLs. Overrides auto-discovery
`--auto`	(implicit default)	Force discovery, fail loudly if nothing found. Without `--url`, discovery already runs implicitly
`--host HOST`	`localhost`	Hostname for discovery and the fallback URL
`--port-range LO-HI`	`8000-8015`	Port range for discovery (e.g. `8000-8015`, `8000:8015`)
`--interval N`	`2.0`	Refresh interval in seconds
`--timeout N`	`4.0`	Per-endpoint fetch timeout
`--once`	off	Print one snapshot and exit (good for cron / CI smoke tests)
`--table`	auto	Force compact table view
`--cost-in PRICE`	off	USD per 1M input (prompt) tokens — enables token-based Cost section
`--cost-out PRICE`	off	USD per 1M output (generation) tokens
`--gpu-cost-hour PRICE`	auto	USD per GPU-hour. Defaults to a built-in hint based on `nvidia-smi` detection
`--num-gpus N`	auto	GPU count. Defaults to `nvidia-smi` count
`--no-gpu-detect`	off	Skip `nvidia-smi` auto-detection entirely
`--currency SYM`	`$`	Currency symbol shown in the Cost section
`--detail`	auto	Force per-instance detail view

What it shows

Throughput (windowed)

Token and request rates computed from the delta between the last two polls — reflects recent behavior, not lifetime average.

Latency (windowed percentiles)

P50/P95/P99 for TTFT, TPOT, E2E, queue time. Percentiles come from histogram bucket deltas between polls — equivalent to Prometheus' histogram_quantile(0.95, rate(..._bucket[Δ])).

Saturation (current gauges)

Running / waiting / swapped requests, plus KV cache usage with a colored bar (green < 65%, yellow < 85%, red ≥ 85%).

Imbalance check (DP only, ≥2 replicas)

Check	Threshold	Means
Running req	ratio > 1.5× and Δ > 3	⚠ load-balancer skew / sticky session
KV cache	Δ > 15 percentage points	⚠ uneven KV pressure (prefix-cache asymmetry?)
TTFT P95	max/min > 1.5×	⚠ slow replica (GPU thermal, NCCL, contention)
TPOT P95	max/min > 1.5×	⚠ slow decode

Cumulative

Two clearly-labelled sources:

life — read directly from vLLM *_total counters: prompt tokens, output tokens, successful requests since vLLM started
sess — peaks observed by the monitor since it started watching: peak running / waiting / KV% / tokens/s

swap-seen is sticky within a session: if swapping fires once, it stays red as a warning even after it recovers.

Design notes

Aggregate percentiles across DP are computed by merging histogram buckets — that's the only mathematically correct way to combine percentiles. Averaging per-replica P95s is wrong.
DOWN replicas are isolated — they don't break the table, aggregate, or imbalance check. The header shows 3/4 up and the offending row stays visible with its error.
STALE status: last fetch failed but we have an older snapshot, useful for transient network blips.
Parallel polling via ThreadPoolExecutor — refresh time stays ≈ slowest single fetch regardless of replica count.
Metric-name matching is substring-based (time_to_first_token, cache_usage_perc) so the tool tolerates vLLM version drift between vllm:gpu_cache_usage_perc and vllm:kv_cache_usage_perc.

Internal DP (engine labels) is auto-split

When you run vllm serve --data-parallel-size N, vLLM exposes one /metrics endpoint whose samples are tagged with engine="0".."N-1". vllm-htop detects this on first contact and expands the single URL into one virtual replica per engine — so the comparison table, imbalance check, and aggregate percentiles all work just like they do for separate-process external DP.

Naming convention in the table:

Setup	Replica names
Pure external (N URLs, no engine label)	`0`, `1`, `2`, …
Pure internal (1 URL, N engines)	`e0`, `e1`, `e2`, …
Mixed (M URLs × N engines each)	`0.e0`, `0.e1`, `1.e0`, …

No new flag — detection runs automatically on startup.

Limitations

Peaks are in-memory only — when the script exits, session peaks are lost. For long-term persistence, use Prometheus.
No alerting — this is a viewer, not a notifier. For real alerting see Andrey Krisanov's vLLM Prometheus rules as a starting point.

Acknowledgments

The vLLM project for exposing rich metrics by default, and for shipping a reference Grafana dashboard that informed the choice of which metrics matter most.

License

MIT

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

eyuansu62

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.5.0

May 26, 2026

0.4.10

May 26, 2026

0.4.9

May 26, 2026

0.4.7

May 21, 2026

0.4.6

May 21, 2026

0.4.5

May 21, 2026

0.4.4

May 21, 2026

0.4.3

May 20, 2026

0.4.2

May 20, 2026

0.4.0

May 20, 2026

0.3.3

May 20, 2026

0.3.2

May 19, 2026

0.3.0

May 19, 2026

This version

0.2.2

May 19, 2026

0.2.1

May 19, 2026

0.2.0

May 19, 2026

0.1.1

May 18, 2026

0.1.0

May 18, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vllm_htop-0.2.2.tar.gz (25.7 kB view details)

Uploaded May 19, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

vllm_htop-0.2.2-py3-none-any.whl (26.3 kB view details)

Uploaded May 19, 2026 Python 3

File details

Details for the file vllm_htop-0.2.2.tar.gz.

File metadata

Download URL: vllm_htop-0.2.2.tar.gz
Upload date: May 19, 2026
Size: 25.7 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for vllm_htop-0.2.2.tar.gz
Algorithm	Hash digest
SHA256	`7a099b2283c423b188bbc665c5a6a1fbfc2bb9a0083a5fbbe532e616a6ed4688`
MD5	`48da00b264b6996fd56bb750a6705578`
BLAKE2b-256	`dd3cb60e01069a794450d62f99ce435921215ca42d3c408e660888f80e2d53e8`

See more details on using hashes here.

Provenance

The following attestation bundles were made for vllm_htop-0.2.2.tar.gz:

Publisher: publish.yml on eyuansu62/vllm-htop

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: vllm_htop-0.2.2.tar.gz
- Subject digest: 7a099b2283c423b188bbc665c5a6a1fbfc2bb9a0083a5fbbe532e616a6ed4688
- Sigstore transparency entry: 1572241085
- Sigstore integration time: May 19, 2026
Source repository:
- Permalink: eyuansu62/vllm-htop@01598ac0587da486dde09f7b721ea387735d2884
- Branch / Tag: refs/tags/v0.2.2
- Owner: https://github.com/eyuansu62
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@01598ac0587da486dde09f7b721ea387735d2884
- Trigger Event: push

File details

Details for the file vllm_htop-0.2.2-py3-none-any.whl.

File metadata

Download URL: vllm_htop-0.2.2-py3-none-any.whl
Upload date: May 19, 2026
Size: 26.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for vllm_htop-0.2.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c33ae7cbf9ab144c551c59a191219c7a6193bfa272f955fc4e4ae92f057c321a`
MD5	`b4488dde1359dd4be55ad0668de96551`
BLAKE2b-256	`253a5c0f70d4a41ab47c230e076753bf8eab94ca4a143e654362ca81ff330850`

See more details on using hashes here.

Provenance

The following attestation bundles were made for vllm_htop-0.2.2-py3-none-any.whl:

Publisher: publish.yml on eyuansu62/vllm-htop

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: vllm_htop-0.2.2-py3-none-any.whl
- Subject digest: c33ae7cbf9ab144c551c59a191219c7a6193bfa272f955fc4e4ae92f057c321a
- Sigstore transparency entry: 1572241105
- Sigstore integration time: May 19, 2026
Source repository:
- Permalink: eyuansu62/vllm-htop@01598ac0587da486dde09f7b721ea387735d2884
- Branch / Tag: refs/tags/v0.2.2
- Owner: https://github.com/eyuansu62
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@01598ac0587da486dde09f7b721ea387735d2884
- Trigger Event: push

vllm-htop 0.2.2

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

vllm-htop

Why?

Install

Usage

Single instance — detail view

DP / multiple replicas — comparison table

Auto-discovery — one machine, many DP replicas

Cost estimation (optional)

Flags

What it shows

Throughput (windowed)

Latency (windowed percentiles)

Saturation (current gauges)

Imbalance check (DP only, ≥2 replicas)

Cumulative

Design notes

Internal DP (engine labels) is auto-split

Limitations

Acknowledgments

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance