htop-style terminal monitor for vLLM inference servers
Project description
vllm-htop
htop for vLLM inference servers. Point it at one or more /metrics endpoints, get the right numbers, the right way, right now.
Zero dependencies. Single file. Python 3.8+.
At a glance
- Auto-discovers vLLM endpoints on the host — no
--urlneeded for typical local setups - Auto-splits internal DP —
vllm serve --data-parallel-size Nbecomes N rows automatically - Model-aware row names —
<model>.e0instead of0.e0, so mixed deployments (LLM + embedding) are readable - Windowed + long-window percentiles — P50/P95/P99 over ~2s, plus stabilized
P95@1mfor SLO reads - Prefix cache hit rate column when vLLM exposes it
- Imbalance check that points to the bad replica by name, median-based and grouped by model
- Cost estimation — token-based, compute-based (auto-detected from
nvidia-smi), and a margin row - htop-style alt-screen rendering — fixed window refresh, scrollback stays clean
- JSON output mode for piping into scripts, logs, or alerting
- Trend sparklines in detail view — 60-sample rolling history per metric
- Cross-DP percentile aggregation done correctly (merged buckets, not averaged P95s)
- Fault-tolerant — DOWN / STALE replicas surface without breaking the table
Install
# Recommended: uvx (zero install, always fresh)
uvx vllm-htop@latest
# pip
pip install vllm-htop
vllm-htop
# Or grab the single file and run it
curl -O https://raw.githubusercontent.com/eyuansu62/vllm-htop/main/vllm_htop.py
python vllm_htop.py
Quick start
vllm-htop
With no flags, vllm-htop:
- Scans
localhost:8000-8015for vLLM endpoints (parallel TCP +/metricsprobe, <100ms) - Detects internal DP via the
engine="N"label and expands each URL into per-engine rows - Picks table view if it found ≥2 replicas, detail view otherwise
- Runs
nvidia-smito detect local GPUs and shows compute cost when the model matches the built-in price table - Refreshes every 2s in alt-screen mode (no scrollback pollution)
- Ctrl-C exits; original terminal contents return
Features
Multi-replica DP — comparison table
vllm-htop --url http://h1:8000 http://h2:8000 http://h3:8000 http://h4:8000
# Comma-separated
vllm-htop --url http://h1:8000,http://h2:8000,http://h3:8000,http://h4:8000
# Shell brace expansion
vllm-htop --url http://localhost:{8000,8001,8002,8003}
Compact per-replica rows + aggregate ALL row + cross-replica imbalance check.
Auto-discovery
If you don't pass --url, vllm-htop scans the configured port range for vLLM-shaped /metrics endpoints. Open ports get an HTTP probe checking for the vllm: metric prefix; non-vLLM services on the same ports are filtered out.
vllm-htop # implicit, falls back to localhost:8000 if nothing's found
vllm-htop --auto # forced; fails loudly if nothing found
vllm-htop --auto --host 10.0.0.7 # remote host
vllm-htop --auto --port-range 9000-9031 # wider range
Internal DP — auto-split
vllm serve --data-parallel-size N exposes one /metrics endpoint with engine="0".."N-1" labels. vllm-htop detects this on first contact and expands the URL into one virtual replica per engine — the comparison table, imbalance check, and aggregate percentiles work just like for separate-process DP.
Row naming chooses the most informative form available:
| Setup | Names |
|---|---|
| External DP (N URLs, no engine) | 0, 1, … |
| Internal DP (1 URL, N engines) | e0, e1, … |
| Mixed (M URLs × N engines each) | 0.e0, 0.e1, 1.e0, … |
| Model name extractable from labels | <model>.e0, <other-model>.e1, … |
When model_name labels are present and distinct across URLs, the names use the model — so multi-model deployments (e.g. LLM + embedding on the same host) are readable at a glance. On name collisions (two URLs serving the same model) the tool falls back to URL indices to keep rows unique.
Imbalance check
When ≥2 replicas serve the same model, vllm-htop runs four checks for cross-replica anomalies. Healthy state collapses to one line; warnings name the bad replica explicitly.
▸ Imbalance check (× 4 replicas) ⚠ 1/4 failed
✓ Running req range 5–8, median 6
✓ KV cache range 40.0%–46.0%
⚠ slow-replica (TTFT) <model>.e3: 979ms is 5.2× median (188ms)
✓ slow-decode (TPOT) median 38.0ms, max 52.0ms (1.4×)
| Check | Threshold | Means |
|---|---|---|
| Running req | Δ > 3 and max > 1.5× median | ⚠ load-balancer skew / sticky session |
| KV cache | Δ > 15 percentage points | ⚠ uneven KV pressure (prefix-cache asymmetry?) |
| TTFT P95 | max / median > 1.5× | ⚠ slow replica (GPU thermal, NCCL, contention) |
| TPOT P95 | max / median > 1.5× | ⚠ slow decode |
Two design choices worth noting:
- median, not min, as the baseline ratio denominator. Min would be dragged to zero by any idle replica and produce misleading 75× ratios.
- grouped by model, so a mixed LLM+embedding deployment doesn't cross-compare workloads that are fundamentally different.
Cost estimation
Two independent pricing models, either or both can be on:
# Token-based: explicit prices in $/M tokens (OpenAI-style)
vllm-htop --cost-in 0.50 --cost-out 1.50
# Compute-based: auto-detected from nvidia-smi
vllm-htop # auto
vllm-htop --gpu-cost-hour 2.99 --num-gpus 8 # explicit override
vllm-htop --no-gpu-detect # disable auto-detect
# Both — also surfaces the Margin row
vllm-htop --cost-in 0.50 --cost-out 1.50 --gpu-cost-hour 2.99 --num-gpus 8
Each model reports Lifetime (since vLLM started), This session (since vllm-htop attached), and a Current rate / Burn rate for the live read. Margin is token-revenue ÷ compute-cost — colored green ≥2× / yellow ≥1× / red <1×.
Built-in GPU price hints cover:
- Blackwell datacenter: B200, B100, GB200
- Blackwell workstation/consumer: RTX PRO 6000, RTX 5090, RTX 5080
- Hopper: H100, H100 NVL, H200
- Hopper China-market: H20-3e, H20
- Ampere: A100 (40/80GB), A40, A30, A10, A10G, RTX A6000/A5000/A4000, RTX 3090
- Ada Lovelace: L40S, L40, L20 (China), L4, RTX 6000 Ada, RTX 4090, RTX 4080
- Older datacenter: V100, T4
Prices are anchored to RunPod Secure tier published rates (2026-05) — what OpenRouter-class token-API providers (Lambda, Hyperbolic, DeepInfra, …) typically pay for their compute. Cross-provider variance:
| Reference | Vs. our hints |
|---|---|
| AWS / GCP on-demand | 3-5× higher |
| Lambda Labs | within ±10% |
| RunPod Community | 20-40% lower |
| vast.ai community | 30-50% lower |
Treat the numbers as ±30% ballpark; override --gpu-cost-hour for anything serious.
Long-window P95
In the detail view's Latency section, the standard P95 column reflects only the latest poll-to-poll delta — noisy, often — when no requests completed in those 2s. The P95@1m column shows the same percentile over the last ~60 seconds of accumulated samples — much more stable, what you'd actually use for an SLO read.
▸ Latency (windowed percentiles)
metric P50 P95 P99 P95@1m
TTFT (ms) 100.0 916.7 3758.6 520.3
TPOT (ms) 8.3 77.8 100.0 65.1
Prefix cache hit rate
When vLLM exposes vllm:prefix_cache_queries_total / vllm:prefix_cache_hits_total, the table view picks up a Cache% column (green ≥60% / yellow ≥30% / red <30%) and the ALL row shows a query-weighted aggregate. The detail view's Saturation block reports both window and lifetime rates.
Prefix cache hit : 78.4% window 76.1% life
Trend sparklines (detail view)
Rolling 60-sample history for the metrics that change most:
▸ Trend (last 60 samples, newest on right)
Running : ▆▆▇▇██▅▆▆▇▇█ min 10 max 16 now 15
KV cache % : ▄▄▅▅▆▆▆▄▄▄▅▆ min 40.0% max 75.0% now 60.0%
in tok/s : █▃▆▂▁▄▄█▃█▆ min 12244 max 12411 now 12411
out tok/s : █▃▆▂▁▄▄█▃█▆ min 4898 max 4964 now 4964
TTFT P95 ms : ████████████ min 917 max 917 now 917
TPOT P95 ms : ████████████ min 77.8 max 77.8 now 77.8
Counters and KV% pin to 0 baseline so the bar height reflects absolute level; rates and latencies auto-scale so motion stays visible.
JSON output for scripting
# Pipe one snapshot to jq
vllm-htop --output json --once | jq '.aggregate.kv_pct_max'
vllm-htop --output json --once | jq '.cost.compute_based.burn_rate_per_hour'
# Stream JSONL to a log file
vllm-htop --output json --interval 5 >> /var/log/vllm-htop.jsonl
Each poll emits one JSON object on stdout. The schema covers per-replica gauges, throughput, latency (windowed + long-window), lifetime counters, session peaks, the aggregate row, and the cost section.
htop-style alt-screen rendering
In interactive mode (TTY + continuous polling), vllm-htop uses the terminal's alternate screen buffer — the same mechanism as htop, vim, less. Successive refreshes overwrite a fixed window; on exit, the original terminal contents return (the vllm-htop output is not left in scrollback).
Falls back to plain printing automatically when:
--onceis set (one-shot snapshot, you might want to capture it)--output json(structured output for pipelines)- stdout is captured (
> out.log,| tee—isatty()returns False)
CLI reference
| Flag | Default | What it does |
|---|---|---|
--url URL [URL ...] |
(auto-discovery) | Explicit base URLs. Space- or comma-separated, shell brace expansion supported |
--auto |
(implicit) | Force discovery, fail loudly if nothing found |
--host HOST |
localhost |
Hostname for --auto discovery and the fallback URL |
--port-range LO-HI |
8000-8015 |
Port range for --auto |
--interval N |
2.0 |
Refresh interval in seconds |
--timeout N |
4.0 |
Per-endpoint fetch timeout |
--once |
off | Print one snapshot and exit |
--output MODE |
auto |
auto/table/detail/json. json emits JSONL |
--table |
off | Force table view (legacy; use --output table) |
--detail |
off | Force detail view (legacy; use --output detail) |
--cost-in PRICE |
off | USD per 1M input (prompt) tokens — enables token cost |
--cost-out PRICE |
off | USD per 1M output (generation) tokens |
--gpu-cost-hour PRICE |
auto | USD per GPU-hour. Defaults to nvidia-smi + built-in price hint |
--num-gpus N |
auto | GPU count. Defaults to nvidia-smi count |
--no-gpu-detect |
off | Skip nvidia-smi auto-detect entirely |
--currency SYM |
$ |
Currency symbol shown in the Cost section |
-V, --version |
— | Print version and exit |
-h, --help |
— | Help |
Concepts
Time-scale of every metric
vllm-htop mixes several time scales — each answers a different question:
| Scale | Examples | Source | Best for |
|---|---|---|---|
| Instantaneous | Run, Wait, Swap, KV% | gauge at this poll | "What's the state right now?" |
| Windowed (~2s) | in/out tok/s, TTFT-P95, Cache% | counter / histogram-bucket deltas | "What's been happening this second?" |
| Long-window (~60s) | P95@1m column |
bucket delta over snapshots ≤60s old | "What's the SLO state?" |
| Trend (~2 min) | sparklines in detail view | rolling 60-sample buffer | "Is something trending up or down?" |
| Lifetime | life-Prompt/Output/Reqs, lifetime cost | vLLM *_total counters |
"How much total work since vLLM started?" |
| Session | peak-Run/KV, this-session cost | tracked since vllm-htop attached |
"How much during my monitoring window?" |
Aggregating percentiles across DP
The ALL row's P95 is computed by merging raw histogram buckets across replicas and then taking the percentile of the merged distribution. Averaging per-replica P95s is mathematically wrong — mean(P95) isn't P95(union). This matters most when one replica is hot and others are idle: averaging would understate the tail.
Time-based vs token-based cost
These answer different questions, and both are useful:
- Compute-based (
$/h × N × uptime) — what's actually leaving your account - Token-based (
tokens × $/M) — what the inference would cost (or is worth) at API prices - Margin (
token revenue ÷ compute cost) — whether the GPU is paying for itself
Self-host LLM as an API: watch margin. Internal-only tool: compute is what matters. Researcher/benchmarker: tokens-burned is a hardware-independent yardstick.
Fault tolerance
- DOWN replicas (fetch failed and no prior snapshot) appear as a row with the error, but don't break the aggregate or imbalance check.
- STALE when the latest fetch failed but we have an older snapshot — useful through transient network blips.
- Parallel polling via
ThreadPoolExecutor— total refresh ≈ slowest single fetch, regardless of replica count. - Substring metric-name matching so version drift between
vllm:gpu_cache_usage_percandvllm:kv_cache_usage_percdoesn't break anything.
Why?
vLLM exports a rich Prometheus /metrics endpoint, but:
- Production Prometheus + Grafana is overkill when you just SSH'd in and want to know if a server is healthy right now.
curl /metrics | grepcan't compute windowed percentiles, rates, or cross-replica aggregates.- The default vLLM Grafana dashboard doesn't surface cross-replica imbalance — which is the most common operational failure mode for DP setups.
vllm-htop sits between Grafana (always-on, persistent) and curl (one-off, raw). Single binary, ssh-friendly, zero ops setup.
Limitations
- No alerting — this is a viewer, not a notifier. For real alerting see Andrey Krisanov's vLLM Prometheus rules.
- Peaks are in-memory only — when the script exits, session peaks are lost. For long-term persistence, use Prometheus.
- GPU price hints are ballpark — RunPod-anchored medians, ±30% across providers. Pass
--gpu-cost-hourfor accuracy. nvidia-smiauto-detection only works on the host running vLLM — if you SSH'd in from your laptop and ranvllm-htopagainstlocalhost, the GPU detection sees the local box (correct). If you point--urlat a remote vLLM, the local GPU info isn't relevant; pass explicit--gpu-cost-hour.
Acknowledgments
The vLLM project for exposing rich metrics by default, and the reference Grafana dashboard that informed the choice of which metrics matter most.
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file vllm_htop-0.4.9.tar.gz.
File metadata
- Download URL: vllm_htop-0.4.9.tar.gz
- Upload date:
- Size: 46.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
40c1db5007596431ccfbdb57e80b6279319246ede5d5cdef6d72aee3d7a780ad
|
|
| MD5 |
fffb59bc37ace3da70acd67d186db763
|
|
| BLAKE2b-256 |
9be3e17aa1cf9895e570d1a287791fd25d3ecb9bf28c059cdc661a632c5684f6
|
Provenance
The following attestation bundles were made for vllm_htop-0.4.9.tar.gz:
Publisher:
publish.yml on eyuansu62/vllm-htop
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
vllm_htop-0.4.9.tar.gz -
Subject digest:
40c1db5007596431ccfbdb57e80b6279319246ede5d5cdef6d72aee3d7a780ad - Sigstore transparency entry: 1632509552
- Sigstore integration time:
-
Permalink:
eyuansu62/vllm-htop@dd95db8bceb7c713a14396299d61187a94cf2b65 -
Branch / Tag:
refs/tags/v0.4.9 - Owner: https://github.com/eyuansu62
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@dd95db8bceb7c713a14396299d61187a94cf2b65 -
Trigger Event:
push
-
Statement type:
File details
Details for the file vllm_htop-0.4.9-py3-none-any.whl.
File metadata
- Download URL: vllm_htop-0.4.9-py3-none-any.whl
- Upload date:
- Size: 46.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fbd2c021a80e3da81aa70ed57a643b23a1ef669f2e463444db338e05e3577c4c
|
|
| MD5 |
d34c733734f75ea88320659d82c2d4b8
|
|
| BLAKE2b-256 |
a81e391a0d00d36611b4fc50307a15e2637311cd84cfd539ddf7d8c3ac891007
|
Provenance
The following attestation bundles were made for vllm_htop-0.4.9-py3-none-any.whl:
Publisher:
publish.yml on eyuansu62/vllm-htop
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
vllm_htop-0.4.9-py3-none-any.whl -
Subject digest:
fbd2c021a80e3da81aa70ed57a643b23a1ef669f2e463444db338e05e3577c4c - Sigstore transparency entry: 1632509557
- Sigstore integration time:
-
Permalink:
eyuansu62/vllm-htop@dd95db8bceb7c713a14396299d61187a94cf2b65 -
Branch / Tag:
refs/tags/v0.4.9 - Owner: https://github.com/eyuansu62
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@dd95db8bceb7c713a14396299d61187a94cf2b65 -
Trigger Event:
push
-
Statement type: