htop-style terminal monitor for vLLM inference servers

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

These details have not been verified by PyPI

Project description

vllm-htop

htop for vLLM inference servers. Point it at one or more /metrics endpoints, get the right numbers, the right way, right now.

Zero dependencies. Single file. Python 3.8+.

vllm-htop terminal screenshot showing the DP table with per-engine rows, prefix cache hit rate column, imbalance check identifying the slow replica, and cost section with margin row

At a glance

Auto-discovers vLLM endpoints on the host — no --url needed for typical local setups
Auto-splits internal DP — vllm serve --data-parallel-size N becomes N rows automatically
Model-aware row names — <model>.e0 instead of 0.e0, so mixed deployments (LLM + embedding) are readable
Windowed + long-window percentiles — P50/P95/P99 over ~2s, plus stabilized P95@1m for SLO reads
Prefix cache hit rate column when vLLM exposes it
Imbalance check that points to the bad replica by name, median-based and grouped by model
Cost estimation — token-based, compute-based (auto-detected from nvidia-smi), and a margin row
htop-style alt-screen rendering — fixed window refresh, scrollback stays clean
JSON output mode for piping into scripts, logs, or alerting
Trend sparklines in detail view — 60-sample rolling history per metric
Cross-DP percentile aggregation done correctly (merged buckets, not averaged P95s)
Fault-tolerant — DOWN / STALE replicas surface without breaking the table

Install

# Recommended: uvx (zero install, always fresh)
uvx vllm-htop@latest

# pip
pip install vllm-htop
vllm-htop

# Or grab the single file and run it
curl -O https://raw.githubusercontent.com/eyuansu62/vllm-htop/main/vllm_htop.py
python vllm_htop.py

Quick start

vllm-htop

With no flags, vllm-htop:

Scans localhost:8000-8015 for vLLM endpoints (parallel TCP + /metrics probe, <100ms)
Detects internal DP via the engine="N" label and expands each URL into per-engine rows
Picks table view if it found ≥2 replicas, detail view otherwise
Runs nvidia-smi to detect local GPUs and shows compute cost when the model matches the built-in price table
Refreshes every 2s in alt-screen mode (no scrollback pollution)
Ctrl-C exits; original terminal contents return

Features

Multi-replica DP — comparison table

vllm-htop --url http://h1:8000 http://h2:8000 http://h3:8000 http://h4:8000

# Comma-separated
vllm-htop --url http://h1:8000,http://h2:8000,http://h3:8000,http://h4:8000

# Shell brace expansion
vllm-htop --url http://localhost:{8000,8001,8002,8003}

Compact per-replica rows + aggregate ALL row + cross-replica imbalance check.

Auto-discovery

If you don't pass --url, vllm-htop scans the configured port range for vLLM-shaped /metrics endpoints. Open ports get an HTTP probe checking for the vllm: metric prefix; non-vLLM services on the same ports are filtered out.

vllm-htop                                  # implicit, falls back to localhost:8000 if nothing's found
vllm-htop --auto                           # forced; fails loudly if nothing found
vllm-htop --auto --host 10.0.0.7           # remote host
vllm-htop --auto --port-range 9000-9031    # wider range

Internal DP — auto-split

vllm serve --data-parallel-size N exposes one /metrics endpoint with engine="0".."N-1" labels. vllm-htop detects this on first contact and expands the URL into one virtual replica per engine — the comparison table, imbalance check, and aggregate percentiles work just like for separate-process DP.

Row naming chooses the most informative form available:

Setup	Names
External DP (N URLs, no engine)	`0`, `1`, …
Internal DP (1 URL, N engines)	`e0`, `e1`, …
Mixed (M URLs × N engines each)	`0.e0`, `0.e1`, `1.e0`, …
Model name extractable from labels	`<model>.e0`, `<other-model>.e1`, …

When model_name labels are present and distinct across URLs, the names use the model — so multi-model deployments (e.g. LLM + embedding on the same host) are readable at a glance. On name collisions (two URLs serving the same model) the tool falls back to URL indices to keep rows unique.

Imbalance check

When ≥2 replicas serve the same model, vllm-htop runs four checks for cross-replica anomalies. Healthy state collapses to one line; warnings name the bad replica explicitly.

▸ Imbalance check  (× 4 replicas)  ⚠ 1/4 failed
  ✓ Running req          range 5–8, median 6
  ✓ KV cache             range 40.0%–46.0%
  ⚠ slow-replica (TTFT)  &lt;model&gt;.e3: 979ms is 5.2× median (188ms)
  ✓ slow-decode (TPOT)   median 38.0ms, max 52.0ms (1.4×)

Check	Threshold	Means
Running req	Δ > 3 and max > 1.5× median	⚠ load-balancer skew / sticky session
KV cache	Δ > 15 percentage points	⚠ uneven KV pressure (prefix-cache asymmetry?)
TTFT P95	max / median > 1.5×	⚠ slow replica (GPU thermal, NCCL, contention)
TPOT P95	max / median > 1.5×	⚠ slow decode

Two design choices worth noting:

median, not min, as the baseline ratio denominator. Min would be dragged to zero by any idle replica and produce misleading 75× ratios.
grouped by model, so a mixed LLM+embedding deployment doesn't cross-compare workloads that are fundamentally different.

Cost estimation

Two independent pricing models, either or both can be on:

# Token-based: explicit prices in $/M tokens (OpenAI-style)
vllm-htop --cost-in 0.50 --cost-out 1.50

# Compute-based: auto-detected from nvidia-smi
vllm-htop                                      # auto
vllm-htop --gpu-cost-hour 2.99 --num-gpus 8    # explicit override
vllm-htop --no-gpu-detect                      # disable auto-detect

# Both — also surfaces the Margin row
vllm-htop --cost-in 0.50 --cost-out 1.50 --gpu-cost-hour 2.99 --num-gpus 8

Each model reports Lifetime (since vLLM started), This session (since vllm-htop attached), and a Current rate / Burn rate for the live read. Margin is token-revenue ÷ compute-cost — colored green ≥2× / yellow ≥1× / red <1×.

Built-in GPU price hints cover:

Blackwell datacenter: B200, B100, GB200
Blackwell workstation/consumer: RTX PRO 6000, RTX 5090, RTX 5080
Hopper: H100, H100 NVL, H200
Hopper China-market: H20-3e, H20
Ampere: A100 (40/80GB), A40, A30, A10, A10G, RTX A6000/A5000/A4000, RTX 3090
Ada Lovelace: L40S, L40, L20 (China), L4, RTX 6000 Ada, RTX 4090, RTX 4080
Older datacenter: V100, T4

Prices are anchored to RunPod Secure tier published rates (2026-05) — what OpenRouter-class token-API providers (Lambda, Hyperbolic, DeepInfra, …) typically pay for their compute. Cross-provider variance:

Reference	Vs. our hints
AWS / GCP on-demand	3-5× higher
Lambda Labs	within ±10%
RunPod Community	20-40% lower
vast.ai community	30-50% lower

Treat the numbers as ±30% ballpark; override --gpu-cost-hour for anything serious.

Long-window P95

In the detail view's Latency section, the standard P95 column reflects only the latest poll-to-poll delta — noisy, often — when no requests completed in those 2s. The P95@1m column shows the same percentile over the last ~60 seconds of accumulated samples — much more stable, what you'd actually use for an SLO read.

▸ Latency  (windowed percentiles)
  metric             P50       P95       P99    P95@1m
  TTFT  (ms)       100.0     916.7    3758.6     520.3
  TPOT  (ms)         8.3      77.8     100.0      65.1

Prefix cache hit rate

When vLLM exposes vllm:prefix_cache_queries_total / vllm:prefix_cache_hits_total, the table view picks up a Cache% column (green ≥60% / yellow ≥30% / red <30%) and the ALL row shows a query-weighted aggregate. The detail view's Saturation block reports both window and lifetime rates.

  Prefix cache hit  :  78.4% window   76.1% life

Trend sparklines (detail view)

Rolling 60-sample history for the metrics that change most:

▸ Trend  (last 60 samples, newest on right)
  Running        :              ▆▆▇▇██▅▆▆▇▇█  min 10 max 16 now 15
  KV cache %     :              ▄▄▅▅▆▆▆▄▄▄▅▆  min 40.0% max 75.0% now 60.0%
  in tok/s       :               █▃▆▂▁▄▄█▃█▆  min 12244 max 12411 now 12411
  out tok/s      :               █▃▆▂▁▄▄█▃█▆  min  4898 max  4964 now  4964
  TTFT P95 ms    :              ████████████  min   917 max   917 now   917
  TPOT P95 ms    :              ████████████  min  77.8 max  77.8 now  77.8

Counters and KV% pin to 0 baseline so the bar height reflects absolute level; rates and latencies auto-scale so motion stays visible.

JSON output for scripting

# Pipe one snapshot to jq
vllm-htop --output json --once | jq '.aggregate.kv_pct_max'
vllm-htop --output json --once | jq '.cost.compute_based.burn_rate_per_hour'

# Stream JSONL to a log file
vllm-htop --output json --interval 5 >> /var/log/vllm-htop.jsonl

Each poll emits one JSON object on stdout. The schema covers per-replica gauges, throughput, latency (windowed + long-window), lifetime counters, session peaks, the aggregate row, and the cost section.

htop-style alt-screen rendering

In interactive mode (TTY + continuous polling), vllm-htop uses the terminal's alternate screen buffer — the same mechanism as htop, vim, less. Successive refreshes overwrite a fixed window; on exit, the original terminal contents return (the vllm-htop output is not left in scrollback).

Falls back to plain printing automatically when:

--once is set (one-shot snapshot, you might want to capture it)
--output json (structured output for pipelines)
stdout is captured (> out.log, | tee — isatty() returns False)

CLI reference

Flag	Default	What it does
`--url URL [URL ...]`	(auto-discovery)	Explicit base URLs. Space- or comma-separated, shell brace expansion supported
`--auto`	(implicit)	Force discovery, fail loudly if nothing found
`--host HOST`	`localhost`	Hostname for `--auto` discovery and the fallback URL
`--port-range LO-HI`	`8000-8015`	Port range for `--auto`
`--interval N`	`2.0`	Refresh interval in seconds
`--timeout N`	`4.0`	Per-endpoint fetch timeout
`--once`	off	Print one snapshot and exit
`--output MODE`	`auto`	`auto`/`table`/`detail`/`json`. `json` emits JSONL
`--table`	off	Force table view (legacy; use `--output table`)
`--detail`	off	Force detail view (legacy; use `--output detail`)
`--cost-in PRICE`	off	USD per 1M input (prompt) tokens — enables token cost
`--cost-out PRICE`	off	USD per 1M output (generation) tokens
`--gpu-cost-hour PRICE`	auto	USD per GPU-hour. Defaults to nvidia-smi + built-in price hint
`--num-gpus N`	auto	GPU count. Defaults to nvidia-smi count
`--no-gpu-detect`	off	Skip nvidia-smi auto-detect entirely
`--currency SYM`	`$`	Currency symbol shown in the Cost section
`-V`, `--version`	—	Print version and exit
`-h`, `--help`	—	Help

Concepts

Time-scale of every metric

vllm-htop mixes several time scales — each answers a different question:

Scale	Examples	Source	Best for
Instantaneous	Run, Wait, Swap, KV%	gauge at this poll	"What's the state right now?"
Windowed (~2s)	in/out tok/s, TTFT-P95, Cache%	counter / histogram-bucket deltas	"What's been happening this second?"
Long-window (~60s)	`P95@1m` column	bucket delta over snapshots ≤60s old	"What's the SLO state?"
Trend (~2 min)	sparklines in detail view	rolling 60-sample buffer	"Is something trending up or down?"
Lifetime	life-Prompt/Output/Reqs, lifetime cost	vLLM `*_total` counters	"How much total work since vLLM started?"
Session	peak-Run/KV, this-session cost	tracked since `vllm-htop` attached	"How much during my monitoring window?"

Aggregating percentiles across DP

The ALL row's P95 is computed by merging raw histogram buckets across replicas and then taking the percentile of the merged distribution. Averaging per-replica P95s is mathematically wrong — mean(P95) isn't P95(union). This matters most when one replica is hot and others are idle: averaging would understate the tail.

Time-based vs token-based cost

These answer different questions, and both are useful:

Compute-based ($/h × N × uptime) — what's actually leaving your account
Token-based (tokens × $/M) — what the inference would cost (or is worth) at API prices
Margin (token revenue ÷ compute cost) — whether the GPU is paying for itself

Self-host LLM as an API: watch margin. Internal-only tool: compute is what matters. Researcher/benchmarker: tokens-burned is a hardware-independent yardstick.

Fault tolerance

DOWN replicas (fetch failed and no prior snapshot) appear as a row with the error, but don't break the aggregate or imbalance check.
STALE when the latest fetch failed but we have an older snapshot — useful through transient network blips.
Parallel polling via ThreadPoolExecutor — total refresh ≈ slowest single fetch, regardless of replica count.
Substring metric-name matching so version drift between vllm:gpu_cache_usage_perc and vllm:kv_cache_usage_perc doesn't break anything.

Why?

vLLM exports a rich Prometheus /metrics endpoint, but:

Production Prometheus + Grafana is overkill when you just SSH'd in and want to know if a server is healthy right now.
curl /metrics | grep can't compute windowed percentiles, rates, or cross-replica aggregates.
The default vLLM Grafana dashboard doesn't surface cross-replica imbalance — which is the most common operational failure mode for DP setups.

vllm-htop sits between Grafana (always-on, persistent) and curl (one-off, raw). Single binary, ssh-friendly, zero ops setup.

Limitations

No alerting — this is a viewer, not a notifier. For real alerting see Andrey Krisanov's vLLM Prometheus rules.
Peaks are in-memory only — when the script exits, session peaks are lost. For long-term persistence, use Prometheus.
GPU price hints are ballpark — RunPod-anchored medians, ±30% across providers. Pass --gpu-cost-hour for accuracy.
nvidia-smi auto-detection only works on the host running vLLM — if you SSH'd in from your laptop and ran vllm-htop against localhost, the GPU detection sees the local box (correct). If you point --url at a remote vLLM, the local GPU info isn't relevant; pass explicit --gpu-cost-hour.

Acknowledgments

The vLLM project for exposing rich metrics by default, and the reference Grafana dashboard that informed the choice of which metrics matter most.

License

MIT

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

eyuansu62

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.5.0

May 26, 2026

0.4.10

May 26, 2026

0.4.9

May 26, 2026

0.4.7

May 21, 2026

0.4.6

May 21, 2026

0.4.5

May 21, 2026

0.4.4

May 21, 2026

0.4.3

May 20, 2026

0.4.2

May 20, 2026

0.4.0

May 20, 2026

0.3.3

May 20, 2026

This version

0.3.2

May 19, 2026

0.3.0

May 19, 2026

0.2.2

May 19, 2026

0.2.1

May 19, 2026

0.2.0

May 19, 2026

0.1.1

May 18, 2026

0.1.0

May 18, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vllm_htop-0.3.2.tar.gz (37.3 kB view details)

Uploaded May 19, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

vllm_htop-0.3.2-py3-none-any.whl (37.5 kB view details)

Uploaded May 19, 2026 Python 3

File details

Details for the file vllm_htop-0.3.2.tar.gz.

File metadata

Download URL: vllm_htop-0.3.2.tar.gz
Upload date: May 19, 2026
Size: 37.3 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for vllm_htop-0.3.2.tar.gz
Algorithm	Hash digest
SHA256	`bfa3ca1b2526416e07d189e26e3934f01bc02792db1a1adab2633fd247d6f55a`
MD5	`95e827c79591ef272af43075166fb374`
BLAKE2b-256	`448ab3aecfd4ac53b2001f11cab816bd4ec03edf411d672d17e4ee980d508755`

See more details on using hashes here.

Provenance

The following attestation bundles were made for vllm_htop-0.3.2.tar.gz:

Publisher: publish.yml on eyuansu62/vllm-htop

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: vllm_htop-0.3.2.tar.gz
- Subject digest: bfa3ca1b2526416e07d189e26e3934f01bc02792db1a1adab2633fd247d6f55a
- Sigstore transparency entry: 1573518652
- Sigstore integration time: May 19, 2026
Source repository:
- Permalink: eyuansu62/vllm-htop@28ddcd2ede8fbdbdeaddeafafe552c1ef9a95c54
- Branch / Tag: refs/tags/v0.3.2
- Owner: https://github.com/eyuansu62
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@28ddcd2ede8fbdbdeaddeafafe552c1ef9a95c54
- Trigger Event: push

File details

Details for the file vllm_htop-0.3.2-py3-none-any.whl.

File metadata

Download URL: vllm_htop-0.3.2-py3-none-any.whl
Upload date: May 19, 2026
Size: 37.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for vllm_htop-0.3.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`625cb76bc99666282d94e7035a67381c6ee76578e9d6c832f6f00aad7a5ed4a0`
MD5	`ec6428034766df3be67b70e5069e4f43`
BLAKE2b-256	`a0dc479fb0796a1f5f3b120d4b7c14c901d40f300d18e72fc5304a174315ecde`

See more details on using hashes here.

Provenance

The following attestation bundles were made for vllm_htop-0.3.2-py3-none-any.whl:

Publisher: publish.yml on eyuansu62/vllm-htop

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: vllm_htop-0.3.2-py3-none-any.whl
- Subject digest: 625cb76bc99666282d94e7035a67381c6ee76578e9d6c832f6f00aad7a5ed4a0
- Sigstore transparency entry: 1573518688
- Sigstore integration time: May 19, 2026
Source repository:
- Permalink: eyuansu62/vllm-htop@28ddcd2ede8fbdbdeaddeafafe552c1ef9a95c54
- Branch / Tag: refs/tags/v0.3.2
- Owner: https://github.com/eyuansu62
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@28ddcd2ede8fbdbdeaddeafafe552c1ef9a95c54
- Trigger Event: push

vllm-htop 0.3.2

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

vllm-htop

At a glance

Install

Quick start

Features

Multi-replica DP — comparison table

Auto-discovery

Internal DP — auto-split

Imbalance check

Cost estimation

Long-window P95

Prefix cache hit rate

Trend sparklines (detail view)

JSON output for scripting

htop-style alt-screen rendering

CLI reference

Concepts

Time-scale of every metric

Aggregating percentiles across DP

Time-based vs token-based cost

Fault tolerance

Why?

Limitations

Acknowledgments

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance