Skip to main content

htop-style terminal monitor for vLLM inference servers

Project description

vllm-htop

PyPI version Python License: MIT

htop for vLLM inference servers. Point it at one or more /metrics endpoints, get the right numbers, the right way, right now.

Zero dependencies. Single file. Python 3.8+.

vllm-htop terminal screenshot showing the DP table with per-engine rows, prefix cache hit rate column, imbalance check identifying the slow replica, and cost section with margin row

At a glance

  • Auto-discovers vLLM endpoints on the host — no --url needed for typical local setups
  • Auto-splits internal DPvllm serve --data-parallel-size N becomes N rows automatically
  • Model-aware row names<model>.e0 instead of 0.e0, so mixed deployments (LLM + embedding) are readable
  • Windowed + long-window percentiles — P50/P95/P99 over ~2s, plus stabilized P95@1m for SLO reads
  • Prefix cache hit rate column when vLLM exposes it
  • Imbalance check that points to the bad replica by name, median-based and grouped by model
  • Cost estimation — token-based, compute-based (auto-detected from nvidia-smi), and a margin row
  • htop-style alt-screen rendering — fixed window refresh, scrollback stays clean
  • JSON output mode for piping into scripts, logs, or alerting
  • Trend sparklines in detail view — 60-sample rolling history per metric
  • Cross-DP percentile aggregation done correctly (merged buckets, not averaged P95s)
  • Fault-tolerant — DOWN / STALE replicas surface without breaking the table

Install

# Recommended: uvx (zero install, always fresh)
uvx vllm-htop@latest

# pip
pip install vllm-htop
vllm-htop

# Or grab the single file and run it
curl -O https://raw.githubusercontent.com/eyuansu62/vllm-htop/main/vllm_htop.py
python vllm_htop.py

Quick start

vllm-htop

With no flags, vllm-htop:

  1. Scans localhost:8000-8015 for vLLM endpoints (parallel TCP + /metrics probe, <100ms)
  2. Detects internal DP via the engine="N" label and expands each URL into per-engine rows
  3. Picks table view if it found ≥2 replicas, detail view otherwise
  4. Runs nvidia-smi to detect local GPUs and shows compute cost when the model matches the built-in price table
  5. Refreshes every 2s in alt-screen mode (no scrollback pollution)
  6. Ctrl-C exits; original terminal contents return

Features

Multi-replica DP — comparison table

vllm-htop --url http://h1:8000 http://h2:8000 http://h3:8000 http://h4:8000

# Comma-separated
vllm-htop --url http://h1:8000,http://h2:8000,http://h3:8000,http://h4:8000

# Shell brace expansion
vllm-htop --url http://localhost:{8000,8001,8002,8003}

Compact per-replica rows + aggregate ALL row + cross-replica imbalance check.

Auto-discovery

If you don't pass --url, vllm-htop scans the configured port range for vLLM-shaped /metrics endpoints. Open ports get an HTTP probe checking for the vllm: metric prefix; non-vLLM services on the same ports are filtered out.

vllm-htop                                  # implicit, falls back to localhost:8000 if nothing's found
vllm-htop --auto                           # forced; fails loudly if nothing found
vllm-htop --auto --host 10.0.0.7           # remote host
vllm-htop --auto --port-range 9000-9031    # wider range

Internal DP — auto-split

vllm serve --data-parallel-size N exposes one /metrics endpoint with engine="0".."N-1" labels. vllm-htop detects this on first contact and expands the URL into one virtual replica per engine — the comparison table, imbalance check, and aggregate percentiles work just like for separate-process DP.

Row naming chooses the most informative form available:

Setup Names
External DP (N URLs, no engine) 0, 1, …
Internal DP (1 URL, N engines) e0, e1, …
Mixed (M URLs × N engines each) 0.e0, 0.e1, 1.e0, …
Model name extractable from labels <model>.e0, <other-model>.e1, …

When model_name labels are present and distinct across URLs, the names use the model — so multi-model deployments (e.g. LLM + embedding on the same host) are readable at a glance. On name collisions (two URLs serving the same model) the tool falls back to URL indices to keep rows unique.

Imbalance check

When ≥2 replicas serve the same model, vllm-htop runs four checks for cross-replica anomalies. Healthy state collapses to one line; warnings name the bad replica explicitly.

▸ Imbalance check  (× 4 replicas)  ⚠ 1/4 failed
  ✓ Running req          range 5–8, median 6
  ✓ KV cache             range 40.0%–46.0%
  ⚠ slow-replica (TTFT)  &lt;model&gt;.e3: 979ms is 5.2× median (188ms)
  ✓ slow-decode (TPOT)   median 38.0ms, max 52.0ms (1.4×)
Check Threshold Means
Running req Δ > 3 and max > 1.5× median ⚠ load-balancer skew / sticky session
KV cache Δ > 15 percentage points ⚠ uneven KV pressure (prefix-cache asymmetry?)
TTFT P95 max / median > 1.5× ⚠ slow replica (GPU thermal, NCCL, contention)
TPOT P95 max / median > 1.5× ⚠ slow decode

Two design choices worth noting:

  • median, not min, as the baseline ratio denominator. Min would be dragged to zero by any idle replica and produce misleading 75× ratios.
  • grouped by model, so a mixed LLM+embedding deployment doesn't cross-compare workloads that are fundamentally different.

Cost estimation

Two independent pricing models, either or both can be on:

# Token-based: explicit prices in $/M tokens (OpenAI-style)
vllm-htop --cost-in 0.50 --cost-out 1.50

# Compute-based: auto-detected from nvidia-smi
vllm-htop                                      # auto
vllm-htop --gpu-cost-hour 2.99 --num-gpus 8    # explicit override
vllm-htop --no-gpu-detect                      # disable auto-detect

# Both — also surfaces the Margin row
vllm-htop --cost-in 0.50 --cost-out 1.50 --gpu-cost-hour 2.99 --num-gpus 8

Each model reports Lifetime (since vLLM started), This session (since vllm-htop attached), and a Current rate / Burn rate for the live read. Margin is token-revenue ÷ compute-cost — colored green ≥2× / yellow ≥1× / red <1×.

Built-in GPU price hints cover:

  • Blackwell datacenter: B200, B100, GB200
  • Blackwell workstation/consumer: RTX PRO 6000, RTX 5090, RTX 5080
  • Hopper: H100, H100 NVL, H200
  • Hopper China-market: H20-3e, H20
  • Ampere: A100 (40/80GB), A40, A30, A10, A10G, RTX A6000/A5000/A4000, RTX 3090
  • Ada Lovelace: L40S, L40, L20 (China), L4, RTX 6000 Ada, RTX 4090, RTX 4080
  • Older datacenter: V100, T4

Prices are anchored to RunPod Secure tier published rates (2026-05) — what OpenRouter-class token-API providers (Lambda, Hyperbolic, DeepInfra, …) typically pay for their compute. Cross-provider variance:

Reference Vs. our hints
AWS / GCP on-demand 3-5× higher
Lambda Labs within ±10%
RunPod Community 20-40% lower
vast.ai community 30-50% lower

Treat the numbers as ±30% ballpark; override --gpu-cost-hour for anything serious.

Long-window P95

In the detail view's Latency section, the standard P95 column reflects only the latest poll-to-poll delta — noisy, often when no requests completed in those 2s. The P95@1m column shows the same percentile over the last ~60 seconds of accumulated samples — much more stable, what you'd actually use for an SLO read.

▸ Latency  (windowed percentiles)
  metric             P50       P95       P99    P95@1m
  TTFT  (ms)       100.0     916.7    3758.6     520.3
  TPOT  (ms)         8.3      77.8     100.0      65.1

Prefix cache hit rate

When vLLM exposes vllm:prefix_cache_queries_total / vllm:prefix_cache_hits_total, the table view picks up a Cache% column (green ≥60% / yellow ≥30% / red <30%) and the ALL row shows a query-weighted aggregate. The detail view's Saturation block reports both window and lifetime rates.

  Prefix cache hit  :  78.4% window   76.1% life

Trend sparklines (detail view)

Rolling 60-sample history for the metrics that change most:

▸ Trend  (last 60 samples, newest on right)
  Running        :              ▆▆▇▇██▅▆▆▇▇█  min 10 max 16 now 15
  KV cache %     :              ▄▄▅▅▆▆▆▄▄▄▅▆  min 40.0% max 75.0% now 60.0%
  in tok/s       :               █▃▆▂▁▄▄█▃█▆  min 12244 max 12411 now 12411
  out tok/s      :               █▃▆▂▁▄▄█▃█▆  min  4898 max  4964 now  4964
  TTFT P95 ms    :              ████████████  min   917 max   917 now   917
  TPOT P95 ms    :              ████████████  min  77.8 max  77.8 now  77.8

Counters and KV% pin to 0 baseline so the bar height reflects absolute level; rates and latencies auto-scale so motion stays visible.

JSON output for scripting

# Pipe one snapshot to jq
vllm-htop --output json --once | jq '.aggregate.kv_pct_max'
vllm-htop --output json --once | jq '.cost.compute_based.burn_rate_per_hour'

# Stream JSONL to a log file
vllm-htop --output json --interval 5 >> /var/log/vllm-htop.jsonl

Each poll emits one JSON object on stdout. The schema covers per-replica gauges, throughput, latency (windowed + long-window), lifetime counters, session peaks, the aggregate row, and the cost section.

htop-style alt-screen rendering

In interactive mode (TTY + continuous polling), vllm-htop uses the terminal's alternate screen buffer — the same mechanism as htop, vim, less. Successive refreshes overwrite a fixed window; on exit, the original terminal contents return (the vllm-htop output is not left in scrollback).

Falls back to plain printing automatically when:

  • --once is set (one-shot snapshot, you might want to capture it)
  • --output json (structured output for pipelines)
  • stdout is captured (> out.log, | teeisatty() returns False)

CLI reference

Flag Default What it does
--url URL [URL ...] (auto-discovery) Explicit base URLs. Space- or comma-separated, shell brace expansion supported
--auto (implicit) Force discovery, fail loudly if nothing found
--host HOST localhost Hostname for --auto discovery and the fallback URL
--port-range LO-HI 8000-8015 Port range for --auto
--interval N 2.0 Refresh interval in seconds
--timeout N 4.0 Per-endpoint fetch timeout
--once off Print one snapshot and exit
--output MODE auto auto/table/detail/json. json emits JSONL
--table off Force table view (legacy; use --output table)
--detail off Force detail view (legacy; use --output detail)
--cost-in PRICE off USD per 1M input (prompt) tokens — enables token cost
--cost-out PRICE off USD per 1M output (generation) tokens
--gpu-cost-hour PRICE auto USD per GPU-hour. Defaults to nvidia-smi + built-in price hint
--num-gpus N auto GPU count. Defaults to nvidia-smi count
--no-gpu-detect off Skip nvidia-smi auto-detect entirely
--currency SYM $ Currency symbol shown in the Cost section
-V, --version Print version and exit
-h, --help Help

Concepts

Time-scale of every metric

vllm-htop mixes several time scales — each answers a different question:

Scale Examples Source Best for
Instantaneous Run, Wait, Swap, KV% gauge at this poll "What's the state right now?"
Windowed (~2s) in/out tok/s, TTFT-P95, Cache% counter / histogram-bucket deltas "What's been happening this second?"
Long-window (~60s) P95@1m column bucket delta over snapshots ≤60s old "What's the SLO state?"
Trend (~2 min) sparklines in detail view rolling 60-sample buffer "Is something trending up or down?"
Lifetime life-Prompt/Output/Reqs, lifetime cost vLLM *_total counters "How much total work since vLLM started?"
Session peak-Run/KV, this-session cost tracked since vllm-htop attached "How much during my monitoring window?"

Aggregating percentiles across DP

The ALL row's P95 is computed by merging raw histogram buckets across replicas and then taking the percentile of the merged distribution. Averaging per-replica P95s is mathematically wrong — mean(P95) isn't P95(union). This matters most when one replica is hot and others are idle: averaging would understate the tail.

Time-based vs token-based cost

These answer different questions, and both are useful:

  • Compute-based ($/h × N × uptime) — what's actually leaving your account
  • Token-based (tokens × $/M) — what the inference would cost (or is worth) at API prices
  • Margin (token revenue ÷ compute cost) — whether the GPU is paying for itself

Self-host LLM as an API: watch margin. Internal-only tool: compute is what matters. Researcher/benchmarker: tokens-burned is a hardware-independent yardstick.

Fault tolerance

  • DOWN replicas (fetch failed and no prior snapshot) appear as a row with the error, but don't break the aggregate or imbalance check.
  • STALE when the latest fetch failed but we have an older snapshot — useful through transient network blips.
  • Parallel polling via ThreadPoolExecutor — total refresh ≈ slowest single fetch, regardless of replica count.
  • Substring metric-name matching so version drift between vllm:gpu_cache_usage_perc and vllm:kv_cache_usage_perc doesn't break anything.

Why?

vLLM exports a rich Prometheus /metrics endpoint, but:

  • Production Prometheus + Grafana is overkill when you just SSH'd in and want to know if a server is healthy right now.
  • curl /metrics | grep can't compute windowed percentiles, rates, or cross-replica aggregates.
  • The default vLLM Grafana dashboard doesn't surface cross-replica imbalance — which is the most common operational failure mode for DP setups.

vllm-htop sits between Grafana (always-on, persistent) and curl (one-off, raw). Single binary, ssh-friendly, zero ops setup.

Limitations

  • No alerting — this is a viewer, not a notifier. For real alerting see Andrey Krisanov's vLLM Prometheus rules.
  • Peaks are in-memory only — when the script exits, session peaks are lost. For long-term persistence, use Prometheus.
  • GPU price hints are ballpark — RunPod-anchored medians, ±30% across providers. Pass --gpu-cost-hour for accuracy.
  • nvidia-smi auto-detection only works on the host running vLLM — if you SSH'd in from your laptop and ran vllm-htop against localhost, the GPU detection sees the local box (correct). If you point --url at a remote vLLM, the local GPU info isn't relevant; pass explicit --gpu-cost-hour.

Acknowledgments

The vLLM project for exposing rich metrics by default, and the reference Grafana dashboard that informed the choice of which metrics matter most.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vllm_htop-0.3.2.tar.gz (37.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

vllm_htop-0.3.2-py3-none-any.whl (37.5 kB view details)

Uploaded Python 3

File details

Details for the file vllm_htop-0.3.2.tar.gz.

File metadata

  • Download URL: vllm_htop-0.3.2.tar.gz
  • Upload date:
  • Size: 37.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for vllm_htop-0.3.2.tar.gz
Algorithm Hash digest
SHA256 bfa3ca1b2526416e07d189e26e3934f01bc02792db1a1adab2633fd247d6f55a
MD5 95e827c79591ef272af43075166fb374
BLAKE2b-256 448ab3aecfd4ac53b2001f11cab816bd4ec03edf411d672d17e4ee980d508755

See more details on using hashes here.

Provenance

The following attestation bundles were made for vllm_htop-0.3.2.tar.gz:

Publisher: publish.yml on eyuansu62/vllm-htop

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file vllm_htop-0.3.2-py3-none-any.whl.

File metadata

  • Download URL: vllm_htop-0.3.2-py3-none-any.whl
  • Upload date:
  • Size: 37.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for vllm_htop-0.3.2-py3-none-any.whl
Algorithm Hash digest
SHA256 625cb76bc99666282d94e7035a67381c6ee76578e9d6c832f6f00aad7a5ed4a0
MD5 ec6428034766df3be67b70e5069e4f43
BLAKE2b-256 a0dc479fb0796a1f5f3b120d4b7c14c901d40f300d18e72fc5304a174315ecde

See more details on using hashes here.

Provenance

The following attestation bundles were made for vllm_htop-0.3.2-py3-none-any.whl:

Publisher: publish.yml on eyuansu62/vllm-htop

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page