A btop-style terminal UI for monitoring a vLLM instance and its GPU in real time.
Project description
vllmtop
A btop-style terminal UI for monitoring a running vLLM instance and its GPU in real time. Hand-rolled braille charts, a responsive curses layout, and a non-blocking background poller so the UI never stalls on network or NVML latency.
╭─┐¹ gpu ┌──────────────────────────────────────────────────────────────────╮
│ NVIDIA GeForce RTX 5090 util 86% 50°C 319/600W SM 2857MHz fan 31% │
│ ⣿⣿⣿⣿ … braille utilisation chart … │
│ VRAM 27.3GB/31.8GB ████████████████████░░░ 86% │
│ PWR ████████████░░░░░░░░░░░░░░░░░░░░░░░░░░ 53% │
╰─────────────────────────────────────────────────────────────────────────────╯
╭─┐² throughput ┌────────────╮╭─┐³ requests ┌──────────────╮
│ gen 149 tok/s ⣀⣤⣶⣿ ││ running ████░░ 1 │
│ prompt 3.5k tok/s ││ waiting ██████ 2 │
╰─┘ tok/s └──────────────────╯╰─┘ queue └───────────────────╯
╭─┐⁴ latency ┌───────────────╮╭─┐⁵ cache ┌──────────────────╮
│ TTFT 964ms TPOT 6ms … ││ KV █░░░ 3% prefix 0.0% │
╰─┘ recent avg └─────────────╯╰─────────────────────────────╯
Rounded corners, superscript panel numbers in the title tabs, and a secondary label on the bottom edge — matching btop's box style.
What it shows
- GPU (via NVML /
pynvml): utilisation %, VRAM used/total, temperature, power draw vs. limit, SM clock, fan — with green/yellow/red thresholds. - Throughput: generation tok/s and prompt tok/s (rates derived from vLLM counters), as big numbers + braille charts.
- Requests / Queue: running vs. waiting requests and preemptions.
- Latency (recent average over the last poll interval — far more useful live than the cumulative average): TTFT, inter-token (TPOT), end-to-end, queue time.
- Cache: KV-cache usage % and prefix-cache hit rate.
Data comes from vLLM's Prometheus /metrics endpoint plus in-process NVML
polling. If vLLM goes away (e.g. a container restart) the UI shows a disconnect
banner and keeps the GPU panel live, then reconnects automatically.
Install
Requires Python 3.10+ on Linux (curses is stdlib). A working NVIDIA driver is needed for the GPU panel.
pip install .
# or, for development:
pip install -e ".[dev]"
Dependencies: nvidia-ml-py (NVML bindings) and prometheus-client (exposition
parser). The /metrics fetch uses stdlib urllib.
Usage
vllmtop # monitor http://localhost:8000
vllmtop --url http://host:8000 # a remote vLLM server
vllmtop --interval 0.5 # poll twice a second
vllmtop --no-gpu # skip the GPU panel
python -m vllmtop # same thing, without the entry point
The server URL can also be set via the VLLMTOP_URL environment variable.
Options
| Flag | Default | Description |
|---|---|---|
--url |
http://localhost:8000 |
vLLM base URL (env VLLMTOP_URL) |
--interval |
1.0 |
poll interval in seconds |
--gpu-index |
0 |
NVML GPU index |
--no-gpu |
off | disable the GPU panel |
--dump-json |
off | collect one snapshot, print JSON, exit (no TTY) |
Keybindings
| Key | Action |
|---|---|
q / Esc |
quit |
+ / - |
faster / slower refresh |
p |
pause / resume polling |
1–5 |
toggle a panel on/off (¹gpu ²throughput ³requests ⁴latency ⁵cache) |
h / ? |
toggle help overlay |
Each panel's title carries a superscript number (btop-style) showing the key that toggles it. Hiding panels reflows the rest to fill the freed space.
Headless smoke test
--dump-json collects two snapshots an interval apart (so rates are populated),
prints the result as JSON, and exits. Works without a TTY — handy for CI or
verifying connectivity:
python -m vllmtop --dump-json --url http://localhost:8000
How it works
- A background poller thread scrapes
/metricsand polls NVML everyintervalseconds, storing the latest combined snapshot under a lock. This keeps all I/O latency off the render path. - The UI loop wakes on a short tick (250 ms), reads the latest snapshot, appends derived values (rates, recent-average latencies) to per-series ring buffers, and redraws — so render cadence is independent of poll cadence.
- Counters → rates:
Δvalue / Δt, guarded againstΔt ≤ 0and counter resets. Histograms → recent average:Δsum / Δcountbetween polls. - Braille charts: each cell is a 2×4 Unicode braille dot matrix, giving
2w × 4h-dot resolution for the smooth btop look.
Development
pytest # parser-against-fixture, rate math, braille rendering
License
MIT — see LICENSE.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file vllmpytop-0.1.0.tar.gz.
File metadata
- Download URL: vllmpytop-0.1.0.tar.gz
- Upload date:
- Size: 25.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1a5917a024bfff042ee374389c985747da2425d2eed029c892a2873e21eb1c96
|
|
| MD5 |
27d34e25c1af5e027c3e79912a13137c
|
|
| BLAKE2b-256 |
775e615715e4ac87983d9b3df43169ca7c1c0deb576b2a4e6eeb0d73e1b6e1dd
|
File details
Details for the file vllmpytop-0.1.0-py3-none-any.whl.
File metadata
- Download URL: vllmpytop-0.1.0-py3-none-any.whl
- Upload date:
- Size: 27.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a33f3fadcc7acb0d04dbabbe18e7b0e9e6018ae7dad97aab3ca8d02a12d6531c
|
|
| MD5 |
cfeeece48b2fa0e70091a2a325a7e78c
|
|
| BLAKE2b-256 |
d0cc899089da5608bca36320a249a64bb9bd15ddde20f8a86be6fb7061f81a95
|