The most comprehensive benchmarking suite for vLLM inference servers
Project description
vLLM Benchmark Suite
The most comprehensive benchmarking tool for vLLM inference servers. Built by a vLLM user, for vLLM users.
Quick Start
pip install vllm-benchmark-suite
vllm-bench --quick
That's it. Point it at your vLLM server and get a complete performance profile in 5 minutes.
Why This Tool?
Every vLLM operator asks the same questions: Is my setup fast enough? Where's the bottleneck? Did that config change make things better or worse?
I built this because I was tired of writing one-off scripts to answer those questions. This tool gives you a single benchmark score, plain-English diagnostics, and shareable HTML reports — everything you need to understand, optimize, and communicate your vLLM deployment's performance.
Features
vLLM Score (0-10,000)
A single composite number — like Geekbench for vLLM. Compare deployments, track improvements, share results. Weighted across throughput, latency, efficiency, energy, and consistency.
Auto-Diagnostics
After every run, get plain-English analysis: "Your p99 latency is 5x your average at 128K context — this usually means request queuing. Try reducing max_num_seqs." Not just numbers — actionable recommendations.
True TTFT Measurement
Uses SSE streaming to measure actual Time-to-First-Token, not estimates. Know exactly how long users wait before they see output.
Interactive HTML Reports
Self-contained HTML files with interactive Plotly charts. Share with your team, attach to tickets, email to your manager. Dark-themed, professional, no dependencies to view.
Regression Detection
vllm-bench --standard --compare last_week.json
Automatically flags throughput drops and latency increases after vLLM upgrades, config changes, or infrastructure moves.
25+ Performance Metrics
- Throughput: tokens/sec, requests/sec, tokens/sec/user, batch scaling efficiency
- Latency: avg, min, max, P50/P90/P95/P99, TTFT, inter-token latency
- GPU: utilization, memory, temperature, power draw, clock frequencies
- Energy: tokens/watt, watts/token/user, total watt-hours consumed
- Cache: prefix cache hit rate, actual prefill/decode time separation
13+ Publication-Quality Charts
Throughput landscapes, latency/throughput heatmaps, TTFT with UX quality zones, inter-token latency, batch scaling efficiency, decode speed, power draw, prompt type comparisons, cache hit rate heatmaps.
4 Prompt Strategies
Test cache behavior with different prompt types:
- Classic: Deterministic cybersecurity text (high cache hits)
- Deterministic: Tokenizer-aware repetitive story (perfect cache hits)
- Madlib: Random word injection (moderate cache misses)
- Random: Fully random text (minimal cache hits)
Custom Prompts
vllm-bench --prompts-file my_production_prompts.jsonl
Test with your actual production prompts for realistic performance numbers.
Benchmark Presets
| Preset | Time | Context Lengths | Concurrency | Prompt Types |
|---|---|---|---|---|
--quick |
~5 min | 32K | 1, 4 | classic |
--standard |
~30 min | 32K, 64K, 128K | 1, 4, 8, 16 | classic, deterministic |
--thorough |
~2 hours | 32K–512K | 1, 4, 8, 16, 32 | all 4 types |
Or configure everything manually:
vllm-bench --context-lengths 32k,64k,128k --concurrency 1,4,8,16 --output-tokens 500
CLI Reference
vllm-bench [OPTIONS]
Connection:
--url URL vLLM server URL (default: http://localhost:8000)
--model NAME Model name override (auto-detected)
Presets:
--quick Quick benchmark (~5 min)
--standard Standard benchmark (~30 min)
--thorough Thorough benchmark (~2 hours)
Test Parameters:
--context-lengths Comma-separated (e.g. 32k,64k,128k)
--concurrency Comma-separated (e.g. 1,4,8,16)
--output-tokens N Max output tokens (default: 500)
--prompt-type TYPE classic|deterministic|madlib|random|all
--prompts-file PATH Custom prompts JSONL file
Behavior:
-y, --non-interactive Skip interactive prompts
--no-warmup Skip model warmup
--no-streaming Disable streaming TTFT measurement
Output:
--output-dir DIR Output directory (default: ./outputs)
--no-html Skip HTML report
--no-charts Skip PNG charts
Comparison:
--compare FILE Compare with previous results JSON
Sample Output
┌──────────────────────────────────────┐
│ vLLM Benchmark Suite │
│ v3.0.0 │
└──────────────────────────────────────┘
vLLM Benchmark Score: 7,234 / 10,000 (Grade: A)
Throughput (30%) ████████████████████░░░░░░░░░░ 6,800
Latency (25%) ██████████████████████████░░░░ 8,500
Efficiency (20%) ███████████████████████░░░░░░░ 7,650
Energy (15%) █████████████████░░░░░░░░░░░░░ 5,800
Consistency (10%) █████████████████████████████░ 9,200
Diagnostics:
OK Excellent throughput: 1,360 tokens/sec at 32K with 16 users
OK All metrics look healthy
WARNING GPU temperature peaked at 82°C
Performance Highlights:
Peak Throughput 1,360.4 tok/s 16u @ 32K
Best Efficiency 340.1 tok/s/user 1u @ 32K
Lowest Latency 0.42s 1u @ 32K
Outputs:
JSON: ./outputs/benchmark_Llama-3-70B_20240115_143022.json
CSV: ./outputs/benchmark_Llama-3-70B_20240115_143022.csv
Charts: ./outputs/benchmark_Llama-3-70B_20240115_143022.png
HTML: ./outputs/benchmark_Llama-3-70B_20240115_143022.html
Installation
# From PyPI
pip install vllm-benchmark-suite
# Or with uv (faster)
uv pip install vllm-benchmark-suite
# From source
git clone https://github.com/notadestroyer/vllm-benchmark-suite.git
cd vllm-benchmark-suite
pip install -e ".[dev]"
Requirements
- Python 3.10+
- A running vLLM server (local or remote)
- NVIDIA GPU (for GPU metrics — benchmarking works without one, but you won't get GPU telemetry)
Output Files
Each benchmark run generates:
| File | Description |
|---|---|
benchmark_*.json |
Complete results with metadata, system info, and all metrics |
benchmark_*.csv |
Tabular results for spreadsheet analysis |
benchmark_*.png |
13+ publication-quality matplotlib charts (300 DPI) |
benchmark_*.html |
Interactive HTML report with Plotly charts |
Contributing
Contributions welcome. Please open an issue first to discuss what you'd like to change.
git clone https://github.com/notadestroyer/vllm-benchmark-suite.git
cd vllm-benchmark-suite
pip install -e ".[dev]"
pytest tests/ -v
ruff check src/
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file vllm_benchmark_suite-3.1.0.tar.gz.
File metadata
- Download URL: vllm_benchmark_suite-3.1.0.tar.gz
- Upload date:
- Size: 74.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
baa8df3e31c3a533e6465b5129c2fca1a42e347fbf03c919dbbed4a4cf6b5a89
|
|
| MD5 |
fa96a92635cace66dabbc98d4b89ec53
|
|
| BLAKE2b-256 |
8a07f1b4add9ead45f6ffaf139c2ea4dd3e0a9c08eb833561aa9bbcb15c1b74e
|
File details
Details for the file vllm_benchmark_suite-3.1.0-py3-none-any.whl.
File metadata
- Download URL: vllm_benchmark_suite-3.1.0-py3-none-any.whl
- Upload date:
- Size: 67.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
39301958f3225388ab3336a69f90cf9e2879b8a699dc2dee38f92943bb0dac68
|
|
| MD5 |
c34ff0e78c6e5eec2dfeaa57bd58511c
|
|
| BLAKE2b-256 |
c0103c1e313fd62807bc0cb525d81ed484bd953f412470c5589edc2f62fdec53
|