The most comprehensive benchmarking suite for vLLM inference servers
Project description
vLLM Benchmark Suite
A rigorous benchmarking tool for vLLM inference servers. Async load generation, statistical confidence intervals, plain-English diagnostics, and shareable reports.
Quick Start
pip install vllm-benchmark-suite
vllm-bench --quick
Point it at your vLLM server and get a full performance profile in ~5 minutes.
What It Does
- True async load — requests run concurrently via
aiohttp, not threads. No GIL bottleneck. - Two load modes — burst (all N requests simultaneously) and sustained RPS (token-bucket rate limiter).
- True TTFT — actual Time-to-First-Token via SSE streaming, not an estimate.
- Accurate token counts — uses
AutoTokenizerfromtransformers, notlen(text) // 4. - Statistical rigor — run multiple iterations to get 95% confidence intervals, outlier detection, and CV warnings.
- Cost analysis — tokens per dollar, cost per 1M tokens, auto-detected from GPU name.
- Composite score — weighted 0–10,000 score across throughput, latency, efficiency, energy, and consistency.
- Auto-diagnostics — 10+ rule-based checks with plain-English recommendations.
- Regression detection — compare two runs and flag statistically meaningful changes.
- Shareable reports — self-contained HTML with Plotly charts, plus PNG and JSON/CSV.
Benchmark Presets
| Preset | Time | Context Lengths | Concurrency | Prompt Types |
|---|---|---|---|---|
--quick |
~5 min | 32K | 1, 4 | classic |
--standard |
~30 min | 32K, 64K, 128K | 1, 4, 8, 16 | classic, deterministic |
--thorough |
~2 hours | 32K–512K | 1, 4, 8, 16, 32 | all 4 types |
Or configure everything manually:
vllm-bench --context-lengths 32k,64k,128k --concurrency 1,4,8,16 --output-tokens 500
Load Modes
Burst (default)
Fires all N concurrent requests simultaneously. Tests peak throughput and how well vLLM handles queue pressure.
vllm-bench --standard --concurrency 1,4,8,16
Sustained RPS
Sends requests at a steady rate for a fixed duration. Tests real-world latency behaviour under continuous load.
vllm-bench --rps 10 --duration 120
Produces per-time-bucket latency tracking (avg and P99 in 10-second windows), actual vs target RPS, and steady-state detection.
Statistical Rigor
Single-run results have no confidence bounds. Use --iterations to get them:
vllm-bench --standard --iterations 5 --seed 42
With multiple iterations the tool runs each (context, concurrency, prompt_type) combination N times, then aggregates:
- 95% confidence intervals on throughput, latency, TTFT (t-distribution via SciPy, bootstrap fallback)
- IQR outlier detection with transparent fence reporting
- Coefficient of variation warnings when variance is too high to trust results
- Welch's t-test + Cohen's d for regression comparisons
Summary table shows CI bounds:
Peak Throughput 1,500.3 tok/s [1,480.1 – 1,520.5] 16u @ 32K
--seed sets random.seed and numpy.random.seed for reproducible prompt generation. The full environment fingerprint (kernel, CPU governor, GPU clocks, driver, package versions) is captured as a SHA-256 hash and printed at the end of every run.
Cost Analysis
vllm-bench --standard --cost 2.21 # explicit $/hr
vllm-bench --standard # auto-detected from GPU name
Known GPU hourly rates (cloud on-demand):
| GPU | $/hr |
|---|---|
| H100 | $4.00 |
| A100 80GB | $2.21 |
| A100 40GB | $1.80 |
| L40S | $1.50 |
| RTX 4090 | $0.74 |
| T4 | $0.53 |
Reported per test: cost per 1M tokens, total cost for that configuration.
Prompt Strategies
Four strategies let you control prefix cache behaviour:
| Type | Cache behaviour | Use case |
|---|---|---|
classic |
High cache hits | Realistic long-context workload |
deterministic |
Near-perfect cache hits | Best-case cache performance |
madlib |
Moderate cache misses | Mixed workload |
random |
Minimal cache hits | Worst-case / stress test |
Or use your own:
vllm-bench --prompts-file production_prompts.jsonl
JSONL format: one JSON object per line with a "prompt" key.
vLLM Score (0–10,000)
A single composite number for easy comparison across runs and deployments.
| Dimension | Weight | What it measures |
|---|---|---|
| Throughput | 30% | Peak tokens/sec vs GPU reference |
| Latency | 25% | Best average latency (lower = better) |
| Efficiency | 20% | Tokens/sec per concurrent user |
| Energy | 15% | Tokens per watt |
| Consistency | 10% | Latency coefficient of variation |
Grades: S (9000+) · A (7500–8999) · B (6000–7499) · C (4000–5999) · D (2000–3999) · F (<2000)
GPU-specific reference baselines are built in for H100, A100, L40S, RTX 4090, T4, and others.
Diagnostics
After every run, 10+ automated checks produce plain-English findings:
OK Peak throughput 1,360 tok/s at 32K, 16 users
WARN GPU temperature peaked at 82°C — thermal throttling risk
WARN p99 latency is 5× average at 128K — likely request queuing
Consider reducing max_num_seqs or enabling prefix caching
Checks include: request failure rate, latency variance, GPU utilisation, TTFT, batch scaling efficiency, cache effectiveness, memory pressure, temperature, and energy efficiency. When vLLM server info is available, config recommendations are included (prefix caching, tensor parallelism, quantization, max_num_seqs).
Regression Detection
vllm-bench --standard --compare baseline.json
Compares results matched by (context_length, concurrency, prompt_type). Flags changes against configurable thresholds (major: >15%, minor: 5–15%). With --iterations, uses Welch's t-test to distinguish real regressions from measurement noise.
Output Files
Each run writes to ./outputs/ (override with --output-dir):
| File | Description |
|---|---|
benchmark_*.json |
All results + metadata, system info, environment fingerprint |
benchmark_*.csv |
Tabular results for spreadsheet analysis |
benchmark_*.png |
5 publication-quality charts (300 DPI) |
benchmark_*.html |
Self-contained interactive report (Plotly, dark theme) |
result_*.json |
Standardised entry for community leaderboard (optional) |
Charts
- Throughput vs Context Length — line plot per concurrency level
- Latency Distribution — box plot with P99 overlay
- TTFT Distribution — with UX quality zones (green <200 ms, yellow <1 s, red >1 s)
- Throughput Heatmap — context × concurrency grid
- GPU Utilization & Power — dual-axis timeline
Metrics Reference
Throughput: tokens_per_second, requests_per_second, throughput_per_user
Latency: avg_latency, min_latency, max_latency, latency_p50/p90/p95/p99
TTFT: ttft_estimate, ttft_p50/p90/p95/p99
Inter-token latency: inter_token_latency, itl_p50/p90/p95/p99
GPU (nvidia-smi): avg_gpu_util, max_gpu_util, avg_mem_used, avg_temperature, avg_power, avg_gpu_clock
Energy: tokens_per_watt, watts_per_token, energy_joules
Cache (vLLM /metrics endpoint): cache_hit_rate, actual_prefill_time, actual_decode_time
Cost (when available): cost_per_hour, cost_per_1m_tokens, cost_total
Tokens: prompt_tokens, completion_tokens, total_tokens
Statistical (with --iterations > 1): *_ci_lower, *_ci_upper for throughput, latency, and TTFT metrics
CLI Reference
vllm-bench [OPTIONS]
Connection:
--url URL vLLM server URL (default: http://localhost:8000)
--model NAME Model name override (auto-detected)
Presets (mutually exclusive):
--quick ~5 min
--standard ~30 min
--thorough ~2 hours
Test Parameters:
--context-lengths Comma-separated, e.g. 32k,64k,128k or 1m
--concurrency Comma-separated, e.g. 1,4,8,16
--output-tokens N Max output tokens per request (default: 500)
--prompt-type TYPE classic|deterministic|madlib|random|all
--prompts-file PATH Custom prompts JSONL file
Load Mode:
--rps FLOAT Sustained requests-per-second mode
--duration FLOAT Duration for sustained RPS run (default: 120s)
Statistical Rigor:
--iterations N Iterations per config for confidence intervals (default: 1)
--seed INT Random seed for reproducibility
Cost:
--cost FLOAT GPU cost in USD/hr (auto-detected for known GPUs)
Behavior:
-y, --non-interactive Skip interactive prompts, use defaults
--no-warmup Skip model warmup
--no-streaming Disable streaming TTFT measurement
Output:
--output-dir DIR Output directory (default: ./outputs)
--no-html Skip HTML report
--no-charts Skip PNG charts
Traffic Simulation:
--traffic TYPE poisson|multiturn
--target-rps FLOAT Target RPS for traffic simulation (default: 2.0)
--traffic-duration S Duration in seconds (default: 60)
--turns N Turns per conversation for multiturn (default: 5)
Comparison:
--compare FILE Compare with previous results JSON
Installation
# From PyPI
pip install vllm-benchmark-suite
# With uv
uv pip install vllm-benchmark-suite
# From source
git clone https://github.com/notadestroyer/vllm-benchmark-suite.git
cd vllm-benchmark-suite
pip install -e ".[dev]"
Requirements: Python 3.10+, a running vLLM server. NVIDIA GPU optional (required for GPU metrics).
Contributing
git clone https://github.com/notadestroyer/vllm-benchmark-suite.git
cd vllm-benchmark-suite
pip install -e ".[dev]"
pytest tests/ -v
ruff check src/
Open an issue first to discuss significant changes.
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file vllm_benchmark_suite-3.1.1.tar.gz.
File metadata
- Download URL: vllm_benchmark_suite-3.1.1.tar.gz
- Upload date:
- Size: 75.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e9245b825e390058ddd9324d84a7afedad3439034b5ccbf7c360335271bce762
|
|
| MD5 |
b832a7e016a57275c04a92c22fcd2df3
|
|
| BLAKE2b-256 |
aeb2f5bda29ffcfc2943eca713e767cdbbdd325d46acdc24c570adb91ddf4602
|
File details
Details for the file vllm_benchmark_suite-3.1.1-py3-none-any.whl.
File metadata
- Download URL: vllm_benchmark_suite-3.1.1-py3-none-any.whl
- Upload date:
- Size: 69.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
88892d8d31d1034fda4e65ed5c7de8f686c46e0f04592df61add3ac88f8d5bbe
|
|
| MD5 |
5cd580ba06654880cebd2197833ae1b1
|
|
| BLAKE2b-256 |
c634bf755d7af7f53d28d17f3d317c223909b6d9b87c8f452467a4eb3e67cdc8
|