The most comprehensive benchmarking suite for vLLM inference servers

These details have not been verified by PyPI

Project links

Project description

vLLM Benchmark Suite

A rigorous benchmarking tool for vLLM inference servers. Async load generation, statistical confidence intervals, plain-English diagnostics, and shareable reports.

Quick Start

pip install vllm-benchmark-suite
vllm-bench --quick

Point it at your vLLM server and get a full performance profile in ~5 minutes.

What It Does

True async load — requests run concurrently via aiohttp, not threads. No GIL bottleneck.
Two load modes — burst (all N requests simultaneously) and sustained RPS (token-bucket rate limiter).
True TTFT — actual Time-to-First-Token via SSE streaming, not an estimate.
Accurate token counts — uses AutoTokenizer from transformers, not len(text) // 4.
Statistical rigor — run multiple iterations to get 95% confidence intervals, outlier detection, and CV warnings.
Cost analysis — tokens per dollar, cost per 1M tokens, auto-detected from GPU name.
Composite score — weighted 0–10,000 score across throughput, latency, efficiency, energy, and consistency.
Auto-diagnostics — 10+ rule-based checks with plain-English recommendations.
Regression detection — compare two runs and flag statistically meaningful changes.
Shareable reports — self-contained HTML with Plotly charts, plus PNG and JSON/CSV.

Benchmark Presets

Preset	Time	Context Lengths	Concurrency	Prompt Types
`--quick`	~5 min	32K	1, 4	classic
`--standard`	~30 min	32K, 64K, 128K	1, 4, 8, 16	classic, deterministic
`--thorough`	~2 hours	32K–512K	1, 4, 8, 16, 32	all 4 types

Or configure everything manually:

vllm-bench --context-lengths 32k,64k,128k --concurrency 1,4,8,16 --output-tokens 500

Load Modes

Burst (default)

Fires all N concurrent requests simultaneously. Tests peak throughput and how well vLLM handles queue pressure.

vllm-bench --standard --concurrency 1,4,8,16

Sustained RPS

Sends requests at a steady rate for a fixed duration. Tests real-world latency behaviour under continuous load.

vllm-bench --rps 10 --duration 120

Produces per-time-bucket latency tracking (avg and P99 in 10-second windows), actual vs target RPS, and steady-state detection.

Statistical Rigor

Single-run results have no confidence bounds. Use --iterations to get them:

vllm-bench --standard --iterations 5 --seed 42

With multiple iterations the tool runs each (context, concurrency, prompt_type) combination N times, then aggregates:

95% confidence intervals on throughput, latency, TTFT (t-distribution via SciPy, bootstrap fallback)
IQR outlier detection with transparent fence reporting
Coefficient of variation warnings when variance is too high to trust results
Welch's t-test + Cohen's d for regression comparisons

Summary table shows CI bounds:

Peak Throughput    1,500.3 tok/s  [1,480.1 – 1,520.5]   16u @ 32K

--seed sets random.seed and numpy.random.seed for reproducible prompt generation. The full environment fingerprint (kernel, CPU governor, GPU clocks, driver, package versions) is captured as a SHA-256 hash and printed at the end of every run.

Cost Analysis

vllm-bench --standard --cost 2.21       # explicit $/hr
vllm-bench --standard                   # auto-detected from GPU name

Known GPU hourly rates (cloud on-demand):

GPU	$/hr
H100	$4.00
A100 80GB	$2.21
A100 40GB	$1.80
L40S	$1.50
RTX 4090	$0.74
T4	$0.53

Reported per test: cost per 1M tokens, total cost for that configuration.

Prompt Strategies

Four strategies let you control prefix cache behaviour:

Type	Cache behaviour	Use case
`classic`	High cache hits	Realistic long-context workload
`deterministic`	Near-perfect cache hits	Best-case cache performance
`madlib`	Moderate cache misses	Mixed workload
`random`	Minimal cache hits	Worst-case / stress test

Or use your own:

vllm-bench --prompts-file production_prompts.jsonl

JSONL format: one JSON object per line with a "prompt" key.

vLLM Score (0–10,000)

A single composite number for easy comparison across runs and deployments.

Dimension	Weight	What it measures
Throughput	30%	Peak tokens/sec vs GPU reference
Latency	25%	Best average latency (lower = better)
Efficiency	20%	Tokens/sec per concurrent user
Energy	15%	Tokens per watt
Consistency	10%	Latency coefficient of variation

Grades: S (9000+) · A (7500–8999) · B (6000–7499) · C (4000–5999) · D (2000–3999) · F (<2000)

GPU-specific reference baselines are built in for H100, A100, L40S, RTX 4090, T4, and others.

Diagnostics

After every run, 10+ automated checks produce plain-English findings:

OK   Peak throughput 1,360 tok/s at 32K, 16 users
WARN GPU temperature peaked at 82°C — thermal throttling risk
WARN p99 latency is 5× average at 128K — likely request queuing
     Consider reducing max_num_seqs or enabling prefix caching

Checks include: request failure rate, latency variance, GPU utilisation, TTFT, batch scaling efficiency, cache effectiveness, memory pressure, temperature, and energy efficiency. When vLLM server info is available, config recommendations are included (prefix caching, tensor parallelism, quantization, max_num_seqs).

Regression Detection

vllm-bench --standard --compare baseline.json

Compares results matched by (context_length, concurrency, prompt_type). Flags changes against configurable thresholds (major: >15%, minor: 5–15%). With --iterations, uses Welch's t-test to distinguish real regressions from measurement noise.

Output Files

Each run writes to ./outputs/ (override with --output-dir):

File	Description
`benchmark_*.json`	All results + metadata, system info, environment fingerprint
`benchmark_*.csv`	Tabular results for spreadsheet analysis
`benchmark_*.png`	5 publication-quality charts (300 DPI)
`benchmark_*.html`	Self-contained interactive report (Plotly, dark theme)
`result_*.json`	Standardised entry for community leaderboard (optional)

Charts

Throughput vs Context Length — line plot per concurrency level
Latency Distribution — box plot with P99 overlay
TTFT Distribution — with UX quality zones (green <200 ms, yellow <1 s, red >1 s)
Throughput Heatmap — context × concurrency grid
GPU Utilization & Power — dual-axis timeline

Metrics Reference

Throughput: tokens_per_second, requests_per_second, throughput_per_user

Latency: avg_latency, min_latency, max_latency, latency_p50/p90/p95/p99

TTFT: ttft_estimate, ttft_p50/p90/p95/p99

Inter-token latency: inter_token_latency, itl_p50/p90/p95/p99

GPU (nvidia-smi): avg_gpu_util, max_gpu_util, avg_mem_used, avg_temperature, avg_power, avg_gpu_clock

Energy: tokens_per_watt, watts_per_token, energy_joules

Cache (vLLM /metrics endpoint): cache_hit_rate, actual_prefill_time, actual_decode_time

Cost (when available): cost_per_hour, cost_per_1m_tokens, cost_total

Tokens: prompt_tokens, completion_tokens, total_tokens

Statistical (with --iterations > 1): *_ci_lower, *_ci_upper for throughput, latency, and TTFT metrics

CLI Reference

vllm-bench [OPTIONS]

Connection:
  --url URL              vLLM server URL (default: http://localhost:8000)
  --model NAME           Model name override (auto-detected)

Presets (mutually exclusive):
  --quick                ~5 min
  --standard             ~30 min
  --thorough             ~2 hours

Test Parameters:
  --context-lengths      Comma-separated, e.g. 32k,64k,128k or 1m
  --concurrency          Comma-separated, e.g. 1,4,8,16
  --output-tokens N      Max output tokens per request (default: 500)
  --prompt-type TYPE     classic|deterministic|madlib|random|all
  --prompts-file PATH    Custom prompts JSONL file

Load Mode:
  --rps FLOAT            Sustained requests-per-second mode
  --duration FLOAT       Duration for sustained RPS run (default: 120s)

Statistical Rigor:
  --iterations N         Iterations per config for confidence intervals (default: 1)
  --seed INT             Random seed for reproducibility

Cost:
  --cost FLOAT           GPU cost in USD/hr (auto-detected for known GPUs)

Behavior:
  -y, --non-interactive  Skip interactive prompts, use defaults
  --no-warmup            Skip model warmup
  --no-streaming         Disable streaming TTFT measurement

Output:
  --output-dir DIR       Output directory (default: ./outputs)
  --no-html              Skip HTML report
  --no-charts            Skip PNG charts

Traffic Simulation:
  --traffic TYPE         poisson|multiturn
  --target-rps FLOAT     Target RPS for traffic simulation (default: 2.0)
  --traffic-duration S   Duration in seconds (default: 60)
  --turns N              Turns per conversation for multiturn (default: 5)

Comparison:
  --compare FILE         Compare with previous results JSON

Installation

# From PyPI
pip install vllm-benchmark-suite

# With uv
uv pip install vllm-benchmark-suite

# From source
git clone https://github.com/notadestroyer/vllm-benchmark-suite.git
cd vllm-benchmark-suite
pip install -e ".[dev]"

Requirements: Python 3.10+, a running vLLM server. NVIDIA GPU optional (required for GPU metrics).

Contributing

git clone https://github.com/notadestroyer/vllm-benchmark-suite.git
cd vllm-benchmark-suite
pip install -e ".[dev]"
pytest tests/ -v
ruff check src/

Open an issue first to discuss significant changes.

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

3.1.1

Apr 16, 2026

3.1.0

Apr 16, 2026

3.0.0

Apr 16, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vllm_benchmark_suite-3.1.1.tar.gz (75.9 kB view details)

Uploaded Apr 16, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

vllm_benchmark_suite-3.1.1-py3-none-any.whl (69.2 kB view details)

Uploaded Apr 16, 2026 Python 3

File details

Details for the file vllm_benchmark_suite-3.1.1.tar.gz.

File metadata

Download URL: vllm_benchmark_suite-3.1.1.tar.gz
Upload date: Apr 16, 2026
Size: 75.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for vllm_benchmark_suite-3.1.1.tar.gz
Algorithm	Hash digest
SHA256	`e9245b825e390058ddd9324d84a7afedad3439034b5ccbf7c360335271bce762`
MD5	`b832a7e016a57275c04a92c22fcd2df3`
BLAKE2b-256	`aeb2f5bda29ffcfc2943eca713e767cdbbdd325d46acdc24c570adb91ddf4602`

See more details on using hashes here.

File details

Details for the file vllm_benchmark_suite-3.1.1-py3-none-any.whl.

File metadata

Download URL: vllm_benchmark_suite-3.1.1-py3-none-any.whl
Upload date: Apr 16, 2026
Size: 69.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for vllm_benchmark_suite-3.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`88892d8d31d1034fda4e65ed5c7de8f686c46e0f04592df61add3ac88f8d5bbe`
MD5	`5cd580ba06654880cebd2197833ae1b1`
BLAKE2b-256	`c634bf755d7af7f53d28d17f3d317c223909b6d9b87c8f452467a4eb3e67cdc8`

See more details on using hashes here.

vllm-benchmark-suite 3.1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

vLLM Benchmark Suite

Quick Start

What It Does

Benchmark Presets

Load Modes

Burst (default)

Sustained RPS

Statistical Rigor

Cost Analysis

Prompt Strategies

vLLM Score (0–10,000)

Diagnostics

Regression Detection

Output Files

Charts

Metrics Reference

CLI Reference

Installation

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes