The most comprehensive benchmarking suite for vLLM inference servers

These details have not been verified by PyPI

Project links

Project description

vLLM Benchmark Suite

The most comprehensive benchmarking tool for vLLM inference servers. Built by a vLLM user, for vLLM users.

Quick Start

pip install vllm-benchmark-suite
vllm-bench --quick

That's it. Point it at your vLLM server and get a complete performance profile in 5 minutes.

Why This Tool?

Every vLLM operator asks the same questions: Is my setup fast enough? Where's the bottleneck? Did that config change make things better or worse?

I built this because I was tired of writing one-off scripts to answer those questions. This tool gives you a single benchmark score, plain-English diagnostics, and shareable HTML reports — everything you need to understand, optimize, and communicate your vLLM deployment's performance.

Features

vLLM Score (0-10,000)

A single composite number — like Geekbench for vLLM. Compare deployments, track improvements, share results. Weighted across throughput, latency, efficiency, energy, and consistency.

Auto-Diagnostics

After every run, get plain-English analysis: "Your p99 latency is 5x your average at 128K context — this usually means request queuing. Try reducing max_num_seqs." Not just numbers — actionable recommendations.

True TTFT Measurement

Uses SSE streaming to measure actual Time-to-First-Token, not estimates. Know exactly how long users wait before they see output.

Interactive HTML Reports

Self-contained HTML files with interactive Plotly charts. Share with your team, attach to tickets, email to your manager. Dark-themed, professional, no dependencies to view.

Regression Detection

vllm-bench --standard --compare last_week.json

Automatically flags throughput drops and latency increases after vLLM upgrades, config changes, or infrastructure moves.

25+ Performance Metrics

Throughput: tokens/sec, requests/sec, tokens/sec/user, batch scaling efficiency
Latency: avg, min, max, P50/P90/P95/P99, TTFT, inter-token latency
GPU: utilization, memory, temperature, power draw, clock frequencies
Energy: tokens/watt, watts/token/user, total watt-hours consumed
Cache: prefix cache hit rate, actual prefill/decode time separation

13+ Publication-Quality Charts

Throughput landscapes, latency/throughput heatmaps, TTFT with UX quality zones, inter-token latency, batch scaling efficiency, decode speed, power draw, prompt type comparisons, cache hit rate heatmaps.

4 Prompt Strategies

Test cache behavior with different prompt types:

Classic: Deterministic cybersecurity text (high cache hits)
Deterministic: Tokenizer-aware repetitive story (perfect cache hits)
Madlib: Random word injection (moderate cache misses)
Random: Fully random text (minimal cache hits)

Custom Prompts

vllm-bench --prompts-file my_production_prompts.jsonl

Test with your actual production prompts for realistic performance numbers.

Benchmark Presets

Preset	Time	Context Lengths	Concurrency	Prompt Types
`--quick`	~5 min	32K	1, 4	classic
`--standard`	~30 min	32K, 64K, 128K	1, 4, 8, 16	classic, deterministic
`--thorough`	~2 hours	32K–512K	1, 4, 8, 16, 32	all 4 types

Or configure everything manually:

vllm-bench --context-lengths 32k,64k,128k --concurrency 1,4,8,16 --output-tokens 500

CLI Reference

vllm-bench [OPTIONS]

Connection:
  --url URL              vLLM server URL (default: http://localhost:8000)
  --model NAME           Model name override (auto-detected)

Presets:
  --quick                Quick benchmark (~5 min)
  --standard             Standard benchmark (~30 min)
  --thorough             Thorough benchmark (~2 hours)

Test Parameters:
  --context-lengths      Comma-separated (e.g. 32k,64k,128k)
  --concurrency          Comma-separated (e.g. 1,4,8,16)
  --output-tokens N      Max output tokens (default: 500)
  --prompt-type TYPE     classic|deterministic|madlib|random|all
  --prompts-file PATH    Custom prompts JSONL file

Behavior:
  -y, --non-interactive  Skip interactive prompts
  --no-warmup            Skip model warmup
  --no-streaming         Disable streaming TTFT measurement

Output:
  --output-dir DIR       Output directory (default: ./outputs)
  --no-html              Skip HTML report
  --no-charts            Skip PNG charts

Comparison:
  --compare FILE         Compare with previous results JSON

Sample Output

┌──────────────────────────────────────┐
│ vLLM Benchmark Suite                 │
│ v3.0.0                               │
└──────────────────────────────────────┘

  vLLM Benchmark Score: 7,234 / 10,000  (Grade: A)

  Throughput  (30%) ████████████████████░░░░░░░░░░ 6,800
  Latency     (25%) ██████████████████████████░░░░ 8,500
  Efficiency  (20%) ███████████████████████░░░░░░░ 7,650
  Energy      (15%) █████████████████░░░░░░░░░░░░░ 5,800
  Consistency (10%) █████████████████████████████░ 9,200

Diagnostics:
  OK Excellent throughput: 1,360 tokens/sec at 32K with 16 users
  OK All metrics look healthy
  WARNING GPU temperature peaked at 82°C

Performance Highlights:
  Peak Throughput    1,360.4 tok/s    16u @ 32K
  Best Efficiency      340.1 tok/s/user  1u @ 32K
  Lowest Latency        0.42s          1u @ 32K

Outputs:
  JSON: ./outputs/benchmark_Llama-3-70B_20240115_143022.json
  CSV:  ./outputs/benchmark_Llama-3-70B_20240115_143022.csv
  Charts: ./outputs/benchmark_Llama-3-70B_20240115_143022.png
  HTML: ./outputs/benchmark_Llama-3-70B_20240115_143022.html

Installation

# From PyPI
pip install vllm-benchmark-suite

# Or with uv (faster)
uv pip install vllm-benchmark-suite

# From source
git clone https://github.com/notadestroyer/vllm-benchmark-suite.git
cd vllm-benchmark-suite
pip install -e ".[dev]"

Requirements

Python 3.10+
A running vLLM server (local or remote)
NVIDIA GPU (for GPU metrics — benchmarking works without one, but you won't get GPU telemetry)

Output Files

Each benchmark run generates:

File	Description
`benchmark_*.json`	Complete results with metadata, system info, and all metrics
`benchmark_*.csv`	Tabular results for spreadsheet analysis
`benchmark_*.png`	13+ publication-quality matplotlib charts (300 DPI)
`benchmark_*.html`	Interactive HTML report with Plotly charts

Contributing

Contributions welcome. Please open an issue first to discuss what you'd like to change.

git clone https://github.com/notadestroyer/vllm-benchmark-suite.git
cd vllm-benchmark-suite
pip install -e ".[dev]"
pytest tests/ -v
ruff check src/

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

3.1.1

Apr 16, 2026

3.1.0

Apr 16, 2026

This version

3.0.0

Apr 16, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vllm_benchmark_suite-3.0.0.tar.gz (57.9 kB view details)

Uploaded Apr 16, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

vllm_benchmark_suite-3.0.0-py3-none-any.whl (56.3 kB view details)

Uploaded Apr 16, 2026 Python 3

File details

Details for the file vllm_benchmark_suite-3.0.0.tar.gz.

File metadata

Download URL: vllm_benchmark_suite-3.0.0.tar.gz
Upload date: Apr 16, 2026
Size: 57.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for vllm_benchmark_suite-3.0.0.tar.gz
Algorithm	Hash digest
SHA256	`70496c2fb48061b10b0d19fd0708ba6cc761c4f6814e60f641b831bcd55a8d1d`
MD5	`d697a45d58da401218d6cbf4b7d7c51c`
BLAKE2b-256	`89f9e212474622c3ca63290478fba2538285100e84c06d4152948bf9e92718eb`

See more details on using hashes here.

File details

Details for the file vllm_benchmark_suite-3.0.0-py3-none-any.whl.

File metadata

Download URL: vllm_benchmark_suite-3.0.0-py3-none-any.whl
Upload date: Apr 16, 2026
Size: 56.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for vllm_benchmark_suite-3.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e6b3f4446d2b99d83e0430396d547c6cc739c363e9643441caa7ba9a80d24c14`
MD5	`8cc17fc6da572636a6c311338ec91a87`
BLAKE2b-256	`65f9dd7a563c95676cd2a44f02ebcbd03517d84f8e23c34a5c573f908fe9c553`

See more details on using hashes here.

vllm-benchmark-suite 3.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

vLLM Benchmark Suite

Quick Start

Why This Tool?

Features

vLLM Score (0-10,000)

Auto-Diagnostics

True TTFT Measurement

Interactive HTML Reports

Regression Detection

25+ Performance Metrics

13+ Publication-Quality Charts

4 Prompt Strategies

Custom Prompts

Benchmark Presets

CLI Reference

Sample Output

Installation

Requirements

Output Files

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes