A GenAI inference performance benchmarking tool.

Project description

Inference Perf

Inference Perf is a production-scale GenAI inference performance benchmarking tool that allows you to benchmark and analyze the performance of inference deployments. It is agnostic of model servers and can be used to measure performance and compare different systems apples-to-apples.

It was founded as a part of the inference benchmarking and metrics standardization effort in wg-serving to standardize the benchmark tooling and the metrics used to measure inference performance across the Kubernetes and model server communities.

🏗️ Architecture

Architecture Diagram

🌟 Key Capabilities

📊 Rich Metrics & Analysis

Comprehensive Latency Metrics: TTFT, TPOT, ITL, and Normalized TPOT.
Throughput Tracking: Input, Output, and Total tokens per second.
Goodput Measurement: Measure rate of requests meeting your SLO constraints. See goodput.md.
Automatic Visualization: Generate charts for QPS vs Latency/Throughput/Goodput. See analysis.md.

🧠 Smart Data Generation

Real-world Datasets: Support for ShareGPT, CNN DailyMail, Infinity Instruct and Billsum.
Synthetic & Random: Configure exact input/output distributions.
Advanced Scenarios: Shared prefix and multi-turn chat conversations.

⏱️ Flexible Load Generation

Load Patterns: Constant rate, Poisson arrival, and concurrent user simulation.
Multi-Stage Runs: Define stages with varying rates and durations to find saturation points.
Trace Replay: Replay real-world traces (e.g., Azure dataset) or OpenTelemetry traces with agentic tree-of-thought simulation and visualization.

🚀 High Scalability

10k+ QPS: Scalable to very high load due to optimized multi-process architecture.
Automatic Saturation Detection: Find the limits of your system via sweeps.

🔌 Engine Agnostic

Verified support for vLLM, SGLang, and TGI with server side aggregate metrics and time series metrics.
Easily extensible to any OpenAI-compatible endpoint.

🚀 Quick Start

Run Locally

Install inference-perf:
```
pip install inference-perf
```

Run a benchmark with a simple random workload:

inference-perf --server.type vllm --server.base_url http://localhost:8000 --data.type random --load.type constant --load.stages '[{"rate": 10, "duration": 60}]' --api.streaming true

Alternatively, you can run using a configuration file:

inference-perf --config_file config.yml

Sample Output

When you run inference-perf, it displays a rich summary table in the CLI:

Metrics Summary

Run in Docker

docker run -it --rm -v $(pwd)/config.yml:/workspace/config.yml quay.io/inference-perf/inference-perf

Run in Kubernetes

Refer to the guide in /deploy.

📚 Documentation Hub

Explore detailed documentation for specific topics:

Topic	Description	Link
Configuration	Full YAML configuration schema and options.	config.md
CLI Flags	Overriding configuration via command line flags.	cli_flags.md
Load Generation	Detailed explanation of load patterns and multi-worker setup.	loadgen.md
Metrics	Definitions of TTFT, TPOT, ITL, etc.	metrics.md
Goodput	How to measure requests meeting SLOs.	goodput.md
Reports	Understanding generated JSON reports.	reports.md
OTel Observability	Instrument benchmark runs with OpenTelemetry tracing to export to Jaeger, Tempo, etc.	otel_instrumentation.md
OTel Trace Replay	Data/load type for replaying production traces with complex dependency graphs.	otel_trace_replay.md
Conversation Replay	Data/load type for benchmarking concurrent multi-turn agentic conversations with configurable distributions.	conversation_replay.md
Analysis	Visualizations and plots for performance metrics.	analysis.md

🤝 Contributing & Community

We welcome contributions! Please join us:

Slack: #inference-perf channel in Kubernetes workspace.
Community Meeting: Weekly on Thursdays alternating between 09:00 and 11:30 PDT.
Code of Conduct: Governed by the Kubernetes Code of Conduct.

See CONTRIBUTING.md for details on how to get started.

Project details

Release history Release notifications | RSS feed

This version

0.5.0

May 1, 2026

0.4.0

Feb 6, 2026

0.3.0

Nov 26, 2025

0.2.0

Sep 24, 2025

0.1.1

Aug 1, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

inference_perf-0.5.0.tar.gz (164.3 kB view details)

Uploaded May 1, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

inference_perf-0.5.0-py3-none-any.whl (191.8 kB view details)

Uploaded May 1, 2026 Python 3

File details

Details for the file inference_perf-0.5.0.tar.gz.

File metadata

Download URL: inference_perf-0.5.0.tar.gz
Upload date: May 1, 2026
Size: 164.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.13

File hashes

Hashes for inference_perf-0.5.0.tar.gz
Algorithm	Hash digest
SHA256	`86aacdd19d7c55557e7d5c3ce787e27dd78acd0319143d2546bd8d3694c248f6`
MD5	`3cd437024e048a9e8cc0ef2c4582cdcc`
BLAKE2b-256	`3bd146bc54b0e32fbbafd41eb29e5edfb704eb2e9ed39df89a6553373c716346`

See more details on using hashes here.

File details

Details for the file inference_perf-0.5.0-py3-none-any.whl.

File metadata

Download URL: inference_perf-0.5.0-py3-none-any.whl
Upload date: May 1, 2026
Size: 191.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.13

File hashes

Hashes for inference_perf-0.5.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`4416aafe9193ca73f56223a61af81b6b0316bd8c9f3139afe5929864752827d7`
MD5	`23ba2f8d297043bc6cdbe7fbedcdf95e`
BLAKE2b-256	`1c5f190828de59eb94a3ad7c86c83c1ba31a80c5e1f25fdcaa7fbabf6cf39cf8`

See more details on using hashes here.

inference-perf 0.5.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

Inference Perf

🏗️ Architecture

🌟 Key Capabilities

📊 Rich Metrics & Analysis

🧠 Smart Data Generation

⏱️ Flexible Load Generation

🚀 High Scalability

🔌 Engine Agnostic

🚀 Quick Start

Run Locally

Sample Output

Run in Docker

Run in Kubernetes

📚 Documentation Hub

🤝 Contributing & Community

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes