Skip to main content

A GenAI inference performance benchmarking tool.

Project description

License GitHub Release PyPI Release Container Image Tests Join Slack

Inference Perf

Inference Perf is a production-scale GenAI inference performance benchmarking tool that allows you to benchmark and analyze the performance of inference deployments. It is agnostic of model servers and can be used to measure performance and compare different systems apples-to-apples.

It was founded as a part of the inference benchmarking and metrics standardization effort in wg-serving to standardize the benchmark tooling and the metrics used to measure inference performance across the Kubernetes and model server communities.


🏗️ Architecture

Architecture Diagram


🌟 Key Capabilities

📊 Rich Metrics & Analysis

  • Comprehensive Latency Metrics: TTFT, TPOT, ITL, and Normalized TPOT.
  • Throughput Tracking: Input, Output, and Total tokens per second.
  • Goodput Measurement: Measure rate of requests meeting your SLO constraints. See goodput.md.
  • Automatic Visualization: Generate charts for QPS vs Latency/Throughput/Goodput. See analysis.md.

🧠 Smart Data Generation

  • Real-world Datasets: Support for ShareGPT, CNN DailyMail, Infinity Instruct and Billsum.
  • Synthetic & Random: Configure exact input/output distributions.
  • Advanced Scenarios: Shared prefix and multi-turn chat conversations.

⏱️ Flexible Load Generation

  • Load Patterns: Constant rate, Poisson arrival, and concurrent user simulation.
  • Multi-Stage Runs: Define stages with varying rates and durations to find saturation points.
  • Trace Replay: Replay real-world traces (e.g., Azure dataset) or OpenTelemetry traces with agentic tree-of-thought simulation and visualization.

🚀 High Scalability

  • 10k+ QPS: Scalable to very high load due to optimized multi-process architecture.
  • Automatic Saturation Detection: Find the limits of your system via sweeps.

🔌 Engine Agnostic

  • Verified support for vLLM, SGLang, and TGI with server side aggregate metrics and time series metrics.
  • Easily extensible to any OpenAI-compatible endpoint.

🚀 Quick Start

Run Locally

  1. Install inference-perf:

    pip install inference-perf
    
  2. Run a benchmark with a simple random workload:

    inference-perf --server.type vllm --server.base_url http://localhost:8000 --data.type random --load.type constant --load.stages '[{"rate": 10, "duration": 60}]' --api.streaming true
    

Alternatively, you can run using a configuration file:

inference-perf --config_file config.yml

Sample Output

When you run inference-perf, it displays a rich summary table in the CLI:

Metrics Summary

Run in Docker

docker run -it --rm -v $(pwd)/config.yml:/workspace/config.yml quay.io/inference-perf/inference-perf

Run in Kubernetes

Refer to the guide in /deploy.


📚 Documentation Hub

Explore detailed documentation for specific topics:

Topic Description Link
Configuration Full YAML configuration schema and options. config.md
CLI Flags Overriding configuration via command line flags. cli_flags.md
Load Generation Detailed explanation of load patterns and multi-worker setup. loadgen.md
Metrics Definitions of TTFT, TPOT, ITL, etc. metrics.md
Goodput How to measure requests meeting SLOs. goodput.md
Reports Understanding generated JSON reports. reports.md
OTel Observability Instrument benchmark runs with OpenTelemetry tracing to export to Jaeger, Tempo, etc. otel_instrumentation.md
OTel Trace Replay Data/load type for replaying production traces with complex dependency graphs. otel_trace_replay.md
Conversation Replay Data/load type for benchmarking concurrent multi-turn agentic conversations with configurable distributions. conversation_replay.md
Analysis Visualizations and plots for performance metrics. analysis.md

🤝 Contributing & Community

We welcome contributions! Please join us:

See CONTRIBUTING.md for details on how to get started.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

inference_perf-0.5.0.tar.gz (164.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

inference_perf-0.5.0-py3-none-any.whl (191.8 kB view details)

Uploaded Python 3

File details

Details for the file inference_perf-0.5.0.tar.gz.

File metadata

  • Download URL: inference_perf-0.5.0.tar.gz
  • Upload date:
  • Size: 164.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.13

File hashes

Hashes for inference_perf-0.5.0.tar.gz
Algorithm Hash digest
SHA256 86aacdd19d7c55557e7d5c3ce787e27dd78acd0319143d2546bd8d3694c248f6
MD5 3cd437024e048a9e8cc0ef2c4582cdcc
BLAKE2b-256 3bd146bc54b0e32fbbafd41eb29e5edfb704eb2e9ed39df89a6553373c716346

See more details on using hashes here.

File details

Details for the file inference_perf-0.5.0-py3-none-any.whl.

File metadata

  • Download URL: inference_perf-0.5.0-py3-none-any.whl
  • Upload date:
  • Size: 191.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.13

File hashes

Hashes for inference_perf-0.5.0-py3-none-any.whl
Algorithm Hash digest
SHA256 4416aafe9193ca73f56223a61af81b6b0316bd8c9f3139afe5929864752827d7
MD5 23ba2f8d297043bc6cdbe7fbedcdf95e
BLAKE2b-256 1c5f190828de59eb94a3ad7c86c83c1ba31a80c5e1f25fdcaa7fbabf6cf39cf8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page