A GenAI inference performance benchmarking tool.

Project description

Inference Perf

Inference Perf is a GenAI inference performance benchmarking tool. It came out of wg-serving and is sponsored by SIG Scalability. Original proposal can be found here.

Architecture

Architecture Diagram

Key Features

Highly scalable and can support benchmarking large inference production deployments.
Reports the key metrics needed to measure LLM performance.
Supports different real world and synthetic datasets.
Supports different APIs and can support multiple model servers.
Supports specifying an exact input and output distribution to simulate different scenarios - Gaussian distribution, fixed length, min-max cases are all supported.
Generates different load patterns and can benchmark specific cases like burst traffic, scaling to saturation and other autoscaling / routing scenarios.

Roadmap

Accelerator metrics collection during benchmarks (GPU utilization, memory usage, power usage, etc.).
Support for more model servers.
Deployment API to help deploy different inference stacks.
Traffic splitting among different use cases or LoRA adapters.
Support for benchmarking non-LLM GenAI use cases.
Support for different datasets to simulate real world use cases.
Replaying traffic from production systems.

Getting Started

Run locally

Setup a virtual environment and install inference-perf
```
pip install .
```
Run inference-perf CLI with a configuration file
```
inference-perf --config_file config.yml
```
See more examples

Run in a Docker container

Build the container
```
docker build -t inference-perf .
```

Run the container

docker run -it --rm -v $(pwd)/config.yml:/workspace/config.yml inference-perf

Run in a Kubernetes cluster

Refer to the guide in /deploy.

Configuration

You can configure inference-perf to run with different data generation and load generation configurations today. Please see config.yml and examples in /examples.

Refer to the CONFIG.md for documentation on all supported configuration options.

Datasets

Supported datasets include the following:

ShareGPT (for a real world conversational dataset)
Synthetic (for specific input / output distributions with Sonnet data)
Random (for specific input / output distributions with random data)
SharedPrefix (for prefix caching scenarios)
mock (for testing)

Load Generators

Multiple load generators are supported:

Poisson / constant-time load generation to send specific QPS.
Multi-process load generation for increased concurrency and higher QPS.

Multiple load patterns can be specified:

Stages with configurable duration and QPS along with specific timeouts in between them allows you to simulate different load patterns like burst in traffic, constantly increasing load till hardware saturation, etc.

API

OpenAI completion and chat completion APIs are supported. It can be pointed to any endpoints which support these APIs - currently verified against vLLM deployments. Other APIs and model server support can be added easily.

Metrics

Different latency and throughput metrics to analyze the performance of different LLM workloads are reported. A snippet from an example report is below.

"latency": {
    "request_latency": {
        "mean": 3.31325431142327,
        "min": 1.62129471905064,
        "p10": 1.67609986825846,
        "p50": 2.11507539497688,
        "p90": 5.94717199734878,
        "max": 6.30658466403838
    },
    "normalized_time_per_output_token": {
        "mean": 0.104340420636009,
        "min": 0.0506654599703325,
        "p10": 0.0523781208830769,
        "p50": 0.0670631669655753,
        "p90": 0.189047570470012,
        "max": 0.20343821496898
    },
    "time_per_output_token": {
        "mean": 0.0836929455635872,
        "min": 0.0517028436646797,
        "p10": 0.0530815053513894,
        "p50": 0.0611870964678625,
        "p90": 0.152292036800645,
        "max": 0.17837208439984
    },
    "time_to_first_token": {
        "mean": 0.800974442732916,
        "min": 0.0625283779809251,
        "p10": 0.072068731742911,
        "p50": 0.203539535985328,
        "p90": 2.26959549135063,
        "max": 4.46773961000145
    },
    "inter_token_latency": {
        "mean": 0.0836929455635872,
        "min": 0.000007129972800612,
        "p10": 0.0534287681337446,
        "p50": 0.0591336835059337,
        "p90": 0.084046097996179,
        "max": 0.614475268055685
    }
},
"throughput": {
    "input_tokens_per_sec": 643.576644186323,
    "output_tokens_per_sec": 32.544923821416,
    "total_tokens_per_sec": 676.121568007739,
    "requests_per_sec": 1.0238155253639
},
"prompt_len": {
    "mean": 628.606060606061,
    "min": 4,
    "p10": 11.4,
    "p50": 364,
    "p90": 2427.6,
    "max": 3836
},
"output_len": {
    "mean": 31.7878787878788,
    "min": 30,
    "p10": 31,
    "p50": 32,
    "p90": 32,
    "max": 32
}

Reports

Reports are generated in JSON format.

Per stage reports for individual request rates.
Summary reports for the overall run.
Request logs / traces for further analysis.

Model server metrics reports from Prometheus collected during the run is also produced.

Model server specific metrics like queue size, batch size, latency metrics, etc.
Supports querying metrics from OSS Prometheus and Google Managed Prometheus.

Analysis

Reports can be analyzed using the following command:

inference-perf --analyze <path-to-dir-with-reports>

This should generate the following charts (below charts are for example only):

QPS vs Latency (TTFT, NTPOT, ITL)

qps-latency-chart

QPS vs Throughput (input tokens / sec, output tokens / sec, total tokens / sec)

qps-throughput-chart

Latency vs Throughput (output tokens / sec vs TTFT, NTPOT and ITL)

latency-throughput-chart

Contributing

Our community meeting is weekly on Thursdays alternating betweem 09:00 and 11:30 PDT (Zoom Link, Meeting Notes, Meeting Recordings).

We currently utilize the #inference-perf channel in Kubernetes Slack workspace for communications.

Contributions are welcomed, thanks for joining us!

Code of conduct

Participation in the Kubernetes community is governed by the Kubernetes Code of Conduct.

Project details

Release history Release notifications | RSS feed

0.5.0

May 1, 2026

0.4.0

Feb 6, 2026

0.3.0

Nov 26, 2025

0.2.0

Sep 24, 2025

This version

0.1.1

Aug 1, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

inference_perf-0.1.1.tar.gz (49.8 kB view details)

Uploaded Aug 1, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

inference_perf-0.1.1-py3-none-any.whl (75.4 kB view details)

Uploaded Aug 1, 2025 Python 3

File details

Details for the file inference_perf-0.1.1.tar.gz.

File metadata

Download URL: inference_perf-0.1.1.tar.gz
Upload date: Aug 1, 2025
Size: 49.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.11

File hashes

Hashes for inference_perf-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`8592737cf5519600f882033b0fa40be3b1ef3bf211f5de75fc5b6d7b71306733`
MD5	`54c859f3401a0816e3f85fc30256f7d3`
BLAKE2b-256	`6fb568be0c9d5422aca1f8048e775580d13f1bf4d0401c03a94724b8425129f9`

See more details on using hashes here.

File details

Details for the file inference_perf-0.1.1-py3-none-any.whl.

File metadata

Download URL: inference_perf-0.1.1-py3-none-any.whl
Upload date: Aug 1, 2025
Size: 75.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.11

File hashes

Hashes for inference_perf-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0f4037e0116b77911bce86eefaf16cb99092409f9c4f7fbe73d89f9ba796f688`
MD5	`4b6d95bf31582cfc2b62af03d26a92bd`
BLAKE2b-256	`dc8208b59114b5aac25a7f67d3022778573fd49e0cb84a877c3f0e353bee5dce`

See more details on using hashes here.

inference-perf 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

Inference Perf

Architecture

Key Features

Roadmap

Getting Started

Run locally

Run in a Docker container

Run in a Kubernetes cluster

Configuration

Datasets

Load Generators

API

Metrics

Reports

Analysis

Contributing

Code of conduct

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes