Skip to main content

AIPerf is a package for performance testing of AI models

Project description

AIPerf

PyPI version License Codecov Discord Ask DeepWiki

AIPerf is a comprehensive benchmarking tool that measures the performance of generative AI models served by your preferred inference solution. It provides detailed metrics using a command line display as well as extensive benchmark performance reports.

AIPerf UI Dashboard

Quick Start

This quick start guide leverages Ollama via Docker Desktop.

Setting up a Local Server

In order to set up an Ollama server, run granite4:350m using the following commands:

docker run -d \
  --name ollama \
  -p 11434:11434 \
  -v ollama-data:/root/.ollama \
  ollama/ollama:latest
docker exec -it ollama ollama pull granite4:350m

Basic Usage

Create a virtual environment and install AIPerf:

python3 -m venv venv
source venv/bin/activate
pip install aiperf

To run a simple benchmark against your Ollama server:

aiperf profile \
  --model "granite4:350m" \
  --streaming \
  --endpoint-type chat \
  --tokenizer ibm-granite/granite-4.0-micro \
  --url http://localhost:11434

Example with Custom Configuration

aiperf profile \
  --model "granite4:350m" \
  --streaming \
  --endpoint-type chat \
  --tokenizer ibm-granite/granite-4.0-micro \
  --url http://localhost:11434
  --concurrency 5 \
  --request-count 10

Example output:

NOTE: The example performance is reflective of a CPU-only run and does not represent an official benchmark.

                                               NVIDIA AIPerf | LLM Metrics
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━┓
┃                               Metric        avg       min        max        p99        p90        p50       std ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━┩
│             Time to First Token (ms)   7,463.28  7,125.81   9,484.24   9,295.48   7,596.62   7,240.23    677.23 │
│            Time to Second Token (ms)      68.73     32.01     102.86     102.55      99.80      67.37     24.95 │
│      Time to First Output Token (ms)   7,463.28  7,125.81   9,484.24   9,295.48   7,596.62   7,240.23    677.23 │
│                 Request Latency (ms)  13,829.40  9,029.36  27,905.46  27,237.77  21,228.48  11,338.31  5,614.32 │
│             Inter Token Latency (ms)      65.31     53.06      81.31      81.24      80.64      63.79      9.09 │
│     Output Token Throughput Per User      15.60     12.30      18.85      18.77      18.08      15.68      2.05 │
│                    (tokens/sec/user)                                                                            │
│      Output Sequence Length (tokens)      95.20     29.00     295.00     283.12     176.20      63.00     77.08 │
│       Input Sequence Length (tokens)     550.00    550.00     550.00     550.00     550.00     550.00      0.00 │
│ Output Token Throughput (tokens/sec)       6.85       N/A        N/A        N/A        N/A        N/A       N/A │
│    Request Throughput (requests/sec)       0.07       N/A        N/A        N/A        N/A        N/A       N/A │
│             Request Count (requests)      10.00       N/A        N/A        N/A        N/A        N/A       N/A │
└──────────────────────────────────────┴───────────┴──────────┴───────────┴───────────┴───────────┴───────────┴──────────┘

CLI Command: aiperf profile --model 'granite4:350m' --streaming --endpoint-type 'chat' --tokenizer 'ibm-granite/granite-4.0-micro' --url 'http://localhost:11434'
Benchmark Duration: 138.89 sec
CSV Export: /home/user/aiperf/artifacts/granite4:350m-openai-chat-concurrency1/profile_export_aiperf.csv
JSON Export: /home/user/Code/aiperf/artifacts/granite4:350m-openai-chat-concurrency1/profile_export_aiperf.json
Log File: /home/user/Code/aiperf/artifacts/granite4:350m-openai-chat-concurrency1/logs/aiperf.log

Features

  • Scalable multiprocess architecture with 9 services communicating via ZMQ
  • 3 UI modes: dashboard (real-time TUI), simple (progress bars), none (headless)
  • Multiple benchmarking modes: concurrency, request-rate, request-rate with max concurrency, trace replay
  • Extensible plugin system for endpoints, datasets, transports, and metrics
  • Public dataset support including ShareGPT and custom formats

Supported APIs

  • OpenAI chat completions, completions, embeddings, audio, images
  • NIM embeddings, rankings

Tutorials and Feature Guides

Getting Started

Load Control and Timing

Workloads and Data

Endpoint Types

Analysis and Monitoring

Documentation

Document Purpose
Architecture Three-plane architecture, core components, credit system, data flow
CLI Options Complete command and option reference
Metrics Reference All metric definitions, formulas, and requirements
Environment Variables All AIPERF_* configuration variables
Plugin System Plugin architecture, 25+ categories, creation guide
Creating Plugins Step-by-step plugin tutorial
Accuracy Benchmarks Accuracy evaluation stubs and datasets
Benchmark Modes Trace replay and timing modes
Server Metrics Prometheus-compatible server metrics collection
Tokenizer Auto-Detection Pre-flight tokenizer detection
Conversation Context Mode How conversation history accumulates in multi-turn
Dataset Synthesis API Synthesis module API reference
Code Patterns Code examples for services, models, messages, plugins
Migrating from Genai-Perf Migration guide and feature comparison
Design Proposals Enhancement proposals and discussions

Contributing

See CONTRIBUTING.md for development setup, coding conventions, and contribution guidelines.

Known Issues

  • Output sequence length constraints (--output-tokens-mean) cannot be guaranteed unless you pass ignore_eos and/or min_tokens via --extra-inputs to an inference server that supports them.
  • Very high concurrency settings (typically >15,000) may lead to port exhaustion on some systems. Adjust system limits or reduce concurrency if connection failures occur.
  • Startup errors caused by invalid configuration settings can cause AIPerf to hang indefinitely. Terminate the process and check configuration settings.
  • Copying selected text may not work reliably in the dashboard UI. Use the c key to copy all logs.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

aiperf_nightly-0.8.0.dev20260511-py3-none-any.whl (3.7 MB view details)

Uploaded Python 3

File details

Details for the file aiperf_nightly-0.8.0.dev20260511-py3-none-any.whl.

File metadata

File hashes

Hashes for aiperf_nightly-0.8.0.dev20260511-py3-none-any.whl
Algorithm Hash digest
SHA256 c21c0536ff402673b08419243e5fc10c4e7f178a1705a19ddcd888ff549c8ff6
MD5 8cbb9bac76f02f1b8b1b529e1ad61b2a
BLAKE2b-256 220c3011bd13cfaa7f7930fd5d42456d6ee80efa8faa259c24a3ec84ad86e3e7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page