LLM inference benchmarking toolkit
Project description
Tokenomics
Benchmarking suite for OpenAI-compatible inference servers. Measures throughput, latency, and steady-state performance.
Install
pip install tokenomics
From source
git clone https://github.com/tugot17/tokenomics.git
cd tokenomics
uv venv --python 3.12 --seed && source .venv/bin/activate
uv pip install -e .
Completion Benchmark
Sends chat completion requests to any OpenAI-compatible server and records per-request and system-wide metrics.
By default, requests are non-streaming for maximum throughput. Use --stream to enable SSE streaming for TTFT and per-token latency metrics.
Usage
# Sustained mode — maintains constant concurrency (recommended)
tokenomics completion \
--scenario "D(1024,256)" \
--model your-model \
--max-concurrency 1,2,4,8,16,32,64,128,256,512,1024
# Burst mode — fires all requests at once
tokenomics completion \
--scenario "D(1024,256)" \
--model your-model \
--batch-sizes 1,2,4,8
# Multiple completions per request (e.g. for RL rollouts)
tokenomics completion \
--scenario "D(1024,256)" \
--model your-model \
--max-concurrency 1,2,4,8,16 \
-n 16
# Streaming mode — enables TTFT and per-token metrics
tokenomics completion \
--scenario "D(1024,256)" \
--model your-model \
--max-concurrency 1,2,4,8 \
--stream
The two execution modes (--batch-sizes and --max-concurrency) are mutually exclusive. Burst is good for peak throughput; sustained gives realistic production numbers.
Traffic Scenarios
| Pattern | Example | Description |
|---|---|---|
D(in,out) |
D(100,50) |
Fixed token counts |
N(mu,sigma)/(mu,sigma) |
N(100,50)/(50,0) |
Normal distribution |
U(min,max)/(min,max) |
U(50,150)/(20,80) |
Uniform distribution |
Datasets
The benchmark uses a bundled AIME dataset by default. You can specify a custom dataset with --dataset-config.
The benchmark concatenates random text snippets from the dataset until it reaches the input token count specified by the scenario. Snippets are picked with replacement, so even a small dataset can produce long prompts.
Dataset config format
A dataset config is a JSON file with a source section:
Local file (TXT, CSV, or JSON):
{
"source": { "type": "file", "path": "../data/prompts.txt" },
"prompt_column": "text"
}
File paths are resolved relative to the config file.
HuggingFace dataset:
{
"source": {
"type": "huggingface",
"path": "squad",
"huggingface_kwargs": { "split": "train" }
},
"prompt_column": "question"
}
AIME (built-in shortcut):
{
"source": { "type": "aime" }
}
See examples/dataset_configs/ for more examples.
Key Options
| Flag | Description |
|---|---|
--scenario |
Traffic pattern (required) |
--model |
Model name (required) |
--api-base |
Server URL (default: http://localhost:8000/v1) |
--batch-sizes |
Burst mode sweep points |
--max-concurrency |
Sustained mode sweep points |
--num-prompts |
Prompts per sweep point in sustained mode |
--num-runs |
Runs per sweep point (default: 3) |
--max-tokens |
Max output tokens (default: 4096) |
-n |
Completions per request (default: 1) |
--stream |
Enable SSE streaming for TTFT/per-token metrics |
--dataset-config |
Path to dataset config (default: bundled AIME) |
--results-dir |
Output directory (one JSON per sweep value) |
--lora-strategy |
LoRA distribution: single, uniform, zipf, mixed, all-unique |
--lora-names |
Comma-separated LoRA adapter names |
Metrics
Per-request:
- TTFT — time to first token (streaming only)
- Decode throughput — output tokens/s per request (streaming only)
- TPOT — time per output token (streaming only)
- Per-request latency — end-to-end time per request
System-wide:
- End-to-end output throughput —
total_output_tokens / wall_time - Steady-state output throughput — median tok/s across time buckets where the batch is >= 80% full (streaming only)
Plotting
# Compare multiple benchmarks
tokenomics plot-completion output.png results_dir1/ results_dir2/
Non-streaming (default) produces a 2-panel plot:
| Top | Output throughput |
|---|---|
| Bottom | Per-request latency |
Streaming (--stream) produces a 6-panel dashboard:
| Left | Right | |
|---|---|---|
| Row 1 | TTFT | Decode throughput per request |
| Row 2 | End-to-end output throughput | Latency breakdown (prefill vs decode) |
| Row 3 | Steady-state output throughput | Time-series token buckets |
Embedding Benchmark
Tests concurrent embedding throughput.
tokenomics embedding \
--model Qwen/Qwen3-Embedding-4B \
--sequence_lengths "200" \
--batch_sizes "1,8,16,32,64,128,256,512" \
--num_runs 3 \
--results-dir embedding_results/
tokenomics plot-embedding embedding_results/ embedding_plot.png
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file tokenomics-0.6.1.tar.gz.
File metadata
- Download URL: tokenomics-0.6.1.tar.gz
- Upload date:
- Size: 3.6 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bbfa6850ba4a5bc5af3cbb2e4ca639567402e950a66ce8c6300f5e1f22a7f326
|
|
| MD5 |
26ed6e8616eef1f018a2ddc79998f282
|
|
| BLAKE2b-256 |
501db35b719ccd088deb3fa44cccfda023f512fdd06a66da12b0d8ecaf947183
|
Provenance
The following attestation bundles were made for tokenomics-0.6.1.tar.gz:
Publisher:
publish.yml on tugot17/tokenomics
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
tokenomics-0.6.1.tar.gz -
Subject digest:
bbfa6850ba4a5bc5af3cbb2e4ca639567402e950a66ce8c6300f5e1f22a7f326 - Sigstore transparency entry: 1201234525
- Sigstore integration time:
-
Permalink:
tugot17/tokenomics@09b00e18a78f8cdc23b975a8a0eb077a93f60927 -
Branch / Tag:
refs/tags/v0.6.1 - Owner: https://github.com/tugot17
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@09b00e18a78f8cdc23b975a8a0eb077a93f60927 -
Trigger Event:
release
-
Statement type:
File details
Details for the file tokenomics-0.6.1-py3-none-any.whl.
File metadata
- Download URL: tokenomics-0.6.1-py3-none-any.whl
- Upload date:
- Size: 41.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f740782599c9659fad9eccfed0795d4aefb7114815efd80f51fffb80ac395190
|
|
| MD5 |
aecdbbc047498f9d29e10aa118b9ddfe
|
|
| BLAKE2b-256 |
8ef246817a1c9887e010b63e0ac41e0a92c3455b58d4c321e408e5bdcb087851
|
Provenance
The following attestation bundles were made for tokenomics-0.6.1-py3-none-any.whl:
Publisher:
publish.yml on tugot17/tokenomics
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
tokenomics-0.6.1-py3-none-any.whl -
Subject digest:
f740782599c9659fad9eccfed0795d4aefb7114815efd80f51fffb80ac395190 - Sigstore transparency entry: 1201234526
- Sigstore integration time:
-
Permalink:
tugot17/tokenomics@09b00e18a78f8cdc23b975a8a0eb077a93f60927 -
Branch / Tag:
refs/tags/v0.6.1 - Owner: https://github.com/tugot17
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@09b00e18a78f8cdc23b975a8a0eb077a93f60927 -
Trigger Event:
release
-
Statement type: