Skip to main content

LLM inference benchmarking toolkit

Project description

Tokenomics

Benchmarking suite for OpenAI-compatible inference servers. Measures throughput, latency, and steady-state performance.

Example benchmark

Install

pip install tokenomics

From source

git clone https://github.com/tugot17/tokenomics.git
cd tokenomics
uv venv --python 3.12 --seed && source .venv/bin/activate
uv pip install -e .

Completion Benchmark

Sends chat completion requests to any OpenAI-compatible server and records per-request and system-wide metrics.

By default, requests are non-streaming for maximum throughput. Use --stream to enable SSE streaming for TTFT and per-token latency metrics.

Usage

# Sustained mode — maintains constant concurrency (recommended)
tokenomics completion \
  --scenario "D(1024,256)" \
  --model your-model \
  --max-concurrency 1,2,4,8,16,32,64,128,256,512,1024

# Burst mode — fires all requests at once
tokenomics completion \
  --scenario "D(1024,256)" \
  --model your-model \
  --batch-sizes 1,2,4,8

# Multiple completions per request (e.g. for RL rollouts)
tokenomics completion \
  --scenario "D(1024,256)" \
  --model your-model \
  --max-concurrency 1,2,4,8,16 \
  -n 16

# Streaming mode — enables TTFT and per-token metrics
tokenomics completion \
  --scenario "D(1024,256)" \
  --model your-model \
  --max-concurrency 1,2,4,8 \
  --stream

The two execution modes (--batch-sizes and --max-concurrency) are mutually exclusive. Burst is good for peak throughput; sustained gives realistic production numbers.

Traffic Scenarios

Pattern Example Description
D(in,out) D(100,50) Fixed token counts
N(mu,sigma)/(mu,sigma) N(100,50)/(50,0) Normal distribution
U(min,max)/(min,max) U(50,150)/(20,80) Uniform distribution

Datasets

The benchmark uses a bundled AIME dataset by default. You can specify a custom dataset with --dataset-config.

The benchmark concatenates random text snippets from the dataset until it reaches the input token count specified by the scenario. Snippets are picked with replacement, so even a small dataset can produce long prompts.

Dataset config format

A dataset config is a JSON file with a source section:

Local file (TXT, CSV, or JSON):

{
  "source": { "type": "file", "path": "../data/prompts.txt" },
  "prompt_column": "text"
}

File paths are resolved relative to the config file.

HuggingFace dataset:

{
  "source": {
    "type": "huggingface",
    "path": "squad",
    "huggingface_kwargs": { "split": "train" }
  },
  "prompt_column": "question"
}

AIME (built-in shortcut):

{
  "source": { "type": "aime" }
}

See examples/dataset_configs/ for more examples.

Key Options

Flag Description
--scenario Traffic pattern (required)
--model Model name (required)
--api-base Server URL (default: http://localhost:8000/v1)
--batch-sizes Burst mode sweep points
--max-concurrency Sustained mode sweep points
--num-prompts Prompts per sweep point in sustained mode
--num-runs Runs per sweep point (default: 3)
--max-tokens Max output tokens (default: 4096)
-n Completions per request (default: 1)
--stream Enable SSE streaming for TTFT/per-token metrics
--dataset-config Path to dataset config (default: bundled AIME)
--results-dir Output directory (one JSON per sweep value)
--lora-strategy LoRA distribution: single, uniform, zipf, mixed, all-unique
--lora-names Comma-separated LoRA adapter names

Metrics

Per-request:

  • TTFT — time to first token (streaming only)
  • Decode throughput — output tokens/s per request (streaming only)
  • TPOT — time per output token (streaming only)
  • Per-request latency — end-to-end time per request

System-wide:

  • End-to-end output throughputtotal_output_tokens / wall_time
  • Steady-state output throughput — median tok/s across time buckets where the batch is >= 80% full (streaming only)

Plotting

# Compare multiple benchmarks
tokenomics plot-completion output.png results_dir1/ results_dir2/

Non-streaming (default) produces a 2-panel plot:

Non-streaming example

Top Output throughput
Bottom Per-request latency

Streaming (--stream) produces a 6-panel dashboard:

Left Right
Row 1 TTFT Decode throughput per request
Row 2 End-to-end output throughput Latency breakdown (prefill vs decode)
Row 3 Steady-state output throughput Time-series token buckets

Embedding Benchmark

Tests concurrent embedding throughput.

tokenomics embedding \
  --model Qwen/Qwen3-Embedding-4B \
  --sequence_lengths "200" \
  --batch_sizes "1,8,16,32,64,128,256,512" \
  --num_runs 3 \
  --results-dir embedding_results/

tokenomics plot-embedding embedding_results/ embedding_plot.png

Embedding performance

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tokenomics-0.6.0.tar.gz (3.6 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tokenomics-0.6.0-py3-none-any.whl (37.9 kB view details)

Uploaded Python 3

File details

Details for the file tokenomics-0.6.0.tar.gz.

File metadata

  • Download URL: tokenomics-0.6.0.tar.gz
  • Upload date:
  • Size: 3.6 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for tokenomics-0.6.0.tar.gz
Algorithm Hash digest
SHA256 b89997b126d6af9775392b9e23bfd15eb2bd903cac48884d48c3a454e3b3b7ac
MD5 24a4334c9cb318424e995615fe5b5312
BLAKE2b-256 58821122a30708e8b049d63fba17dacd1ad7c80c6c26d3c3c9fdb70ed6f14a03

See more details on using hashes here.

Provenance

The following attestation bundles were made for tokenomics-0.6.0.tar.gz:

Publisher: publish.yml on tugot17/tokenomics

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file tokenomics-0.6.0-py3-none-any.whl.

File metadata

  • Download URL: tokenomics-0.6.0-py3-none-any.whl
  • Upload date:
  • Size: 37.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for tokenomics-0.6.0-py3-none-any.whl
Algorithm Hash digest
SHA256 cbeef6007a0ed839a216e3e860e8e9f57d10b43331d3a8e2ea050022967298b2
MD5 f8225e2304646063c458823d46d08c84
BLAKE2b-256 629d6d99efd753cd51cd7b025e51485f692fb99ce853d638bb7ce9f3f4914649

See more details on using hashes here.

Provenance

The following attestation bundles were made for tokenomics-0.6.0-py3-none-any.whl:

Publisher: publish.yml on tugot17/tokenomics

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page