Benchmark any OpenAI-compatible LLM endpoint. TTFT, inter-token latency, throughput, P50-P99 — in one command.
Project description
infermark
Know how fast your LLM endpoint actually is.
infermark benchmarks any OpenAI-compatible API endpoint — vLLM, TGI, Ollama, SGLang, or anything behind /v1/chat/completions. It measures what matters: time to first token, inter-token latency, throughput under load, and tail latencies. One command, no config files, real numbers.
Both llmperf and llm-bench were archived in 2025. infermark fills the gap.
What it measures
| Metric | What it tells you |
|---|---|
| TTFT | Time to first token — how long until streaming starts |
| ITL | Inter-token latency — smoothness of the stream |
| Throughput (tok/s) | Output tokens per second across all concurrent requests |
| P50 / P95 / P99 | Tail latency distribution at each concurrency level |
| Error rate | Failed requests under load |
| RPS | Requests per second the server can sustain |
Install
pip install infermark
With the CLI (rich tables, progress):
pip install infermark[cli]
Quick start
CLI
# Benchmark a local vLLM server
infermark run http://localhost:8000/v1 --model meta-llama/Llama-3-70B -n 50
# Sweep concurrency levels
infermark run http://localhost:8000/v1 -c 1,4,8,16,32,64 -n 100
# Save results as JSON
infermark run http://localhost:8000/v1 -o results.json
# Compare multiple endpoints
infermark compare vllm.json tgi.json ollama.json
Python
from infermark import BenchmarkConfig, run_benchmark
config = BenchmarkConfig(
url="http://localhost:8000/v1",
model="meta-llama/Llama-3-70B-Instruct",
concurrency_levels=[1, 4, 8, 16, 32],
n_requests=100,
max_tokens=256,
)
report = run_benchmark(config)
# Best throughput
best = report.best_throughput()
print(f"Peak: {best.tokens_per_second:.1f} tok/s at concurrency {best.concurrency}")
# Lowest latency
low = report.lowest_latency()
print(f"Lowest P50: {low.latency.p50 * 1000:.1f} ms at concurrency {low.concurrency}")
Async
import asyncio
from infermark import BenchmarkConfig, run_benchmark_async
async def main():
config = BenchmarkConfig(url="http://localhost:8000/v1", model="llama-3")
report = await run_benchmark_async(config)
print(f"Peak throughput: {report.best_throughput().tokens_per_second:.1f} tok/s")
asyncio.run(main())
Compare endpoints
Find out whether vLLM, TGI, or Ollama is faster for your model and hardware:
# Benchmark each endpoint separately
infermark run http://gpu1:8000/v1 --model llama-3 -o vllm.json
infermark run http://gpu2:8080/v1 --model llama-3 -o tgi.json
infermark run http://gpu3:11434/v1 --model llama-3 -o ollama.json
# Side-by-side comparison
infermark compare vllm.json tgi.json ollama.json
Export formats
# JSON (for programmatic analysis)
infermark run http://localhost:8000/v1 -o report.json
# Markdown (paste into docs/PRs)
infermark run http://localhost:8000/v1 --markdown report.md
Configuration
BenchmarkConfig(
url="http://localhost:8000/v1", # Any OpenAI-compatible endpoint
model="meta-llama/Llama-3-70B", # Model name
prompt="Explain relativity.", # Prompt to send
max_tokens=256, # Max output tokens per request
concurrency_levels=[1, 4, 8, 16], # Test these concurrency levels
n_requests=100, # Requests per level
timeout=120.0, # Per-request timeout (seconds)
mode=BenchmarkMode.STREAMING, # STREAMING or NON_STREAMING
warmup=3, # Warmup requests before measurement
api_key="sk-...", # Optional API key
)
How it works
- Warmup — Sends a few requests to prime the server's KV cache and JIT compilation
- For each concurrency level — Fires N requests with M concurrent workers using
asyncio - Streaming measurement — Parses SSE chunks to measure TTFT and inter-token latency
- Statistics — Computes P50/P75/P90/P95/P99, mean, min, max, std from raw timings
- Report — Rich terminal tables, JSON, or Markdown output
Supported endpoints
Anything that speaks the OpenAI chat completions API:
- vLLM
- Text Generation Inference (TGI)
- SGLang
- Ollama (with
OLLAMA_ORIGINS=*) - llama.cpp server
- LiteLLM proxy
- OpenAI, Anthropic (via compatible proxy), Together, Fireworks, etc.
See Also
Part of the stef41 LLM toolkit — open-source tools for every stage of the LLM lifecycle:
| Project | What it does |
|---|---|
| tokonomics | Token counting & cost management for LLM APIs |
| datacrux | Training data quality — dedup, PII, contamination |
| castwright | Synthetic instruction data generation |
| datamix | Dataset mixing & curriculum optimization |
| toksight | Tokenizer analysis & comparison |
| trainpulse | Training health monitoring |
| ckpt | Checkpoint inspection, diffing & merging |
| quantbench | Quantization quality analysis |
| modeldiff | Behavioral regression testing |
| vibesafe | AI-generated code safety scanner |
| injectionguard | Prompt injection detection |
License
Apache-2.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file infermark-0.2.0.tar.gz.
File metadata
- Download URL: infermark-0.2.0.tar.gz
- Upload date:
- Size: 31.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7fb8fd03963c3ece4f920aa08ff6a142977578c06074fb84bdd45e75c2351ae8
|
|
| MD5 |
d4e5864bb61af20622488bcf3a612767
|
|
| BLAKE2b-256 |
c488bfeb462a73768deb2e4e0abe8d0d67a79bd7acdded4b2715561980b6cc68
|
File details
Details for the file infermark-0.2.0-py3-none-any.whl.
File metadata
- Download URL: infermark-0.2.0-py3-none-any.whl
- Upload date:
- Size: 22.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
96b209598d76f4ea77735b2189840dcf6de2279e3355050c716e4fe0bdf40e0a
|
|
| MD5 |
ce4c00aa11cad061effa444583ee7126
|
|
| BLAKE2b-256 |
05503b9fa921fa41eb732569bd50a20490f7aa15b7d37ff7855c4e05bd367392
|