Benchmark any OpenAI-compatible LLM endpoint. TTFT, inter-token latency, throughput, P50-P99 — in one command.

These details have not been verified by PyPI

Project links

Project description

infermark

Know how fast your LLM endpoint actually is.

infermark benchmarks any OpenAI-compatible API endpoint — vLLM, TGI, Ollama, SGLang, or anything behind /v1/chat/completions. It measures what matters: time to first token, inter-token latency, throughput under load, and tail latencies. One command, no config files, real numbers.

Both llmperf and llm-bench were archived in 2025. infermark fills the gap.

Demo

What it measures

Metric	What it tells you
TTFT	Time to first token — how long until streaming starts
ITL	Inter-token latency — smoothness of the stream
Throughput (tok/s)	Output tokens per second across all concurrent requests
P50 / P95 / P99	Tail latency distribution at each concurrency level
Error rate	Failed requests under load
RPS	Requests per second the server can sustain

Install

pip install infermark

With the CLI (rich tables, progress):

pip install infermark[cli]

Quick start

CLI

# Benchmark a local vLLM server
infermark run http://localhost:8000/v1 --model meta-llama/Llama-3-70B -n 50

# Sweep concurrency levels
infermark run http://localhost:8000/v1 -c 1,4,8,16,32,64 -n 100

# Save results as JSON
infermark run http://localhost:8000/v1 -o results.json

# Compare multiple endpoints
infermark compare vllm.json tgi.json ollama.json

Python

from infermark import BenchmarkConfig, run_benchmark

config = BenchmarkConfig(
    url="http://localhost:8000/v1",
    model="meta-llama/Llama-3-70B-Instruct",
    concurrency_levels=[1, 4, 8, 16, 32],
    n_requests=100,
    max_tokens=256,
)

report = run_benchmark(config)

# Best throughput
best = report.best_throughput()
print(f"Peak: {best.tokens_per_second:.1f} tok/s at concurrency {best.concurrency}")

# Lowest latency
low = report.lowest_latency()
print(f"Lowest P50: {low.latency.p50 * 1000:.1f} ms at concurrency {low.concurrency}")

Async

import asyncio
from infermark import BenchmarkConfig, run_benchmark_async

async def main():
    config = BenchmarkConfig(url="http://localhost:8000/v1", model="llama-3")
    report = await run_benchmark_async(config)
    print(f"Peak throughput: {report.best_throughput().tokens_per_second:.1f} tok/s")

asyncio.run(main())

Compare endpoints

Find out whether vLLM, TGI, or Ollama is faster for your model and hardware:

Comparison

# Benchmark each endpoint separately
infermark run http://gpu1:8000/v1 --model llama-3 -o vllm.json
infermark run http://gpu2:8080/v1 --model llama-3 -o tgi.json
infermark run http://gpu3:11434/v1 --model llama-3 -o ollama.json

# Side-by-side comparison
infermark compare vllm.json tgi.json ollama.json

Export formats

# JSON (for programmatic analysis)
infermark run http://localhost:8000/v1 -o report.json

# Markdown (paste into docs/PRs)
infermark run http://localhost:8000/v1 --markdown report.md

Configuration

BenchmarkConfig(
    url="http://localhost:8000/v1",     # Any OpenAI-compatible endpoint
    model="meta-llama/Llama-3-70B",     # Model name
    prompt="Explain relativity.",        # Prompt to send
    max_tokens=256,                      # Max output tokens per request
    concurrency_levels=[1, 4, 8, 16],   # Test these concurrency levels
    n_requests=100,                      # Requests per level
    timeout=120.0,                       # Per-request timeout (seconds)
    mode=BenchmarkMode.STREAMING,        # STREAMING or NON_STREAMING
    warmup=3,                            # Warmup requests before measurement
    api_key="sk-...",                    # Optional API key
)

How it works

Warmup — Sends a few requests to prime the server's KV cache and JIT compilation
For each concurrency level — Fires N requests with M concurrent workers using asyncio
Streaming measurement — Parses SSE chunks to measure TTFT and inter-token latency
Statistics — Computes P50/P75/P90/P95/P99, mean, min, max, std from raw timings
Report — Rich terminal tables, JSON, or Markdown output

Supported endpoints

Anything that speaks the OpenAI chat completions API:

vLLM
Text Generation Inference (TGI)
SGLang
Ollama (with OLLAMA_ORIGINS=*)
llama.cpp server
LiteLLM proxy
OpenAI, Anthropic (via compatible proxy), Together, Fireworks, etc.

Project	What it does
tokonomics	Token counting & cost management for LLM APIs
datacrux	Training data quality — dedup, PII, contamination
castwright	Synthetic instruction data generation
datamix	Dataset mixing & curriculum optimization
toksight	Tokenizer analysis & comparison
trainpulse	Training health monitoring
ckpt	Checkpoint inspection, diffing & merging
quantbench	Quantization quality analysis
modeldiff	Behavioral regression testing
vibesafe	AI-generated code safety scanner
injectionguard	Prompt injection detection

License

Apache-2.0

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.3.0

Apr 10, 2026

This version

0.2.0

Apr 10, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

infermark-0.2.0.tar.gz (31.9 kB view details)

Uploaded Apr 10, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

infermark-0.2.0-py3-none-any.whl (22.4 kB view details)

Uploaded Apr 10, 2026 Python 3

File details

Details for the file infermark-0.2.0.tar.gz.

File metadata

Download URL: infermark-0.2.0.tar.gz
Upload date: Apr 10, 2026
Size: 31.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for infermark-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`7fb8fd03963c3ece4f920aa08ff6a142977578c06074fb84bdd45e75c2351ae8`
MD5	`d4e5864bb61af20622488bcf3a612767`
BLAKE2b-256	`c488bfeb462a73768deb2e4e0abe8d0d67a79bd7acdded4b2715561980b6cc68`

See more details on using hashes here.

File details

Details for the file infermark-0.2.0-py3-none-any.whl.

File metadata

Download URL: infermark-0.2.0-py3-none-any.whl
Upload date: Apr 10, 2026
Size: 22.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for infermark-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`96b209598d76f4ea77735b2189840dcf6de2279e3355050c716e4fe0bdf40e0a`
MD5	`ce4c00aa11cad061effa444583ee7126`
BLAKE2b-256	`05503b9fa921fa41eb732569bd50a20490f7aa15b7d37ff7855c4e05bd367392`

See more details on using hashes here.

infermark 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

infermark

What it measures

Install

Quick start

CLI

Python

Async

Compare endpoints

Export formats

Configuration

How it works

Supported endpoints

See Also

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes