Skip to main content

Benchmark any OpenAI-compatible LLM endpoint. TTFT, inter-token latency, throughput, P50-P99 — in one command.

Project description

infermark

CI Python 3.9+ License: Apache 2.0

Know how fast your LLM endpoint actually is.

infermark benchmarks any OpenAI-compatible API endpoint — vLLM, TGI, Ollama, SGLang, or anything behind /v1/chat/completions. It measures what matters: time to first token, inter-token latency, throughput under load, and tail latencies. One command, no config files, real numbers.

Both llmperf and llm-bench were archived in 2025. infermark fills the gap.

Demo

What it measures

Metric What it tells you
TTFT Time to first token — how long until streaming starts
ITL Inter-token latency — smoothness of the stream
Throughput (tok/s) Output tokens per second across all concurrent requests
P50 / P95 / P99 Tail latency distribution at each concurrency level
Error rate Failed requests under load
RPS Requests per second the server can sustain

Install

pip install infermark

With the CLI (rich tables, progress):

pip install infermark[cli]

Quick start

CLI

# Benchmark a local vLLM server
infermark run http://localhost:8000/v1 --model meta-llama/Llama-3-70B -n 50

# Sweep concurrency levels
infermark run http://localhost:8000/v1 -c 1,4,8,16,32,64 -n 100

# Save results as JSON
infermark run http://localhost:8000/v1 -o results.json

# Compare multiple endpoints
infermark compare vllm.json tgi.json ollama.json

Python

from infermark import BenchmarkConfig, run_benchmark

config = BenchmarkConfig(
    url="http://localhost:8000/v1",
    model="meta-llama/Llama-3-70B-Instruct",
    concurrency_levels=[1, 4, 8, 16, 32],
    n_requests=100,
    max_tokens=256,
)

report = run_benchmark(config)

# Best throughput
best = report.best_throughput()
print(f"Peak: {best.tokens_per_second:.1f} tok/s at concurrency {best.concurrency}")

# Lowest latency
low = report.lowest_latency()
print(f"Lowest P50: {low.latency.p50 * 1000:.1f} ms at concurrency {low.concurrency}")

Async

import asyncio
from infermark import BenchmarkConfig, run_benchmark_async

async def main():
    config = BenchmarkConfig(url="http://localhost:8000/v1", model="llama-3")
    report = await run_benchmark_async(config)
    print(f"Peak throughput: {report.best_throughput().tokens_per_second:.1f} tok/s")

asyncio.run(main())

Compare endpoints

Find out whether vLLM, TGI, or Ollama is faster for your model and hardware:

Comparison

# Benchmark each endpoint separately
infermark run http://gpu1:8000/v1 --model llama-3 -o vllm.json
infermark run http://gpu2:8080/v1 --model llama-3 -o tgi.json
infermark run http://gpu3:11434/v1 --model llama-3 -o ollama.json

# Side-by-side comparison
infermark compare vllm.json tgi.json ollama.json

Export formats

# JSON (for programmatic analysis)
infermark run http://localhost:8000/v1 -o report.json

# Markdown (paste into docs/PRs)
infermark run http://localhost:8000/v1 --markdown report.md

Configuration

BenchmarkConfig(
    url="http://localhost:8000/v1",     # Any OpenAI-compatible endpoint
    model="meta-llama/Llama-3-70B",     # Model name
    prompt="Explain relativity.",        # Prompt to send
    max_tokens=256,                      # Max output tokens per request
    concurrency_levels=[1, 4, 8, 16],   # Test these concurrency levels
    n_requests=100,                      # Requests per level
    timeout=120.0,                       # Per-request timeout (seconds)
    mode=BenchmarkMode.STREAMING,        # STREAMING or NON_STREAMING
    warmup=3,                            # Warmup requests before measurement
    api_key="sk-...",                    # Optional API key
)

How it works

  1. Warmup — Sends a few requests to prime the server's KV cache and JIT compilation
  2. For each concurrency level — Fires N requests with M concurrent workers using asyncio
  3. Streaming measurement — Parses SSE chunks to measure TTFT and inter-token latency
  4. Statistics — Computes P50/P75/P90/P95/P99, mean, min, max, std from raw timings
  5. Report — Rich terminal tables, JSON, or Markdown output

Supported endpoints

Anything that speaks the OpenAI chat completions API:

See Also

Part of the stef41 LLM toolkit — open-source tools for every stage of the LLM lifecycle:

Project What it does
tokonomics Token counting & cost management for LLM APIs
datacrux Training data quality — dedup, PII, contamination
castwright Synthetic instruction data generation
datamix Dataset mixing & curriculum optimization
toksight Tokenizer analysis & comparison
trainpulse Training health monitoring
ckpt Checkpoint inspection, diffing & merging
quantbench Quantization quality analysis
modeldiff Behavioral regression testing
vibesafe AI-generated code safety scanner
injectionguard Prompt injection detection

License

Apache-2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

infermark-0.2.0.tar.gz (31.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

infermark-0.2.0-py3-none-any.whl (22.4 kB view details)

Uploaded Python 3

File details

Details for the file infermark-0.2.0.tar.gz.

File metadata

  • Download URL: infermark-0.2.0.tar.gz
  • Upload date:
  • Size: 31.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for infermark-0.2.0.tar.gz
Algorithm Hash digest
SHA256 7fb8fd03963c3ece4f920aa08ff6a142977578c06074fb84bdd45e75c2351ae8
MD5 d4e5864bb61af20622488bcf3a612767
BLAKE2b-256 c488bfeb462a73768deb2e4e0abe8d0d67a79bd7acdded4b2715561980b6cc68

See more details on using hashes here.

File details

Details for the file infermark-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: infermark-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 22.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for infermark-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 96b209598d76f4ea77735b2189840dcf6de2279e3355050c716e4fe0bdf40e0a
MD5 ce4c00aa11cad061effa444583ee7126
BLAKE2b-256 05503b9fa921fa41eb732569bd50a20490f7aa15b7d37ff7855c4e05bd367392

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page