Skip to main content

Benchmark any OpenAI-compatible LLM endpoint. TTFT, inter-token latency, throughput, P50-P99 — in one command.

Project description

infermark

CI Python 3.9+ License: Apache 2.0

Know how fast your LLM endpoint actually is.

infermark benchmarks any OpenAI-compatible API endpoint — vLLM, TGI, Ollama, SGLang, or anything behind /v1/chat/completions. It measures what matters: time to first token, inter-token latency, throughput under load, and tail latencies. One command, no config files, real numbers.

Both llmperf and llm-bench were archived in 2025. infermark fills the gap.

Demo

What it measures

Metric What it tells you
TTFT Time to first token — how long until streaming starts
ITL Inter-token latency — smoothness of the stream
Throughput (tok/s) Output tokens per second across all concurrent requests
P50 / P95 / P99 Tail latency distribution at each concurrency level
Error rate Failed requests under load
RPS Requests per second the server can sustain

Install

pip install infermark

With the CLI (rich tables, progress):

pip install infermark[cli]

Quick start

CLI

# Benchmark a local vLLM server
infermark run http://localhost:8000/v1 --model meta-llama/Llama-3-70B -n 50

# Sweep concurrency levels
infermark run http://localhost:8000/v1 -c 1,4,8,16,32,64 -n 100

# Save results as JSON
infermark run http://localhost:8000/v1 -o results.json

# Compare multiple endpoints
infermark compare vllm.json tgi.json ollama.json

Python

from infermark import BenchmarkConfig, run_benchmark

config = BenchmarkConfig(
    url="http://localhost:8000/v1",
    model="meta-llama/Llama-3-70B-Instruct",
    concurrency_levels=[1, 4, 8, 16, 32],
    n_requests=100,
    max_tokens=256,
)

report = run_benchmark(config)

# Best throughput
best = report.best_throughput()
print(f"Peak: {best.tokens_per_second:.1f} tok/s at concurrency {best.concurrency}")

# Lowest latency
low = report.lowest_latency()
print(f"Lowest P50: {low.latency.p50 * 1000:.1f} ms at concurrency {low.concurrency}")

Async

import asyncio
from infermark import BenchmarkConfig, run_benchmark_async

async def main():
    config = BenchmarkConfig(url="http://localhost:8000/v1", model="llama-3")
    report = await run_benchmark_async(config)
    print(f"Peak throughput: {report.best_throughput().tokens_per_second:.1f} tok/s")

asyncio.run(main())

Compare endpoints

Find out whether vLLM, TGI, or Ollama is faster for your model and hardware:

Comparison

# Benchmark each endpoint separately
infermark run http://gpu1:8000/v1 --model llama-3 -o vllm.json
infermark run http://gpu2:8080/v1 --model llama-3 -o tgi.json
infermark run http://gpu3:11434/v1 --model llama-3 -o ollama.json

# Side-by-side comparison
infermark compare vllm.json tgi.json ollama.json

Export formats

# JSON (for programmatic analysis)
infermark run http://localhost:8000/v1 -o report.json

# Markdown (paste into docs/PRs)
infermark run http://localhost:8000/v1 --markdown report.md

Configuration

BenchmarkConfig(
    url="http://localhost:8000/v1",     # Any OpenAI-compatible endpoint
    model="meta-llama/Llama-3-70B",     # Model name
    prompt="Explain relativity.",        # Prompt to send
    max_tokens=256,                      # Max output tokens per request
    concurrency_levels=[1, 4, 8, 16],   # Test these concurrency levels
    n_requests=100,                      # Requests per level
    timeout=120.0,                       # Per-request timeout (seconds)
    mode=BenchmarkMode.STREAMING,        # STREAMING or NON_STREAMING
    warmup=3,                            # Warmup requests before measurement
    api_key="sk-...",                    # Optional API key
)

How it works

  1. Warmup — Sends a few requests to prime the server's KV cache and JIT compilation
  2. For each concurrency level — Fires N requests with M concurrent workers using asyncio
  3. Streaming measurement — Parses SSE chunks to measure TTFT and inter-token latency
  4. Statistics — Computes P50/P75/P90/P95/P99, mean, min, max, std from raw timings
  5. Report — Rich terminal tables, JSON, or Markdown output

Supported endpoints

Anything that speaks the OpenAI chat completions API:

See Also

Part of the stef41 LLM toolkit — open-source tools for every stage of the LLM lifecycle:

Project What it does
tokonomics Token counting & cost management for LLM APIs
datacrux Training data quality — dedup, PII, contamination
castwright Synthetic instruction data generation
datamix Dataset mixing & curriculum optimization
toksight Tokenizer analysis & comparison
trainpulse Training health monitoring
ckpt Checkpoint inspection, diffing & merging
quantbench Quantization quality analysis
modeldiff Behavioral regression testing
vibesafe AI-generated code safety scanner
injectionguard Prompt injection detection

License

Apache-2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

infermark-0.3.0.tar.gz (39.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

infermark-0.3.0-py3-none-any.whl (27.6 kB view details)

Uploaded Python 3

File details

Details for the file infermark-0.3.0.tar.gz.

File metadata

  • Download URL: infermark-0.3.0.tar.gz
  • Upload date:
  • Size: 39.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for infermark-0.3.0.tar.gz
Algorithm Hash digest
SHA256 b40004c620c659074b8742a28f730be318982279af60c21acd382dcf77825d40
MD5 678e0e939afd41f8e7d6dbb56532d2ee
BLAKE2b-256 2f1380599336f2879c7620cba7e33429fd07f5a356b060c8c9468987975de7ae

See more details on using hashes here.

File details

Details for the file infermark-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: infermark-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 27.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for infermark-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e71b0506f64a120f1353ec96897e90cc10f24cff83c6feec06f703836f54515f
MD5 5a7f3436bf124b2d9b7b325f9d0ae8bb
BLAKE2b-256 56d728baa9db00d464453db2bfc02c7db8a872e100eadf133c6d8f760147cba5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page