Skip to main content

Dead-simple LLM benchmarking CLI. Measure TTFT, TPS, latency, cost, and quality for any OpenAI-compatible API.

Project description

๐ŸŽ๏ธ bench-my-llm

New here? Start with the Getting Started Guide.

PyPI version Python 3.10+ License: MIT CI

Stop guessing which model is faster. Measure it.

Point bench-my-llm at any OpenAI-compatible API and get latency, throughput, cost, and quality metrics in seconds. Compare models side by side. Get a beautiful terminal report. Ship with confidence.

โœจ Features

  • ๐Ÿ”ฅ TTFT Measurement - Time to first token via streaming
  • โšก Tokens per Second - Real throughput numbers
  • ๐Ÿ“Š p50 / p95 / p99 Latencies - Production-grade percentiles
  • ๐Ÿ’ฐ Cost Estimation - Know what you're spending
  • ๐ŸŽฏ Quality Scoring - Compare responses against reference answers
  • ๐Ÿ Model Comparison - Side-by-side with winner highlights
  • ๐Ÿ“ฆ Built-in Prompt Suites - Reasoning, coding, creative, factual
  • ๐Ÿ”Œ Any OpenAI-compatible API - OpenAI, Anthropic, Ollama, vLLM, Together, and more
  • ๐Ÿ’พ Export to JSON - Pipe into CI, dashboards, or your own tools

๐Ÿš€ Quick Start

pip install bench-my-llm

Single Model Benchmark

bench-my-llm run --model gpt-4o --suite reasoning
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  ๐ŸŽ๏ธ  Benchmark Report                                    โ”‚
โ”‚  bench-my-llm results for gpt-4o                         โ”‚
โ”‚  Suite: reasoning | Prompts: 5 | Cost: $0.0043           โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

          Latency Summary
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Metric โ”‚ TTFT (ms)  โ”‚ Total Latency (ms) โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ p50    โ”‚ 234.1      โ”‚ 1,523.4            โ”‚
โ”‚ p95    โ”‚ 312.7      โ”‚ 2,187.9            โ”‚
โ”‚ p99    โ”‚ 348.2      โ”‚ 2,401.3            โ”‚
โ”‚ Mean   โ”‚ 251.3      โ”‚ 1,687.2            โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

       Throughput & Quality
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Metric            โ”‚ Value       โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ Mean TPS          โ”‚ 67.3 tok/s  โ”‚
โ”‚ Median TPS        โ”‚ 64.8 tok/s  โ”‚
โ”‚ Quality Score     โ”‚ 82%         โ”‚
โ”‚ Estimated Cost    โ”‚ $0.0043     โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Model Comparison

bench-my-llm compare gpt-4o gpt-4o-mini --suite reasoning
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  ๐Ÿ Model Comparison                                     โ”‚
โ”‚  gpt-4o vs gpt-4o-mini                                   โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

              Head-to-Head
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Metric                 โ”‚ gpt-4o  โ”‚ gpt-4o-mini โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ TTFT p50 (ms)          โ”‚ 234.1   โ”‚ 142.3  ๐Ÿ†   โ”‚
โ”‚ TTFT p95 (ms)          โ”‚ 312.7   โ”‚ 198.4  ๐Ÿ†   โ”‚
โ”‚ Total Latency p50 (ms) โ”‚ 1523.4  โ”‚ 876.2  ๐Ÿ†   โ”‚
โ”‚ Mean TPS               โ”‚ 67.3 ๐Ÿ† โ”‚ 54.1        โ”‚
โ”‚ Cost (USD)             โ”‚ $0.0043 โ”‚ $0.0008 ๐Ÿ†  โ”‚
โ”‚ Quality Score          โ”‚ 0.82 ๐Ÿ† โ”‚ 0.71        โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

๐Ÿ† Winner: gpt-4o-mini (4/6 metrics)

๐Ÿ“– Usage

Custom Prompts

Pass your own prompts file (JSON array):

[
  {"text": "Explain quantum computing", "category": "factual", "reference": "...", "max_tokens": 256}
]

Prompt Suites

Suite Description Prompts
reasoning Logic, math, step-by-step 5
coding Code generation and explanation 5
creative Writing, storytelling, metaphors 5
factual Knowledge recall, definitions 5
all Everything combined 20

Export Results

bench-my-llm run --model gpt-4o --suite all --output results.json
bench-my-llm report results.json

Local Models (Ollama)

bench-my-llm run --model llama3 --base-url http://localhost:11434/v1 --api-key ollama

CI Integration

Add to your GitHub Actions workflow:

- name: Benchmark LLM
  run: |
    pip install bench-my-llm
    bench-my-llm run --model gpt-4o-mini --suite reasoning --output benchmark.json
  env:
    OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

- name: Upload results
  uses: actions/upload-artifact@v4
  with:
    name: benchmark-results
    path: benchmark.json

๐Ÿ› ๏ธ Development

git clone https://github.com/manasvardhan/bench-my-llm.git
cd bench-my-llm
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"
pytest

๐Ÿ“„ License

MIT. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bench_my_llm-0.1.1.tar.gz (16.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

bench_my_llm-0.1.1-py3-none-any.whl (15.7 kB view details)

Uploaded Python 3

File details

Details for the file bench_my_llm-0.1.1.tar.gz.

File metadata

  • Download URL: bench_my_llm-0.1.1.tar.gz
  • Upload date:
  • Size: 16.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for bench_my_llm-0.1.1.tar.gz
Algorithm Hash digest
SHA256 4995f6382a3f4d12e3ad2283046f10dbeb5691cfc2675fd7d2a8dd8b0b5722e1
MD5 6eccd94e17197b87ee850b94d4a225c0
BLAKE2b-256 1de14f6f49a0b0b1d0c219dfb9e6e2baecafd8e6aa4e4e87c6f710f92e83fef4

See more details on using hashes here.

File details

Details for the file bench_my_llm-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: bench_my_llm-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 15.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for bench_my_llm-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 f32dcd01949e05d9b9cece0aed321b507bf99f221f454990d2fc495b705b3971
MD5 b466a4c3e16296ecf6a0ee1e7d041d82
BLAKE2b-256 070693fddabacaee3cd6a1c1b0ce896d3f540c2906773b15dbd5a35b766af515

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page