Skip to main content

Dead-simple LLM benchmarking CLI. Measure TTFT, TPS, latency, cost, and quality for any OpenAI-compatible API.

Project description

๐ŸŽ๏ธ bench-my-llm

New here? Start with the Getting Started Guide.

PyPI version Python 3.10+ License: MIT CI

Stop guessing which model is faster. Measure it.

Point bench-my-llm at any OpenAI-compatible API and get latency, throughput, cost, and quality metrics in seconds. Compare models side by side. Get a beautiful terminal report. Ship with confidence.

โœจ Features

  • ๐Ÿ”ฅ TTFT Measurement - Time to first token via streaming
  • โšก Tokens per Second - Real throughput numbers
  • ๐Ÿ“Š p50 / p95 / p99 Latencies - Production-grade percentiles
  • ๐Ÿ’ฐ Cost Estimation - Know what you're spending
  • ๐ŸŽฏ Quality Scoring - Compare responses against reference answers
  • ๐Ÿ Model Comparison - Side-by-side with winner highlights
  • ๐Ÿ“ฆ Built-in Prompt Suites - Reasoning, coding, creative, factual
  • ๐Ÿ”Œ Any OpenAI-compatible API - OpenAI, Anthropic, Ollama, vLLM, Together, and more
  • ๐Ÿ’พ Export to JSON - Pipe into CI, dashboards, or your own tools

๐Ÿš€ Quick Start

pip install bench-my-llm

Single Model Benchmark

bench-my-llm run --model gpt-4o --suite reasoning
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  ๐ŸŽ๏ธ  Benchmark Report                                    โ”‚
โ”‚  bench-my-llm results for gpt-4o                         โ”‚
โ”‚  Suite: reasoning | Prompts: 5 | Cost: $0.0043           โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

          Latency Summary
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Metric โ”‚ TTFT (ms)  โ”‚ Total Latency (ms) โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ p50    โ”‚ 234.1      โ”‚ 1,523.4            โ”‚
โ”‚ p95    โ”‚ 312.7      โ”‚ 2,187.9            โ”‚
โ”‚ p99    โ”‚ 348.2      โ”‚ 2,401.3            โ”‚
โ”‚ Mean   โ”‚ 251.3      โ”‚ 1,687.2            โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

       Throughput & Quality
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Metric            โ”‚ Value       โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ Mean TPS          โ”‚ 67.3 tok/s  โ”‚
โ”‚ Median TPS        โ”‚ 64.8 tok/s  โ”‚
โ”‚ Quality Score     โ”‚ 82%         โ”‚
โ”‚ Estimated Cost    โ”‚ $0.0043     โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Model Comparison

bench-my-llm compare gpt-4o gpt-4o-mini --suite reasoning
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  ๐Ÿ Model Comparison                                     โ”‚
โ”‚  gpt-4o vs gpt-4o-mini                                   โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

              Head-to-Head
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Metric                 โ”‚ gpt-4o  โ”‚ gpt-4o-mini โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ TTFT p50 (ms)          โ”‚ 234.1   โ”‚ 142.3  ๐Ÿ†   โ”‚
โ”‚ TTFT p95 (ms)          โ”‚ 312.7   โ”‚ 198.4  ๐Ÿ†   โ”‚
โ”‚ Total Latency p50 (ms) โ”‚ 1523.4  โ”‚ 876.2  ๐Ÿ†   โ”‚
โ”‚ Mean TPS               โ”‚ 67.3 ๐Ÿ† โ”‚ 54.1        โ”‚
โ”‚ Cost (USD)             โ”‚ $0.0043 โ”‚ $0.0008 ๐Ÿ†  โ”‚
โ”‚ Quality Score          โ”‚ 0.82 ๐Ÿ† โ”‚ 0.71        โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

๐Ÿ† Winner: gpt-4o-mini (4/6 metrics)

๐Ÿ“– Usage

Custom Prompts

Pass your own prompts file (JSON array):

[
  {"text": "Explain quantum computing", "category": "factual", "reference": "...", "max_tokens": 256}
]

Prompt Suites

Suite Description Prompts
reasoning Logic, math, step-by-step 5
coding Code generation and explanation 5
creative Writing, storytelling, metaphors 5
factual Knowledge recall, definitions 5
all Everything combined 20

Export Results

bench-my-llm run --model gpt-4o --suite all --output results.json
bench-my-llm report results.json

Local Models (Ollama)

bench-my-llm run --model llama3 --base-url http://localhost:11434/v1 --api-key ollama

CI Integration

Add to your GitHub Actions workflow:

- name: Benchmark LLM
  run: |
    pip install bench-my-llm
    bench-my-llm run --model gpt-4o-mini --suite reasoning --output benchmark.json
  env:
    OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

- name: Upload results
  uses: actions/upload-artifact@v4
  with:
    name: benchmark-results
    path: benchmark.json

๐Ÿ› ๏ธ Development

git clone https://github.com/manasvardhan/bench-my-llm.git
cd bench-my-llm
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"
pytest

๐Ÿ“„ License

MIT. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bench_my_llm-0.1.0.tar.gz (16.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

bench_my_llm-0.1.0-py3-none-any.whl (15.7 kB view details)

Uploaded Python 3

File details

Details for the file bench_my_llm-0.1.0.tar.gz.

File metadata

  • Download URL: bench_my_llm-0.1.0.tar.gz
  • Upload date:
  • Size: 16.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for bench_my_llm-0.1.0.tar.gz
Algorithm Hash digest
SHA256 76d1ea3d10efb6d0001d9dd4a2c12e5c32121594d6ecea9e4469a4c30d90e9f3
MD5 42cd90e6794adbc8b91a476102d8f913
BLAKE2b-256 8084a8e70aa726b66e0a0842fb90578b94c3547addae619ae8a0a7537548bc21

See more details on using hashes here.

File details

Details for the file bench_my_llm-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: bench_my_llm-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 15.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for bench_my_llm-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 6ae9e6210b2a34fb628d6e959022e03b05db50f2bdae035dc0fdd466a6724b8c
MD5 1cff116773f3dd60363b9356cf737071
BLAKE2b-256 33352e03632b95aa89b3184bcc64c00709260380fdf88495862315ff1cbeb279

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page