Dead-simple LLM benchmarking CLI. Measure TTFT, TPS, latency, cost, and quality for any OpenAI-compatible API.
Project description
๐๏ธ bench-my-llm
New here? Start with the Getting Started Guide.
Stop guessing which model is faster. Measure it.
Point bench-my-llm at any OpenAI-compatible API and get latency, throughput, cost, and quality metrics in seconds. Compare models side by side. Get a beautiful terminal report. Ship with confidence.
โจ Features
- ๐ฅ TTFT Measurement - Time to first token via streaming
- โก Tokens per Second - Real throughput numbers
- ๐ p50 / p95 / p99 Latencies - Production-grade percentiles
- ๐ฐ Cost Estimation - Know what you're spending
- ๐ฏ Quality Scoring - Compare responses against reference answers
- ๐ Model Comparison - Side-by-side with winner highlights
- ๐ฆ Built-in Prompt Suites - Reasoning, coding, creative, factual
- ๐ Any OpenAI-compatible API - OpenAI, Anthropic, Ollama, vLLM, Together, and more
- ๐พ Export to JSON - Pipe into CI, dashboards, or your own tools
๐ Quick Start
pip install bench-my-llm
Single Model Benchmark
bench-my-llm run --model gpt-4o --suite reasoning
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ ๐๏ธ Benchmark Report โ
โ bench-my-llm results for gpt-4o โ
โ Suite: reasoning | Prompts: 5 | Cost: $0.0043 โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Latency Summary
โโโโโโโโโโฌโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโ
โ Metric โ TTFT (ms) โ Total Latency (ms) โ
โโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโค
โ p50 โ 234.1 โ 1,523.4 โ
โ p95 โ 312.7 โ 2,187.9 โ
โ p99 โ 348.2 โ 2,401.3 โ
โ Mean โ 251.3 โ 1,687.2 โ
โโโโโโโโโโดโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโ
Throughput & Quality
โโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโ
โ Metric โ Value โ
โโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโค
โ Mean TPS โ 67.3 tok/s โ
โ Median TPS โ 64.8 tok/s โ
โ Quality Score โ 82% โ
โ Estimated Cost โ $0.0043 โ
โโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโ
Model Comparison
bench-my-llm compare gpt-4o gpt-4o-mini --suite reasoning
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ ๐ Model Comparison โ
โ gpt-4o vs gpt-4o-mini โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Head-to-Head
โโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโฌโโโโโโโโโโโโโโ
โ Metric โ gpt-4o โ gpt-4o-mini โ
โโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโผโโโโโโโโโโโโโโค
โ TTFT p50 (ms) โ 234.1 โ 142.3 ๐ โ
โ TTFT p95 (ms) โ 312.7 โ 198.4 ๐ โ
โ Total Latency p50 (ms) โ 1523.4 โ 876.2 ๐ โ
โ Mean TPS โ 67.3 ๐ โ 54.1 โ
โ Cost (USD) โ $0.0043 โ $0.0008 ๐ โ
โ Quality Score โ 0.82 ๐ โ 0.71 โ
โโโโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโดโโโโโโโโโโโโโโ
๐ Winner: gpt-4o-mini (4/6 metrics)
๐ Usage
Custom Prompts
Pass your own prompts file (JSON array):
[
{"text": "Explain quantum computing", "category": "factual", "reference": "...", "max_tokens": 256}
]
Prompt Suites
| Suite | Description | Prompts |
|---|---|---|
reasoning |
Logic, math, step-by-step | 5 |
coding |
Code generation and explanation | 5 |
creative |
Writing, storytelling, metaphors | 5 |
factual |
Knowledge recall, definitions | 5 |
all |
Everything combined | 20 |
Export Results
bench-my-llm run --model gpt-4o --suite all --output results.json
bench-my-llm report results.json
Local Models (Ollama)
bench-my-llm run --model llama3 --base-url http://localhost:11434/v1 --api-key ollama
CI Integration
Add to your GitHub Actions workflow:
- name: Benchmark LLM
run: |
pip install bench-my-llm
bench-my-llm run --model gpt-4o-mini --suite reasoning --output benchmark.json
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
- name: Upload results
uses: actions/upload-artifact@v4
with:
name: benchmark-results
path: benchmark.json
๐ ๏ธ Development
git clone https://github.com/manasvardhan/bench-my-llm.git
cd bench-my-llm
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"
pytest
๐ License
MIT. See LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file bench_my_llm-0.1.1.tar.gz.
File metadata
- Download URL: bench_my_llm-0.1.1.tar.gz
- Upload date:
- Size: 16.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4995f6382a3f4d12e3ad2283046f10dbeb5691cfc2675fd7d2a8dd8b0b5722e1
|
|
| MD5 |
6eccd94e17197b87ee850b94d4a225c0
|
|
| BLAKE2b-256 |
1de14f6f49a0b0b1d0c219dfb9e6e2baecafd8e6aa4e4e87c6f710f92e83fef4
|
File details
Details for the file bench_my_llm-0.1.1-py3-none-any.whl.
File metadata
- Download URL: bench_my_llm-0.1.1-py3-none-any.whl
- Upload date:
- Size: 15.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f32dcd01949e05d9b9cece0aed321b507bf99f221f454990d2fc495b705b3971
|
|
| MD5 |
b466a4c3e16296ecf6a0ee1e7d041d82
|
|
| BLAKE2b-256 |
070693fddabacaee3cd6a1c1b0ce896d3f540c2906773b15dbd5a35b766af515
|