Benchmark any LLM provider against your actual prompts — latency, cost, quality
Project description
llm-bench
Benchmark any LLM against your actual prompts. Compare OpenAI, Anthropic, Gemini, Mistral, Groq — latency, cost, quality, side by side.
pip install llm-benchmarker
Quick Start
# Install with the providers you need
pip install "llm-benchmarker[openai,anthropic,gemini]"
# Set your API keys
export OPENAI_API_KEY=sk-...
export ANTHROPIC_API_KEY=sk-ant-...
export GEMINI_API_KEY=AIza...
# Run a benchmark
llm-bench run \
--prompt "Classify the sentiment: 'I love this product!'" \
--models gpt-4o,claude-3-5-sonnet,gemini-2.0-flash
Output:
llm-bench results — 3 runs × 1 prompt(s)
Model Provider p50 (ms) p95 (ms) Avg tokens ↑↓ $/1k req Errors
────────────────────────────────────────────────────────────────────────────────────────────
gemini-2.0-flash gemini 312 445 15↑ 18↓ $0.003 —
gpt-4o openai 487 623 15↑ 22↓ $0.059 —
claude-3-5-sonnet anthropic 891 1204 15↑ 31↓ $0.121 —
Installation
# Core only
pip install llm-benchmarker
# With specific providers
pip install "llm-benchmarker[openai]"
pip install "llm-benchmarker[anthropic]"
pip install "llm-benchmarker[gemini]"
pip install "llm-benchmarker[mistral]"
pip install "llm-benchmarker[groq]"
# All providers
pip install "llm-benchmarker[all]"
Usage
CLI
# Single prompt, multiple models
llm-bench run \
--prompt "Write a Python function to reverse a string" \
--models gpt-4o-mini,claude-3-5-haiku,gemini-2.0-flash
# Multiple prompts
llm-bench run \
--prompt "What is 2+2?" \
--prompt "Explain quantum entanglement simply." \
--models gpt-4o,claude-3-5-sonnet
# With quality scoring (LLM-as-judge)
llm-bench run \
--prompt "Summarize the French Revolution in 3 sentences." \
--models gpt-4o,claude-3-5-sonnet \
--judge gpt-4o-mini
# Save to JSON (for CI/CD)
llm-bench run \
--prompt "Hello" \
--models gpt-4o \
--json > results.json
# Use a YAML config for complex benchmarks
llm-bench run --config benchmark.yaml
YAML Config
Generate a starter config:
llm-bench init
Or create benchmark.yaml:
models:
- gpt-4o
- claude-3-5-sonnet
- gemini-2.0-flash
- llama-3.3-70b-versatile # Groq
prompts:
- text: "Classify sentiment: 'Great product, fast shipping!'"
name: positive_sentiment
- text: "Debug this: def fib(n): return fib(n-1) + fib(n-2)"
name: code_debug
# Optional: score response quality with a judge model
judge_model: gpt-4o-mini
n_runs: 5
temperature: 0.0
max_tokens: 512
output: results.json
Run it:
llm-bench run --config benchmark.yaml
JSON Output (CI/CD)
llm-bench run --config benchmark.yaml --json
Output schema:
{
"timestamp": "2026-03-27T03:00:00Z",
"duration_seconds": 12.4,
"config": {
"models": ["gpt-4o", "claude-3-5-sonnet"],
"n_prompts": 2,
"n_runs": 3
},
"results": {
"gpt-4o": {
"model": "gpt-4o",
"provider": "openai",
"n_success": 6,
"n_errors": 0,
"latency": {
"p50_ms": 487.2,
"p95_ms": 623.1,
"mean_ms": 501.4
},
"tokens": {
"avg_input": 15.0,
"avg_output": 22.3
},
"cost_per_1k_requests_usd": 0.059,
"quality_score": 8.4
}
}
}
Supported Models
| Provider | Models |
|---|---|
| OpenAI | gpt-4o, gpt-4o-mini, gpt-4-turbo, gpt-3.5-turbo, o1, o1-mini, o3-mini |
| Anthropic | claude-opus-4, claude-sonnet-4, claude-3-5-sonnet, claude-3-5-haiku, claude-3-haiku |
| Gemini | gemini-2.5-pro, gemini-2.0-flash, gemini-2.0-flash-lite, gemini-1.5-pro, gemini-1.5-flash |
| Mistral | mistral-large, mistral-small, codestral, mixtral-8x7b |
| Groq | llama-3.3-70b-versatile, llama-3.1-8b-instant, mixtral-8x7b-32768, gemma2-9b-it |
See all: llm-bench list-models
Environment Variables
| Variable | Provider |
|---|---|
OPENAI_API_KEY |
OpenAI |
ANTHROPIC_API_KEY |
Anthropic |
GEMINI_API_KEY or GOOGLE_API_KEY |
Google Gemini |
MISTRAL_API_KEY |
Mistral |
GROQ_API_KEY |
Groq |
Options Reference
llm-bench run [OPTIONS]
--prompt, -p TEXT Prompt to benchmark (repeatable)
--models, -m TEXT Comma-separated model list
--config, -c PATH YAML config file
--runs, -n INTEGER Runs per prompt per model [default: 3]
--temperature, -t FLOAT Sampling temperature [default: 0.0]
--max-tokens INTEGER Max output tokens [default: 1024]
--judge TEXT Judge model for quality scoring
--system, -s TEXT System prompt
--output, -o PATH Save results to JSON file
--json Output JSON to stdout
--concurrency INTEGER Max concurrent requests [default: 5]
--timeout FLOAT Request timeout in seconds [default: 60.0]
--verbose, -v Show progress
Python API
import asyncio
from llm_bench.benchmark import BenchmarkConfig, run_benchmark
from llm_bench.reporter import print_results_table
config = BenchmarkConfig(
models=["gpt-4o", "claude-3-5-sonnet"],
prompts=["What is the meaning of life?"],
n_runs=3,
judge_model="gpt-4o-mini",
)
result = asyncio.run(run_benchmark(config))
print_results_table(result)
# Access raw data
for model, metrics in result.metrics.items():
print(f"{model}: p50={metrics.latency_p50_ms:.0f}ms, quality={metrics.quality_score}")
Contributing
Contributions welcome! See CONTRIBUTING.md.
git clone https://github.com/midnightrunai/llm-bench
cd llm-bench
pip install -e ".[dev]"
pytest
License
MIT © Midnight Run
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file llm_benchmarker-0.1.0.tar.gz.
File metadata
- Download URL: llm_benchmarker-0.1.0.tar.gz
- Upload date:
- Size: 26.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8267883ac6d67dc2a40f3a86462f88058349e1beeaca1e450f203052394e3ba3
|
|
| MD5 |
046dad43eb8144ff43bec917a5e159f7
|
|
| BLAKE2b-256 |
5abd9201f3bca4beac2b736560b967c7dcd59eff868fde5a2024ad2d0d05e76f
|
File details
Details for the file llm_benchmarker-0.1.0-py3-none-any.whl.
File metadata
- Download URL: llm_benchmarker-0.1.0-py3-none-any.whl
- Upload date:
- Size: 24.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
993c13c69fcb958cfdb80105511514e07ab7611c37182e6b663ff3bb705b18e8
|
|
| MD5 |
5e4e95e70dc73cc33933e78e550382b4
|
|
| BLAKE2b-256 |
e62bcb40a429d6a25a4fbad96c8ce094a42fb698a0359d8eaead9607f9214c63
|