Skip to main content

Benchmark any LLM provider against your actual prompts — latency, cost, quality

Project description

llm-bench

Benchmark any LLM against your actual prompts. Compare OpenAI, Anthropic, Gemini, Mistral, Groq — latency, cost, quality, side by side.

pip install llm-benchmarker

Quick Start

# Install with the providers you need
pip install "llm-benchmarker[openai,anthropic,gemini]"

# Set your API keys
export OPENAI_API_KEY=sk-...
export ANTHROPIC_API_KEY=sk-ant-...
export GEMINI_API_KEY=AIza...

# Run a benchmark
llm-bench run \
  --prompt "Classify the sentiment: 'I love this product!'" \
  --models gpt-4o,claude-3-5-sonnet,gemini-2.0-flash

Output:

llm-bench results — 3 runs × 1 prompt(s)

  Model                     Provider    p50 (ms)  p95 (ms)  Avg tokens ↑↓   $/1k req  Errors
  ────────────────────────────────────────────────────────────────────────────────────────────
  gemini-2.0-flash          gemini          312       445       15↑  18↓     $0.003      —
  gpt-4o                    openai          487       623       15↑  22↓     $0.059      —
  claude-3-5-sonnet         anthropic       891      1204       15↑  31↓     $0.121      —

Installation

# Core only
pip install llm-benchmarker

# With specific providers
pip install "llm-benchmarker[openai]"
pip install "llm-benchmarker[anthropic]"
pip install "llm-benchmarker[gemini]"
pip install "llm-benchmarker[mistral]"
pip install "llm-benchmarker[groq]"

# All providers
pip install "llm-benchmarker[all]"

Usage

CLI

# Single prompt, multiple models
llm-bench run \
  --prompt "Write a Python function to reverse a string" \
  --models gpt-4o-mini,claude-3-5-haiku,gemini-2.0-flash

# Multiple prompts
llm-bench run \
  --prompt "What is 2+2?" \
  --prompt "Explain quantum entanglement simply." \
  --models gpt-4o,claude-3-5-sonnet

# With quality scoring (LLM-as-judge)
llm-bench run \
  --prompt "Summarize the French Revolution in 3 sentences." \
  --models gpt-4o,claude-3-5-sonnet \
  --judge gpt-4o-mini

# Save to JSON (for CI/CD)
llm-bench run \
  --prompt "Hello" \
  --models gpt-4o \
  --json > results.json

# Use a YAML config for complex benchmarks
llm-bench run --config benchmark.yaml

YAML Config

Generate a starter config:

llm-bench init

Or create benchmark.yaml:

models:
  - gpt-4o
  - claude-3-5-sonnet
  - gemini-2.0-flash
  - llama-3.3-70b-versatile  # Groq

prompts:
  - text: "Classify sentiment: 'Great product, fast shipping!'"
    name: positive_sentiment

  - text: "Debug this: def fib(n): return fib(n-1) + fib(n-2)"
    name: code_debug

# Optional: score response quality with a judge model
judge_model: gpt-4o-mini

n_runs: 5
temperature: 0.0
max_tokens: 512
output: results.json

Run it:

llm-bench run --config benchmark.yaml

JSON Output (CI/CD)

llm-bench run --config benchmark.yaml --json

Output schema:

{
  "timestamp": "2026-03-27T03:00:00Z",
  "duration_seconds": 12.4,
  "config": {
    "models": ["gpt-4o", "claude-3-5-sonnet"],
    "n_prompts": 2,
    "n_runs": 3
  },
  "results": {
    "gpt-4o": {
      "model": "gpt-4o",
      "provider": "openai",
      "n_success": 6,
      "n_errors": 0,
      "latency": {
        "p50_ms": 487.2,
        "p95_ms": 623.1,
        "mean_ms": 501.4
      },
      "tokens": {
        "avg_input": 15.0,
        "avg_output": 22.3
      },
      "cost_per_1k_requests_usd": 0.059,
      "quality_score": 8.4
    }
  }
}

Supported Models

Provider Models
OpenAI gpt-4o, gpt-4o-mini, gpt-4-turbo, gpt-3.5-turbo, o1, o1-mini, o3-mini
Anthropic claude-opus-4, claude-sonnet-4, claude-3-5-sonnet, claude-3-5-haiku, claude-3-haiku
Gemini gemini-2.5-pro, gemini-2.0-flash, gemini-2.0-flash-lite, gemini-1.5-pro, gemini-1.5-flash
Mistral mistral-large, mistral-small, codestral, mixtral-8x7b
Groq llama-3.3-70b-versatile, llama-3.1-8b-instant, mixtral-8x7b-32768, gemma2-9b-it

See all: llm-bench list-models


Environment Variables

Variable Provider
OPENAI_API_KEY OpenAI
ANTHROPIC_API_KEY Anthropic
GEMINI_API_KEY or GOOGLE_API_KEY Google Gemini
MISTRAL_API_KEY Mistral
GROQ_API_KEY Groq

Options Reference

llm-bench run [OPTIONS]

  --prompt, -p TEXT        Prompt to benchmark (repeatable)
  --models, -m TEXT        Comma-separated model list
  --config, -c PATH        YAML config file
  --runs, -n INTEGER       Runs per prompt per model [default: 3]
  --temperature, -t FLOAT  Sampling temperature [default: 0.0]
  --max-tokens INTEGER     Max output tokens [default: 1024]
  --judge TEXT             Judge model for quality scoring
  --system, -s TEXT        System prompt
  --output, -o PATH        Save results to JSON file
  --json                   Output JSON to stdout
  --concurrency INTEGER    Max concurrent requests [default: 5]
  --timeout FLOAT          Request timeout in seconds [default: 60.0]
  --verbose, -v            Show progress

Python API

import asyncio
from llm_bench.benchmark import BenchmarkConfig, run_benchmark
from llm_bench.reporter import print_results_table

config = BenchmarkConfig(
    models=["gpt-4o", "claude-3-5-sonnet"],
    prompts=["What is the meaning of life?"],
    n_runs=3,
    judge_model="gpt-4o-mini",
)

result = asyncio.run(run_benchmark(config))
print_results_table(result)

# Access raw data
for model, metrics in result.metrics.items():
    print(f"{model}: p50={metrics.latency_p50_ms:.0f}ms, quality={metrics.quality_score}")

Contributing

Contributions welcome! See CONTRIBUTING.md.

git clone https://github.com/midnightrunai/llm-bench
cd llm-bench
pip install -e ".[dev]"
pytest

License

MIT © Midnight Run

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_benchmarker-0.1.0.tar.gz (26.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llm_benchmarker-0.1.0-py3-none-any.whl (24.1 kB view details)

Uploaded Python 3

File details

Details for the file llm_benchmarker-0.1.0.tar.gz.

File metadata

  • Download URL: llm_benchmarker-0.1.0.tar.gz
  • Upload date:
  • Size: 26.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for llm_benchmarker-0.1.0.tar.gz
Algorithm Hash digest
SHA256 8267883ac6d67dc2a40f3a86462f88058349e1beeaca1e450f203052394e3ba3
MD5 046dad43eb8144ff43bec917a5e159f7
BLAKE2b-256 5abd9201f3bca4beac2b736560b967c7dcd59eff868fde5a2024ad2d0d05e76f

See more details on using hashes here.

File details

Details for the file llm_benchmarker-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for llm_benchmarker-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 993c13c69fcb958cfdb80105511514e07ab7611c37182e6b663ff3bb705b18e8
MD5 5e4e95e70dc73cc33933e78e550382b4
BLAKE2b-256 e62bcb40a429d6a25a4fbad96c8ce094a42fb698a0359d8eaead9607f9214c63

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page