Benchmark any LLM provider against your actual prompts — latency, cost, quality

These details have not been verified by PyPI

Project links

Project description

llm-bench

Benchmark any LLM against your actual prompts. Compare OpenAI, Anthropic, Gemini, Mistral, Groq — latency, cost, quality, side by side.

pip install llm-benchmarker

Quick Start

# Install with the providers you need
pip install "llm-benchmarker[openai,anthropic,gemini]"

# Set your API keys
export OPENAI_API_KEY=sk-...
export ANTHROPIC_API_KEY=sk-ant-...
export GEMINI_API_KEY=AIza...

# Run a benchmark
llm-bench run \
  --prompt "Classify the sentiment: 'I love this product!'" \
  --models gpt-4o,claude-3-5-sonnet,gemini-2.0-flash

Output:

llm-bench results — 3 runs × 1 prompt(s)

  Model                     Provider    p50 (ms)  p95 (ms)  Avg tokens ↑↓   $/1k req  Errors
  ────────────────────────────────────────────────────────────────────────────────────────────
  gemini-2.0-flash          gemini          312       445       15↑  18↓     $0.003      —
  gpt-4o                    openai          487       623       15↑  22↓     $0.059      —
  claude-3-5-sonnet         anthropic       891      1204       15↑  31↓     $0.121      —

Installation

# Core only
pip install llm-benchmarker

# With specific providers
pip install "llm-benchmarker[openai]"
pip install "llm-benchmarker[anthropic]"
pip install "llm-benchmarker[gemini]"
pip install "llm-benchmarker[mistral]"
pip install "llm-benchmarker[groq]"

# All providers
pip install "llm-benchmarker[all]"

Usage

CLI

# Single prompt, multiple models
llm-bench run \
  --prompt "Write a Python function to reverse a string" \
  --models gpt-4o-mini,claude-3-5-haiku,gemini-2.0-flash

# Multiple prompts
llm-bench run \
  --prompt "What is 2+2?" \
  --prompt "Explain quantum entanglement simply." \
  --models gpt-4o,claude-3-5-sonnet

# With quality scoring (LLM-as-judge)
llm-bench run \
  --prompt "Summarize the French Revolution in 3 sentences." \
  --models gpt-4o,claude-3-5-sonnet \
  --judge gpt-4o-mini

# Save to JSON (for CI/CD)
llm-bench run \
  --prompt "Hello" \
  --models gpt-4o \
  --json > results.json

# Use a YAML config for complex benchmarks
llm-bench run --config benchmark.yaml

YAML Config

Generate a starter config:

llm-bench init

Or create benchmark.yaml:

models:
  - gpt-4o
  - claude-3-5-sonnet
  - gemini-2.0-flash
  - llama-3.3-70b-versatile  # Groq

prompts:
  - text: "Classify sentiment: 'Great product, fast shipping!'"
    name: positive_sentiment

  - text: "Debug this: def fib(n): return fib(n-1) + fib(n-2)"
    name: code_debug

# Optional: score response quality with a judge model
judge_model: gpt-4o-mini

n_runs: 5
temperature: 0.0
max_tokens: 512
output: results.json

Run it:

llm-bench run --config benchmark.yaml

JSON Output (CI/CD)

llm-bench run --config benchmark.yaml --json

Output schema:

{
  "timestamp": "2026-03-27T03:00:00Z",
  "duration_seconds": 12.4,
  "config": {
    "models": ["gpt-4o", "claude-3-5-sonnet"],
    "n_prompts": 2,
    "n_runs": 3
  },
  "results": {
    "gpt-4o": {
      "model": "gpt-4o",
      "provider": "openai",
      "n_success": 6,
      "n_errors": 0,
      "latency": {
        "p50_ms": 487.2,
        "p95_ms": 623.1,
        "mean_ms": 501.4
      },
      "tokens": {
        "avg_input": 15.0,
        "avg_output": 22.3
      },
      "cost_per_1k_requests_usd": 0.059,
      "quality_score": 8.4
    }
  }
}

Supported Models

Provider	Models
OpenAI	gpt-4o, gpt-4o-mini, gpt-4-turbo, gpt-3.5-turbo, o1, o1-mini, o3-mini
Anthropic	claude-opus-4, claude-sonnet-4, claude-3-5-sonnet, claude-3-5-haiku, claude-3-haiku
Gemini	gemini-2.5-pro, gemini-2.0-flash, gemini-2.0-flash-lite, gemini-1.5-pro, gemini-1.5-flash
Mistral	mistral-large, mistral-small, codestral, mixtral-8x7b
Groq	llama-3.3-70b-versatile, llama-3.1-8b-instant, mixtral-8x7b-32768, gemma2-9b-it

See all: llm-bench list-models

Environment Variables

Variable	Provider
`OPENAI_API_KEY`	OpenAI
`ANTHROPIC_API_KEY`	Anthropic
`GEMINI_API_KEY` or `GOOGLE_API_KEY`	Google Gemini
`MISTRAL_API_KEY`	Mistral
`GROQ_API_KEY`	Groq

Options Reference

llm-bench run [OPTIONS]

  --prompt, -p TEXT        Prompt to benchmark (repeatable)
  --models, -m TEXT        Comma-separated model list
  --config, -c PATH        YAML config file
  --runs, -n INTEGER       Runs per prompt per model [default: 3]
  --temperature, -t FLOAT  Sampling temperature [default: 0.0]
  --max-tokens INTEGER     Max output tokens [default: 1024]
  --judge TEXT             Judge model for quality scoring
  --system, -s TEXT        System prompt
  --output, -o PATH        Save results to JSON file
  --json                   Output JSON to stdout
  --concurrency INTEGER    Max concurrent requests [default: 5]
  --timeout FLOAT          Request timeout in seconds [default: 60.0]
  --verbose, -v            Show progress

Python API

import asyncio
from llm_bench.benchmark import BenchmarkConfig, run_benchmark
from llm_bench.reporter import print_results_table

config = BenchmarkConfig(
    models=["gpt-4o", "claude-3-5-sonnet"],
    prompts=["What is the meaning of life?"],
    n_runs=3,
    judge_model="gpt-4o-mini",
)

result = asyncio.run(run_benchmark(config))
print_results_table(result)

# Access raw data
for model, metrics in result.metrics.items():
    print(f"{model}: p50={metrics.latency_p50_ms:.0f}ms, quality={metrics.quality_score}")

Contributing

Contributions welcome! See CONTRIBUTING.md.

git clone https://github.com/midnightrunai/llm-bench
cd llm-bench
pip install -e ".[dev]"
pytest

License

MIT © Midnight Run

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Mar 27, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_benchmarker-0.1.0.tar.gz (26.2 kB view details)

Uploaded Mar 27, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

llm_benchmarker-0.1.0-py3-none-any.whl (24.1 kB view details)

Uploaded Mar 27, 2026 Python 3

File details

Details for the file llm_benchmarker-0.1.0.tar.gz.

File metadata

Download URL: llm_benchmarker-0.1.0.tar.gz
Upload date: Mar 27, 2026
Size: 26.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for llm_benchmarker-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`8267883ac6d67dc2a40f3a86462f88058349e1beeaca1e450f203052394e3ba3`
MD5	`046dad43eb8144ff43bec917a5e159f7`
BLAKE2b-256	`5abd9201f3bca4beac2b736560b967c7dcd59eff868fde5a2024ad2d0d05e76f`

See more details on using hashes here.

File details

Details for the file llm_benchmarker-0.1.0-py3-none-any.whl.

File metadata

Download URL: llm_benchmarker-0.1.0-py3-none-any.whl
Upload date: Mar 27, 2026
Size: 24.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for llm_benchmarker-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`993c13c69fcb958cfdb80105511514e07ab7611c37182e6b663ff3bb705b18e8`
MD5	`5e4e95e70dc73cc33933e78e550382b4`
BLAKE2b-256	`e62bcb40a429d6a25a4fbad96c8ce094a42fb698a0359d8eaead9607f9214c63`

See more details on using hashes here.

llm-benchmarker 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

llm-bench

Quick Start

Installation

Usage

CLI

YAML Config

JSON Output (CI/CD)

Supported Models

Environment Variables

Options Reference

Python API

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes