A developer-centric CLI tool to systematically evaluate and compare Large Language Models (LLMs)
Project description
LLM-Bench
A developer-centric CLI tool to systematically evaluate and compare Large Language Models (LLMs). Define prompts, test cases, and expected outputs, run them concurrently against multiple providers, and get detailed performance metrics (Accuracy, Latency, Throughput, Cost) in your terminal.
Features
- Multi-Provider Support: OpenAI, Anthropic, Google Gemini, Groq, Mistral, Cohere, Together AI, Azure, and more (via
litellm). - Rich Terminal Output: Real-time progress bars with ETA.
- Detailed Metrics: Pass rates, P95 latency, TTFT, tokens/sec, and cost.
- Flexible Validation: Regex patterns, custom Python validators, JSON Schema, LLM judges.
- ROUGE-L Scoring: Compare outputs against reference texts.
- Concurrent Execution: Fast parallel testing with rate limiting.
- Caching: Saves money on repeated runs.
- Export: HTML, CSV, or JSON reports with interactive charts.
- Comparison: Compare benchmark results across runs.
- External Data: Load test cases from CSV or JSONL files.
- Interactive Setup: Config wizard to get started quickly.
- CI/CD Friendly: Quiet mode, no-color output, fail-fast, and dry-run support.
Installation
pip install llm-bench-cli
Or install from source:
git clone https://github.com/abdulbb/llm-bench-cli.git
cd llm-bench-cli
pip install -e .
Setting Up API Keys
Set your API keys as environment variables before running benchmarks:
export OPENAI_API_KEY="sk-..."
export ANTHROPIC_API_KEY="sk-ant-..."
export GEMINI_API_KEY="..."
export GROQ_API_KEY="gsk_..."
export OPENROUTER_API_KEY="sk-or-..."
To make these persistent across terminal sessions, add them to your ~/.bashrc, ~/.zshrc, or use a tool like direnv.
For more detailed setup instructions, see Provider Setup.
Documentation
- CLI Reference: All commands and options.
- Configuration Guide: Learn how to write
bench.config.yaml. - Real-World Examples: Start Here! Code generation, summarization, and data extraction scenarios.
- Evaluation & Validation: How results are judged (Regex, Custom Validators, JSON Schema, LLM Judge).
- Exporting Results: Details on HTML, CSV, and JSON output.
- Provider Setup: How to setup API keys for OpenAI, Gemini, Groq, etc.
- Development: How to contribute to this project.
Quick Start
1. Create a new benchmark configuration:
# Interactive wizard
llm-bench init
# Or generate a template
llm-bench init --non-interactive
2. Validate your configuration:
llm-bench validate --config bench.config.yaml
3. Preview without making API calls:
llm-bench run --config bench.config.yaml --dry-run
4. Run the benchmark:
llm-bench run --config bench.config.yaml
Usage Examples
Run a Standard Benchmark:
llm-bench run --config code_gen.yaml
Compare Models on the Fly:
llm-bench run --model openai/gpt-4o --model anthropic/claude-3-5-sonnet-20241022
Generate an Interactive Report:
llm-bench run --config summarization.yaml --export html --output report.html
Set Safety Limits:
llm-bench run --config expensive_test.yaml --max-cost 1.0
Stop on First Failure:
llm-bench run --config ci_tests.yaml --fail-fast
CI/CD Mode (quiet, no colors):
llm-bench run --config bench.config.yaml --quiet --no-color
List Available Models:
llm-bench models
llm-bench models --provider openai
llm-bench models --provider groq
Manage Cache:
llm-bench cache info
llm-bench cache clear
Compare Benchmark Results:
llm-bench compare run1.json run2.json --export html --output comparison.html
Enable Shell Completions:
# Install completions for your shell
llm-bench --install-completion
# Show completion script without installing
llm-bench --show-completion
See CLI Reference for all commands and options, and Real-World Examples for full configuration files.
Example Configuration
name: "Code Generation Benchmark"
system_prompt: "You are a Python coding assistant."
models:
- "openai/gpt-4o"
- "anthropic/claude-3-5-sonnet-20241022"
- "groq/llama-3.1-70b-versatile"
validators_file: "validators.py"
test_cases:
- input: "Write a function to calculate factorial"
regex_pattern: "def factorial"
validator: "is_valid_python"
- input: "What is 2+2?"
expected: 4
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file llm_bench_cli-0.1.0.tar.gz.
File metadata
- Download URL: llm_bench_cli-0.1.0.tar.gz
- Upload date:
- Size: 590.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d1065230f93495ccf78b2cf7bbcaaaadc83745c04ae42a8c0d8feed9c0124fd7
|
|
| MD5 |
a7f1db782fff159f102513f1f213c971
|
|
| BLAKE2b-256 |
bf84bd7abcdc60b00b16b005f77f801d566c50028e4fda383f10f89358318f07
|
Provenance
The following attestation bundles were made for llm_bench_cli-0.1.0.tar.gz:
Publisher:
publish.yml on abdulbb/llm-bench-cli
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
llm_bench_cli-0.1.0.tar.gz -
Subject digest:
d1065230f93495ccf78b2cf7bbcaaaadc83745c04ae42a8c0d8feed9c0124fd7 - Sigstore transparency entry: 782624012
- Sigstore integration time:
-
Permalink:
abdulbb/llm-bench-cli@0e3efd69499b4be79af62b9e83ef1430831fe2c2 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/abdulbb
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@0e3efd69499b4be79af62b9e83ef1430831fe2c2 -
Trigger Event:
release
-
Statement type:
File details
Details for the file llm_bench_cli-0.1.0-py3-none-any.whl.
File metadata
- Download URL: llm_bench_cli-0.1.0-py3-none-any.whl
- Upload date:
- Size: 55.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
33194c20307b235f72377188c36cddd693c8bde6e245bcf5a01a39cfa04ed0dc
|
|
| MD5 |
b2d010799b28cbbc2027412f478bfac0
|
|
| BLAKE2b-256 |
2c7923ce0f14d462d59de8b6b280ca804fdf3942cb2ec01eec1a7f36fdd3e4bd
|
Provenance
The following attestation bundles were made for llm_bench_cli-0.1.0-py3-none-any.whl:
Publisher:
publish.yml on abdulbb/llm-bench-cli
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
llm_bench_cli-0.1.0-py3-none-any.whl -
Subject digest:
33194c20307b235f72377188c36cddd693c8bde6e245bcf5a01a39cfa04ed0dc - Sigstore transparency entry: 782624020
- Sigstore integration time:
-
Permalink:
abdulbb/llm-bench-cli@0e3efd69499b4be79af62b9e83ef1430831fe2c2 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/abdulbb
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@0e3efd69499b4be79af62b9e83ef1430831fe2c2 -
Trigger Event:
release
-
Statement type: