A developer-centric CLI tool to systematically evaluate and compare Large Language Models (LLMs)

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

These details have not been verified by PyPI

Project description

LLM-Bench

A developer-centric CLI tool to systematically evaluate and compare Large Language Models (LLMs). Define prompts, test cases, and expected outputs, run them concurrently against multiple providers, and get detailed performance metrics (Accuracy, Latency, Throughput, Cost) in your terminal.

Preview

Demo

Features

Multi-Provider Support: OpenAI, Anthropic, Google Gemini, Groq, Mistral, Cohere, Together AI, Azure, and more (via litellm).
Rich Terminal Output: Real-time progress bars with ETA.
Detailed Metrics: Pass rates, P95 latency, TTFT, tokens/sec, and cost.
Flexible Validation: Regex patterns, custom Python validators, JSON Schema, LLM judges.
ROUGE-L Scoring: Compare outputs against reference texts.
Concurrent Execution: Fast parallel testing with rate limiting.
Caching: Saves money on repeated runs.
Export: HTML, CSV, or JSON reports with interactive charts.
Comparison: Compare benchmark results across runs.
External Data: Load test cases from CSV or JSONL files.
Interactive Setup: Config wizard to get started quickly.
CI/CD Friendly: Quiet mode, no-color output, fail-fast, and dry-run support.

Installation

pip install llm-bench-cli

Or install from source:

git clone https://github.com/abdulbb/llm-bench-cli.git
cd llm-bench-cli
pip install -e .

Setting Up API Keys

Set your API keys as environment variables before running benchmarks:

export OPENAI_API_KEY="sk-..."
export ANTHROPIC_API_KEY="sk-ant-..."
export GEMINI_API_KEY="..."
export GROQ_API_KEY="gsk_..."
export OPENROUTER_API_KEY="sk-or-..."

To make these persistent across terminal sessions, add them to your ~/.bashrc, ~/.zshrc, or use a tool like direnv.

For more detailed setup instructions, see Provider Setup.

Documentation

CLI Reference: All commands and options.
Configuration Guide: Learn how to write bench.config.yaml.
Real-World Examples: Start Here! Code generation, summarization, and data extraction scenarios.
Evaluation & Validation: How results are judged (Regex, Custom Validators, JSON Schema, LLM Judge).
Exporting Results: Details on HTML, CSV, and JSON output.
Provider Setup: How to setup API keys for OpenAI, Gemini, Groq, etc.
Development: How to contribute to this project.

Quick Start

1. Create a new benchmark configuration:

# Interactive wizard
llm-bench init

# Or generate a template
llm-bench init --non-interactive

2. Validate your configuration:

llm-bench validate --config bench.config.yaml

3. Preview without making API calls:

llm-bench run --config bench.config.yaml --dry-run

4. Run the benchmark:

llm-bench run --config bench.config.yaml

Usage Examples

Run a Standard Benchmark:

llm-bench run --config code_gen.yaml

Compare Models on the Fly:

llm-bench run --model openai/gpt-4o --model anthropic/claude-3-5-sonnet-20241022

Generate an Interactive Report:

llm-bench run --config summarization.yaml --export html --output report.html

Set Safety Limits:

llm-bench run --config expensive_test.yaml --max-cost 1.0

Stop on First Failure:

llm-bench run --config ci_tests.yaml --fail-fast

CI/CD Mode (quiet, no colors):

llm-bench run --config bench.config.yaml --quiet --no-color

List Available Models:

llm-bench models
llm-bench models --provider openai
llm-bench models --provider groq

Manage Cache:

llm-bench cache info
llm-bench cache clear

Compare Benchmark Results:

llm-bench compare run1.json run2.json --export html --output comparison.html

Enable Shell Completions:

# Install completions for your shell
llm-bench --install-completion

# Show completion script without installing
llm-bench --show-completion

See CLI Reference for all commands and options, and Real-World Examples for full configuration files.

Example Configuration

name: "Code Generation Benchmark"
system_prompt: "You are a Python coding assistant."
models:
  - "openai/gpt-4o"
  - "anthropic/claude-3-5-sonnet-20241022"
  - "groq/llama-3.1-70b-versatile"

validators_file: "validators.py"

test_cases:
  - input: "Write a function to calculate factorial"
    regex_pattern: "def factorial"
    validator: "is_valid_python"
    
  - input: "What is 2+2?"
    expected: 4

License

MIT

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

abdulbb

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.2.0

Dec 31, 2025

This version

0.1.0

Dec 31, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_bench_cli-0.1.0.tar.gz (590.6 kB view details)

Uploaded Dec 31, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

llm_bench_cli-0.1.0-py3-none-any.whl (55.9 kB view details)

Uploaded Dec 31, 2025 Python 3

File details

Details for the file llm_bench_cli-0.1.0.tar.gz.

File metadata

Download URL: llm_bench_cli-0.1.0.tar.gz
Upload date: Dec 31, 2025
Size: 590.6 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for llm_bench_cli-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`d1065230f93495ccf78b2cf7bbcaaaadc83745c04ae42a8c0d8feed9c0124fd7`
MD5	`a7f1db782fff159f102513f1f213c971`
BLAKE2b-256	`bf84bd7abcdc60b00b16b005f77f801d566c50028e4fda383f10f89358318f07`

See more details on using hashes here.

Provenance

The following attestation bundles were made for llm_bench_cli-0.1.0.tar.gz:

Publisher: publish.yml on abdulbb/llm-bench-cli

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: llm_bench_cli-0.1.0.tar.gz
- Subject digest: d1065230f93495ccf78b2cf7bbcaaaadc83745c04ae42a8c0d8feed9c0124fd7
- Sigstore transparency entry: 782624012
- Sigstore integration time: Dec 31, 2025
Source repository:
- Permalink: abdulbb/llm-bench-cli@0e3efd69499b4be79af62b9e83ef1430831fe2c2
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/abdulbb
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@0e3efd69499b4be79af62b9e83ef1430831fe2c2
- Trigger Event: release

File details

Details for the file llm_bench_cli-0.1.0-py3-none-any.whl.

File metadata

Download URL: llm_bench_cli-0.1.0-py3-none-any.whl
Upload date: Dec 31, 2025
Size: 55.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for llm_bench_cli-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`33194c20307b235f72377188c36cddd693c8bde6e245bcf5a01a39cfa04ed0dc`
MD5	`b2d010799b28cbbc2027412f478bfac0`
BLAKE2b-256	`2c7923ce0f14d462d59de8b6b280ca804fdf3942cb2ec01eec1a7f36fdd3e4bd`

See more details on using hashes here.

Provenance

The following attestation bundles were made for llm_bench_cli-0.1.0-py3-none-any.whl:

Publisher: publish.yml on abdulbb/llm-bench-cli

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: llm_bench_cli-0.1.0-py3-none-any.whl
- Subject digest: 33194c20307b235f72377188c36cddd693c8bde6e245bcf5a01a39cfa04ed0dc
- Sigstore transparency entry: 782624020
- Sigstore integration time: Dec 31, 2025
Source repository:
- Permalink: abdulbb/llm-bench-cli@0e3efd69499b4be79af62b9e83ef1430831fe2c2
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/abdulbb
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@0e3efd69499b4be79af62b9e83ef1430831fe2c2
- Trigger Event: release

llm-bench-cli 0.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

LLM-Bench

Features

Installation

Setting Up API Keys

Documentation

Quick Start

Usage Examples

Example Configuration

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance