Skip to main content

A developer-centric CLI tool to systematically evaluate and compare Large Language Models (LLMs)

Project description

LLM-Bench

A developer-centric CLI tool to systematically evaluate and compare Large Language Models (LLMs). Define prompts, test cases, and expected outputs, run them concurrently against multiple providers, and get detailed performance metrics (Accuracy, Latency, Throughput, Cost) in your terminal.

Preview

Demo

Features

  • Multi-Provider Support: OpenAI, Anthropic, Google Gemini, Groq, Mistral, Cohere, Together AI, Azure, and more (via litellm).
  • Rich Terminal Output: Real-time progress bars with ETA.
  • Detailed Metrics: Pass rates, P95 latency, TTFT, tokens/sec, and cost.
  • Flexible Validation: Regex patterns, custom Python validators, JSON Schema, LLM judges.
  • ROUGE-L Scoring: Compare outputs against reference texts.
  • Concurrent Execution: Fast parallel testing with rate limiting.
  • Caching: Saves money on repeated runs.
  • Export: HTML, CSV, or JSON reports with interactive charts.
  • Comparison: Compare benchmark results across runs.
  • External Data: Load test cases from CSV or JSONL files.
  • Interactive Setup: Config wizard to get started quickly.
  • CI/CD Friendly: Quiet mode, no-color output, fail-fast, and dry-run support.

Installation

pip install llm-bench-cli

Or install from source:

git clone https://github.com/abdulbb/llm-bench-cli.git
cd llm-bench-cli
pip install -e .

Setting Up API Keys

Set your API keys as environment variables before running benchmarks:

export OPENAI_API_KEY="sk-..."
export ANTHROPIC_API_KEY="sk-ant-..."
export GEMINI_API_KEY="..."
export GROQ_API_KEY="gsk_..."
export OPENROUTER_API_KEY="sk-or-..."

To make these persistent across terminal sessions, add them to your ~/.bashrc, ~/.zshrc, or use a tool like direnv.

For more detailed setup instructions, see Provider Setup.

Documentation

Quick Start

1. Create a new benchmark configuration:

# Interactive wizard
llm-bench init

# Or generate a template
llm-bench init --non-interactive

2. Validate your configuration:

llm-bench validate --config bench.config.yaml

3. Preview without making API calls:

llm-bench run --config bench.config.yaml --dry-run

4. Run the benchmark:

llm-bench run --config bench.config.yaml

Usage Examples

Run a Standard Benchmark:

llm-bench run --config code_gen.yaml

Compare Models on the Fly:

llm-bench run --model openai/gpt-4o --model anthropic/claude-3-5-sonnet-20241022

Generate an Interactive Report:

llm-bench run --config summarization.yaml --export html --output report.html

Set Safety Limits:

llm-bench run --config expensive_test.yaml --max-cost 1.0

Stop on First Failure:

llm-bench run --config ci_tests.yaml --fail-fast

CI/CD Mode (quiet, no colors):

llm-bench run --config bench.config.yaml --quiet --no-color

List Available Models:

llm-bench models
llm-bench models --provider openai
llm-bench models --provider groq

Manage Cache:

llm-bench cache info
llm-bench cache clear

Compare Benchmark Results:

llm-bench compare run1.json run2.json --export html --output comparison.html

Enable Shell Completions:

# Install completions for your shell
llm-bench --install-completion

# Show completion script without installing
llm-bench --show-completion

See CLI Reference for all commands and options, and Real-World Examples for full configuration files.

Example Configuration

name: "Code Generation Benchmark"
system_prompt: "You are a Python coding assistant."
models:
  - "openai/gpt-4o"
  - "anthropic/claude-3-5-sonnet-20241022"
  - "groq/llama-3.1-70b-versatile"

validators_file: "validators.py"

test_cases:
  - input: "Write a function to calculate factorial"
    regex_pattern: "def factorial"
    validator: "is_valid_python"
    
  - input: "What is 2+2?"
    expected: 4

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_bench_cli-0.1.0.tar.gz (590.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llm_bench_cli-0.1.0-py3-none-any.whl (55.9 kB view details)

Uploaded Python 3

File details

Details for the file llm_bench_cli-0.1.0.tar.gz.

File metadata

  • Download URL: llm_bench_cli-0.1.0.tar.gz
  • Upload date:
  • Size: 590.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for llm_bench_cli-0.1.0.tar.gz
Algorithm Hash digest
SHA256 d1065230f93495ccf78b2cf7bbcaaaadc83745c04ae42a8c0d8feed9c0124fd7
MD5 a7f1db782fff159f102513f1f213c971
BLAKE2b-256 bf84bd7abcdc60b00b16b005f77f801d566c50028e4fda383f10f89358318f07

See more details on using hashes here.

Provenance

The following attestation bundles were made for llm_bench_cli-0.1.0.tar.gz:

Publisher: publish.yml on abdulbb/llm-bench-cli

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file llm_bench_cli-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: llm_bench_cli-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 55.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for llm_bench_cli-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 33194c20307b235f72377188c36cddd693c8bde6e245bcf5a01a39cfa04ed0dc
MD5 b2d010799b28cbbc2027412f478bfac0
BLAKE2b-256 2c7923ce0f14d462d59de8b6b280ca804fdf3942cb2ec01eec1a7f36fdd3e4bd

See more details on using hashes here.

Provenance

The following attestation bundles were made for llm_bench_cli-0.1.0-py3-none-any.whl:

Publisher: publish.yml on abdulbb/llm-bench-cli

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page