Skip to main content

Benchmark LLMs with 10 benchmarks & 132K+ questions. 8 providers: OpenAI, Anthropic, Groq, Together, Fireworks, DeepSeek, Ollama, HuggingFace. Unified CLI + Web dashboard.

Project description

๐Ÿš€ LLM Benchmark Toolkit

PyPI Downloads Stars Coverage Python License

๐ŸŽฏ Benchmark LLMs with 10 benchmarks & 108,000+ real questions
MMLU โ€ข TruthfulQA โ€ข HellaSwag โ€ข ARC โ€ข WinoGrande โ€ข CommonsenseQA โ€ข BoolQ โ€ข SafetyBench โ€ข Do-Not-Answer โ€ข GSM8K

Get Started โ€ข Compare Models โ€ข Python API โ€ข Academic โ€ข Contributing


โšก One command to evaluate any LLM
Zero config โ€ข Auto-detection โ€ข Beautiful dashboard โ€ข Academic-grade results


๐Ÿš€ Get Started (60 Seconds)

Install

# Full installation (everything included)
pip install llm-benchmark-toolkit

# Or with all extras (notebooks, dev tools)
pip install llm-benchmark-toolkit[all]

That's it! Everything included: Dashboard, OpenAI, Anthropic, Ollama, HuggingFace.

๐Ÿฉบ Check Your Setup

llm-eval doctor

This diagnoses your environment and shows what's ready to use.

๐ŸŒ Web Dashboard (Recommended!)

The easiest way to evaluate models - a beautiful web interface:

# Launch the dashboard (choose one):
llm-eval dashboard
# or
llm-dashboard
# or
python -m llm_evaluator.dashboard

Opens your browser to http://localhost:8888 where you can:

  • ๐Ÿš€ Run evaluations with real-time progress tracking
  • ๐Ÿ“Š Compare models with interactive charts
  • ๐Ÿ” Inspect scenarios - see every question & answer
  • ๐Ÿ“ˆ View history - track improvements over time
  • ๐Ÿ’พ Export results - JSON, charts, reports

Quick CLI Evaluation

# Run quick evaluation with auto-detection
llm-eval quick

# Or specify provider
llm-eval quick --model gpt-4o

# Full benchmark suite
llm-eval benchmark --model llama3.2:1b

Output:

๐Ÿš€ LLM QUICK EVALUATION
==================================================
โœ… Provider: openai (gpt-4o-mini)
โœ… Sample size: 20

๐Ÿ“Š RESULTS
==================================================
  ๐ŸŽฏ MMLU:       78.5%
  ๐ŸŽฏ TruthfulQA: 71.2%
  ๐ŸŽฏ HellaSwag:  82.4%
  
  ๐Ÿ“ˆ Overall:    77.4%
==================================================
โœจ Evaluation complete!

Auto-detection works with:

  • OPENAI_API_KEY โ†’ GPT-4o-mini
  • ANTHROPIC_API_KEY โ†’ Claude 3.5 Sonnet
  • GEMINI_API_KEY โ†’ Gemini 2.0 Flash (โš ๏ธ Free tier: 10 req/min)
  • DEEPSEEK_API_KEY โ†’ DeepSeek-V3
  • Ollama running locally โ†’ Llama 3.2

๐Ÿ”„ Compare Models

llm-eval compare \
  --models gpt-4o-mini,claude-3-5-sonnet \
  --sample-size 100

More examples:

# Pre-download datasets (optional, speeds up first run)
llm-eval download mmlu truthfulqa gsm8k
llm-eval download all  # Download all benchmarks

# Ollama (local models)
llm-eval quick --model llama3.2:1b

# OpenAI
llm-eval quick --model gpt-4o-mini

# Anthropic
llm-eval run --model claude-3-5-sonnet-20241022 --provider anthropic

# DeepSeek (super affordable!)
llm-eval quick --model deepseek-chat

# Google Gemini (NEW!)
llm-eval quick --model gemini-1.5-flash --provider gemini

# Run specific benchmarks (any combination!)
llm-eval benchmark --model gpt-4o --benchmarks mmlu,truthfulqa,arc,safetybench

# Run ALL benchmarks
llm-eval benchmark --model llama3.2:1b --benchmarks mmlu,truthfulqa,hellaswag,arc,winogrande,commonsenseqa,boolq,safetybench,donotanswer

# Full academic evaluation
llm-eval academic --model llama3.2:1b \
  --sample-size 500 \
  --output-latex results.tex

๐Ÿ–ฅ๏ธ CLI Commands Reference

Command Description
llm-eval quick ๐Ÿš€ Zero-config evaluation (auto-detects provider)
llm-eval doctor ๐Ÿฉบ Diagnose your setup (dependencies, providers, API keys)
llm-eval download ๐Ÿ“ฅ Pre-download benchmark datasets (MMLU, TruthfulQA, etc.)
llm-eval run Full evaluation on a single model
llm-eval benchmark Run specific benchmarks
llm-eval compare Compare multiple models side-by-side
llm-eval vs ๐ŸฅŠ Run same benchmark on multiple models sequentially
llm-eval dashboard ๐ŸŒ Launch web dashboard
llm-eval academic ๐ŸŽ“ Academic evaluation with statistics
llm-eval export ๐Ÿ“ค Export results (JSON, CSV, LaTeX, BibTeX)
llm-eval providers Check available providers status
llm-eval list-runs ๐Ÿ“‹ List saved evaluation runs

Key Options

# Common options for most commands
-m, --model TEXT       # Model name
-p, --provider TYPE    # ollama, openai, anthropic, huggingface, deepseek,
                       # groq, together, fireworks
-s, --sample-size INT  # Number of questions to test
-u, --base-url URL     # Custom API endpoint (vLLM, LM Studio, Azure)
--cache / --no-cache   # Enable/disable caching

# Benchmark selection
-b, --benchmarks TEXT  # Comma-separated: mmlu,truthfulqa,hellaswag,arc,
                       # winogrande,commonsenseqa,boolq,safetybench,donotanswer

VS Command (Model Battle)

Compare models head-to-head:

# Compare two local models
llm-eval vs llama3.2:1b mistral:7b

# Compare with specific benchmarks
llm-eval vs llama3.2:1b mistral:7b -b mmlu,arc -s 50

# Compare models from different providers
llm-eval vs gpt-4o-mini claude-3.5-sonnet -p openai,anthropic

# Ultra-fast with Groq
llm-eval quick --model llama-3.1-8b-instant --provider groq

๐Ÿ Python API

from llm_evaluator import ModelEvaluator
from llm_evaluator.providers import OpenAIProvider

provider = OpenAIProvider(model="gpt-4o-mini")
evaluator = ModelEvaluator(provider=provider)

results = evaluator.evaluate_all()
print(f"Overall: {results.overall_score:.1%}")

With caching (10x faster):

from llm_evaluator.providers import CachedProvider, OllamaProvider

provider = OllamaProvider(model="llama3.2:1b")
cached = CachedProvider(provider)  # Automatic caching!

evaluator = ModelEvaluator(provider=cached)
results = evaluator.evaluate_all()

๐ŸŽฏ Features

Feature Description
๐Ÿ“Š 10 Benchmarks MMLU, TruthfulQA, HellaSwag, ARC, WinoGrande, CommonsenseQA, BoolQ, SafetyBench, Do-Not-Answer, GSM8K
๐Ÿ”ข 108,000+ Questions Real academic datasets from HuggingFace
๐Ÿ”Œ 9 Providers Ollama, OpenAI, Anthropic, Google Gemini, DeepSeek, Groq, Together.ai, Fireworks, HuggingFace
๐Ÿณ Docker Support docker run llm-benchmark quick
๐ŸŒ Web Dashboard Beautiful UI with real-time progress, charts, and history
โšก Parallel Execution 5-10x speedup with --workers 4
๐Ÿ’พ Smart Caching 10x faster repeated evaluations
๐Ÿ“ˆ Academic Rigor 95% CI, McNemar tests, baseline comparisons
๐Ÿ“„ Paper Exports LaTeX tables, BibTeX citations, CSV, JSON
๐Ÿ›ก๏ธ Safety Testing SafetyBench + Do-Not-Answer for security evaluation
๐Ÿ”ข Math Reasoning GSM8K (8,500 grade school math problems)
๐ŸŽจ Beautiful CLI Progress bars, colored output, ETA tracking

โšก Parallel Execution (5-10x Speedup)

Speed up benchmarks with concurrent API calls:

# 4 parallel workers (4x faster)
llm-eval benchmark --model gpt-4o-mini --provider openai --workers 4 --sample-size 100

# Maximum parallelism for fast providers like Groq
llm-eval benchmark --model llama3-8b-8192 --provider groq --workers 8 --sample-size 500

Note: Set workers based on your provider's rate limits:

  • Groq: 8-16 workers (very high rate limits)
  • OpenAI: 4-8 workers
  • Ollama: 1-2 workers (local, CPU-bound)

๐ŸŽ“ Academic Use

For publication-quality evaluations:

from llm_evaluator import ModelEvaluator
from llm_evaluator.providers import OllamaProvider
from llm_evaluator.export import export_to_latex, generate_bibtex

provider = OllamaProvider(model="llama3.2:1b")
evaluator = ModelEvaluator(provider=provider)

results = evaluator.evaluate_all_academic(
    sample_size=500,
    compare_baselines=True
)

# 95% confidence intervals
print(f"MMLU: {results.mmlu_accuracy:.1%}")
print(f"95% CI: [{results.mmlu_ci[0]:.1%}, {results.mmlu_ci[1]:.1%}]")

# Compare to GPT-4, Claude, Llama baselines
for baseline, comparison in results.baseline_comparison.items():
    print(f"vs {baseline}: {comparison['difference']:+.1%}")

# Export for papers
latex = export_to_latex(results, "My Model")
bibtex = generate_bibtex()

๐ŸŽจ Visual Output Examples

Benchmark Comparison

Benchmark Comparison

Interactive Dashboard

Dashboard

(Add screenshots to docs/images/ folder)


๐Ÿ”Œ Check Available Providers

llm-eval providers
๐Ÿ”Œ Available Providers:

โœ… Auto-detected: openai (gpt-4o-mini)

  โœ… ollama          - Local LLMs (llama3.2, mistral, etc.)
  โœ… openai          - GPT-3.5, GPT-4, GPT-4o
  โŒ anthropic       - Claude 3/3.5 (pip install anthropic)
  โœ… deepseek        - DeepSeek-V3, DeepSeek-R1
  โŒ huggingface     - Inference API

๐Ÿ“‹ Environment Variables:
  โœ… OPENAI_API_KEY       sk-abc1...
  โŒ ANTHROPIC_API_KEY    Not set

๐Ÿ”ฌ Benchmarks Included

๐Ÿ“š Knowledge & Reasoning (7 benchmarks)

Benchmark Questions Description
MMLU 14,042 Massive Multitask Language Understanding - 57 subjects
TruthfulQA 817 Truthfulness and avoiding misinformation
HellaSwag 10,042 Common-sense reasoning and sentence completion
ARC-Challenge 2,590 Grade-school science questions (hard subset)
WinoGrande 44,000 Pronoun resolution and commonsense reasoning
CommonsenseQA 12,247 Commonsense knowledge questions
BoolQ 15,942 Yes/no reading comprehension questions

๐Ÿ”ข Math Reasoning (1 benchmark)

Benchmark Questions Description
GSM8K 8,500 Grade school math word problems requiring multi-step reasoning

๐Ÿ›ก๏ธ Safety & Security (2 benchmarks)

Benchmark Questions Description
SafetyBench 11,000 Safety evaluation across multiple risk categories
Do-Not-Answer 939 Harmful prompt detection and refusal testing

Total: 10 benchmarks, 108,000+ questions


๐Ÿค Contributing

This is open source. Make it better:

git clone https://github.com/NahuelGiudizi/llm-evaluation
cd llm-evaluation
pip install -e ".[dev]"
pytest tests/ -v

Wanted

  • Async evaluation for faster throughput
  • More benchmarks (GSM8K, HumanEval, GPQA, MT-Bench)
  • Batch evaluation mode
  • Custom benchmark support
  • Kubernetes deployment

Contributors welcome! ๐ŸŽ‰


๐Ÿ“š Documentation

Doc Description
๐Ÿ“– Quick Start Get running in 5 minutes
๐Ÿ”Œ Providers Guide Ollama, OpenAI, Anthropic, DeepSeek, HuggingFace
๐Ÿ”ฌ Benchmarks MMLU, TruthfulQA, HellaSwag details
๐ŸŽ“ Academic Usage Statistical methods, LaTeX export
๐Ÿ“˜ API Reference Complete Python API documentation

๐Ÿณ Docker

Run benchmarks without installing anything:

# Build the image
docker build -t llm-benchmark .

# Quick evaluation with OpenAI
docker run -e OPENAI_API_KEY=$OPENAI_API_KEY llm-benchmark quick

# Ultra-fast with Groq
docker run -e GROQ_API_KEY=$GROQ_API_KEY llm-benchmark quick \
  --model llama-3.1-8b-instant --provider groq

# Run specific benchmarks
docker run -e OPENAI_API_KEY=$OPENAI_API_KEY llm-benchmark benchmark \
  --model gpt-4o-mini --benchmarks mmlu,truthfulqa -s 50

# Launch dashboard
docker run -p 8888:8888 -e OPENAI_API_KEY=$OPENAI_API_KEY \
  llm-benchmark dashboard --host 0.0.0.0

# With docker-compose
docker compose up dashboard

๐Ÿ“Š Output Formats

# JSON (default)
llm-eval run --model llama3.2:1b --output results.json

# Export to multiple formats
llm-eval export results.json --format all

# Individual formats
llm-eval export results.json --format csv
llm-eval export results.json --format latex
llm-eval export results.json --format bibtex

# Academic evaluation with direct exports
llm-eval academic --model llama3.2:1b --output-latex table.tex --output-bibtex refs.bib

๐Ÿงช Provider Testing Status

  • โœ… Ollama: Fully tested with multiple models (Llama, Mistral, Phi3)
  • โš ๏ธ Gemini: Tested with free tier - works but has strict rate limits (10 req/min)
  • โš ๏ธ OpenAI, Anthropic, DeepSeek, Groq, Together, Fireworks, HuggingFace: Unit tests pass, should work with valid API keys but not extensively tested to avoid subscription costs

Found an issue? Report it here

For detailed provider documentation, see PROVIDERS.md.


๐Ÿ“œ License

MIT License - see LICENSE for details.


โญ Star History

If this project helped you, please star it! โญ

Star History Chart


Made with โค๏ธ by Nahuel Giudizi

Install

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_benchmark_toolkit-2.4.2.tar.gz (398.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llm_benchmark_toolkit-2.4.2-py3-none-any.whl (365.3 kB view details)

Uploaded Python 3

File details

Details for the file llm_benchmark_toolkit-2.4.2.tar.gz.

File metadata

  • Download URL: llm_benchmark_toolkit-2.4.2.tar.gz
  • Upload date:
  • Size: 398.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.3

File hashes

Hashes for llm_benchmark_toolkit-2.4.2.tar.gz
Algorithm Hash digest
SHA256 6b72df5d82834bfca748a3badc45c96c470e3d2badc7f389caf60f9b96828883
MD5 e1a57cf22c9a1b77074c3f91e6d995d2
BLAKE2b-256 e69f94e53c1accb629b7506238a1a2dcddf0c2b66105d7dc27dce04b4ce30a11

See more details on using hashes here.

File details

Details for the file llm_benchmark_toolkit-2.4.2-py3-none-any.whl.

File metadata

File hashes

Hashes for llm_benchmark_toolkit-2.4.2-py3-none-any.whl
Algorithm Hash digest
SHA256 3c56d9744f0b0da156077a8d4b6e783c241fb2125bc6a6feffb78cd8c78ab27f
MD5 10547aa88a2851478e20a640fdf3e570
BLAKE2b-256 5578b1ef2ae73612640393d1fc73c52da4f80c9aa492bc19058ddd234437ebd8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page