Benchmark LLMs with 10 benchmarks & 132K+ questions. 8 providers: OpenAI, Anthropic, Groq, Together, Fireworks, DeepSeek, Ollama, HuggingFace. Unified CLI + Web dashboard.
Project description
๐ LLM Benchmark Toolkit
๐ฏ Benchmark LLMs with 10 benchmarks & 108,000+ real questions
MMLU โข TruthfulQA โข HellaSwag โข ARC โข WinoGrande โข CommonsenseQA โข BoolQ โข SafetyBench โข Do-Not-Answer โข GSM8K
Get Started โข Compare Models โข Python API โข Academic โข Contributing
โก One command to evaluate any LLM
Zero config โข Auto-detection โข Beautiful dashboard โข Academic-grade results
๐ Get Started (60 Seconds)
Install
# Full installation (everything included)
pip install llm-benchmark-toolkit
# Or with all extras (notebooks, dev tools)
pip install llm-benchmark-toolkit[all]
That's it! Everything included: Dashboard, OpenAI, Anthropic, Ollama, HuggingFace.
๐ฉบ Check Your Setup
llm-eval doctor
This diagnoses your environment and shows what's ready to use.
๐ Web Dashboard (Recommended!)
The easiest way to evaluate models - a beautiful web interface:
# Launch the dashboard (choose one):
llm-eval dashboard
# or
llm-dashboard
# or
python -m llm_evaluator.dashboard
Opens your browser to http://localhost:8888 where you can:
- ๐ Run evaluations with real-time progress tracking
- ๐ Compare models with interactive charts
- ๐ Inspect scenarios - see every question & answer
- ๐ View history - track improvements over time
- ๐พ Export results - JSON, charts, reports
Quick CLI Evaluation
# Run quick evaluation with auto-detection
llm-eval quick
# Or specify provider
llm-eval quick --model gpt-4o
# Full benchmark suite
llm-eval benchmark --model llama3.2:1b
Output:
๐ LLM QUICK EVALUATION
==================================================
โ
Provider: openai (gpt-4o-mini)
โ
Sample size: 20
๐ RESULTS
==================================================
๐ฏ MMLU: 78.5%
๐ฏ TruthfulQA: 71.2%
๐ฏ HellaSwag: 82.4%
๐ Overall: 77.4%
==================================================
โจ Evaluation complete!
Auto-detection works with:
OPENAI_API_KEYโ GPT-4o-miniANTHROPIC_API_KEYโ Claude 3.5 SonnetGEMINI_API_KEYโ Gemini 2.0 Flash (โ ๏ธ Free tier: 10 req/min)DEEPSEEK_API_KEYโ DeepSeek-V3- Ollama running locally โ Llama 3.2
๐ Compare Models
llm-eval compare \
--models gpt-4o-mini,claude-3-5-sonnet \
--sample-size 100
More examples:
# Pre-download datasets (optional, speeds up first run)
llm-eval download mmlu truthfulqa gsm8k
llm-eval download all # Download all benchmarks
# Ollama (local models)
llm-eval quick --model llama3.2:1b
# OpenAI
llm-eval quick --model gpt-4o-mini
# Anthropic
llm-eval run --model claude-3-5-sonnet-20241022 --provider anthropic
# DeepSeek (super affordable!)
llm-eval quick --model deepseek-chat
# Google Gemini (NEW!)
llm-eval quick --model gemini-1.5-flash --provider gemini
# Run specific benchmarks (any combination!)
llm-eval benchmark --model gpt-4o --benchmarks mmlu,truthfulqa,arc,safetybench
# Run ALL benchmarks
llm-eval benchmark --model llama3.2:1b --benchmarks mmlu,truthfulqa,hellaswag,arc,winogrande,commonsenseqa,boolq,safetybench,donotanswer
# Full academic evaluation
llm-eval academic --model llama3.2:1b \
--sample-size 500 \
--output-latex results.tex
๐ฅ๏ธ CLI Commands Reference
| Command | Description |
|---|---|
llm-eval quick |
๐ Zero-config evaluation (auto-detects provider) |
llm-eval doctor |
๐ฉบ Diagnose your setup (dependencies, providers, API keys) |
llm-eval download |
๐ฅ Pre-download benchmark datasets (MMLU, TruthfulQA, etc.) |
llm-eval run |
Full evaluation on a single model |
llm-eval benchmark |
Run specific benchmarks |
llm-eval compare |
Compare multiple models side-by-side |
llm-eval vs |
๐ฅ Run same benchmark on multiple models sequentially |
llm-eval dashboard |
๐ Launch web dashboard |
llm-eval academic |
๐ Academic evaluation with statistics |
llm-eval export |
๐ค Export results (JSON, CSV, LaTeX, BibTeX) |
llm-eval providers |
Check available providers status |
llm-eval list-runs |
๐ List saved evaluation runs |
Key Options
# Common options for most commands
-m, --model TEXT # Model name
-p, --provider TYPE # ollama, openai, anthropic, huggingface, deepseek,
# groq, together, fireworks
-s, --sample-size INT # Number of questions to test
-u, --base-url URL # Custom API endpoint (vLLM, LM Studio, Azure)
--cache / --no-cache # Enable/disable caching
# Benchmark selection
-b, --benchmarks TEXT # Comma-separated: mmlu,truthfulqa,hellaswag,arc,
# winogrande,commonsenseqa,boolq,safetybench,donotanswer
VS Command (Model Battle)
Compare models head-to-head:
# Compare two local models
llm-eval vs llama3.2:1b mistral:7b
# Compare with specific benchmarks
llm-eval vs llama3.2:1b mistral:7b -b mmlu,arc -s 50
# Compare models from different providers
llm-eval vs gpt-4o-mini claude-3.5-sonnet -p openai,anthropic
# Ultra-fast with Groq
llm-eval quick --model llama-3.1-8b-instant --provider groq
๐ Python API
from llm_evaluator import ModelEvaluator
from llm_evaluator.providers import OpenAIProvider
provider = OpenAIProvider(model="gpt-4o-mini")
evaluator = ModelEvaluator(provider=provider)
results = evaluator.evaluate_all()
print(f"Overall: {results.overall_score:.1%}")
With caching (10x faster):
from llm_evaluator.providers import CachedProvider, OllamaProvider
provider = OllamaProvider(model="llama3.2:1b")
cached = CachedProvider(provider) # Automatic caching!
evaluator = ModelEvaluator(provider=cached)
results = evaluator.evaluate_all()
๐ฏ Features
| Feature | Description |
|---|---|
| ๐ 10 Benchmarks | MMLU, TruthfulQA, HellaSwag, ARC, WinoGrande, CommonsenseQA, BoolQ, SafetyBench, Do-Not-Answer, GSM8K |
| ๐ข 108,000+ Questions | Real academic datasets from HuggingFace |
| ๐ 9 Providers | Ollama, OpenAI, Anthropic, Google Gemini, DeepSeek, Groq, Together.ai, Fireworks, HuggingFace |
| ๐ณ Docker Support | docker run llm-benchmark quick |
| ๐ Web Dashboard | Beautiful UI with real-time progress, charts, and history |
| โก Parallel Execution | 5-10x speedup with --workers 4 |
| ๐พ Smart Caching | 10x faster repeated evaluations |
| ๐ Academic Rigor | 95% CI, McNemar tests, baseline comparisons |
| ๐ Paper Exports | LaTeX tables, BibTeX citations, CSV, JSON |
| ๐ก๏ธ Safety Testing | SafetyBench + Do-Not-Answer for security evaluation |
| ๐ข Math Reasoning | GSM8K (8,500 grade school math problems) |
| ๐จ Beautiful CLI | Progress bars, colored output, ETA tracking |
โก Parallel Execution (5-10x Speedup)
Speed up benchmarks with concurrent API calls:
# 4 parallel workers (4x faster)
llm-eval benchmark --model gpt-4o-mini --provider openai --workers 4 --sample-size 100
# Maximum parallelism for fast providers like Groq
llm-eval benchmark --model llama3-8b-8192 --provider groq --workers 8 --sample-size 500
Note: Set workers based on your provider's rate limits:
- Groq: 8-16 workers (very high rate limits)
- OpenAI: 4-8 workers
- Ollama: 1-2 workers (local, CPU-bound)
๐ Academic Use
For publication-quality evaluations:
from llm_evaluator import ModelEvaluator
from llm_evaluator.providers import OllamaProvider
from llm_evaluator.export import export_to_latex, generate_bibtex
provider = OllamaProvider(model="llama3.2:1b")
evaluator = ModelEvaluator(provider=provider)
results = evaluator.evaluate_all_academic(
sample_size=500,
compare_baselines=True
)
# 95% confidence intervals
print(f"MMLU: {results.mmlu_accuracy:.1%}")
print(f"95% CI: [{results.mmlu_ci[0]:.1%}, {results.mmlu_ci[1]:.1%}]")
# Compare to GPT-4, Claude, Llama baselines
for baseline, comparison in results.baseline_comparison.items():
print(f"vs {baseline}: {comparison['difference']:+.1%}")
# Export for papers
latex = export_to_latex(results, "My Model")
bibtex = generate_bibtex()
๐จ Visual Output Examples
Benchmark Comparison
Interactive Dashboard
(Add screenshots to docs/images/ folder)
๐ Check Available Providers
llm-eval providers
๐ Available Providers:
โ
Auto-detected: openai (gpt-4o-mini)
โ
ollama - Local LLMs (llama3.2, mistral, etc.)
โ
openai - GPT-3.5, GPT-4, GPT-4o
โ anthropic - Claude 3/3.5 (pip install anthropic)
โ
deepseek - DeepSeek-V3, DeepSeek-R1
โ huggingface - Inference API
๐ Environment Variables:
โ
OPENAI_API_KEY sk-abc1...
โ ANTHROPIC_API_KEY Not set
๐ฌ Benchmarks Included
๐ Knowledge & Reasoning (7 benchmarks)
| Benchmark | Questions | Description |
|---|---|---|
| MMLU | 14,042 | Massive Multitask Language Understanding - 57 subjects |
| TruthfulQA | 817 | Truthfulness and avoiding misinformation |
| HellaSwag | 10,042 | Common-sense reasoning and sentence completion |
| ARC-Challenge | 2,590 | Grade-school science questions (hard subset) |
| WinoGrande | 44,000 | Pronoun resolution and commonsense reasoning |
| CommonsenseQA | 12,247 | Commonsense knowledge questions |
| BoolQ | 15,942 | Yes/no reading comprehension questions |
๐ข Math Reasoning (1 benchmark)
| Benchmark | Questions | Description |
|---|---|---|
| GSM8K | 8,500 | Grade school math word problems requiring multi-step reasoning |
๐ก๏ธ Safety & Security (2 benchmarks)
| Benchmark | Questions | Description |
|---|---|---|
| SafetyBench | 11,000 | Safety evaluation across multiple risk categories |
| Do-Not-Answer | 939 | Harmful prompt detection and refusal testing |
Total: 10 benchmarks, 108,000+ questions
๐ค Contributing
This is open source. Make it better:
git clone https://github.com/NahuelGiudizi/llm-evaluation
cd llm-evaluation
pip install -e ".[dev]"
pytest tests/ -v
Wanted
- Async evaluation for faster throughput
- More benchmarks (GSM8K, HumanEval, GPQA, MT-Bench)
- Batch evaluation mode
- Custom benchmark support
- Kubernetes deployment
Contributors welcome! ๐
๐ Documentation
| Doc | Description |
|---|---|
| ๐ Quick Start | Get running in 5 minutes |
| ๐ Providers Guide | Ollama, OpenAI, Anthropic, DeepSeek, HuggingFace |
| ๐ฌ Benchmarks | MMLU, TruthfulQA, HellaSwag details |
| ๐ Academic Usage | Statistical methods, LaTeX export |
| ๐ API Reference | Complete Python API documentation |
๐ณ Docker
Run benchmarks without installing anything:
# Build the image
docker build -t llm-benchmark .
# Quick evaluation with OpenAI
docker run -e OPENAI_API_KEY=$OPENAI_API_KEY llm-benchmark quick
# Ultra-fast with Groq
docker run -e GROQ_API_KEY=$GROQ_API_KEY llm-benchmark quick \
--model llama-3.1-8b-instant --provider groq
# Run specific benchmarks
docker run -e OPENAI_API_KEY=$OPENAI_API_KEY llm-benchmark benchmark \
--model gpt-4o-mini --benchmarks mmlu,truthfulqa -s 50
# Launch dashboard
docker run -p 8888:8888 -e OPENAI_API_KEY=$OPENAI_API_KEY \
llm-benchmark dashboard --host 0.0.0.0
# With docker-compose
docker compose up dashboard
๐ Output Formats
# JSON (default)
llm-eval run --model llama3.2:1b --output results.json
# Export to multiple formats
llm-eval export results.json --format all
# Individual formats
llm-eval export results.json --format csv
llm-eval export results.json --format latex
llm-eval export results.json --format bibtex
# Academic evaluation with direct exports
llm-eval academic --model llama3.2:1b --output-latex table.tex --output-bibtex refs.bib
๐งช Provider Testing Status
- โ Ollama: Fully tested with multiple models (Llama, Mistral, Phi3)
- โ ๏ธ Gemini: Tested with free tier - works but has strict rate limits (10 req/min)
- โ ๏ธ OpenAI, Anthropic, DeepSeek, Groq, Together, Fireworks, HuggingFace: Unit tests pass, should work with valid API keys but not extensively tested to avoid subscription costs
Found an issue? Report it here
For detailed provider documentation, see PROVIDERS.md.
๐ License
MIT License - see LICENSE for details.
โญ Star History
If this project helped you, please star it! โญ
Made with โค๏ธ by Nahuel Giudizi
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file llm_benchmark_toolkit-2.4.2.tar.gz.
File metadata
- Download URL: llm_benchmark_toolkit-2.4.2.tar.gz
- Upload date:
- Size: 398.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6b72df5d82834bfca748a3badc45c96c470e3d2badc7f389caf60f9b96828883
|
|
| MD5 |
e1a57cf22c9a1b77074c3f91e6d995d2
|
|
| BLAKE2b-256 |
e69f94e53c1accb629b7506238a1a2dcddf0c2b66105d7dc27dce04b4ce30a11
|
File details
Details for the file llm_benchmark_toolkit-2.4.2-py3-none-any.whl.
File metadata
- Download URL: llm_benchmark_toolkit-2.4.2-py3-none-any.whl
- Upload date:
- Size: 365.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3c56d9744f0b0da156077a8d4b6e783c241fb2125bc6a6feffb78cd8c78ab27f
|
|
| MD5 |
10547aa88a2851478e20a640fdf3e570
|
|
| BLAKE2b-256 |
5578b1ef2ae73612640393d1fc73c52da4f80c9aa492bc19058ddd234437ebd8
|