Benchmark LLMs with 10 benchmarks & 132K+ questions. 8 providers: OpenAI, Anthropic, Groq, Together, Fireworks, DeepSeek, Ollama, HuggingFace. Unified CLI + Web dashboard.

These details have not been verified by PyPI

Project links

Project description

🚀 LLM Benchmark Toolkit

PyPI Downloads Stars Coverage Python License

🎯 Benchmark LLMs with 10 benchmarks & 108,000+ real questions
_{MMLU • TruthfulQA • HellaSwag • ARC • WinoGrande • CommonsenseQA • BoolQ • SafetyBench • Do-Not-Answer • GSM8K}

Get Started • Compare Models • Python API • Academic • Contributing

⚡ One command to evaluate any LLM
Zero config • Auto-detection • Beautiful dashboard • Academic-grade results

🚀 Get Started (60 Seconds)

Install

# Full installation (everything included)
pip install llm-benchmark-toolkit

# Or with all extras (notebooks, dev tools)
pip install llm-benchmark-toolkit[all]

That's it! Everything included: Dashboard, OpenAI, Anthropic, Ollama, HuggingFace.

🩺 Check Your Setup

llm-eval doctor

This diagnoses your environment and shows what's ready to use.

🌐 Web Dashboard (Recommended!)

The easiest way to evaluate models - a beautiful web interface:

# Launch the dashboard (choose one):
llm-eval dashboard
# or
llm-dashboard
# or
python -m llm_evaluator.dashboard

Opens your browser to http://localhost:8888 where you can:

🚀 Run evaluations with real-time progress tracking
📊 Compare models with interactive charts
🔍 Inspect scenarios - see every question & answer
📈 View history - track improvements over time
💾 Export results - JSON, charts, reports

Quick CLI Evaluation

# Run quick evaluation with auto-detection
llm-eval quick

# Or specify provider
llm-eval quick --model gpt-4o

# Full benchmark suite
llm-eval benchmark --model llama3.2:1b

Output:

🚀 LLM QUICK EVALUATION
==================================================
✅ Provider: openai (gpt-4o-mini)
✅ Sample size: 20

📊 RESULTS
==================================================
  🎯 MMLU:       78.5%
  🎯 TruthfulQA: 71.2%
  🎯 HellaSwag:  82.4%
  
  📈 Overall:    77.4%
==================================================
✨ Evaluation complete!

Auto-detection works with:

OPENAI_API_KEY → GPT-4o-mini
ANTHROPIC_API_KEY → Claude 3.5 Sonnet
GEMINI_API_KEY → Gemini 2.0 Flash (⚠️ Free tier: 10 req/min)
DEEPSEEK_API_KEY → DeepSeek-V3
Ollama running locally → Llama 3.2

🔄 Compare Models

llm-eval compare \
  --models gpt-4o-mini,claude-3-5-sonnet \
  --sample-size 100

More examples:

# Pre-download datasets (optional, speeds up first run)
llm-eval download mmlu truthfulqa gsm8k
llm-eval download all  # Download all benchmarks

# Ollama (local models)
llm-eval quick --model llama3.2:1b

# OpenAI
llm-eval quick --model gpt-4o-mini

# Anthropic
llm-eval run --model claude-3-5-sonnet-20241022 --provider anthropic

# DeepSeek (super affordable!)
llm-eval quick --model deepseek-chat

# Google Gemini (NEW!)
llm-eval quick --model gemini-1.5-flash --provider gemini

# Run specific benchmarks (any combination!)
llm-eval benchmark --model gpt-4o --benchmarks mmlu,truthfulqa,arc,safetybench

# Run ALL benchmarks
llm-eval benchmark --model llama3.2:1b --benchmarks mmlu,truthfulqa,hellaswag,arc,winogrande,commonsenseqa,boolq,safetybench,donotanswer

# Full academic evaluation
llm-eval academic --model llama3.2:1b \
  --sample-size 500 \
  --output-latex results.tex

🖥️ CLI Commands Reference

Command	Description
`llm-eval quick`	🚀 Zero-config evaluation (auto-detects provider)
`llm-eval doctor`	🩺 Diagnose your setup (dependencies, providers, API keys)
`llm-eval download`	📥 Pre-download benchmark datasets (MMLU, TruthfulQA, etc.)
`llm-eval run`	Full evaluation on a single model
`llm-eval benchmark`	Run specific benchmarks
`llm-eval compare`	Compare multiple models side-by-side
`llm-eval vs`	🥊 Run same benchmark on multiple models sequentially
`llm-eval dashboard`	🌐 Launch web dashboard
`llm-eval academic`	🎓 Academic evaluation with statistics
`llm-eval export`	📤 Export results (JSON, CSV, LaTeX, BibTeX)
`llm-eval providers`	Check available providers status
`llm-eval list-runs`	📋 List saved evaluation runs

Key Options

# Common options for most commands
-m, --model TEXT       # Model name
-p, --provider TYPE    # ollama, openai, anthropic, huggingface, deepseek,
                       # groq, together, fireworks
-s, --sample-size INT  # Number of questions to test
-u, --base-url URL     # Custom API endpoint (vLLM, LM Studio, Azure)
--cache / --no-cache   # Enable/disable caching

# Benchmark selection
-b, --benchmarks TEXT  # Comma-separated: mmlu,truthfulqa,hellaswag,arc,
                       # winogrande,commonsenseqa,boolq,safetybench,donotanswer

VS Command (Model Battle)

Compare models head-to-head:

# Compare two local models
llm-eval vs llama3.2:1b mistral:7b

# Compare with specific benchmarks
llm-eval vs llama3.2:1b mistral:7b -b mmlu,arc -s 50

# Compare models from different providers
llm-eval vs gpt-4o-mini claude-3.5-sonnet -p openai,anthropic

# Ultra-fast with Groq
llm-eval quick --model llama-3.1-8b-instant --provider groq

🐍 Python API

from llm_evaluator import ModelEvaluator
from llm_evaluator.providers import OpenAIProvider

provider = OpenAIProvider(model="gpt-4o-mini")
evaluator = ModelEvaluator(provider=provider)

results = evaluator.evaluate_all()
print(f"Overall: {results.overall_score:.1%}")

With caching (10x faster):

from llm_evaluator.providers import CachedProvider, OllamaProvider

provider = OllamaProvider(model="llama3.2:1b")
cached = CachedProvider(provider)  # Automatic caching!

evaluator = ModelEvaluator(provider=cached)
results = evaluator.evaluate_all()

🎯 Features

Feature	Description
📊 10 Benchmarks	MMLU, TruthfulQA, HellaSwag, ARC, WinoGrande, CommonsenseQA, BoolQ, SafetyBench, Do-Not-Answer, GSM8K
🔢 108,000+ Questions	Real academic datasets from HuggingFace
🔌 9 Providers	Ollama, OpenAI, Anthropic, Google Gemini, DeepSeek, Groq, Together.ai, Fireworks, HuggingFace
🐳 Docker Support	`docker run llm-benchmark quick`
🌐 Web Dashboard	Beautiful UI with real-time progress, charts, and history
⚡ Parallel Execution	5-10x speedup with `--workers 4`
💾 Smart Caching	10x faster repeated evaluations
📈 Academic Rigor	95% CI, McNemar tests, baseline comparisons
📄 Paper Exports	LaTeX tables, BibTeX citations, CSV, JSON
🛡️ Safety Testing	SafetyBench + Do-Not-Answer for security evaluation
🔢 Math Reasoning	GSM8K (8,500 grade school math problems)
🎨 Beautiful CLI	Progress bars, colored output, ETA tracking

⚡ Parallel Execution (5-10x Speedup)

Speed up benchmarks with concurrent API calls:

# 4 parallel workers (4x faster)
llm-eval benchmark --model gpt-4o-mini --provider openai --workers 4 --sample-size 100

# Maximum parallelism for fast providers like Groq
llm-eval benchmark --model llama3-8b-8192 --provider groq --workers 8 --sample-size 500

Note: Set workers based on your provider's rate limits:

Groq: 8-16 workers (very high rate limits)
OpenAI: 4-8 workers
Ollama: 1-2 workers (local, CPU-bound)

🎓 Academic Use

For publication-quality evaluations:

from llm_evaluator import ModelEvaluator
from llm_evaluator.providers import OllamaProvider
from llm_evaluator.export import export_to_latex, generate_bibtex

provider = OllamaProvider(model="llama3.2:1b")
evaluator = ModelEvaluator(provider=provider)

results = evaluator.evaluate_all_academic(
    sample_size=500,
    compare_baselines=True
)

# 95% confidence intervals
print(f"MMLU: {results.mmlu_accuracy:.1%}")
print(f"95% CI: [{results.mmlu_ci[0]:.1%}, {results.mmlu_ci[1]:.1%}]")

# Compare to GPT-4, Claude, Llama baselines
for baseline, comparison in results.baseline_comparison.items():
    print(f"vs {baseline}: {comparison['difference']:+.1%}")

# Export for papers
latex = export_to_latex(results, "My Model")
bibtex = generate_bibtex()

🎨 Visual Output Examples

Benchmark Comparison

Interactive Dashboard

Dashboard

(Add screenshots to docs/images/ folder)

🔌 Check Available Providers

llm-eval providers

🔌 Available Providers:

✅ Auto-detected: openai (gpt-4o-mini)

  ✅ ollama          - Local LLMs (llama3.2, mistral, etc.)
  ✅ openai          - GPT-3.5, GPT-4, GPT-4o
  ❌ anthropic       - Claude 3/3.5 (pip install anthropic)
  ✅ deepseek        - DeepSeek-V3, DeepSeek-R1
  ❌ huggingface     - Inference API

📋 Environment Variables:
  ✅ OPENAI_API_KEY       sk-abc1...
  ❌ ANTHROPIC_API_KEY    Not set

🔬 Benchmarks Included

📚 Knowledge & Reasoning (7 benchmarks)

Benchmark	Questions	Description
MMLU	14,042	Massive Multitask Language Understanding - 57 subjects
TruthfulQA	817	Truthfulness and avoiding misinformation
HellaSwag	10,042	Common-sense reasoning and sentence completion
ARC-Challenge	2,590	Grade-school science questions (hard subset)
WinoGrande	44,000	Pronoun resolution and commonsense reasoning
CommonsenseQA	12,247	Commonsense knowledge questions
BoolQ	15,942	Yes/no reading comprehension questions

🔢 Math Reasoning (1 benchmark)

Benchmark	Questions	Description
GSM8K	8,500	Grade school math word problems requiring multi-step reasoning

🛡️ Safety & Security (2 benchmarks)

Benchmark	Questions	Description
SafetyBench	11,000	Safety evaluation across multiple risk categories
Do-Not-Answer	939	Harmful prompt detection and refusal testing

Total: 10 benchmarks, 108,000+ questions

🤝 Contributing

This is open source. Make it better:

git clone https://github.com/NahuelGiudizi/llm-evaluation
cd llm-evaluation
pip install -e ".[dev]"
pytest tests/ -v

Wanted

Async evaluation for faster throughput
More benchmarks (GSM8K, HumanEval, GPQA, MT-Bench)
Batch evaluation mode
Custom benchmark support
Kubernetes deployment

Contributors welcome! 🎉

📚 Documentation

Doc	Description
📖 Quick Start	Get running in 5 minutes
🔌 Providers Guide	Ollama, OpenAI, Anthropic, DeepSeek, HuggingFace
🔬 Benchmarks	MMLU, TruthfulQA, HellaSwag details
🎓 Academic Usage	Statistical methods, LaTeX export
📘 API Reference	Complete Python API documentation

🐳 Docker

Run benchmarks without installing anything:

# Build the image
docker build -t llm-benchmark .

# Quick evaluation with OpenAI
docker run -e OPENAI_API_KEY=$OPENAI_API_KEY llm-benchmark quick

# Ultra-fast with Groq
docker run -e GROQ_API_KEY=$GROQ_API_KEY llm-benchmark quick \
  --model llama-3.1-8b-instant --provider groq

# Run specific benchmarks
docker run -e OPENAI_API_KEY=$OPENAI_API_KEY llm-benchmark benchmark \
  --model gpt-4o-mini --benchmarks mmlu,truthfulqa -s 50

# Launch dashboard
docker run -p 8888:8888 -e OPENAI_API_KEY=$OPENAI_API_KEY \
  llm-benchmark dashboard --host 0.0.0.0

# With docker-compose
docker compose up dashboard

📊 Output Formats

# JSON (default)
llm-eval run --model llama3.2:1b --output results.json

# Export to multiple formats
llm-eval export results.json --format all

# Individual formats
llm-eval export results.json --format csv
llm-eval export results.json --format latex
llm-eval export results.json --format bibtex

# Academic evaluation with direct exports
llm-eval academic --model llama3.2:1b --output-latex table.tex --output-bibtex refs.bib

🧪 Provider Testing Status

✅ Ollama: Fully tested with multiple models (Llama, Mistral, Phi3)
⚠️ Gemini: Tested with free tier - works but has strict rate limits (10 req/min)
⚠️ OpenAI, Anthropic, DeepSeek, Groq, Together, Fireworks, HuggingFace: Unit tests pass, should work with valid API keys but not extensively tested to avoid subscription costs

Found an issue? Report it here

For detailed provider documentation, see PROVIDERS.md.

📜 License

MIT License - see LICENSE for details.

⭐ Star History

If this project helped you, please star it! ⭐

Made with ❤️ by Nahuel Giudizi

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

2.4.2

Dec 5, 2025

2.4.1

Dec 5, 2025

2.4.0

Dec 5, 2025

2.3.2

Dec 4, 2025

2.3.1

Dec 4, 2025

2.3.0

Dec 3, 2025

2.2.1

Dec 2, 2025

2.2.0

Dec 2, 2025

2.1.0

Dec 2, 2025

2.0.0

Dec 1, 2025

0.4.1

Dec 1, 2025

0.4.0

Dec 1, 2025

0.3.2

Dec 1, 2025

0.3.1

Nov 30, 2025

0.3.0

Nov 30, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_benchmark_toolkit-2.4.2.tar.gz (398.1 kB view details)

Uploaded Dec 5, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

llm_benchmark_toolkit-2.4.2-py3-none-any.whl (365.3 kB view details)

Uploaded Dec 5, 2025 Python 3

File details

Details for the file llm_benchmark_toolkit-2.4.2.tar.gz.

File metadata

Download URL: llm_benchmark_toolkit-2.4.2.tar.gz
Upload date: Dec 5, 2025
Size: 398.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.3

File hashes

Hashes for llm_benchmark_toolkit-2.4.2.tar.gz
Algorithm	Hash digest
SHA256	`6b72df5d82834bfca748a3badc45c96c470e3d2badc7f389caf60f9b96828883`
MD5	`e1a57cf22c9a1b77074c3f91e6d995d2`
BLAKE2b-256	`e69f94e53c1accb629b7506238a1a2dcddf0c2b66105d7dc27dce04b4ce30a11`

See more details on using hashes here.

File details

Details for the file llm_benchmark_toolkit-2.4.2-py3-none-any.whl.

File metadata

Download URL: llm_benchmark_toolkit-2.4.2-py3-none-any.whl
Upload date: Dec 5, 2025
Size: 365.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.3

File hashes

Hashes for llm_benchmark_toolkit-2.4.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`3c56d9744f0b0da156077a8d4b6e783c241fb2125bc6a6feffb78cd8c78ab27f`
MD5	`10547aa88a2851478e20a640fdf3e570`
BLAKE2b-256	`5578b1ef2ae73612640393d1fc73c52da4f80c9aa492bc19058ddd234437ebd8`

See more details on using hashes here.

llm-benchmark-toolkit 2.4.2

Navigation

Verified details

Project links

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

🚀 LLM Benchmark Toolkit

🚀 Get Started (60 Seconds)

Install

🩺 Check Your Setup

🌐 Web Dashboard (Recommended!)

Quick CLI Evaluation

🔄 Compare Models

🖥️ CLI Commands Reference

Key Options

VS Command (Model Battle)

🐍 Python API

🎯 Features

⚡ Parallel Execution (5-10x Speedup)

🎓 Academic Use

🎨 Visual Output Examples

Benchmark Comparison

Interactive Dashboard

🔌 Check Available Providers

🔬 Benchmarks Included

📚 Knowledge & Reasoning (7 benchmarks)

🔢 Math Reasoning (1 benchmark)

🛡️ Safety & Security (2 benchmarks)

🤝 Contributing

Wanted

📚 Documentation

🐳 Docker

📊 Output Formats

🧪 Provider Testing Status

📜 License

⭐ Star History

Project details

Verified details

Project links

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes