A pip-installable benchmark runner for LLMs and agents. 5 minutes to your first eval.
Project description
LiteBench
A pip-installable benchmark runner for LLMs and agents. Five minutes to your first eval.
What is this?
inspect_ai is powerful but heavy — you write Solver and Scorer classes.
lm-evaluation-harness is thorough but research-oriented and slow to set up.
promptfoo tests prompts, not full agents.
LiteBench sits in the middle: an opinionated CLI for app developers who want to benchmark their model or agent on common tasks (HumanEval / GSM8K / MMLU / MATH / TruthfulQA / ARC) without having to write a framework first.
pip install litebench
litebench list
litebench run gsm8k -m deepseek/deepseek-chat -n 50
litebench run humaneval -m gpt-5 -n 20
litebench run mmlu -m claude-sonnet-4-6 --subject computer_security -n 100
litebench run math -m kimi -n 50
# Custom YAML tasks
litebench run ./my-task.yaml -m gpt-4o-mini
# Compare models
litebench runs
litebench compare <run-id-1> <run-id-2>
Features
- 6 built-in tasks — HumanEval, GSM8K, MMLU, MATH-500, TruthfulQA, ARC-Challenge.
- 100+ model providers via litellm — OpenAI, Anthropic, Gemini, DeepSeek, Kimi, Qwen, GLM, local Ollama, and more. Shortcuts built in:
-m opus,-m kimi,-m deepseek. - Streaming datasets via HuggingFace
datasets— no manual downloads. - Local SQLite run history — diff runs across models and days.
- Async concurrency —
--concurrency 8default, safely parallel. - Custom YAML tasks — point at a YAML or JSONL and go. Supports
number/mc/regex/string/llm-judgescorers. - LLM-as-judge — plug a grader model in for free-form tasks.
Install
pip install litebench
Then set the API key for whatever provider you plan to hit:
export OPENAI_API_KEY=...
export ANTHROPIC_API_KEY=...
export GEMINI_API_KEY=...
# etc.
Usage
Run a built-in task
litebench run gsm8k -m deepseek/deepseek-chat -n 100 --concurrency 8
Output:
gsm8k · deepseek/deepseek-chat
Samples 100
Accuracy 85.0% (85/100)
Mean latency 3420 ms
Tokens prompt=22,100 completion=58,743
Duration 57.3s
Run ID a51819c4
Model shortcuts
The CLI accepts either a full litellm string or one of the shortcuts:
| Shortcut | Resolves to |
|---|---|
opus |
claude-opus-4-7 |
sonnet |
claude-sonnet-4-6 |
haiku |
claude-haiku-4-5-20251001 |
gpt-5 |
gpt-5 |
gpt-4o |
gpt-4o |
gemini |
gemini/gemini-2.5-pro |
deepseek |
deepseek/deepseek-chat |
kimi |
openrouter/moonshotai/kimi-k2.6 |
qwen |
openrouter/qwen/qwen3.5-max |
glm |
openrouter/zhipu/glm-5 |
Custom YAML task
Create my-task.yaml:
name: sql-questions
description: Ask for a SQL query, grade with a pattern.
scorer: regex
regex: "SELECT\\s+.*FROM\\s+users"
system_prompt: |
Return only a SQL query, nothing else.
samples:
- input: "Get every user's email."
target: "SELECT email FROM users"
- input: "Get active users."
target: "SELECT * FROM users WHERE active = TRUE"
Then run it:
litebench run my-task.yaml -m gpt-4o-mini
Supported scorers: number / mc / regex / string (default: substring match) / llm-judge.
For llm-judge, add judge_model: gpt-4o-mini (or any litellm-supported model).
You can also load samples from JSONL instead of inline:
name: my-task
scorer: string
samples_jsonl: ./data.jsonl
Compare runs
$ litebench runs
Recent runs
┏━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┓
┃ Run ┃ Task ┃ Model ┃ Samples ┃ Accuracy ┃ When ┃
┡━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━┩
│ 10ab7654 │ gsm8k │ gpt-4o │ 100 │ 89.0% │ 2026-04-23 17:38 │
│ 86d845e0 │ gsm8k │ gpt-4o-mini │ 100 │ 80.0% │ 2026-04-23 17:37 │
└──────────┴───────┴─────────────┴─────────┴──────────┴──────────────────┘
$ litebench compare 10ab7654 86d845e0
Comparing 2 runs
┏━━━━━━━━━━━━━┳━━━━━━━┳━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Model ┃ Task ┃ N ┃ Accuracy ┃ Mean latency ┃ Tokens (p/c) ┃
┡━━━━━━━━━━━━━╇━━━━━━━╇━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ gpt-4o │ gsm8k │ 100 │ 89.0% │ 3710ms │ 8,700 / 23.9k│
│ gpt-4o-mini │ gsm8k │ 100 │ 80.0% │ 4230ms │ 8,700 / 22.3k│
└─────────────┴───────┴─────┴──────────┴──────────────┴───────────────┘
Built-in tasks
| Task | Description | Dataset |
|---|---|---|
humaneval |
Code completion, executed against hidden tests | openai_humaneval |
gsm8k |
Grade-school word problems | gsm8k (main, test) |
mmlu |
57-subject multiple choice; use --subject |
cais/mmlu |
math |
Competition-level math, answer in \boxed{…} |
HuggingFaceH4/MATH-500 |
truthfulqa |
MC1 single-correct multiple choice | truthful_qa (multiple_choice) |
arc |
AI2 science exam; --arc-easy for Easy split |
allenai/ai2_arc (Challenge) |
Agent mode
Pass a task that exposes tools and LiteBench runs a full multi-turn rollout instead of a single chat:
litebench run gsm8k-agent -m gpt-5 -n 50
The built-in gsm8k-agent task gives the model a calculator tool and a
final_answer tool, then scores whichever number it submits. The recorded
per-sample trace (tool name, arguments, result) is kept in the SQLite history
and can be dumped with --json-out:
gsm8k-agent-0 | correct=True | steps=3 | final="18"
→ calculator({'expression': '16 - 3 - 4'}) = 9
→ calculator({'expression': '9 * 2'}) = 18
→ final_answer({'answer': '18'}) = 18
Custom agent tasks are a Python subclass (AgentTask) — see src/litebench/tasks/gsm8k_agent.py.
Web dashboard
pip install 'litebench[web]'
litebench serve
# → open http://127.0.0.1:8600
Three tabs:
- Runs — every run you've saved, clickable for full sample-by-sample breakdown (including per-sample agent tool traces).
- Compare — accuracy heatmap across (task × model), shows the latest run per pair.
- Tasks — the built-in task registry.
Pure single-file HTML + vanilla JS — no React, no build step, works offline.
Roadmap
- ✅ Phase 1 — MVP CLI, 3 tasks, SQLite history
- ✅ Phase 2 — 6 tasks, YAML custom, LLM judge, 31 regression tests
- ✅ Phase 3 — Agent mode (tool-use eval via litellm function calling), 10 more tests
- ✅ Phase 4 — Web dashboard (
litebench serve), 5 more tests
Contributing
Issues and PRs welcome. pytest tests/ should stay green.
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file litebench-0.3.0.tar.gz.
File metadata
- Download URL: litebench-0.3.0.tar.gz
- Upload date:
- Size: 34.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5468b540f20e641e48daf05ce948231c91ebbc4351d1f353c7d4012c83a94e99
|
|
| MD5 |
38698f3d312463b7a69d1f882c93ffd6
|
|
| BLAKE2b-256 |
533359a846bd9435f60abe71e1e634e645ad497d5b17ee3662a794d8a819ad81
|
Provenance
The following attestation bundles were made for litebench-0.3.0.tar.gz:
Publisher:
publish.yml on he-yufeng/LiteBench
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
litebench-0.3.0.tar.gz -
Subject digest:
5468b540f20e641e48daf05ce948231c91ebbc4351d1f353c7d4012c83a94e99 - Sigstore transparency entry: 1362881828
- Sigstore integration time:
-
Permalink:
he-yufeng/LiteBench@87674a004d329438cd476e22f4c2d8577b8dc140 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/he-yufeng
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@87674a004d329438cd476e22f4c2d8577b8dc140 -
Trigger Event:
workflow_dispatch
-
Statement type:
File details
Details for the file litebench-0.3.0-py3-none-any.whl.
File metadata
- Download URL: litebench-0.3.0-py3-none-any.whl
- Upload date:
- Size: 41.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dcbe0ac8adc3e0c1ba0105794da5f62230ffea21720a77e5c0b9dedf3f0aeb3f
|
|
| MD5 |
5909084f44cb60e55c8a638c7005d44f
|
|
| BLAKE2b-256 |
0ebc5b0efe50ba51bfbcc3afe2755d77c52473b71cf1a7dd29c0a1e4a33f202b
|
Provenance
The following attestation bundles were made for litebench-0.3.0-py3-none-any.whl:
Publisher:
publish.yml on he-yufeng/LiteBench
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
litebench-0.3.0-py3-none-any.whl -
Subject digest:
dcbe0ac8adc3e0c1ba0105794da5f62230ffea21720a77e5c0b9dedf3f0aeb3f - Sigstore transparency entry: 1362881887
- Sigstore integration time:
-
Permalink:
he-yufeng/LiteBench@87674a004d329438cd476e22f4c2d8577b8dc140 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/he-yufeng
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@87674a004d329438cd476e22f4c2d8577b8dc140 -
Trigger Event:
workflow_dispatch
-
Statement type: