A pip-installable benchmark runner for LLMs and agents. 5 minutes to your first eval.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

heyufeng

These details have not been verified by PyPI

Project description

LiteBench

A pip-installable benchmark runner for LLMs and agents. Five minutes to your first eval.

中文文档

What is this?

inspect_ai is powerful but heavy — you write Solver and Scorer classes. lm-evaluation-harness is thorough but research-oriented and slow to set up. promptfoo tests prompts, not full agents.

LiteBench sits in the middle: an opinionated CLI for app developers who want to benchmark their model or agent on common tasks (HumanEval / GSM8K / MMLU / MATH / TruthfulQA / ARC) without having to write a framework first.

pip install litebench

litebench list
litebench run gsm8k -m deepseek/deepseek-chat -n 50
litebench run humaneval -m gpt-5 -n 20
litebench run mmlu -m claude-sonnet-4-6 --subject computer_security -n 100
litebench run math -m kimi -n 50

# Custom YAML tasks
litebench run ./my-task.yaml -m gpt-4o-mini

# Compare models
litebench runs
litebench compare <run-id-1> <run-id-2>

Features

6 built-in tasks — HumanEval, GSM8K, MMLU, MATH-500, TruthfulQA, ARC-Challenge.
100+ model providers via litellm — OpenAI, Anthropic, Gemini, DeepSeek, Kimi, Qwen, GLM, local Ollama, and more. Shortcuts built in: -m opus, -m kimi, -m deepseek.
Streaming datasets via HuggingFace datasets — no manual downloads.
Local SQLite run history — diff runs across models and days.
Async concurrency — --concurrency 8 default, safely parallel.
Custom YAML tasks — point at a YAML or JSONL and go. Supports number / mc / regex / string / llm-judge scorers.
LLM-as-judge — plug a grader model in for free-form tasks.

Install

pip install litebench

Then set the API key for whatever provider you plan to hit:

export OPENAI_API_KEY=...
export ANTHROPIC_API_KEY=...
export GEMINI_API_KEY=...
# etc.

Usage

Run a built-in task

litebench run gsm8k -m deepseek/deepseek-chat -n 100 --concurrency 8

Output:

           gsm8k · deepseek/deepseek-chat
 Samples       100
 Accuracy      85.0%  (85/100)
 Mean latency  3420 ms
 Tokens        prompt=22,100  completion=58,743
 Duration      57.3s
 Run ID        a51819c4

Model shortcuts

The CLI accepts either a full litellm string or one of the shortcuts:

Shortcut	Resolves to
`opus`	`claude-opus-4-7`
`sonnet`	`claude-sonnet-4-6`
`haiku`	`claude-haiku-4-5-20251001`
`gpt-5`	`gpt-5`
`gpt-4o`	`gpt-4o`
`gemini`	`gemini/gemini-2.5-pro`
`deepseek`	`deepseek/deepseek-chat`
`kimi`	`openrouter/moonshotai/kimi-k2.6`
`qwen`	`openrouter/qwen/qwen3.5-max`
`glm`	`openrouter/zhipu/glm-5`

Custom YAML task

Create my-task.yaml:

name: sql-questions
description: Ask for a SQL query, grade with a pattern.
scorer: regex
regex: "SELECT\\s+.*FROM\\s+users"
system_prompt: |
  Return only a SQL query, nothing else.
samples:
  - input: "Get every user's email."
    target: "SELECT email FROM users"
  - input: "Get active users."
    target: "SELECT * FROM users WHERE active = TRUE"

Then run it:

litebench run my-task.yaml -m gpt-4o-mini

Supported scorers: number / mc / regex / string (default: substring match) / llm-judge.

For llm-judge, add judge_model: gpt-4o-mini (or any litellm-supported model).

You can also load samples from JSONL instead of inline:

name: my-task
scorer: string
samples_jsonl: ./data.jsonl

Compare runs

$ litebench runs
                                Recent runs
┏━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┓
┃ Run      ┃ Task  ┃ Model       ┃ Samples ┃ Accuracy ┃ When             ┃
┡━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━┩
│ 10ab7654 │ gsm8k │ gpt-4o      │     100 │    89.0% │ 2026-04-23 17:38 │
│ 86d845e0 │ gsm8k │ gpt-4o-mini │     100 │    80.0% │ 2026-04-23 17:37 │
└──────────┴───────┴─────────────┴─────────┴──────────┴──────────────────┘

$ litebench compare 10ab7654 86d845e0
                              Comparing 2 runs
┏━━━━━━━━━━━━━┳━━━━━━━┳━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Model       ┃ Task  ┃ N   ┃ Accuracy ┃ Mean latency ┃ Tokens (p/c)  ┃
┡━━━━━━━━━━━━━╇━━━━━━━╇━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ gpt-4o      │ gsm8k │ 100 │    89.0% │       3710ms │  8,700 / 23.9k│
│ gpt-4o-mini │ gsm8k │ 100 │    80.0% │       4230ms │  8,700 / 22.3k│
└─────────────┴───────┴─────┴──────────┴──────────────┴───────────────┘

Built-in tasks

Task	Description	Dataset
`humaneval`	Code completion, executed against hidden tests	`openai_humaneval`
`gsm8k`	Grade-school word problems	`gsm8k` (main, test)
`mmlu`	57-subject multiple choice; use `--subject`	`cais/mmlu`
`math`	Competition-level math, answer in `\boxed{…}`	`HuggingFaceH4/MATH-500`
`truthfulqa`	MC1 single-correct multiple choice	`truthful_qa` (multiple_choice)
`arc`	AI2 science exam; `--arc-easy` for Easy split	`allenai/ai2_arc` (Challenge)

Agent mode

Pass a task that exposes tools and LiteBench runs a full multi-turn rollout instead of a single chat:

litebench run gsm8k-agent -m gpt-5 -n 50

The built-in gsm8k-agent task gives the model a calculator tool and a final_answer tool, then scores whichever number it submits. The recorded per-sample trace (tool name, arguments, result) is kept in the SQLite history and can be dumped with --json-out:

gsm8k-agent-0 | correct=True | steps=3 | final="18"
  → calculator({'expression': '16 - 3 - 4'}) = 9
  → calculator({'expression': '9 * 2'}) = 18
  → final_answer({'answer': '18'}) = 18

Custom agent tasks are a Python subclass (AgentTask) — see src/litebench/tasks/gsm8k_agent.py.

Web dashboard

pip install 'litebench[web]'
litebench serve
# → open http://127.0.0.1:8600

Three tabs:

Runs — every run you've saved, clickable for full sample-by-sample breakdown (including per-sample agent tool traces).
Compare — accuracy heatmap across (task × model), shows the latest run per pair.
Tasks — the built-in task registry.

Pure single-file HTML + vanilla JS — no React, no build step, works offline.

Roadmap

✅ Phase 1 — MVP CLI, 3 tasks, SQLite history
✅ Phase 2 — 6 tasks, YAML custom, LLM judge, 31 regression tests
✅ Phase 3 — Agent mode (tool-use eval via litellm function calling), 10 more tests
✅ Phase 4 — Web dashboard (litebench serve), 5 more tests

Contributing

Issues and PRs welcome. pytest tests/ should stay green.

License

MIT

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

heyufeng

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.3.0

Apr 23, 2026

0.2.0

Apr 23, 2026

0.1.0

Apr 23, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

litebench-0.3.0.tar.gz (34.8 kB view details)

Uploaded Apr 23, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

litebench-0.3.0-py3-none-any.whl (41.3 kB view details)

Uploaded Apr 23, 2026 Python 3

File details

Details for the file litebench-0.3.0.tar.gz.

File metadata

Download URL: litebench-0.3.0.tar.gz
Upload date: Apr 23, 2026
Size: 34.8 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for litebench-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`5468b540f20e641e48daf05ce948231c91ebbc4351d1f353c7d4012c83a94e99`
MD5	`38698f3d312463b7a69d1f882c93ffd6`
BLAKE2b-256	`533359a846bd9435f60abe71e1e634e645ad497d5b17ee3662a794d8a819ad81`

See more details on using hashes here.

Provenance

The following attestation bundles were made for litebench-0.3.0.tar.gz:

Publisher: publish.yml on he-yufeng/LiteBench

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: litebench-0.3.0.tar.gz
- Subject digest: 5468b540f20e641e48daf05ce948231c91ebbc4351d1f353c7d4012c83a94e99
- Sigstore transparency entry: 1362881828
- Sigstore integration time: Apr 23, 2026
Source repository:
- Permalink: he-yufeng/LiteBench@87674a004d329438cd476e22f4c2d8577b8dc140
- Branch / Tag: refs/heads/main
- Owner: https://github.com/he-yufeng
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@87674a004d329438cd476e22f4c2d8577b8dc140
- Trigger Event: workflow_dispatch

File details

Details for the file litebench-0.3.0-py3-none-any.whl.

File metadata

Download URL: litebench-0.3.0-py3-none-any.whl
Upload date: Apr 23, 2026
Size: 41.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for litebench-0.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`dcbe0ac8adc3e0c1ba0105794da5f62230ffea21720a77e5c0b9dedf3f0aeb3f`
MD5	`5909084f44cb60e55c8a638c7005d44f`
BLAKE2b-256	`0ebc5b0efe50ba51bfbcc3afe2755d77c52473b71cf1a7dd29c0a1e4a33f202b`

See more details on using hashes here.

Provenance

The following attestation bundles were made for litebench-0.3.0-py3-none-any.whl:

Publisher: publish.yml on he-yufeng/LiteBench

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: litebench-0.3.0-py3-none-any.whl
- Subject digest: dcbe0ac8adc3e0c1ba0105794da5f62230ffea21720a77e5c0b9dedf3f0aeb3f
- Sigstore transparency entry: 1362881887
- Sigstore integration time: Apr 23, 2026
Source repository:
- Permalink: he-yufeng/LiteBench@87674a004d329438cd476e22f4c2d8577b8dc140
- Branch / Tag: refs/heads/main
- Owner: https://github.com/he-yufeng
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@87674a004d329438cd476e22f4c2d8577b8dc140
- Trigger Event: workflow_dispatch

litebench 0.3.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

LiteBench

What is this?

Features

Install

Usage

Run a built-in task

Model shortcuts

Custom YAML task

Compare runs

Built-in tasks

Agent mode

Web dashboard

Roadmap

Contributing

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance