Skip to main content

A pip-installable benchmark runner for LLMs and agents. 5 minutes to your first eval.

Project description

LiteBench

A pip-installable benchmark runner for LLMs and agents. Five minutes to your first eval.

PyPI Python License: MIT

中文文档

What is this?

inspect_ai is powerful but heavy — you write Solver and Scorer classes. lm-evaluation-harness is thorough but research-oriented and slow to set up. promptfoo tests prompts, not full agents.

LiteBench sits in the middle: an opinionated CLI for app developers who want to benchmark their model or agent on common tasks (HumanEval / GSM8K / MMLU / MATH / TruthfulQA / ARC) without having to write a framework first.

pip install litebench

litebench list
litebench run gsm8k -m deepseek/deepseek-chat -n 50
litebench run humaneval -m gpt-5 -n 20
litebench run mmlu -m claude-sonnet-4-6 --subject computer_security -n 100
litebench run math -m kimi -n 50

# Custom YAML tasks
litebench run ./my-task.yaml -m gpt-4o-mini

# Compare models
litebench runs
litebench compare <run-id-1> <run-id-2>

Features

  • 6 built-in tasks — HumanEval, GSM8K, MMLU, MATH-500, TruthfulQA, ARC-Challenge.
  • 100+ model providers via litellm — OpenAI, Anthropic, Gemini, DeepSeek, Kimi, Qwen, GLM, local Ollama, and more. Shortcuts built in: -m opus, -m kimi, -m deepseek.
  • Streaming datasets via HuggingFace datasets — no manual downloads.
  • Local SQLite run history — diff runs across models and days.
  • Async concurrency--concurrency 8 default, safely parallel.
  • Custom YAML tasks — point at a YAML or JSONL and go. Supports number / mc / regex / string / llm-judge scorers.
  • LLM-as-judge — plug a grader model in for free-form tasks.

Install

pip install litebench

Then set the API key for whatever provider you plan to hit:

export OPENAI_API_KEY=...
export ANTHROPIC_API_KEY=...
export GEMINI_API_KEY=...
# etc.

Usage

Run a built-in task

litebench run gsm8k -m deepseek/deepseek-chat -n 100 --concurrency 8

Output:

           gsm8k · deepseek/deepseek-chat
 Samples       100
 Accuracy      85.0%  (85/100)
 Mean latency  3420 ms
 Tokens        prompt=22,100  completion=58,743
 Duration      57.3s
 Run ID        a51819c4

Model shortcuts

The CLI accepts either a full litellm string or one of the shortcuts:

Shortcut Resolves to
opus claude-opus-4-7
sonnet claude-sonnet-4-6
haiku claude-haiku-4-5-20251001
gpt-5 gpt-5
gpt-4o gpt-4o
gemini gemini/gemini-2.5-pro
deepseek deepseek/deepseek-chat
kimi openrouter/moonshotai/kimi-k2.6
qwen openrouter/qwen/qwen3.5-max
glm openrouter/zhipu/glm-5

Custom YAML task

Create my-task.yaml:

name: sql-questions
description: Ask for a SQL query, grade with a pattern.
scorer: regex
regex: "SELECT\\s+.*FROM\\s+users"
system_prompt: |
  Return only a SQL query, nothing else.
samples:
  - input: "Get every user's email."
    target: "SELECT email FROM users"
  - input: "Get active users."
    target: "SELECT * FROM users WHERE active = TRUE"

Then run it:

litebench run my-task.yaml -m gpt-4o-mini

Supported scorers: number / mc / regex / string (default: substring match) / llm-judge.

For llm-judge, add judge_model: gpt-4o-mini (or any litellm-supported model).

You can also load samples from JSONL instead of inline:

name: my-task
scorer: string
samples_jsonl: ./data.jsonl

Compare runs

$ litebench runs
                                Recent runs
┏━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┓
┃ Run       Task   Model        Samples  Accuracy  When             ┃
┡━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━┩
│ 10ab7654  gsm8k  gpt-4o           100     89.0%  2026-04-23 17:38 │
│ 86d845e0  gsm8k  gpt-4o-mini      100     80.0%  2026-04-23 17:37 │
└──────────┴───────┴─────────────┴─────────┴──────────┴──────────────────┘

$ litebench compare 10ab7654 86d845e0
                              Comparing 2 runs
┏━━━━━━━━━━━━━┳━━━━━━━┳━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Model        Task   N    Accuracy  Mean latency  Tokens (p/c)  ┃
┡━━━━━━━━━━━━━╇━━━━━━━╇━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ gpt-4o       gsm8k  100     89.0%        3710ms   8,700 / 23.9k│
│ gpt-4o-mini  gsm8k  100     80.0%        4230ms   8,700 / 22.3k│
└─────────────┴───────┴─────┴──────────┴──────────────┴───────────────┘

Built-in tasks

Task Description Dataset
humaneval Code completion, executed against hidden tests openai_humaneval
gsm8k Grade-school word problems gsm8k (main, test)
mmlu 57-subject multiple choice; use --subject cais/mmlu
math Competition-level math, answer in \boxed{…} HuggingFaceH4/MATH-500
truthfulqa MC1 single-correct multiple choice truthful_qa (multiple_choice)
arc AI2 science exam; --arc-easy for Easy split allenai/ai2_arc (Challenge)

Agent mode

Pass a task that exposes tools and LiteBench runs a full multi-turn rollout instead of a single chat:

litebench run gsm8k-agent -m gpt-5 -n 50

The built-in gsm8k-agent task gives the model a calculator tool and a final_answer tool, then scores whichever number it submits. The recorded per-sample trace (tool name, arguments, result) is kept in the SQLite history and can be dumped with --json-out:

gsm8k-agent-0 | correct=True | steps=3 | final="18"
  → calculator({'expression': '16 - 3 - 4'}) = 9
  → calculator({'expression': '9 * 2'}) = 18
  → final_answer({'answer': '18'}) = 18

Custom agent tasks are a Python subclass (AgentTask) — see src/litebench/tasks/gsm8k_agent.py.

Web dashboard

pip install 'litebench[web]'
litebench serve
# → open http://127.0.0.1:8600

Three tabs:

  • Runs — every run you've saved, clickable for full sample-by-sample breakdown (including per-sample agent tool traces).
  • Compare — accuracy heatmap across (task × model), shows the latest run per pair.
  • Tasks — the built-in task registry.

Pure single-file HTML + vanilla JS — no React, no build step, works offline.

Roadmap

  • ✅ Phase 1 — MVP CLI, 3 tasks, SQLite history
  • ✅ Phase 2 — 6 tasks, YAML custom, LLM judge, 31 regression tests
  • ✅ Phase 3 — Agent mode (tool-use eval via litellm function calling), 10 more tests
  • ✅ Phase 4 — Web dashboard (litebench serve), 5 more tests

Contributing

Issues and PRs welcome. pytest tests/ should stay green.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

litebench-0.3.0.tar.gz (34.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

litebench-0.3.0-py3-none-any.whl (41.3 kB view details)

Uploaded Python 3

File details

Details for the file litebench-0.3.0.tar.gz.

File metadata

  • Download URL: litebench-0.3.0.tar.gz
  • Upload date:
  • Size: 34.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for litebench-0.3.0.tar.gz
Algorithm Hash digest
SHA256 5468b540f20e641e48daf05ce948231c91ebbc4351d1f353c7d4012c83a94e99
MD5 38698f3d312463b7a69d1f882c93ffd6
BLAKE2b-256 533359a846bd9435f60abe71e1e634e645ad497d5b17ee3662a794d8a819ad81

See more details on using hashes here.

Provenance

The following attestation bundles were made for litebench-0.3.0.tar.gz:

Publisher: publish.yml on he-yufeng/LiteBench

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file litebench-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: litebench-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 41.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for litebench-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 dcbe0ac8adc3e0c1ba0105794da5f62230ffea21720a77e5c0b9dedf3f0aeb3f
MD5 5909084f44cb60e55c8a638c7005d44f
BLAKE2b-256 0ebc5b0efe50ba51bfbcc3afe2755d77c52473b71cf1a7dd29c0a1e4a33f202b

See more details on using hashes here.

Provenance

The following attestation bundles were made for litebench-0.3.0-py3-none-any.whl:

Publisher: publish.yml on he-yufeng/LiteBench

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page