Skip to main content

A pip-installable benchmark runner for LLMs and agents. 5 minutes to your first eval.

Project description

LiteBench

A pip-installable benchmark runner for LLMs and agents. Five minutes to your first eval.

PyPI Python License: MIT

中文文档

What is this?

inspect_ai is powerful but heavy — you write Solver and Scorer classes. lm-evaluation-harness is thorough but research-oriented and slow to set up. promptfoo tests prompts, not full agents.

LiteBench sits in the middle: an opinionated CLI for app developers who want to benchmark their model or agent on common tasks (HumanEval / GSM8K / MMLU / MATH / TruthfulQA / ARC) without having to write a framework first.

pip install litebench

litebench list
litebench run gsm8k -m deepseek/deepseek-chat -n 50
litebench run humaneval -m gpt-5 -n 20
litebench run mmlu -m claude-sonnet-4-6 --subject computer_security -n 100
litebench run math -m kimi -n 50

# Custom YAML tasks
litebench run ./my-task.yaml -m gpt-4o-mini

# Compare models
litebench runs
litebench compare <run-id-1> <run-id-2>

Features

  • 6 built-in tasks — HumanEval, GSM8K, MMLU, MATH-500, TruthfulQA, ARC-Challenge.
  • 100+ model providers via litellm — OpenAI, Anthropic, Gemini, DeepSeek, Kimi, Qwen, GLM, local Ollama, and more. Shortcuts built in: -m opus, -m kimi, -m deepseek.
  • Streaming datasets via HuggingFace datasets — no manual downloads.
  • Local SQLite run history — diff runs across models and days.
  • Async concurrency--concurrency 8 default, safely parallel.
  • Custom YAML tasks — point at a YAML or JSONL and go. Supports number / mc / regex / string / llm-judge scorers.
  • LLM-as-judge — plug a grader model in for free-form tasks.

Install

pip install litebench

Then set the API key for whatever provider you plan to hit:

export OPENAI_API_KEY=...
export ANTHROPIC_API_KEY=...
export GEMINI_API_KEY=...
# etc.

Usage

Run a built-in task

litebench run gsm8k -m deepseek/deepseek-chat -n 100 --concurrency 8

Output:

           gsm8k · deepseek/deepseek-chat
 Samples       100
 Accuracy      85.0%  (85/100)
 Mean latency  3420 ms
 Tokens        prompt=22,100  completion=58,743
 Duration      57.3s
 Run ID        a51819c4

Model shortcuts

The CLI accepts either a full litellm string or one of the shortcuts:

Shortcut Resolves to
opus claude-opus-4-7
sonnet claude-sonnet-4-6
haiku claude-haiku-4-5-20251001
gpt-5 gpt-5
gpt-4o gpt-4o
gemini gemini/gemini-2.5-pro
deepseek deepseek/deepseek-chat
kimi openrouter/moonshotai/kimi-k2.6
qwen openrouter/qwen/qwen3.5-max
glm openrouter/zhipu/glm-5

Custom YAML task

Create my-task.yaml:

name: sql-questions
description: Ask for a SQL query, grade with a pattern.
scorer: regex
regex: "SELECT\\s+.*FROM\\s+users"
system_prompt: |
  Return only a SQL query, nothing else.
samples:
  - input: "Get every user's email."
    target: "SELECT email FROM users"
  - input: "Get active users."
    target: "SELECT * FROM users WHERE active = TRUE"

Then run it:

litebench run my-task.yaml -m gpt-4o-mini

Supported scorers: number / mc / regex / string (default: substring match) / llm-judge.

For llm-judge, add judge_model: gpt-4o-mini (or any litellm-supported model).

You can also load samples from JSONL instead of inline:

name: my-task
scorer: string
samples_jsonl: ./data.jsonl

Compare runs

$ litebench runs
                                Recent runs
┏━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┓
┃ Run       Task   Model        Samples  Accuracy  When             ┃
┡━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━┩
│ 10ab7654  gsm8k  gpt-4o           100     89.0%  2026-04-23 17:38 │
│ 86d845e0  gsm8k  gpt-4o-mini      100     80.0%  2026-04-23 17:37 │
└──────────┴───────┴─────────────┴─────────┴──────────┴──────────────────┘

$ litebench compare 10ab7654 86d845e0
                              Comparing 2 runs
┏━━━━━━━━━━━━━┳━━━━━━━┳━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Model        Task   N    Accuracy  Mean latency  Tokens (p/c)  ┃
┡━━━━━━━━━━━━━╇━━━━━━━╇━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ gpt-4o       gsm8k  100     89.0%        3710ms   8,700 / 23.9k│
│ gpt-4o-mini  gsm8k  100     80.0%        4230ms   8,700 / 22.3k│
└─────────────┴───────┴─────┴──────────┴──────────────┴───────────────┘

Built-in tasks

Task Description Dataset
humaneval Code completion, executed against hidden tests openai_humaneval
gsm8k Grade-school word problems gsm8k (main, test)
mmlu 57-subject multiple choice; use --subject cais/mmlu
math Competition-level math, answer in \boxed{…} HuggingFaceH4/MATH-500
truthfulqa MC1 single-correct multiple choice truthful_qa (multiple_choice)
arc AI2 science exam; --arc-easy for Easy split allenai/ai2_arc (Challenge)

Roadmap

  • ✅ Phase 1 — MVP CLI, 3 tasks, SQLite history
  • ✅ Phase 2 — 6 tasks, YAML custom, LLM judge, 31 regression tests
  • ⏳ Phase 3 — Agent mode (tool-use eval via litellm function calling)
  • ⏳ Phase 4 — Web dashboard (FastAPI + React, litebench serve)

Contributing

Issues and PRs welcome. pytest tests/ should stay green.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

litebench-0.1.0.tar.gz (23.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

litebench-0.1.0-py3-none-any.whl (28.6 kB view details)

Uploaded Python 3

File details

Details for the file litebench-0.1.0.tar.gz.

File metadata

  • Download URL: litebench-0.1.0.tar.gz
  • Upload date:
  • Size: 23.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for litebench-0.1.0.tar.gz
Algorithm Hash digest
SHA256 2179df228f5169ee22889a1c2b558e5738534c9b270dce567d92deee76eb9a3d
MD5 ecd2c03f600e2e0b6c70cfd5ecbad6a8
BLAKE2b-256 991e5cdf76c3147374387fe45af4e3d1fae974f08e7c125dedf5bae39cec398f

See more details on using hashes here.

Provenance

The following attestation bundles were made for litebench-0.1.0.tar.gz:

Publisher: publish.yml on he-yufeng/LiteBench

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file litebench-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: litebench-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 28.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for litebench-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 cef1a276976072cabd33d61795d6126a92f3c290d43d2e4b69133d3421c34f26
MD5 d2cf6cee426e45e2c4f94f7c1f1ba580
BLAKE2b-256 58d7c03a8efefa26af134f9827f43384db9c92adc4825a152f9563b53da39c4d

See more details on using hashes here.

Provenance

The following attestation bundles were made for litebench-0.1.0-py3-none-any.whl:

Publisher: publish.yml on he-yufeng/LiteBench

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page