lighteval

A lightweight and configurable evaluation package

These details have not been verified by PyPI

Project links

Project description

lighteval library logo

Your go-to toolkit for lightning-fast, flexible LLM evaluation, from Hugging Face's Leaderboard and Evals Team.

Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends—whether your model is being served somewhere or already loaded in memory. Dive deep into your model's performance by saving and exploring detailed, sample-by-sample results to debug and see how your models stack-up.

Customization at your fingertips: letting you either browse all our existing tasks and metrics or effortlessly create your own custom task and custom metric, tailored to your needs.

Available Tasks

Lighteval supports 1000+ evaluation tasks across multiple domains and languages. Use this space to find what you need, or, here's an overview of some popular benchmarks:

📚 Knowledge

General Knowledge: MMLU, MMLU-Pro, MMMU, BIG-Bench
Question Answering: TriviaQA, Natural Questions, SimpleQA, Humanity's Last Exam (HLE)
Specialized: GPQA, AGIEval

🧮 Math and Code

Math Problems: GSM8K, GSM-Plus, MATH, MATH500
Competition Math: AIME24, AIME25
Multilingual Math: MGSM (Grade School Math in 10+ languages)
Coding Benchmarks: LCB (LiveCodeBench)

🎯 Chat Model Evaluation

Instruction Following: IFEval, IFEval-fr
Reasoning: MUSR, DROP (discrete reasoning)
Long Context: RULER
Dialogue: MT-Bench
Holistic Evaluation: HELM, BIG-Bench

🌍 Multilingual Evaluation

Cross-lingual: XTREME, Flores200 (200 languages), XCOPA, XQuAD
Language-specific:
- Arabic: ArabicMMLU
- Filipino: FilBench
- French: IFEval-fr, GPQA-fr, BAC-fr
- German: German RAG Eval
- Serbian: Serbian LLM Benchmark, OZ Eval
- Turkic: TUMLU (9 Turkic languages)
- Chinese: CMMLU, CEval, AGIEval
- Russian: RUMMLU, Russian SQuAD
- And many more...

🧠 Core Language Understanding

NLU: GLUE, SuperGLUE, TriviaQA, Natural Questions
Commonsense: HellaSwag, WinoGrande, ProtoQA
Natural Language Inference: XNLI
Reading Comprehension: SQuAD, XQuAD, MLQA, Belebele

⚡️ Installation

Note: lighteval is currently completely untested on Windows, and we don't support it yet. (Should be fully functional on Mac/Linux)

pip install lighteval

Lighteval allows for many extras when installing, see here for a complete list.

If you want to push results to the Hugging Face Hub, add your access token as an environment variable:

hf auth login

🚀 Quickstart

Lighteval offers the following entry points for model evaluation:

lighteval eval: Evaluation models using inspect-ai as a backend (prefered).
lighteval accelerate: Evaluate models on CPU or one or more GPUs using 🤗 Accelerate
lighteval nanotron: Evaluate models in distributed settings using ⚡️ Nanotron
lighteval vllm: Evaluate models on one or more GPUs using 🚀 VLLM
lighteval sglang: Evaluate models using SGLang as backend
lighteval endpoint: Evaluate models using various endpoints as backend
- lighteval endpoint inference-endpoint: Evaluate models using Hugging Face's Inference Endpoints API
- lighteval endpoint tgi: Evaluate models using 🔗 Text Generation Inference running locally
- lighteval endpoint litellm: Evaluate models on any compatible API using LiteLLM
- lighteval endpoint inference-providers: Evaluate models using HuggingFace's inference providers as backend

Did not find what you need ? You can always make your custom model API by following this guide

lighteval custom: Evaluate custom models (can be anything)

Here's a quick command to evaluate using the Accelerate backend:

lighteval eval "hf-inference-providers/openai/gpt-oss-20b" gpqa:diamond

Or use the Python API to run a model already loaded in memory!

from transformers import AutoModelForCausalLM

from lighteval.logging.evaluation_tracker import EvaluationTracker
from lighteval.models.transformers.transformers_model import TransformersModel, TransformersModelConfig
from lighteval.pipeline import ParallelismManager, Pipeline, PipelineParameters


MODEL_NAME = "meta-llama/Meta-Llama-3-8B-Instruct"
BENCHMARKS = "gsm8k"

evaluation_tracker = EvaluationTracker(output_dir="./results")
pipeline_params = PipelineParameters(
    launcher_type=ParallelismManager.NONE,
    max_samples=2
)

model = AutoModelForCausalLM.from_pretrained(
  MODEL_NAME, device_map="auto"
)
config = TransformersModelConfig(model_name=MODEL_NAME, batch_size=1)
model = TransformersModel.from_model(model, config)

pipeline = Pipeline(
    model=model,
    pipeline_parameters=pipeline_params,
    evaluation_tracker=evaluation_tracker,
    tasks=BENCHMARKS,
)

results = pipeline.evaluate()
pipeline.show_results()
results = pipeline.get_results()

🙏 Acknowledgements

Lighteval took inspiration from the following amazing frameworks: Eleuther's AI Harness and Stanford's HELM. We are grateful to their teams for their pioneering work on LLM evaluations.

We'd also like to offer our thanks to all the community members who have contributed to the library, adding new features and reporting or fixing bugs.

🌟 Contributions Welcome 💙💚💛💜🧡

Got ideas? Found a bug? Want to add a task or metric? Contributions are warmly welcomed!

If you're adding a new feature, please open an issue first.

If you open a PR, don't forget to run the styling!

pip install -e .[dev]
pre-commit install
pre-commit run --all-files

📜 Citation

@misc{lighteval,
  author = {Habib, Nathan and Fourrier, Clémentine and Kydlíček, Hynek and Wolf, Thomas and Tunstall, Lewis},
  title = {LightEval: A lightweight framework for LLM evaluation},
  year = {2023},
  version = {0.11.0},
  url = {https://github.com/huggingface/lighteval}
}

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.13.0

Nov 24, 2025

0.12.2

Nov 12, 2025

0.12.1

Nov 6, 2025

0.12.0

Nov 4, 2025

0.11.0

Sep 22, 2025

0.10.0

May 22, 2025

0.9.2

May 7, 2025

0.9.1

May 7, 2025

0.9.0

May 6, 2025

0.8.1

Mar 24, 2025

0.7.0

Jan 3, 2025

0.6.2

Oct 23, 2024

0.5.0

Sep 24, 2024

0.4.0a0 pre-release

Sep 5, 2024

0.3.0

Mar 29, 2024

0.3.0a0 pre-release

Mar 29, 2024

0.2.0

Mar 1, 2024

0.1.1

Feb 9, 2024

0.1.1.dev0 pre-release

Feb 9, 2024

0.1.0

Feb 8, 2024

0.0.2

Feb 8, 2024

0.0.1

Oct 19, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lighteval-0.13.0.tar.gz (425.9 kB view details)

Uploaded Nov 24, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

lighteval-0.13.0-py3-none-any.whl (612.7 kB view details)

Uploaded Nov 24, 2025 Python 3

File details

Details for the file lighteval-0.13.0.tar.gz.

File metadata

Download URL: lighteval-0.13.0.tar.gz
Upload date: Nov 24, 2025
Size: 425.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.1

File hashes

Hashes for lighteval-0.13.0.tar.gz
Algorithm	Hash digest
SHA256	`156ae469a9ea472f2a08ac5c6b0f06a44de2a097b91149c9da3a23e8d793b345`
MD5	`05d8bb61d2a8dadf4df244bfa7bd830d`
BLAKE2b-256	`e165c3963c69296ea622c05c131e95732a7c206db478fdaf1cb59b211f21531b`

See more details on using hashes here.

File details

Details for the file lighteval-0.13.0-py3-none-any.whl.

File metadata

Download URL: lighteval-0.13.0-py3-none-any.whl
Upload date: Nov 24, 2025
Size: 612.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.1

File hashes

Hashes for lighteval-0.13.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2552944f09e90680394f051bc109ca951cff694002ac3365c99fd6b6bb394dcf`
MD5	`cd4f39ee16a7a2d4c1be97ab1fe9ca9b`
BLAKE2b-256	`d03d7b56fe26973d2e914bd9bd4fbf6278a10c063ed809c6da6164dd2962d4da`

See more details on using hashes here.

lighteval 0.13.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Available Tasks

📚 Knowledge

🧮 Math and Code

🎯 Chat Model Evaluation

🌍 Multilingual Evaluation

🧠 Core Language Understanding

⚡️ Installation

🚀 Quickstart

🙏 Acknowledgements

🌟 Contributions Welcome 💙💚💛💜🧡

📜 Citation

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes