Skip to main content

A lightweight and configurable evaluation package

Project description


lighteval library logo

Your go-to toolkit for lightning-fast, flexible LLM evaluation, from Hugging Face's Leaderboard and Evals Team.

Tests Quality Python versions License Version


Documentation Open Benchmark Index


Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends—whether your model is being served somewhere or already loaded in memory. Dive deep into your model's performance by saving and exploring detailed, sample-by-sample results to debug and see how your models stack-up.

Customization at your fingertips: letting you either browse all our existing tasks and metrics or effortlessly create your own custom task and custom metric, tailored to your needs.

Available Tasks

Lighteval supports 1000+ evaluation tasks across multiple domains and languages. Use this space to find what you need, or, here's an overview of some popular benchmarks:

📚 Knowledge

  • General Knowledge: MMLU, MMLU-Pro, MMMU, BIG-Bench
  • Question Answering: TriviaQA, Natural Questions, SimpleQA, Humanity's Last Exam (HLE)
  • Specialized: GPQA, AGIEval

🧮 Math and Code

  • Math Problems: GSM8K, GSM-Plus, MATH, MATH500
  • Competition Math: AIME24, AIME25
  • Multilingual Math: MGSM (Grade School Math in 10+ languages)
  • Coding Benchmarks: LCB (LiveCodeBench)

🎯 Chat Model Evaluation

  • Instruction Following: IFEval, IFEval-fr
  • Reasoning: MUSR, DROP (discrete reasoning)
  • Long Context: RULER
  • Dialogue: MT-Bench
  • Holistic Evaluation: HELM, BIG-Bench

🌍 Multilingual Evaluation

  • Cross-lingual: XTREME, Flores200 (200 languages), XCOPA, XQuAD
  • Language-specific:
    • Arabic: ArabicMMLU
    • Filipino: FilBench
    • French: IFEval-fr, GPQA-fr, BAC-fr
    • German: German RAG Eval
    • Serbian: Serbian LLM Benchmark, OZ Eval
    • Turkic: TUMLU (9 Turkic languages)
    • Chinese: CMMLU, CEval, AGIEval
    • Russian: RUMMLU, Russian SQuAD
    • And many more...

🧠 Core Language Understanding

  • NLU: GLUE, SuperGLUE, TriviaQA, Natural Questions
  • Commonsense: HellaSwag, WinoGrande, ProtoQA
  • Natural Language Inference: XNLI
  • Reading Comprehension: SQuAD, XQuAD, MLQA, Belebele

⚡️ Installation

Note: lighteval is currently completely untested on Windows, and we don't support it yet. (Should be fully functional on Mac/Linux)

pip install lighteval

Lighteval allows for many extras when installing, see here for a complete list.

If you want to push results to the Hugging Face Hub, add your access token as an environment variable:

hf auth login

🚀 Quickstart

Lighteval offers the following entry points for model evaluation:

  • lighteval eval: Evaluation models using inspect-ai as a backend (prefered).
  • lighteval accelerate: Evaluate models on CPU or one or more GPUs using 🤗 Accelerate
  • lighteval nanotron: Evaluate models in distributed settings using ⚡️ Nanotron
  • lighteval vllm: Evaluate models on one or more GPUs using 🚀 VLLM
  • lighteval sglang: Evaluate models using SGLang as backend
  • lighteval endpoint: Evaluate models using various endpoints as backend

Did not find what you need ? You can always make your custom model API by following this guide

  • lighteval custom: Evaluate custom models (can be anything)

Here's a quick command to evaluate using the Accelerate backend:

lighteval eval "hf-inference-providers/openai/gpt-oss-20b" gpqa:diamond

Or use the Python API to run a model already loaded in memory!

from transformers import AutoModelForCausalLM

from lighteval.logging.evaluation_tracker import EvaluationTracker
from lighteval.models.transformers.transformers_model import TransformersModel, TransformersModelConfig
from lighteval.pipeline import ParallelismManager, Pipeline, PipelineParameters


MODEL_NAME = "meta-llama/Meta-Llama-3-8B-Instruct"
BENCHMARKS = "gsm8k"

evaluation_tracker = EvaluationTracker(output_dir="./results")
pipeline_params = PipelineParameters(
    launcher_type=ParallelismManager.NONE,
    max_samples=2
)

model = AutoModelForCausalLM.from_pretrained(
  MODEL_NAME, device_map="auto"
)
config = TransformersModelConfig(model_name=MODEL_NAME, batch_size=1)
model = TransformersModel.from_model(model, config)

pipeline = Pipeline(
    model=model,
    pipeline_parameters=pipeline_params,
    evaluation_tracker=evaluation_tracker,
    tasks=BENCHMARKS,
)

results = pipeline.evaluate()
pipeline.show_results()
results = pipeline.get_results()

🙏 Acknowledgements

Lighteval took inspiration from the following amazing frameworks: Eleuther's AI Harness and Stanford's HELM. We are grateful to their teams for their pioneering work on LLM evaluations.

We'd also like to offer our thanks to all the community members who have contributed to the library, adding new features and reporting or fixing bugs.

🌟 Contributions Welcome 💙💚💛💜🧡

Got ideas? Found a bug? Want to add a task or metric? Contributions are warmly welcomed!

If you're adding a new feature, please open an issue first.

If you open a PR, don't forget to run the styling!

pip install -e .[dev]
pre-commit install
pre-commit run --all-files

📜 Citation

@misc{lighteval,
  author = {Habib, Nathan and Fourrier, Clémentine and Kydlíček, Hynek and Wolf, Thomas and Tunstall, Lewis},
  title = {LightEval: A lightweight framework for LLM evaluation},
  year = {2023},
  version = {0.11.0},
  url = {https://github.com/huggingface/lighteval}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lighteval-0.13.0.tar.gz (425.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

lighteval-0.13.0-py3-none-any.whl (612.7 kB view details)

Uploaded Python 3

File details

Details for the file lighteval-0.13.0.tar.gz.

File metadata

  • Download URL: lighteval-0.13.0.tar.gz
  • Upload date:
  • Size: 425.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.1

File hashes

Hashes for lighteval-0.13.0.tar.gz
Algorithm Hash digest
SHA256 156ae469a9ea472f2a08ac5c6b0f06a44de2a097b91149c9da3a23e8d793b345
MD5 05d8bb61d2a8dadf4df244bfa7bd830d
BLAKE2b-256 e165c3963c69296ea622c05c131e95732a7c206db478fdaf1cb59b211f21531b

See more details on using hashes here.

File details

Details for the file lighteval-0.13.0-py3-none-any.whl.

File metadata

  • Download URL: lighteval-0.13.0-py3-none-any.whl
  • Upload date:
  • Size: 612.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.1

File hashes

Hashes for lighteval-0.13.0-py3-none-any.whl
Algorithm Hash digest
SHA256 2552944f09e90680394f051bc109ca951cff694002ac3365c99fd6b6bb394dcf
MD5 cd4f39ee16a7a2d4c1be97ab1fe9ca9b
BLAKE2b-256 d03d7b56fe26973d2e914bd9bd4fbf6278a10c063ed809c6da6164dd2962d4da

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page