Skip to main content

Evalulation Framework

Project description

Aleph Alpha Eval-Framework

Comprehensive LLM evaluation at scale - A production-ready framework for evaluating large language models across 90+ benchmarks.

Build Status Version PyPI License

Docs Stars

eval-framework

Why Choose This Framework?

  • Scalability: Built for distributed evaluation. Currently providing an integration with Determined AI.
  • Extensibility: Easily add custom models, benchmarks, and metrics with object-oriented base classes.
  • Comprehensive: Comes pre-loaded with over 90 tasks covering a broad and diverse range, from reasoning and coding to safety and long-context. Also comes with a comprehensive set of metrics, including LLM-as-a-judge evaluations.

Other features

  • Flexible Model Integration: Supports models loaded via HuggingFace Transformers or custom implementations using the BaseLLM class.
  • Custom Benchmarks: Easily add new benchmarks with minimal code using the BaseTask class.
  • Custom Metrics: Easily define new metrics using the BaseMetric class.
  • Perturbation Testing: Robustness analysis with configurable perturbation types and probabilities.
  • Rich Outputs: Generates JSON results, plots, and detailed analysis reports.
  • Statistical Analysis: Includes confidence intervals and significance testing for reliable comparisons.
  • Docker Support: Pre-configured Dockerfiles for local and distributed setups.

For full documentation, visit our Docs Page.

Quick Start

The codebase is tested and compatible with Python 3.12 and PyTorch 2.5. You will also need the appropriate CUDA dependencies and version installed on your system for GPU support. Detailed installation instructions can be found here.

The easiest way to get started is by installing the library via pip and use it as an external dependency.

pip install eval_framework

There are optional extras available to unlock specific features of the library:

  • api for inference using the aleph-alpha client.
  • comet for the COMET metric.
  • determined for running jobs via determined.
  • mistral for inference on Mistral models.
  • transformers for inference using the transformers library.
  • vllm for inference via VLLM.

As a short hand, the all extra installs all of the above.

We use uv to better resolve dependencies when downloading the extras. You can install uv with:

curl -LsSf https://astral.sh/uv/install.sh | sh

or by follwing the uv installation docs.

Now, you can safely install the project with all optional extras:

uv sync --all-extras

or with pip

uv pip install eval_framework[all]

Tip: ensure python is properly installed with uv:

uv python install 3.12 --reinstall

We provide custom groups to control optional extras.

  • flash_attn: Install flash_attn with correct handling of build isolation

Thus, the following will setup the project with flash_attn

uv sync --all-extras --group flash_attn

To evaluate a single benchmark locally, you can use the following command:

eval_framework \
    --models src/eval_framework/llm/models.py \
    --llm-name Smollm135MInstruct \
    --task-name "MMLU" \
    --task-subjects "abstract_algebra" \
    --output-dir ./eval_results \
    --num-fewshot 5 \
    --num-samples 10

For more detailed CLI usage instructions, see the CLI Usage Guide.

Benchmark Coverage & Task Categories

Core Capabilities

Subset of core capabilities benchmarks coverd by eval-framework:

Reasoning Knowledge Math Coding Structured outputs Long Context
COPA, BalancedCOPA ARC AIME BigCodeBench IFEval InfiniteBench
Hellaswag MMLU GSM8K HumanEval StructEval QUALITY
Winogrande Openbook QA MATH-500 MBPP ZeroSCROLLS

Languages & Domains

Subset of language-specific and domain-specific benchmarks coverd by eval-framework:

Multilingual Specialized Safety & Bias Efficiency Metrics
WMT Translation MMLU TruthfulQA Compression ratios
FLORES-200 Legal (CaseHold) Winogender Runtime
Multilingual MMLU Scientific (SciQ)
German/Finnish tasks

Completion

Tasks focused on logical reasoning, text distillation, instruction following, and output control. Examples include:

  • AIME 2024: Logical Reasoning (Math)
  • DUC Abstractive: Text Distillation (Extraction)
  • Custom Data: Complaint Summarization: Text Distillation (Summarization)

Loglikelihoods

Tasks emphasizing classification, reasoning, and open QA. Examples include:

  • Abstract Reasoning Challenge (ARC): Classification
  • Casehold: Open QA

Long-Context

Tasks designed for long-context scenarios, including QA, summarization, and aggregation. Examples include:

  • InfiniteBench_CodeDebug: Programming
  • ZeroSCROLLS GovReport: QA (Government)

Metrics

Evaluation metrics include:

  • Completion Metrics: Accuracy, Bleu, F1, Rouge
  • Loglikelihood Metrics: Accuracy Loglikelihood, Probability Mass
  • LLM Metrics: Chatbot Style Judge, Instruction Judge
  • Efficiency Metrics: Bytes per Sequence Position

For the full list of tasks and metrics, see Detailed Task Table.

Getting Started

Understanding the Evaluation Framework

Eval-Framework provides a unified interface for evaluating language models across diverse benchmarks. The framework follows this interaction model:

  1. Define Your Model - Specify which model to evaluate (HuggingFace, API, or custom)
  2. Choose Your Task - Select from 150+ available benchmarks or create custom ones
  3. Configure Evaluation - Set parameters like few-shot examples, sample count, and output format
  4. Run Evaluation - Execute locally via CLI/script or distribute via Determined AI
  5. Analyze Results - Review detailed JSON outputs, metrics, and generated reports

Core Components

  • Models: Defined via BaseLLM interface (HuggingFace, OpenAI, custom APIs)
  • Tasks: Inherit from BaseTask (completion, loglikelihood, or LLM-judge based)
  • Metrics: Automatic scoring via BaseMetric classes
  • Formatters: Handle prompt construction and model-specific formatting
  • Results: Structured outputs with sample-level details and aggregated statistics

Your First Evaluation

  1. Install the framework (see Quick Start above)
pip install eval_framework[transformers]
  1. Create and run your first evaluation using HuggingFace model:
from functools import partial
from pathlib import Path

from eval_framework.llm.huggingface import HFLLM
from eval_framework.main import main
from eval_framework.tasks.eval_config import EvalConfig
from template_formatting.formatter import HFFormatter

# Define your model
class MyHuggingFaceModel(HFLLM):
    LLM_NAME = "microsoft/DialoGPT-medium"
    DEFAULT_FORMATTER = partial(HFFormatter, "microsoft/DialoGPT-medium")

if __name__ == "__main__":
    # Initialize your model
    llm = MyHuggingFaceModel()

    # Running evaluation on MMLU abstract algebra task using 5 few-shot examples and 10 samples
    config = EvalConfig(
        output_dir=Path("./eval_results"),
        num_fewshot=5,
        num_samples=10,
        task_name="MMLU",
        task_subjects=["abstract_algebra", "astronomy"],
        llm_class=MyHuggingFaceModel,
    )

    # Run evaluation and get results
    results = main(llm=llm, config=config)
  1. Review results - Check ./eval_results/ for detailed outputs and use our results guide to interpret them

Next Steps

Documentation

Getting Started

Advanced Usage

Scaling & Production

Contributing

  • Contributing Guide - Guide for contributing to this project
  • Testing - Guide for running tests comparable to the CI pipelines

Citation

If you use eval-framework in your research, please cite:

@software{eval_framework,
  title={Aleph Alpha Eval Framework},
  year={2025},
  url={https://github.com/Aleph-Alpha-Research/eval-framework}
}

License

This project is licensed under the Apache License 2.0.



This project has received funding from the European Union’s Digital Europe Programme under grant agreement No. 101195233 (OpenEuroLLM).

The contents of this publication are the sole responsibility of the OpenEuroLLM consortium and do not necessarily reflect the opinion of the European Union.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

eval_framework-0.3.0.tar.gz (246.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

eval_framework-0.3.0-py3-none-any.whl (358.5 kB view details)

Uploaded Python 3

File details

Details for the file eval_framework-0.3.0.tar.gz.

File metadata

  • Download URL: eval_framework-0.3.0.tar.gz
  • Upload date:
  • Size: 246.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for eval_framework-0.3.0.tar.gz
Algorithm Hash digest
SHA256 bd2741743f9a2eea13126b21e831205f42189d9b296001e6218c7480e04ccf77
MD5 fd63f8dc8dadccd6540e82138f0d8c0a
BLAKE2b-256 ede1627408aeb27e154d271250f329319c04db44830975088ecdfb5bf6808316

See more details on using hashes here.

Provenance

The following attestation bundles were made for eval_framework-0.3.0.tar.gz:

Publisher: release.yml on Aleph-Alpha-Research/eval-framework

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file eval_framework-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: eval_framework-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 358.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for eval_framework-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b25317d2b0925ec3de9ccc6dc306b201a34843045b1bad43e02132b32009d8ae
MD5 9d46a408379d9748660bab962ea7895b
BLAKE2b-256 5fb7ef5db36cfa4bb0eda6d0eb3b58982ed1d9a96077952e9e12fc9bd4147db2

See more details on using hashes here.

Provenance

The following attestation bundles were made for eval_framework-0.3.0-py3-none-any.whl:

Publisher: release.yml on Aleph-Alpha-Research/eval-framework

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page