Evaluation Framework

These details have been verified by PyPI

Project links

repository

GitHub Statistics

Maintainers

alephalpharesearch

These details have not been verified by PyPI

Project description

Aleph Alpha Eval-Framework

Comprehensive LLM evaluation at scale - A production-ready framework for evaluating large language models across 90+ benchmarks.

eval-framework

Why Choose This Framework?

Scalability: Built for distributed evaluation. Currently providing an integration with Determined AI.
Extensibility: Easily add custom models, benchmarks, and metrics with object-oriented base classes.
Comprehensive: Comes pre-loaded with over 90 tasks covering a broad and diverse range, from reasoning and coding to safety and long-context. Also comes with a comprehensive set of metrics, including LLM-as-a-judge evaluations.

Other features

Flexible Model Integration: Supports models loaded via HuggingFace Transformers or custom implementations using the BaseLLM class.
Custom Benchmarks: Easily add new benchmarks with minimal code using the BaseTask class.
Custom Metrics: Easily define new metrics using the BaseMetric class.
Perturbation Testing: Robustness analysis with configurable perturbation types and probabilities.
Rich Outputs: Generates JSON results, plots, and detailed analysis reports.
Statistical Analysis: Includes confidence intervals and significance testing for reliable comparisons.
Docker Support: Pre-configured Dockerfiles for local and distributed setups.

For full documentation, visit our Docs Page.

Quick Start

The codebase is tested and compatible with Python 3.12 and PyTorch 2.5. You will also need the appropriate CUDA dependencies and version installed on your system for GPU support. Detailed installation instructions can be found here.

The easiest way to get started is by installing the library via pip and use it as an external dependency.

pip install eval_framework

There are optional extras available to unlock specific features of the library:

api for inference using the aleph-alpha client.
determined for running jobs via determined.
mistral for inference on Mistral models.
transformers for inference using the transformers library.
vllm for inference via VLLM.

As a short hand, the all extra installs all of the above.

We use uv to better resolve dependencies when downloading the extras. You can install uv with:

curl -LsSf https://astral.sh/uv/install.sh | sh

or by follwing the uv installation docs.

Now, you can safely install the project with all optional extras:

uv sync --all-extras

or with pip

uv pip install eval_framework[all]

Tip: ensure python is properly installed with uv:

uv python install 3.12 --reinstall

We provide custom groups to control optional extras.

flash_attn: Install flash_attn with correct handling of build isolation

Thus, the following will setup the project with flash_attn

uv sync --all-extras --group flash_attn

To evaluate a single benchmark locally, you can use the following command:

eval_framework \
    --models src/eval_framework/llm/models.py \
    --llm-name Smollm135MInstruct \
    --task-name "MMLU" \
    --task-subjects "abstract_algebra" \
    --output-dir ./eval_results \
    --num-fewshot 5 \
    --num-samples 10

For more detailed CLI usage instructions, see the CLI Usage Guide.

Benchmark Coverage & Task Categories

Core Capabilities

Subset of core capabilities benchmarks coverd by eval-framework:

Reasoning	Knowledge	Math	Coding	Structured outputs	Long Context
COPA, BalancedCOPA	ARC	AIME	BigCodeBench	IFEval	InfiniteBench
Hellaswag	MMLU	GSM8K	HumanEval	StructEval	QUALITY
Winogrande	Openbook QA	MATH-500	MBPP		ZeroSCROLLS

Languages & Domains

Subset of language-specific and domain-specific benchmarks coverd by eval-framework:

Multilingual	Specialized	Safety & Bias	Efficiency Metrics
WMT Translation	MMLU	TruthfulQA	Compression ratios
FLORES-200	Legal (CaseHold)	Winogender	Runtime
Multilingual MMLU	Scientific (SciQ)
German/Finnish tasks

Completion

Tasks focused on logical reasoning, text distillation, instruction following, and output control. Examples include:

AIME 2024: Logical Reasoning (Math)
DUC Abstractive: Text Distillation (Extraction)
Custom Data: Complaint Summarization: Text Distillation (Summarization)

Loglikelihoods

Tasks emphasizing classification, reasoning, and open QA. Examples include:

Abstract Reasoning Challenge (ARC): Classification
Casehold: Open QA

Long-Context

Tasks designed for long-context scenarios, including QA, summarization, and aggregation. Examples include:

InfiniteBench_CodeDebug: Programming
ZeroSCROLLS GovReport: QA (Government)

Metrics

Evaluation metrics include:

Completion Metrics: Accuracy, Bleu, F1, Rouge
Loglikelihood Metrics: Accuracy Loglikelihood, Probability Mass
LLM Metrics: Chatbot Style Judge, Instruction Judge
Efficiency Metrics: Bytes per Sequence Position

For the full list of tasks and metrics, see Detailed Task Table.

Getting Started

Understanding the Evaluation Framework

Eval-Framework provides a unified interface for evaluating language models across diverse benchmarks. The framework follows this interaction model:

Define Your Model - Specify which model to evaluate (HuggingFace, API, or custom)
Choose Your Task - Select from 150+ available benchmarks or create custom ones
Configure Evaluation - Set parameters like few-shot examples, sample count, and output format
Run Evaluation - Execute locally via CLI/script or distribute via Determined AI
Analyze Results - Review detailed JSON outputs, metrics, and generated reports

Core Components

Models: Defined via BaseLLM interface (HuggingFace, OpenAI, custom APIs)
Tasks: Inherit from BaseTask (completion, loglikelihood, or LLM-judge based)
Metrics: Automatic scoring via BaseMetric classes
Formatters: Handle prompt construction and model-specific formatting
Results: Structured outputs with sample-level details and aggregated statistics

Your First Evaluation

Install the framework (see Quick Start above)

pip install eval_framework[transformers]

Create and run your first evaluation using HuggingFace model:

from functools import partial
from pathlib import Path

from eval_framework.llm.huggingface import HFLLM
from eval_framework.main import main
from eval_framework.tasks.eval_config import EvalConfig
from template_formatting.formatter import HFFormatter

# Define your model
class MyHuggingFaceModel(HFLLM):
    LLM_NAME = "microsoft/DialoGPT-medium"
    DEFAULT_FORMATTER = partial(HFFormatter, "microsoft/DialoGPT-medium")

if __name__ == "__main__":
    # Initialize your model
    llm = MyHuggingFaceModel()

    # Running evaluation on MMLU abstract algebra task using 5 few-shot examples and 10 samples
    config = EvalConfig(
        output_dir=Path("./eval_results"),
        num_fewshot=5,
        num_samples=10,
        task_name="MMLU",
        task_subjects=["abstract_algebra", "astronomy"],
        llm_class=MyHuggingFaceModel,
    )

    # Run evaluation and get results
    results = main(llm=llm, config=config)

Review results - Check ./eval_results/ for detailed outputs and use our results guide to interpret them

Next Steps

Use CLI interface: See CLI usage guide for command-line evaluation options
Evaluate HuggingFace models: Follow our HuggingFace evaluation guide
Understand model arguments: Read out Model Arguments guide
Create custom benchmarks: Follow our benchmark creation guide
Scale your evaluations: Use Determined AI integration for distributed evaluation
Understand your results: Read our results interpretation guide
Log results in WandB: See how we integrate WandB for metric and lineage tracking

Documentation

Getting Started

CLI Usage Guide - Detailed instructions for using the command-line interface
Evaluating HuggingFace Models - Complete guide for evaluating HuggingFace models
Understanding Results - How to read and interpret evaluation results

Advanced Usage

Understanding Model Arguments - Thorough guide on each constructor argument for salient model classes
Adding New Benchmarks - Complete guide with practical examples for adding new benchmarks
Benchmarks and Metrics - Comprehensive overview of all available benchmarks and evaluation metrics
Overview of Dataloading - Explanation of dataloading and task/sample/message structure

Scaling & Production

Using Determined - Guide for distributed evaluation using Determined AI
Controlling Upload Results - How to manage and control the upload of evaluation results

Contributing

Contributing Guide - Guide for contributing to this project
Testing - Guide for running tests comparable to the CI pipelines

Citation

If you use eval-framework in your research, please cite:

@software{eval_framework,
  title={Aleph Alpha Eval Framework},
  year={2025},
  url={https://github.com/Aleph-Alpha-Research/eval-framework}
}

License

This project is licensed under the Apache License 2.0.

This project has received funding from the European Union’s Digital Europe Programme under grant agreement No. 101195233 (OpenEuroLLM).

The contents of this publication are the sole responsibility of the OpenEuroLLM consortium and do not necessarily reflect the opinion of the European Union.

Project details

These details have been verified by PyPI

Project links

repository

GitHub Statistics

Maintainers

alephalpharesearch

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.5.3

Jul 6, 2026

0.5.2

Jun 25, 2026

0.5.1

Jun 24, 2026

0.5.0

Jun 22, 2026

0.3.8

Jun 2, 2026

0.3.7

May 8, 2026

0.3.6

Apr 30, 2026

0.3.5

Apr 28, 2026

0.3.4

Apr 13, 2026

0.3.3

Apr 10, 2026

0.3.2

Apr 10, 2026

0.3.1

Apr 8, 2026

0.3.0

Mar 31, 2026

0.2.14

Mar 10, 2026

0.2.13

Mar 3, 2026

0.2.12

Feb 4, 2026

0.2.11

Jan 30, 2026

0.2.10

Jan 27, 2026

0.2.9

Jan 15, 2026

0.2.8

Jan 9, 2026

0.2.7

Jan 8, 2026

0.2.6

Dec 15, 2025

0.2.5

Dec 8, 2025

0.2.4

Nov 26, 2025

0.2.3

Nov 14, 2025

0.2.2

Sep 23, 2025

0.2.1

Sep 22, 2025

0.2.0

Sep 8, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

eval_framework-0.5.3.tar.gz (253.4 kB view details)

Uploaded Jul 6, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

eval_framework-0.5.3-py3-none-any.whl (364.7 kB view details)

Uploaded Jul 6, 2026 Python 3

File details

Details for the file eval_framework-0.5.3.tar.gz.

File metadata

Download URL: eval_framework-0.5.3.tar.gz
Upload date: Jul 6, 2026
Size: 253.4 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for eval_framework-0.5.3.tar.gz
Algorithm	Hash digest
SHA256	`f10edc75db128f0deb7b8e3c33725ffa275ad332ff1b3306eea4ca8c8587f9b6`
MD5	`9f8ccc8290745e493cd5d1acd19a6864`
BLAKE2b-256	`9ee3d3a3adde0e77be66579a02193ea9947f7d96cae906cb443ef7b1b591a268`

See more details on using hashes here.

Provenance

The following attestation bundles were made for eval_framework-0.5.3.tar.gz:

Publisher: release.yml on Aleph-Alpha-Research/eval-framework

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: eval_framework-0.5.3.tar.gz
- Subject digest: f10edc75db128f0deb7b8e3c33725ffa275ad332ff1b3306eea4ca8c8587f9b6
- Sigstore transparency entry: 2084406581
- Sigstore integration time: Jul 6, 2026
Source repository:
- Permalink: Aleph-Alpha-Research/eval-framework@bca886fa8c29c259760a9f257e665bb4e11d2e33
- Branch / Tag: refs/tags/v0.5.3
- Owner: https://github.com/Aleph-Alpha-Research
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@bca886fa8c29c259760a9f257e665bb4e11d2e33
- Trigger Event: release

File details

Details for the file eval_framework-0.5.3-py3-none-any.whl.

File metadata

Download URL: eval_framework-0.5.3-py3-none-any.whl
Upload date: Jul 6, 2026
Size: 364.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for eval_framework-0.5.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`886ecbf69bb1a66490fe164ad240fed31c22fa7088c4181cfdb8a9e4f94ace4b`
MD5	`9a34bc2dad116a9dee1293328a305542`
BLAKE2b-256	`95ea6d79e73783c2f05bbde601a91ee0001d3268ef39bd277741d8bf833238f5`

See more details on using hashes here.

Provenance

The following attestation bundles were made for eval_framework-0.5.3-py3-none-any.whl:

Publisher: release.yml on Aleph-Alpha-Research/eval-framework

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: eval_framework-0.5.3-py3-none-any.whl
- Subject digest: 886ecbf69bb1a66490fe164ad240fed31c22fa7088c4181cfdb8a9e4f94ace4b
- Sigstore transparency entry: 2084406592
- Sigstore integration time: Jul 6, 2026
Source repository:
- Permalink: Aleph-Alpha-Research/eval-framework@bca886fa8c29c259760a9f257e665bb4e11d2e33
- Branch / Tag: refs/tags/v0.5.3
- Owner: https://github.com/Aleph-Alpha-Research
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@bca886fa8c29c259760a9f257e665bb4e11d2e33
- Trigger Event: release

eval-framework 0.5.3

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

Aleph Alpha Eval-Framework

Why Choose This Framework?

Other features

Quick Start

Benchmark Coverage & Task Categories

Core Capabilities

Languages & Domains

Completion

Loglikelihoods

Long-Context

Metrics

Getting Started

Understanding the Evaluation Framework

Core Components

Your First Evaluation

Next Steps

Documentation

Getting Started

Advanced Usage

Scaling & Production

Contributing

Citation

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance