Dynamic Evaluation Set Generation with Large Language Models
Project description
YourBench: A Dynamic Benchmark Generation Framework
[GitHub] · [Dataset] · [Documentation] · [Paper]
Yourbench is a structured data generation library for building better AI systems. Generate high-quality QA pairs, training data, and evaluation datasets from any source documents with full control over the output format and complexity. The modular architecture lets you configure every aspect of the generation pipeline, from document parsing (with built-in converters for common formats to markdown) to chunking strategies to output schemas. Most eval frameworks force you into their structure; Yourbench adapts to yours. Use it to create domain-specific benchmarks, fine-tuning datasets, or systematic model evaluations. Peer-reviewed and appearing at COLM 2025. 100% free and open source, forever.
Quick Start
You can use yourbench without installation instantly with uv! Simply run:
uvx yourbench --model gpt-4o-mini <YOUR_FILE_DIRECTORY_HERE>
You will see the dataset appear locally! If a valid HF_TOKEN is set, you will also see the dataset appear on your Hugging Face Hub!
Installation
YourBench is available on PyPI and requires Python 3.12+. You can install it as follows:
-
Install via PyPI (stable release):
# uv (recommended; get it here: https://docs.astral.sh/uv/getting-started/installation/) uv venv --python 3.12 source .venv/bin/activate uv pip install yourbench # pip (standard support) pip install yourbench
This will install the latest published version (e.g.
0.4.1). -
Install from source (development version):
git clone https://github.com/huggingface/yourbench.git cd yourbench # uv, recommended uv venv --python 3.12 source .venv/bin/activate uv pip install -e . # pip pip install -e .
Installing from source is recommended if you want the latest updates or to run the included example configuration.
Note: If you plan to use models that require API access (e.g. OpenAI GPT-4o or Hugging Face Inference API), make sure to have the appropriate credentials. You’ll also need a Hugging Face token (to optionally to upload results). See below for how to configure these before running YourBench.
Usage
Once installed, YourBench can be run from the command line to generate a custom evaluation set. Here’s a quick example:
# 1. (Optional) If not done already, install YourBench
pip install yourbench
# 2. Prepare your API credentials (for model inference and Hub access)
# For example, create a .env file with required keys:
# echo "OPENROUTER_API_KEY=<your_openrouter_api_key>" >> .env # Example
echo "HF_TOKEN=<your_huggingface_api_token>" >> .env # Hugging Face token (for Hub datasets & inference)
# 3. Run the pipeline on the provided example config (uses sample docs and models), or, use your own config file!
yourbench example/configs/simple_example.yaml
The example configuration example/configs/simple_example.yaml (included in the repository) demonstrates a basic setup. It specifies sample documents and default models for each stage of the pipeline. In step 3 above, YourBench will automatically ingest the example documents, generate a set of Q&A pairs, and output a Hugging Face Dataset containing the evaluation questions and answers.
For your own data, you can create a YAML config pointing to your documents and preferred models. For instance, you might specify a folder of PDFs or text files under a documents field, and choose which LLM to use for question generation. YourBench is fully configurable – you can easily toggle stages on or off and swap in different models. For example: you could disable the summarization stage for very short texts, or use a powerful, large, API model for question generation while using a faster local model for summarization. The possibilities are endless! Simply adjust the YAML, and the pipeline will accommodate it. (See the usage example for all available options!)
You may be interested in How YourBench Works
Try it Online (Hugging Face Spaces)
You can try YourBench right away in your browser – no installation needed:
- YourBench Demo Space – Use our ready-to-go web demo to upload a document (or paste text) and generate a custom evaluation set with one click, complete with an instant model leaderboard. (This free demo will use a default set of models to answer the questions and show how different models perform.)
- YourBench Advanced Space – For power users, the advanced demo lets you provide a custom YAML config and plug in your own models or API endpoints. This gives you full control over the pipeline (choose specific models, adjust chunking parameters, etc.) via a convenient UI, right from the browser.
👉 Both hosted apps are available on Hugging Face Spaces under the yourbench organization. Give them a try to see how YourBench can generate benchmarks tailored to your use-case in minutes.
Contributing
Contributions are welcome!
We actively review PRs and welcome improvements or fixes from the community. For major changes, feel free to open an issue first to discuss the idea.
📈 Progress
📜 License
This project is licensed under the Apache 2.0 License – see the LICENSE file for details. You are free to use, modify, and distribute YourBench in either commercial or academic projects under the terms of this license.
📚 Citation
If you use YourBench in your research or applications, please consider citing our paper:
@misc{shashidhar2025yourbencheasycustomevaluation,
title={YourBench: Easy Custom Evaluation Sets for Everyone},
author={Sumuk Shashidhar and Clémentine Fourrier and Alina Lozovskia and Thomas Wolf and Gokhan Tur and Dilek Hakkani-Tür},
year={2025},
eprint={2504.01833},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2504.01833}
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file yourbench-0.5.1.tar.gz.
File metadata
- Download URL: yourbench-0.5.1.tar.gz
- Upload date:
- Size: 92.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1fa3af5171aa25a2855bf3fda53723d3f71711dc084061ea57925337769f9977
|
|
| MD5 |
c567853bf788c3cbc61e2d974cecc474
|
|
| BLAKE2b-256 |
f65bd794dbbc75aa8dce9eebcb05ea6516af87c277e219b58f4d6fb76ccbaa93
|
Provenance
The following attestation bundles were made for yourbench-0.5.1.tar.gz:
Publisher:
python-publish.yml on huggingface/yourbench
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
yourbench-0.5.1.tar.gz -
Subject digest:
1fa3af5171aa25a2855bf3fda53723d3f71711dc084061ea57925337769f9977 - Sigstore transparency entry: 352778410
- Sigstore integration time:
-
Permalink:
huggingface/yourbench@5781dd1cafbdef42dd05c072985fcaf02d219152 -
Branch / Tag:
refs/tags/v0.5.1 - Owner: https://github.com/huggingface
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@5781dd1cafbdef42dd05c072985fcaf02d219152 -
Trigger Event:
release
-
Statement type:
File details
Details for the file yourbench-0.5.1-py3-none-any.whl.
File metadata
- Download URL: yourbench-0.5.1-py3-none-any.whl
- Upload date:
- Size: 105.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e46233b509f65edd042f4c0348aceba2a895ee66c7e721286ded82e20d60a782
|
|
| MD5 |
84c2d05beaf80c70c159232b0596f6ab
|
|
| BLAKE2b-256 |
73af12290215bbfeef616c7461e983d3741babb0289737a1cb56737b0611afe8
|
Provenance
The following attestation bundles were made for yourbench-0.5.1-py3-none-any.whl:
Publisher:
python-publish.yml on huggingface/yourbench
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
yourbench-0.5.1-py3-none-any.whl -
Subject digest:
e46233b509f65edd042f4c0348aceba2a895ee66c7e721286ded82e20d60a782 - Sigstore transparency entry: 352778411
- Sigstore integration time:
-
Permalink:
huggingface/yourbench@5781dd1cafbdef42dd05c072985fcaf02d219152 -
Branch / Tag:
refs/tags/v0.5.1 - Owner: https://github.com/huggingface
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@5781dd1cafbdef42dd05c072985fcaf02d219152 -
Trigger Event:
release
-
Statement type: