Dynamic Evaluation Set Generation with Large Language Models

Project description

YourBench: A Dynamic Benchmark Generation Framework

[GitHub] · [Dataset] · [Documentation] · [Paper]

Watch our 3-minute demo of the YourBench pipeline

YourBench is an open-source framework for generating domain-specific benchmarks in a zero-shot manner. It aims to keep your large language models on their toes—even as new data sources, domains, and knowledge demands evolve.

Highlights:

Dynamic Benchmark Generation: Produce diverse, up-to-date questions from real-world source documents (PDF, Word, HTML, even multimedia).
Scalable & Structured: Seamlessly handles ingestion, summarization, and multi-hop chunking for large or specialized datasets.
Zero-Shot Focus: Emulates real-world usage scenarios by creating fresh tasks that guard against memorized knowledge.
Extensible: Out-of-the-box pipeline stages (ingestion, summarization, question generation), plus an easy plugin mechanism to accommodate custom models or domain constraints.

Quick Start (Alpha)

# 1. Clone the repo
git clone https://github.com/huggingface/yourbench.git
cd yourbench

# Use uv to install the dependencies
# pip install uv # if you do not have uv already
uv venv
source .venv/bin/activate
uv sync
uv pip install -e .

# 3. Get a key from https://openrouter.ai/ and add it to the .env file (or make your own config with a different model!)
touch .env
echo "HF_TOKEN=<your_huggingface_token>" >> .env
echo "HF_ORGANIZATION=<your_HF_username_or_organization>" >> .env

# 4. Run the pipeline with an example config
yourbench run --config example/configs/example.yaml

Note: The above instructions are a work-in-progress, and more comprehensive usage info will be provided soon.

Process Flow

Key Features

Automated Benchmark Generation
Generate question-answer pairs that test LLMs on specific domains or knowledge slices, derived directly from your raw documents.
Flexible Pipeline
Each stage (ingestion, summarization, chunking, multi-/single-hop QG, deduplication) can be enabled or disabled via YAML config. Fine-grained control allows minimal or comprehensive runs.
Robust Config System
A single YAML config controls model roles, data paths, chunking parameters, question generation instructions, deduplication thresholds, etc.
Multi-Model Ensemble Support
Use different LLMs for ingestion, summarization, question generation, or answering. This fosters broader coverage and question style diversity.
Deduplication & Quality Filtering
Automatic grouping of near-duplicate questions to prune and keep a curated set.
Extensive Logging & Analysis
Built-in modules measure dataset coverage, question distribution, difficulty metrics, and more.
Public or Private
Optionally push ingested or generated data to the Hugging Face Hub or keep it local.
Extensible
Each pipeline step is modular. Easily add custom question-generation prompts, chunking logic, or domain-specific expansions.

Core Concepts & Workflow

YourBench follows a multi-stage approach:

Document Ingestion
Convert PDFs, HTML, Word, or text into a standardized Markdown format.
Summarization
Generate a concise "global summary" for each document, using a designated summarization LLM.
Chunking
Split or chunk documents (and optionally combine multiple smaller segments) based on text similarity or length constraints.
Question Generation
- Single-Shot: Create straightforward, single-chunk questions.
- Multi-Hop: Combine multiple chunks to produce more complex, integrative questions.
Deduplication
Remove or group near-duplicate questions across your dataset using embedding-based similarity.
Analysis
Evaluate question distribution, difficulty, coverage, or run custom analyses.
Export
The resulting question sets can be stored locally or uploaded as a new dataset on the Hugging Face Hub.

🧰 Development

We use:

Ruff for code formatting and linting
pytest for testing

🚀 Try YourBench on Hugging Face

To test YourBench on your own documents:

Use the Demo Space to generate a dataset and leaderboard in one click – entirely free
Use the Advanced Space for full control over the pipeline, with custom configs and your own inference

🤝 Contributing

Fork the repository
Create your feature branch (git checkout -b feature/amazing-feature)
Install development dependencies
Make your changes
Run tests and ensure code style compliance
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

📄 License

This project is licensed under the Apache-2.0 License - see the LICENSE file for details.

🙏 Acknowledgments

Sentence Transformers for semantic embeddings
Hugging Face for dataset infrastructure

Citation

If YourBench is helpful to you, please cite!:

@misc{shashidhar2025yourbencheasycustomevaluation,
      title={YourBench: Easy Custom Evaluation Sets for Everyone}, 
      author={Sumuk Shashidhar and Clémentine Fourrier and Alina Lozovskia and Thomas Wolf and Gokhan Tur and Dilek Hakkani-Tür},
      year={2025},
      eprint={2504.01833},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2504.01833}, 
}

Project details

Release history Release notifications | RSS feed

0.9.0

Dec 29, 2025

0.6.0

Aug 5, 2025

0.5.3

Aug 5, 2025

0.5.2

Aug 5, 2025

0.5.1

Aug 5, 2025

0.5.0

Aug 5, 2025

0.4.3

Aug 5, 2025

0.4.1

Aug 4, 2025

0.4.0

Jul 31, 2025

0.3.1

May 16, 2025

This version

0.3.0

May 5, 2025

0.2.0

Mar 21, 2025

0.1.0

Mar 20, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

yourbench-0.3.0.tar.gz (57.9 kB view details)

Uploaded May 5, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

yourbench-0.3.0-py3-none-any.whl (64.3 kB view details)

Uploaded May 5, 2025 Python 3

File details

Details for the file yourbench-0.3.0.tar.gz.

File metadata

Download URL: yourbench-0.3.0.tar.gz
Upload date: May 5, 2025
Size: 57.9 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for yourbench-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`8b5485c890f33ee1d6f634d44362d5588d62d9ad542a830b7250c359fe5134f6`
MD5	`8167fd765098c365014f256d4619dd4a`
BLAKE2b-256	`9c5fb9ec49c747cd0b117a3d5632e0d1b47c2b98c7f44012293d300b61529dd0`

See more details on using hashes here.

Provenance

The following attestation bundles were made for yourbench-0.3.0.tar.gz:

Publisher: python-publish.yml on huggingface/yourbench

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: yourbench-0.3.0.tar.gz
- Subject digest: 8b5485c890f33ee1d6f634d44362d5588d62d9ad542a830b7250c359fe5134f6
- Sigstore transparency entry: 206844378
- Sigstore integration time: May 5, 2025
Source repository:
- Permalink: huggingface/yourbench@0d068b98b649d3cef83be00b1dd32681b6352d18
- Branch / Tag: refs/tags/v0.3
- Owner: https://github.com/huggingface
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: python-publish.yml@0d068b98b649d3cef83be00b1dd32681b6352d18
- Trigger Event: release

File details

Details for the file yourbench-0.3.0-py3-none-any.whl.

File metadata

Download URL: yourbench-0.3.0-py3-none-any.whl
Upload date: May 5, 2025
Size: 64.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for yourbench-0.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`6c8d2083ce47982551b8639a33d651c068c6ff3d76ad06d8b0db392f305d8304`
MD5	`7736b7544ff7abda4c5ac2aeafa0c464`
BLAKE2b-256	`8114e65a89adb55afc37f421a2bff96288206da083f7dfafd5053d5fa7dadd37`

See more details on using hashes here.

Provenance

The following attestation bundles were made for yourbench-0.3.0-py3-none-any.whl:

Publisher: python-publish.yml on huggingface/yourbench

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: yourbench-0.3.0-py3-none-any.whl
- Subject digest: 6c8d2083ce47982551b8639a33d651c068c6ff3d76ad06d8b0db392f305d8304
- Sigstore transparency entry: 206844385
- Sigstore integration time: May 5, 2025
Source repository:
- Permalink: huggingface/yourbench@0d068b98b649d3cef83be00b1dd32681b6352d18
- Branch / Tag: refs/tags/v0.3
- Owner: https://github.com/huggingface
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: python-publish.yml@0d068b98b649d3cef83be00b1dd32681b6352d18
- Trigger Event: release

yourbench 0.3.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

YourBench: A Dynamic Benchmark Generation Framework

Quick Start (Alpha)

Process Flow

Key Features

Core Concepts & Workflow

🧰 Development

🚀 Try YourBench on Hugging Face

🤝 Contributing

📄 License

🙏 Acknowledgments

Citation

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance