Dynamic Evaluation Set Generation with LLMs

Project description

🤗 Yourbench

Dynamic Evaluation Set Generation for LLM Benchmarking [NAACL '25]

🌟 Overview

Yourbench is a powerful framework for dynamically generating evaluation sets from source documents. It addresses the limitations of static benchmarks and benchmark saturation by creating diverse, contextually-rich questions tailored to specific educational levels.

🔄 Process Flow

Process Flow

✨ Features

🔄 Dynamic Generation: Create evaluation sets on-the-fly from any source documents
📚 Semantic Chunking: Smart document splitting that maintains context and meaning
🤔 Multi-hop Questions: Generate questions that require synthesizing information across document sections
📊 Configurable Difficulty: Tailor questions to specific educational levels
🔍 Diverse Question Types: Support for 10 different question types
🤖 Model Flexibility: Works with OpenAI and Azure OpenAI models via LiteLLM
📦 Hugging Face Integration: Direct dataset publishing to Hugging Face Hub

🛠️ Requirements

Python 3.12+
LiteLLM for model inference
Sentence Transformers for semantic chunking
Hugging Face Datasets for dataset management
OpenAI API Compatible API / Azure AI. (more model types coming soon!)

📦 Installation

# Create a virtual environment (recommended)
python -m venv venv
source venv/bin/activate  # Linux/Mac
# or
.\venv\Scripts\activate  # Windows

# Install dependencies
pip install -r requirements.txt

🚀 Quick Start

Set up your environment:

# For OpenAI / OpenAI compatible APIs
export MODEL_BASE_URL=your_openai_url
export MODEL_API_KEY=your_openai_key

# For Azure OpenAI
export AZURE_BASE_URL=your_azure_url
export AZURE_API_KEY=your_azure_key

Create a task configuration (config.yaml). Here is some more information!. You can also look at an example task configuration
Run the example task (after setting your 🤗 username / organization in the config!):

python yourbench/main.py --task-name yourbench_y1

📚 Documentation

Detailed documentation is available in the docs directory:

Configuration Guide: Comprehensive guide to YAML configuration
Question Generation: Details about the question generation process
Chunking System: Information about the semantic chunking system

🏗️ Pipeline Components

1. Dataset Generation

Processes source documents
Creates structured datasets
Supports local files and Hugging Face datasets

2. Document Summarization

Generates document summaries
Provides context for question generation
Uses configured language model

3. Semantic Chunking

Splits documents intelligently
Maintains semantic coherence
Configurable chunk sizes and overlap

4. Multi-hop Chunk Creation

Pairs related document chunks
Enables complex reasoning questions
Smart chunk selection

5. Question Generation

Single-shot questions from individual chunks
Multi-hop questions from chunk pairs
10 different question types
Difficulty calibration
Educational level targeting

6. Dataset Management

Hugging Face integration
Local storage options
Dataset versioning

🎯 Question Types

Analytical: Break down complex ideas
Application-based: Apply concepts to scenarios
Clarification: Deep dive into specifics
Counterfactual: Explore alternatives
Conceptual: Examine theories
True-false: Verify understanding
Factual: Test recall
Open-ended: Encourage discussion
False-premise: Correct misconceptions
Edge-case: Test boundaries

⚙️ Configuration

Example configuration:

task_name: yourbench_y1
configurations:
  push_to_huggingface: true
  set_hf_repo_visibility: public
  hf_organization: your-org
  model:
    model_name: gpt-4
    model_type: openai
    max_concurrent_requests: 512

selected_choices:
  generate_dataset:
    execute: true
    files_directory: examples/data
    dataset_name: my_dataset

See Configuration Guide for detailed options.

🧰 Development

We use:

Ruff for code formatting and linting
pytest for testing

🤝 Contributing

Fork the repository
Create your feature branch (git checkout -b feature/amazing-feature)
Install development dependencies
Make your changes
Run tests and ensure code style compliance
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

LiteLLM for model inference
Sentence Transformers for semantic embeddings
Hugging Face for dataset infrastructure

Project details

Release history Release notifications | RSS feed

0.9.0

Dec 29, 2025

0.6.0

Aug 5, 2025

0.5.3

Aug 5, 2025

0.5.2

Aug 5, 2025

0.5.1

Aug 5, 2025

0.5.0

Aug 5, 2025

0.4.3

Aug 5, 2025

0.4.1

Aug 4, 2025

0.4.0

Jul 31, 2025

0.3.1

May 16, 2025

0.3.0

May 5, 2025

0.2.0

Mar 21, 2025

This version

0.1.0

Mar 20, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

yourbench-0.1.0.tar.gz (33.7 kB view details)

Uploaded Mar 20, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

yourbench-0.1.0-py3-none-any.whl (37.7 kB view details)

Uploaded Mar 20, 2025 Python 3

File details

Details for the file yourbench-0.1.0.tar.gz.

File metadata

Download URL: yourbench-0.1.0.tar.gz
Upload date: Mar 20, 2025
Size: 33.7 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for yourbench-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`11d4e6d811f628fc0cc38a8899448c5926fb41af6273e091a3623d1080468021`
MD5	`101e170626c0f58f00aa6d40847e134d`
BLAKE2b-256	`804ae9ebd5460447059ebb857f6b497f264df77df222a79eb995426cecbd5ec5`

See more details on using hashes here.

Provenance

The following attestation bundles were made for yourbench-0.1.0.tar.gz:

Publisher: python-publish.yml on huggingface/yourbench

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: yourbench-0.1.0.tar.gz
- Subject digest: 11d4e6d811f628fc0cc38a8899448c5926fb41af6273e091a3623d1080468021
- Sigstore transparency entry: 185575430
- Sigstore integration time: Mar 20, 2025
Source repository:
- Permalink: huggingface/yourbench@40ab6b3533a9f655e3d8953c5750425665caef92
- Branch / Tag: refs/tags/v0.0.0
- Owner: https://github.com/huggingface
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: python-publish.yml@40ab6b3533a9f655e3d8953c5750425665caef92
- Trigger Event: release

File details

Details for the file yourbench-0.1.0-py3-none-any.whl.

File metadata

Download URL: yourbench-0.1.0-py3-none-any.whl
Upload date: Mar 20, 2025
Size: 37.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for yourbench-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`7eaae9555fd9e2a21b7a0caa69a43925bd78bf545f8fd7b026b94dc8ddaa378e`
MD5	`4043b46ced89cd3feae481cca696ec91`
BLAKE2b-256	`e790ce88060081b851e837e907a0e95fe20e4660a2ead2d947f0f45edc660ecf`

See more details on using hashes here.

Provenance

The following attestation bundles were made for yourbench-0.1.0-py3-none-any.whl:

Publisher: python-publish.yml on huggingface/yourbench

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: yourbench-0.1.0-py3-none-any.whl
- Subject digest: 7eaae9555fd9e2a21b7a0caa69a43925bd78bf545f8fd7b026b94dc8ddaa378e
- Sigstore transparency entry: 185575433
- Sigstore integration time: Mar 20, 2025
Source repository:
- Permalink: huggingface/yourbench@40ab6b3533a9f655e3d8953c5750425665caef92
- Branch / Tag: refs/tags/v0.0.0
- Owner: https://github.com/huggingface
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: python-publish.yml@40ab6b3533a9f655e3d8953c5750425665caef92
- Trigger Event: release

yourbench 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

🤗 Yourbench

🌟 Overview

🔄 Process Flow

✨ Features

🛠️ Requirements

📦 Installation

🚀 Quick Start

📚 Documentation

🏗️ Pipeline Components

1. Dataset Generation

2. Document Summarization

3. Semantic Chunking

4. Multi-hop Chunk Creation

5. Question Generation

6. Dataset Management

🎯 Question Types

⚙️ Configuration

🧰 Development

🤝 Contributing

📄 License

🙏 Acknowledgments

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance