Skip to main content

Dynamic Evaluation Set Generation with LLMs

Project description

🤗 Yourbench

Dynamic Evaluation Set Generation for LLM Benchmarking [NAACL '25]

Python 3.12+ Code style: ruff License: MIT 🤗 Hugging Face

🌟 Overview

Yourbench is a powerful framework for dynamically generating evaluation sets from source documents. It addresses the limitations of static benchmarks and benchmark saturation by creating diverse, contextually-rich questions tailored to specific educational levels.

🔄 Process Flow

Process Flow

✨ Features

  • 🔄 Dynamic Generation: Create evaluation sets on-the-fly from any source documents
  • 📚 Semantic Chunking: Smart document splitting that maintains context and meaning
  • 🤔 Multi-hop Questions: Generate questions that require synthesizing information across document sections
  • 📊 Configurable Difficulty: Tailor questions to specific educational levels
  • 🔍 Diverse Question Types: Support for 10 different question types
  • 🤖 Model Flexibility: Works with OpenAI and Azure OpenAI models via LiteLLM
  • 📦 Hugging Face Integration: Direct dataset publishing to Hugging Face Hub

🛠️ Requirements

📦 Installation

# Create a virtual environment (recommended)
python -m venv venv
source venv/bin/activate  # Linux/Mac
# or
.\venv\Scripts\activate  # Windows

# Install dependencies
pip install -r requirements.txt

🚀 Quick Start

  1. Set up your environment:
# For OpenAI / OpenAI compatible APIs
export MODEL_BASE_URL=your_openai_url
export MODEL_API_KEY=your_openai_key

# For Azure OpenAI
export AZURE_BASE_URL=your_azure_url
export AZURE_API_KEY=your_azure_key
  1. Create a task configuration (config.yaml). Here is some more information!. You can also look at an example task configuration

  2. Run the example task (after setting your 🤗 username / organization in the config!):

python yourbench/main.py --task-name yourbench_y1

📚 Documentation

Detailed documentation is available in the docs directory:

🏗️ Pipeline Components

1. Dataset Generation

  • Processes source documents
  • Creates structured datasets
  • Supports local files and Hugging Face datasets

2. Document Summarization

  • Generates document summaries
  • Provides context for question generation
  • Uses configured language model

3. Semantic Chunking

  • Splits documents intelligently
  • Maintains semantic coherence
  • Configurable chunk sizes and overlap

4. Multi-hop Chunk Creation

  • Pairs related document chunks
  • Enables complex reasoning questions
  • Smart chunk selection

5. Question Generation

  • Single-shot questions from individual chunks
  • Multi-hop questions from chunk pairs
  • 10 different question types
  • Difficulty calibration
  • Educational level targeting

6. Dataset Management

  • Hugging Face integration
  • Local storage options
  • Dataset versioning

🎯 Question Types

  1. Analytical: Break down complex ideas
  2. Application-based: Apply concepts to scenarios
  3. Clarification: Deep dive into specifics
  4. Counterfactual: Explore alternatives
  5. Conceptual: Examine theories
  6. True-false: Verify understanding
  7. Factual: Test recall
  8. Open-ended: Encourage discussion
  9. False-premise: Correct misconceptions
  10. Edge-case: Test boundaries

⚙️ Configuration

Example configuration:

task_name: yourbench_y1
configurations:
  push_to_huggingface: true
  set_hf_repo_visibility: public
  hf_organization: your-org
  model:
    model_name: gpt-4
    model_type: openai
    max_concurrent_requests: 512

selected_choices:
  generate_dataset:
    execute: true
    files_directory: examples/data
    dataset_name: my_dataset

See Configuration Guide for detailed options.

🧰 Development

We use:

  • Ruff for code formatting and linting
  • pytest for testing

🤝 Contributing

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/amazing-feature)
  3. Install development dependencies
  4. Make your changes
  5. Run tests and ensure code style compliance
  6. Commit your changes (git commit -m 'Add amazing feature')
  7. Push to the branch (git push origin feature/amazing-feature)
  8. Open a Pull Request

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

yourbench-0.1.0.tar.gz (33.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

yourbench-0.1.0-py3-none-any.whl (37.7 kB view details)

Uploaded Python 3

File details

Details for the file yourbench-0.1.0.tar.gz.

File metadata

  • Download URL: yourbench-0.1.0.tar.gz
  • Upload date:
  • Size: 33.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for yourbench-0.1.0.tar.gz
Algorithm Hash digest
SHA256 11d4e6d811f628fc0cc38a8899448c5926fb41af6273e091a3623d1080468021
MD5 101e170626c0f58f00aa6d40847e134d
BLAKE2b-256 804ae9ebd5460447059ebb857f6b497f264df77df222a79eb995426cecbd5ec5

See more details on using hashes here.

Provenance

The following attestation bundles were made for yourbench-0.1.0.tar.gz:

Publisher: python-publish.yml on huggingface/yourbench

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file yourbench-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: yourbench-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 37.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for yourbench-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 7eaae9555fd9e2a21b7a0caa69a43925bd78bf545f8fd7b026b94dc8ddaa378e
MD5 4043b46ced89cd3feae481cca696ec91
BLAKE2b-256 e790ce88060081b851e837e907a0e95fe20e4660a2ead2d947f0f45edc660ecf

See more details on using hashes here.

Provenance

The following attestation bundles were made for yourbench-0.1.0-py3-none-any.whl:

Publisher: python-publish.yml on huggingface/yourbench

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page