Skip to main content

Dynamic Evaluation Set Generation with Large Language Models

Project description

YourBench Logo

YourBench: A Dynamic Benchmark Generation Framework

[GitHub] · [Dataset] · [Documentation]

GitHub Repo stars

YourBench Demo Video
Watch Demo on YouTube
Watch our 3-minute demo of the YourBench pipeline


YourBench is an open-source framework for generating domain-specific benchmarks in a zero-shot manner. It aims to keep your large language models on their toes—even as new data sources, domains, and knowledge demands evolve.

Highlights:

  • Dynamic Benchmark Generation: Produce diverse, up-to-date questions from real-world source documents (PDF, Word, HTML, even multimedia).
  • Scalable & Structured: Seamlessly handles ingestion, summarization, and multi-hop chunking for large or specialized datasets.
  • Zero-Shot Focus: Emulates real-world usage scenarios by creating fresh tasks that guard against memorized knowledge.
  • Extensible: Out-of-the-box pipeline stages (ingestion, summarization, question generation), plus an easy plugin mechanism to accommodate custom models or domain constraints.

Quick Start (Alpha)

# 1. Clone the repo
git clone https://github.com/huggingface/yourbench.git
cd yourbench

# Use uv to install the dependencies
# pip install uv # if you do not have uv already
uv venv
source .venv/bin/activate
uv sync
uv pip install -e .

# 3. Get a key from https://openrouter.ai/ and add it to the .env file (or make your own config with a different model!)
touch .env
echo "HF_TOKEN=<your_huggingface_token>" >> .env

# 4. Run the pipeline with an example config
yourbench run --config configs/example.yaml

Note: The above instructions are a work-in-progress, and more comprehensive usage info will be provided soon.

Process Flow

Process Flow

Key Features

  • Automated Benchmark Generation
    Generate question-answer pairs that test LLMs on specific domains or knowledge slices, derived directly from your raw documents.

  • Flexible Pipeline
    Each stage (ingestion, summarization, chunking, multi-/single-hop QG, deduplication) can be enabled or disabled via YAML config. Fine-grained control allows minimal or comprehensive runs.

  • Robust Config System
    A single YAML config controls model roles, data paths, chunking parameters, question generation instructions, deduplication thresholds, etc.

  • Multi-Model Ensemble Support
    Use different LLMs for ingestion, summarization, question generation, or answering. This fosters broader coverage and question style diversity.

  • Deduplication & Quality Filtering
    Automatic grouping of near-duplicate questions to prune and keep a curated set.

  • Extensive Logging & Analysis
    Built-in modules measure dataset coverage, question distribution, difficulty metrics, and more.

  • Public or Private
    Optionally push ingested or generated data to the Hugging Face Hub or keep it local.

  • Extensible
    Each pipeline step is modular. Easily add custom question-generation prompts, chunking logic, or domain-specific expansions.


Core Concepts & Workflow

YourBench follows a multi-stage approach:

  1. Document Ingestion
    Convert PDFs, HTML, Word, or text into a standardized Markdown format.

  2. Summarization
    Generate a concise "global summary" for each document, using a designated summarization LLM.

  3. Chunking
    Split or chunk documents (and optionally combine multiple smaller segments) based on text similarity or length constraints.

  4. Question Generation

    • Single-Shot: Create straightforward, single-chunk questions.
    • Multi-Hop: Combine multiple chunks to produce more complex, integrative questions.
  5. Deduplication
    Remove or group near-duplicate questions across your dataset using embedding-based similarity.

  6. Analysis
    Evaluate question distribution, difficulty, coverage, or run custom analyses.

  7. Export
    The resulting question sets can be stored locally or uploaded as a new dataset on the Hugging Face Hub.


🧰 Development

We use:

  • Ruff for code formatting and linting
  • pytest for testing

🤝 Contributing

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/amazing-feature)
  3. Install development dependencies
  4. Make your changes
  5. Run tests and ensure code style compliance
  6. Commit your changes (git commit -m 'Add amazing feature')
  7. Push to the branch (git push origin feature/amazing-feature)
  8. Open a Pull Request

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

yourbench-0.2.0.tar.gz (48.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

yourbench-0.2.0-py3-none-any.whl (54.7 kB view details)

Uploaded Python 3

File details

Details for the file yourbench-0.2.0.tar.gz.

File metadata

  • Download URL: yourbench-0.2.0.tar.gz
  • Upload date:
  • Size: 48.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for yourbench-0.2.0.tar.gz
Algorithm Hash digest
SHA256 01c1fa3fe2246a972d4b6d0d2068af455a84ccf0ec0e052a1c36313cb0fd0a66
MD5 9984c30498f146acb6e15a5267ee80ae
BLAKE2b-256 ae3e2664d2d60ce71c80d7f950c79c45b2b235f36ec24a7ca5c15525bded6b9d

See more details on using hashes here.

Provenance

The following attestation bundles were made for yourbench-0.2.0.tar.gz:

Publisher: python-publish.yml on huggingface/yourbench

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file yourbench-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: yourbench-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 54.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for yourbench-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 bcb8d609a3f4ed91bc18cd3fbec8ae2487506aa686fd52cf80ad96430670d9ea
MD5 15d4abefa545abdf7827ec988af77831
BLAKE2b-256 c5f38a1ac910a9cce27d02d7811c0035266b9227765c483ebe088bcbb5d52994

See more details on using hashes here.

Provenance

The following attestation bundles were made for yourbench-0.2.0-py3-none-any.whl:

Publisher: python-publish.yml on huggingface/yourbench

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page