Skip to main content

A comprehensive benchmarking framework for Retrieval-Augmented Generation (RAG) systems

Project description

RAGWorkbench

A comprehensive benchmarking framework for Retrieval-Augmented Generation (RAG) systems.

License Python 3.11+

Overview

RAGWorkbench is a powerful Python framework designed to evaluate and benchmark RAG systems across multiple datasets and metrics. It provides a unified interface for loading diverse RAG benchmarks, running inference pipelines, and computing comprehensive evaluation metrics.

Key Features

  • 🎯 Multiple Benchmark Datasets: Support for 18+ RAG benchmark datasets including AIT-QA, BioASQ, HotpotQA, NarrativeQA, QASPER, and more
  • 📊 Comprehensive Metrics: Built-in evaluation metrics for context correctness (Recall@K, MRR, MAP) and answer correctness (BERT Score, Sentence-BERT, LLM-as-a-Judge)
  • 🔄 Flexible Pipeline: Modular architecture supporting custom ingest and inference pipelines
  • 💾 Smart Caching: File-system based caching for data loading, generation, and evaluation results
  • 🌐 Interactive Explorer: Web-based dataset exploration tool with advanced filtering capabilities
  • 🧪 Experiment Management: End-to-end experiment orchestration from data loading to evaluation

Installation

Requirements

  • Python 3.11 or higher

Basic Installation

pip install .

Development Installation

git clone https://github.com/IBM/RagWorkbench.git
cd RagWorkbench
pip install -e ".[dev]"

Environment Configuration

Some evaluation metrics require environment variables to be configured. See ENVIRONMENT_SETUP.md for detailed instructions on setting up credentials for:

  • watsonx.ai LLM-as-a-Judge metrics
  • Azure OpenAI metrics

Optional Dependencies

# For documentation
pip install .[docs]

# For examples
pip install .[examples]

# Install all optional dependencies
pip install .[all]

Quick Start

Basic Usage

from ragbench import DataLoaderFactory, DatasetName, Experiment
from ragbench.api.inference import InferencePipeline, InferenceParams
from ragbench.api.ingest import IngestPipeline
from ragbench.eval import MetricDefinition

# Load a dataset
data_loader = DataLoaderFactory.create_data_loader(DatasetName.HOTPOT_QA)

# Define your custom pipelines
class MyIngestPipeline(IngestPipeline):
    def process(self, data_loader):
        # Your ingestion logic
        pass

class MyInferencePipeline(InferencePipeline):
    def __init__(self, params: InferenceParams, cache_dir=None):
        super().__init__(params, cache_dir)

    def set_ingest_artifacts(self, ingest_artifacts):
        # Set up your retrieval system
        pass

    def process_no_cache(self, benchmark_entry):
        # Your inference logic
        pass

# Define evaluation metrics
metrics = [
    MetricDefinition.from_yaml_key("unitxt.context_correctness.retrieval_at_k"),
    MetricDefinition.from_yaml_key("unitxt.answer_correctness.bert_score_recall"),
]

# Create and run experiment
experiment = Experiment(
    name="my_rag_experiment",
    data_loader=data_loader,
    ingest_pipeline=MyIngestPipeline(),
    inference_pipeline=MyInferencePipeline(InferenceParams()),
    eval_metrics=metrics,
    cache_dir="./cache"
)

results, evaluation = experiment.run()

Supported Datasets

RAGWorkbench supports 18+ benchmark datasets across various domains:

Dataset Domain Retrieval Hops Modalities
AIT-QA Financial Single TEXT, TABLE
BioASQ Biomedical Single TEXT
CLAP-NQ Wikipedia Single TEXT
DA-Code Code Single TEXT
DABStep Code Multi TEXT
HotpotQA Wikipedia Multi TEXT
KramaBench Wikipedia Single TEXT
Mini-Wiki Wikipedia Single TEXT
MLDR Multilingual Single TEXT
NarrativeQA Literature Single TEXT
OfficeQA Technical Docs Single TEXT
QASPER Scientific Papers Single TEXT
SecQue Policies Single TEXT
WatsonX DocsQA Technical Docs Single TEXT
RealMM (4 variants) Financial/Technical Single TEXT, TABLE, IMAGE

Loading Datasets

from ragbench import DataLoaderFactory, DatasetName

# Load a specific dataset
loader = DataLoaderFactory.create_data_loader(DatasetName.BIOASQ)

# Get the benchmark and corpus
benchmark = loader.get_benchmark()
corpus = loader.get_corpus()

# Access benchmark entries
for entry in benchmark.get_benchmark_entries():
    print(f"Question: {entry.question}")
    print(f"Ground truth answers: {entry.ground_truth_answers}")

Evaluation Metrics

RAGWorkbench provides comprehensive evaluation metrics through integration with Unitxt:

Context Correctness Metrics

  • Retrieval@K: Measures retrieval accuracy at different cutoffs (K=1, 3, 5, 10, 20, 40)
  • MRR (Mean Reciprocal Rank): Evaluates the rank of the first relevant document
  • MAP (Mean Average Precision): Measures precision across all relevant documents

Answer Correctness Metrics

  • BERT Score Recall: Semantic similarity using BERT embeddings
  • Sentence-BERT: Sentence-level semantic similarity
  • LLM-as-a-Judge: Uses LLMs (Llama, GPT-4) to evaluate answer quality

Using Metrics

from ragbench.eval import MetricDefinition

# Load metrics from YAML definitions
metric = MetricDefinition.from_yaml_key("unitxt.context_correctness.retrieval_at_k")

# Or create custom metrics
custom_metric = MetricDefinition(
    metric_id="custom.metric",
    metric_params={"param": "value"},
    metric_fields=["field1", "field2"],
    vendor="unitxt"
)

Caching System

RagWorkbench includes a sophisticated caching system to speed up experiments:

from pathlib import Path

# Enable caching for all components
cache_dir = Path("./cache")

experiment = Experiment(
    name="cached_experiment",
    data_loader=data_loader,
    ingest_pipeline=ingest_pipeline,
    inference_pipeline=MyInferencePipeline(params, cache_dir=cache_dir),
    eval_metrics=metrics,
    cache_dir=cache_dir  # Enables evaluator caching
)

The caching system supports:

  • Data Loader Cache: Caches loaded datasets
  • Generation Cache: Caches inference results
  • Evaluator Cache: Caches evaluation results

Dataset Explorer

RagWorkbench includes an interactive web-based dataset explorer:

python -m ragbench.dataset_exploration.dataset_explorer

Then open your browser to http://localhost:8080

Explorer Features

  • 📋 Browse all available datasets in a sortable table
  • 🔍 Search datasets by name or description
  • 🎨 Filter by domain, retrieval hops, modalities, and more
  • 📊 View detailed dataset statistics and metadata
  • 📋 Copy dataset names with one click

Core Components

  • DataLoader: Loads and manages benchmark datasets
  • IngestPipeline: Processes and indexes documents
  • InferencePipeline: Runs retrieval and generation
  • Evaluator: Computes evaluation metrics
  • Experiment: Orchestrates the complete workflow

Development

Running Tests

# Run all tests
pytest

# Run with coverage
pytest --cov=src --cov-report=html

# Run only unit tests
pytest tests/datasets_loader/unit

# Run only integration tests
pytest -m integration

Code Quality

# Format code
black src tests

# Lint code
ruff check src tests

# Type checking
mypy src

Pre-commit Hooks

pre-commit install
pre-commit run --all-files

Contributing

We welcome contributions! Please see our contributing guidelines for more details.

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

Releases

RAGWorkbench follows Semantic Versioning and uses automated releases via GitHub Actions.

Installation

Install the latest stable release from PyPI:

pip install ragworkbench

Install a specific version:

pip install ragworkbench==0.1.0

Release Process

For maintainers preparing a new release:

  1. Prepare the release:

    ./scripts/prepare_release.sh 0.2.0
    
  2. Create and push the tag:

    git tag -a v0.2.0 -m "Release version 0.2.0"
    git push origin v0.2.0
    
  3. Monitor the automated workflow at GitHub Actions

The release workflow will automatically:

  • Build the package
  • Publish to PyPI
  • Create a GitHub release with release notes

For detailed release instructions, see RELEASE.md.

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Authors

Acknowledgments

Support

For questions, issues, or feature requests, please:


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ragworkbench-0.1.1.tar.gz (152.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ragworkbench-0.1.1-py3-none-any.whl (116.4 kB view details)

Uploaded Python 3

File details

Details for the file ragworkbench-0.1.1.tar.gz.

File metadata

  • Download URL: ragworkbench-0.1.1.tar.gz
  • Upload date:
  • Size: 152.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for ragworkbench-0.1.1.tar.gz
Algorithm Hash digest
SHA256 897df1f00b0f40c87cba711eebf08d3aa94136a09c4b91553da55cb1e52804de
MD5 139bfb45c9471985f807f21ba4b9f9e9
BLAKE2b-256 8179d941a7025b6ec9c5484ed127025c426e070ce19df53236178202a7f81408

See more details on using hashes here.

File details

Details for the file ragworkbench-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: ragworkbench-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 116.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for ragworkbench-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 64164263517dcd351ab9f6b8284c4a366383ceb83ee3a3c18c7e3ce595923b0c
MD5 977ed681cc38cca46de618ae001d5562
BLAKE2b-256 92db6ef90c14342367a3a1bbe7b0c2e5c54e62971992c5ca174269951f4f2816

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page