A comprehensive benchmarking framework for Retrieval-Augmented Generation (RAG) systems
Project description
RAGWorkbench
A comprehensive benchmarking framework for Retrieval-Augmented Generation (RAG) systems.
Overview
RAGWorkbench is a powerful Python framework designed to evaluate and benchmark RAG systems across multiple datasets and metrics. It provides a unified interface for loading diverse RAG benchmarks, running inference pipelines, and computing comprehensive evaluation metrics.
Key Features
- 🎯 Multiple Benchmark Datasets: Support for 18+ RAG benchmark datasets including AIT-QA, BioASQ, HotpotQA, NarrativeQA, QASPER, and more
- 📊 Comprehensive Metrics: Built-in evaluation metrics for context correctness (Recall@K, MRR, MAP) and answer correctness (BERT Score, Sentence-BERT, LLM-as-a-Judge)
- 🔄 Flexible Pipeline: Modular architecture supporting custom ingest and inference pipelines
- 💾 Smart Caching: File-system based caching for data loading, generation, and evaluation results
- 🌐 Interactive Explorer: Web-based dataset exploration tool with advanced filtering capabilities
- 🧪 Experiment Management: End-to-end experiment orchestration from data loading to evaluation
Installation
Requirements
- Python 3.11 or higher
Basic Installation
pip install .
Development Installation
git clone https://github.com/IBM/RagWorkbench.git
cd RagWorkbench
pip install -e ".[dev]"
Environment Configuration
Some evaluation metrics require environment variables to be configured. See ENVIRONMENT_SETUP.md for detailed instructions on setting up credentials for:
- watsonx.ai LLM-as-a-Judge metrics
- Azure OpenAI metrics
Optional Dependencies
# For documentation
pip install .[docs]
# For examples
pip install .[examples]
# Install all optional dependencies
pip install .[all]
Quick Start
Basic Usage
from ragbench import DataLoaderFactory, DatasetName, Experiment
from ragbench.api.inference import InferencePipeline, InferenceParams
from ragbench.api.ingest import IngestPipeline
from ragbench.eval import MetricDefinition
# Load a dataset
data_loader = DataLoaderFactory.create_data_loader(DatasetName.HOTPOT_QA)
# Define your custom pipelines
class MyIngestPipeline(IngestPipeline):
def process(self, data_loader):
# Your ingestion logic
pass
class MyInferencePipeline(InferencePipeline):
def __init__(self, params: InferenceParams, cache_dir=None):
super().__init__(params, cache_dir)
def set_ingest_artifacts(self, ingest_artifacts):
# Set up your retrieval system
pass
def process_no_cache(self, benchmark_entry):
# Your inference logic
pass
# Define evaluation metrics
metrics = [
MetricDefinition.from_yaml_key("unitxt.context_correctness.retrieval_at_k"),
MetricDefinition.from_yaml_key("unitxt.answer_correctness.bert_score_recall"),
]
# Create and run experiment
experiment = Experiment(
name="my_rag_experiment",
data_loader=data_loader,
ingest_pipeline=MyIngestPipeline(),
inference_pipeline=MyInferencePipeline(InferenceParams()),
eval_metrics=metrics,
cache_dir="./cache"
)
results, evaluation = experiment.run()
Supported Datasets
RAGWorkbench supports 18+ benchmark datasets across various domains:
| Dataset | Domain | Retrieval Hops | Modalities |
|---|---|---|---|
| AIT-QA | Financial | Single | TEXT, TABLE |
| BioASQ | Biomedical | Single | TEXT |
| CLAP-NQ | Wikipedia | Single | TEXT |
| DA-Code | Code | Single | TEXT |
| DABStep | Code | Multi | TEXT |
| HotpotQA | Wikipedia | Multi | TEXT |
| KramaBench | Wikipedia | Single | TEXT |
| Mini-Wiki | Wikipedia | Single | TEXT |
| MLDR | Multilingual | Single | TEXT |
| NarrativeQA | Literature | Single | TEXT |
| OfficeQA | Technical Docs | Single | TEXT |
| QASPER | Scientific Papers | Single | TEXT |
| SecQue | Policies | Single | TEXT |
| WatsonX DocsQA | Technical Docs | Single | TEXT |
| RealMM (4 variants) | Financial/Technical | Single | TEXT, TABLE, IMAGE |
Loading Datasets
from ragbench import DataLoaderFactory, DatasetName
# Load a specific dataset
loader = DataLoaderFactory.create_data_loader(DatasetName.BIOASQ)
# Get the benchmark and corpus
benchmark = loader.get_benchmark()
corpus = loader.get_corpus()
# Access benchmark entries
for entry in benchmark.get_benchmark_entries():
print(f"Question: {entry.question}")
print(f"Ground truth answers: {entry.ground_truth_answers}")
Evaluation Metrics
RAGWorkbench provides comprehensive evaluation metrics through integration with Unitxt:
Context Correctness Metrics
- Retrieval@K: Measures retrieval accuracy at different cutoffs (K=1, 3, 5, 10, 20, 40)
- MRR (Mean Reciprocal Rank): Evaluates the rank of the first relevant document
- MAP (Mean Average Precision): Measures precision across all relevant documents
Answer Correctness Metrics
- BERT Score Recall: Semantic similarity using BERT embeddings
- Sentence-BERT: Sentence-level semantic similarity
- LLM-as-a-Judge: Uses LLMs (Llama, GPT-4) to evaluate answer quality
Using Metrics
from ragbench.eval import MetricDefinition
# Load metrics from YAML definitions
metric = MetricDefinition.from_yaml_key("unitxt.context_correctness.retrieval_at_k")
# Or create custom metrics
custom_metric = MetricDefinition(
metric_id="custom.metric",
metric_params={"param": "value"},
metric_fields=["field1", "field2"],
vendor="unitxt"
)
Caching System
RagWorkbench includes a sophisticated caching system to speed up experiments:
from pathlib import Path
# Enable caching for all components
cache_dir = Path("./cache")
experiment = Experiment(
name="cached_experiment",
data_loader=data_loader,
ingest_pipeline=ingest_pipeline,
inference_pipeline=MyInferencePipeline(params, cache_dir=cache_dir),
eval_metrics=metrics,
cache_dir=cache_dir # Enables evaluator caching
)
The caching system supports:
- Data Loader Cache: Caches loaded datasets
- Generation Cache: Caches inference results
- Evaluator Cache: Caches evaluation results
Dataset Explorer
RagWorkbench includes an interactive web-based dataset explorer:
python -m ragbench.dataset_exploration.dataset_explorer
Then open your browser to http://localhost:8080
Explorer Features
- 📋 Browse all available datasets in a sortable table
- 🔍 Search datasets by name or description
- 🎨 Filter by domain, retrieval hops, modalities, and more
- 📊 View detailed dataset statistics and metadata
- 📋 Copy dataset names with one click
Core Components
- DataLoader: Loads and manages benchmark datasets
- IngestPipeline: Processes and indexes documents
- InferencePipeline: Runs retrieval and generation
- Evaluator: Computes evaluation metrics
- Experiment: Orchestrates the complete workflow
Development
Running Tests
# Run all tests
pytest
# Run with coverage
pytest --cov=src --cov-report=html
# Run only unit tests
pytest tests/datasets_loader/unit
# Run only integration tests
pytest -m integration
Code Quality
# Format code
black src tests
# Lint code
ruff check src tests
# Type checking
mypy src
Pre-commit Hooks
pre-commit install
pre-commit run --all-files
Contributing
We welcome contributions! Please see our contributing guidelines for more details.
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
Releases
RAGWorkbench follows Semantic Versioning and uses automated releases via GitHub Actions.
Installation
Install the latest stable release from PyPI:
pip install ragworkbench
Install a specific version:
pip install ragworkbench==0.1.0
Release Process
For maintainers preparing a new release:
-
Prepare the release:
./scripts/prepare_release.sh 0.2.0
-
Create and push the tag:
git tag -a v0.2.0 -m "Release version 0.2.0" git push origin v0.2.0
-
Monitor the automated workflow at GitHub Actions
The release workflow will automatically:
- Build the package
- Publish to PyPI
- Create a GitHub release with release notes
For detailed release instructions, see RELEASE.md.
License
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
Authors
- Matan Orbach - matano@il.ibm.com
- Assaf Toledo - assaf.toledo@ibm.com
- Benjamin Sznajder - benjams@il.ibm.com
- Odellia Boni - odelliab@il.ibm.com
Acknowledgments
- Built with Unitxt for evaluation metrics
- Uses NiceGUI for the dataset explorer
- Integrates with Hugging Face Datasets
Support
For questions, issues, or feature requests, please:
- Open an issue on GitHub Issues
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ragworkbench-0.1.1.tar.gz.
File metadata
- Download URL: ragworkbench-0.1.1.tar.gz
- Upload date:
- Size: 152.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
897df1f00b0f40c87cba711eebf08d3aa94136a09c4b91553da55cb1e52804de
|
|
| MD5 |
139bfb45c9471985f807f21ba4b9f9e9
|
|
| BLAKE2b-256 |
8179d941a7025b6ec9c5484ed127025c426e070ce19df53236178202a7f81408
|
File details
Details for the file ragworkbench-0.1.1-py3-none-any.whl.
File metadata
- Download URL: ragworkbench-0.1.1-py3-none-any.whl
- Upload date:
- Size: 116.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
64164263517dcd351ab9f6b8284c4a366383ceb83ee3a3c18c7e3ce595923b0c
|
|
| MD5 |
977ed681cc38cca46de618ae001d5562
|
|
| BLAKE2b-256 |
92db6ef90c14342367a3a1bbe7b0c2e5c54e62971992c5ca174269951f4f2816
|