A comprehensive benchmarking framework for Retrieval-Augmented Generation (RAG) systems
Project description
RAGWorkbench
A comprehensive benchmarking framework for Retrieval-Augmented Generation (RAG) systems.
Overview
RAGWorkbench is a powerful Python framework designed to evaluate and benchmark RAG systems across multiple datasets and metrics. It provides a unified interface for loading diverse RAG benchmarks, running inference pipelines, and computing comprehensive evaluation metrics.
Key Features
- 🎯 Multiple Benchmark Datasets: Support for 18+ RAG benchmark datasets including AIT-QA, BioASQ, HotpotQA, NarrativeQA, QASPER, and more
- 📊 Comprehensive Metrics: Built-in evaluation metrics for context correctness (Recall@K, MRR, MAP) and answer correctness (BERT Score, Sentence-BERT, LLM-as-a-Judge)
- 🔄 Flexible Pipeline: Modular architecture supporting custom ingest and inference pipelines
- 💾 Smart Caching: File-system based caching for data loading, generation, and evaluation results
- 💰 Cost Tracking: Automatic API usage and cost tracking via LiteLLM proxy with detailed reporting in results and boards - see details
- 🌐 Interactive Explorer: Web-based dataset exploration tool with advanced filtering capabilities
- 🧪 Experiment Management: End-to-end experiment orchestration from data loading to evaluation
Installation
Requirements
- Python 3.11 or higher
Basic Installation
pip install .
Development Installation
git clone https://github.com/IBM/RagWorkbench.git
cd RagWorkbench
pip install -e ".[dev]"
Environment Configuration
Some evaluation metrics require environment variables to be configured. See ENVIRONMENT_SETUP.md for detailed instructions on setting up credentials for:
- watsonx.ai LLM-as-a-Judge metrics
- Azure OpenAI metrics
Optional Dependencies
# For documentation
pip install .[docs]
# For examples
pip install .[examples]
# Install all optional dependencies
pip install .[all]
Quick Start
Basic Usage
from ragbench import DataLoaderFactory, DatasetName, Experiment
from ragbench.api.inference import InferencePipeline, InferenceParams
from ragbench.api.ingest import IngestPipeline
from ragbench.eval import MetricDefinition
# Load a dataset
data_loader = DataLoaderFactory.create_data_loader(DatasetName.HOTPOT_QA)
# Define your custom pipelines
class MyIngestPipeline(IngestPipeline):
def process(self, data_loader):
# Your ingestion logic
pass
class MyInferencePipeline(InferencePipeline):
def __init__(self, params: InferenceParams, cache_dir=None):
super().__init__(params, cache_dir)
def set_ingest_artifacts(self, ingest_artifacts):
# Set up your retrieval system
pass
def process_no_cache(self, benchmark_entry):
# Your inference logic
pass
# Define evaluation metrics
metrics = [
MetricDefinition.from_yaml_key("unitxt.context_correctness.retrieval_at_k"),
MetricDefinition.from_yaml_key("unitxt.answer_correctness.bert_score_recall"),
]
# Create and run experiment
experiment = Experiment(
name="my_rag_experiment",
data_loader=data_loader,
ingest_pipeline=MyIngestPipeline(),
inference_pipeline=MyInferencePipeline(InferenceParams()),
eval_metrics=metrics,
cache_dir="./cache"
)
results, evaluation = experiment.run()
Supported Datasets
RAGWorkbench supports 18+ benchmark datasets across various domains:
| Dataset | Domain | Retrieval Hops | Modalities |
|---|---|---|---|
| AIT-QA | Financial | Single | TEXT, TABLE |
| BioASQ | Biomedical | Single | TEXT |
| CLAP-NQ | Wikipedia | Single | TEXT |
| DA-Code | Code | Single | TEXT |
| DABStep | Code | Multi | TEXT |
| HotpotQA | Wikipedia | Multi | TEXT |
| KramaBench | Wikipedia | Single | TEXT |
| Mini-Wiki | Wikipedia | Single | TEXT |
| MLDR | Multilingual | Single | TEXT |
| NarrativeQA | Literature | Single | TEXT |
| OfficeQA | Technical Docs | Single | TEXT |
| QASPER | Scientific Papers | Single | TEXT |
| SecQue | Policies | Single | TEXT |
| WatsonX DocsQA | Technical Docs | Single | TEXT |
| RealMM (4 variants) | Financial/Technical | Single | TEXT, TABLE, IMAGE |
Loading Datasets
from ragbench import DataLoaderFactory, DatasetName
# Load a specific dataset
loader = DataLoaderFactory.create_data_loader(DatasetName.BIOASQ)
# Get the benchmark and corpus
benchmark = loader.get_benchmark()
corpus = loader.get_corpus()
# Access benchmark entries
for entry in benchmark.get_benchmark_entries():
print(f"Question: {entry.question}")
print(f"Ground truth answers: {entry.ground_truth_answers}")
Evaluation Metrics
RAGWorkbench provides comprehensive evaluation metrics through integration with Unitxt:
Context Correctness Metrics
- Retrieval@K: Measures retrieval accuracy at different cutoffs (K=1, 3, 5, 10, 20, 40)
- MRR (Mean Reciprocal Rank): Evaluates the rank of the first relevant document
- MAP (Mean Average Precision): Measures precision across all relevant documents
Answer Correctness Metrics
- BERT Score Recall: Semantic similarity using BERT embeddings
- Sentence-BERT: Sentence-level semantic similarity
- LLM-as-a-Judge: Uses LLMs (Llama, GPT-4) to evaluate answer quality
Using Metrics
from ragbench.eval import MetricDefinition
# Load metrics from YAML definitions
metric = MetricDefinition.from_yaml_key("unitxt.context_correctness.retrieval_at_k")
# Or create custom metrics
custom_metric = MetricDefinition(
metric_id="custom.metric",
metric_params={"param": "value"},
metric_fields=["field1", "field2"],
vendor="unitxt"
)
Caching System
RagWorkbench includes a sophisticated caching system to speed up experiments:
from pathlib import Path
# Enable caching for all components
cache_dir = Path("./cache")
experiment = Experiment(
name="cached_experiment",
data_loader=data_loader,
ingest_pipeline=ingest_pipeline,
inference_pipeline=MyInferencePipeline(params, cache_dir=cache_dir),
eval_metrics=metrics,
cache_dir=cache_dir # Enables evaluator caching
)
The caching system supports:
- Data Loader Cache: Caches loaded datasets
- Generation Cache: Caches inference results
- Evaluator Cache: Caches evaluation results
Cost Tracking
RAGWorkbench supports optional cost tracking for experiments using LiteLLM proxy. This feature allows you to monitor API usage and costs during experiment runs by generating unique tracking keys and querying usage statistics.
Prerequisites
Before enabling cost tracking, ensure you have:
-
LiteLLM Proxy Running: A LiteLLM proxy server must be running (default:
http://localhost:4000), with the inference and ingestion calls going through that proxy. The proxy should be configured to track usage by API key (See LiteLLM documentation) -
Master Key: Set the
LITELLM_MASTER_KEYenvironment variable to your LiteLLM proxy master key
# .env file
LITELLM_MASTER_KEY=sk-your-master-key-here
Enabling Cost Tracking
Cost tracking is configured at the experiment level in your board.yaml file:
# Experiment-level configuration
experiment:
usage_tracking: true # Enable cost tracking
Viewing Cost Tracking Results
When cost tracking is enabled, usage and cost information is available in:
1. Board Results (CSV)
Cost data is included in the main results.csv file in the output/ directory with columns:
total_cost- Total cost in USDtotal_tokens- Total tokens used (prompt + completion)prompt_tokens- Number of prompt tokenscompletion_tokens- Number of completion tokensrequests- Number of API requests mademodels_used- List of models used
2. Board Markdown Report
Cost metrics can be displayed in the board's markdown report board.md by adding them to your report configuration:
report:
screens:
- title: "Performance & Cost"
columns:
accuracy_mean: "Accuracy"
total_cost: "Cost ($)"
total_tokens: "Tokens"
3. Experiment Results JSON
Detailed cost data is exported to experiment_results_<id>.json files:
{
"cost_data": {
"api_key": "sk-...",
"total_cost": 0.1234,
"total_tokens": 5000,
"prompt_tokens": 3000,
"completion_tokens": 2000,
"requests": 10,
"models_used": ["gpt-4", "gpt-3.5-turbo"]
}
}
Dataset Explorer
RagWorkbench includes an interactive web-based dataset explorer:
python -m ragbench.dataset_exploration.dataset_explorer
Then open your browser to http://localhost:8080
Explorer Features
- 📋 Browse all available datasets in a sortable table
- 🔍 Search datasets by name or description
- 🎨 Filter by domain, retrieval hops, modalities, and more
- 📊 View detailed dataset statistics and metadata
- 📋 Copy dataset names with one click
Core Components
- DataLoader: Loads and manages benchmark datasets
- IngestPipeline: Processes and indexes documents
- InferencePipeline: Runs retrieval and generation
- Evaluator: Computes evaluation metrics
- Experiment: Orchestrates the complete workflow
Development
Running Tests
# Run all tests
pytest
# Run with coverage
pytest --cov=src --cov-report=html
# Run only unit tests
pytest tests/datasets_loader/unit
# Run only integration tests
pytest -m integration
Code Quality
# Format code
black src tests
# Lint code
ruff check src tests
# Type checking
mypy src
Pre-commit Hooks
pre-commit install
pre-commit run --all-files
Contributing
We welcome contributions! Please see our contributing guidelines for more details.
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
Releases
RAGWorkbench follows Semantic Versioning and uses automated releases via GitHub Actions.
Installation
Install the latest stable release from PyPI:
pip install ragworkbench
Install a specific version:
pip install ragworkbench==0.1.0
Release Process
For maintainers preparing a new release:
-
Prepare the release:
./scripts/prepare_release.sh 0.2.0
-
Create and push the tag:
git tag -a v0.2.0 -m "Release version 0.2.0" git push origin v0.2.0
-
Monitor the automated workflow at GitHub Actions
The release workflow will automatically:
- Build the package
- Publish to PyPI
- Create a GitHub release with release notes
For detailed release instructions, see RELEASE.md.
License
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
Authors
- Matan Orbach - matano@il.ibm.com
- Assaf Toledo - assaf.toledo@ibm.com
- Benjamin Sznajder - benjams@il.ibm.com
- Odellia Boni - odelliab@il.ibm.com
Acknowledgments
- Built with Unitxt for evaluation metrics
- Uses NiceGUI for the dataset explorer
- Integrates with Hugging Face Datasets
Support
For questions, issues, or feature requests, please:
- Open an issue on GitHub Issues
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ragworkbench-0.1.3.tar.gz.
File metadata
- Download URL: ragworkbench-0.1.3.tar.gz
- Upload date:
- Size: 164.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7a81302195337ac17512603f351dae88a6eb3d6a477e62a98dc129073873fd27
|
|
| MD5 |
80a942f2f6e97bf74e2f8f9959044745
|
|
| BLAKE2b-256 |
b04d3c9957b9a1d1e15503bbf4e114699355a58ebbbc69f010b892966c97dbe9
|
File details
Details for the file ragworkbench-0.1.3-py3-none-any.whl.
File metadata
- Download URL: ragworkbench-0.1.3-py3-none-any.whl
- Upload date:
- Size: 124.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1566c0f84d0c7692f17961855d762e278b03de5fb0cf16fcd7243a4c38fd1f13
|
|
| MD5 |
ac823b316d17a14afc1a223289765557
|
|
| BLAKE2b-256 |
5455c370754e363b1e26e5ce3e6103bd2f8d555ff954f10c4cb6460de2fa6024
|