A comprehensive benchmarking framework for Retrieval-Augmented Generation (RAG) systems

These details have not been verified by PyPI

Project links

Project description

RAGWorkbench

A comprehensive benchmarking framework for Retrieval-Augmented Generation (RAG) systems.

Overview

RAGWorkbench is a powerful Python framework designed to evaluate and benchmark RAG systems across multiple datasets and metrics. It provides a unified interface for loading diverse RAG benchmarks, running inference pipelines, and computing comprehensive evaluation metrics.

Key Features

🎯 Multiple Benchmark Datasets: Support for 18+ RAG benchmark datasets including AIT-QA, BioASQ, HotpotQA, NarrativeQA, QASPER, and more
📊 Comprehensive Metrics: Built-in evaluation metrics for context correctness (Recall@K, MRR, MAP) and answer correctness (BERT Score, Sentence-BERT, LLM-as-a-Judge)
🔄 Flexible Pipeline: Modular architecture supporting custom ingest and inference pipelines
💾 Smart Caching: File-system based caching for data loading, generation, and evaluation results
💰 Cost Tracking: Automatic API usage and cost tracking via LiteLLM proxy with detailed reporting in results and boards - see details
🌐 Interactive Explorer: Web-based dataset exploration tool with advanced filtering capabilities
🧪 Experiment Management: End-to-end experiment orchestration from data loading to evaluation

Installation

Requirements

Python 3.11 or higher

Basic Installation

pip install .

Development Installation

git clone https://github.com/IBM/RagWorkbench.git
cd RagWorkbench
pip install -e ".[dev]"

Environment Configuration

Some evaluation metrics require environment variables to be configured. See ENVIRONMENT_SETUP.md for detailed instructions on setting up credentials for:

watsonx.ai LLM-as-a-Judge metrics
Azure OpenAI metrics

Optional Dependencies

# For documentation
pip install .[docs]

# For examples
pip install .[examples]

# Install all optional dependencies
pip install .[all]

Quick Start

Basic Usage

from ragbench import DataLoaderFactory, DatasetName, Experiment
from ragbench.api.inference import InferencePipeline, InferenceParams
from ragbench.api.ingest import IngestPipeline
from ragbench.eval import MetricDefinition

# Load a dataset
data_loader = DataLoaderFactory.create_data_loader(DatasetName.HOTPOT_QA)

# Define your custom pipelines
class MyIngestPipeline(IngestPipeline):
    def process(self, data_loader):
        # Your ingestion logic
        pass

class MyInferencePipeline(InferencePipeline):
    def __init__(self, params: InferenceParams, cache_dir=None):
        super().__init__(params, cache_dir)

    def set_ingest_artifacts(self, ingest_artifacts):
        # Set up your retrieval system
        pass

    def process_no_cache(self, benchmark_entry):
        # Your inference logic
        pass

# Define evaluation metrics
metrics = [
    MetricDefinition.from_yaml_key("unitxt.context_correctness.retrieval_at_k"),
    MetricDefinition.from_yaml_key("unitxt.answer_correctness.bert_score_recall"),
]

# Create and run experiment
experiment = Experiment(
    name="my_rag_experiment",
    data_loader=data_loader,
    ingest_pipeline=MyIngestPipeline(),
    inference_pipeline=MyInferencePipeline(InferenceParams()),
    eval_metrics=metrics,
    cache_dir="./cache"
)

results, evaluation = experiment.run()

Supported Datasets

RAGWorkbench supports 18+ benchmark datasets across various domains:

Dataset	Domain	Retrieval Hops	Modalities
AIT-QA	Financial	Single	TEXT, TABLE
BioASQ	Biomedical	Single	TEXT
CLAP-NQ	Wikipedia	Single	TEXT
DA-Code	Code	Single	TEXT
DABStep	Code	Multi	TEXT
HotpotQA	Wikipedia	Multi	TEXT
KramaBench	Wikipedia	Single	TEXT
Mini-Wiki	Wikipedia	Single	TEXT
MLDR	Multilingual	Single	TEXT
NarrativeQA	Literature	Single	TEXT
OfficeQA	Technical Docs	Single	TEXT
QASPER	Scientific Papers	Single	TEXT
SecQue	Policies	Single	TEXT
WatsonX DocsQA	Technical Docs	Single	TEXT
RealMM (4 variants)	Financial/Technical	Single	TEXT, TABLE, IMAGE

Loading Datasets

from ragbench import DataLoaderFactory, DatasetName

# Load a specific dataset
loader = DataLoaderFactory.create_data_loader(DatasetName.BIOASQ)

# Get the benchmark and corpus
benchmark = loader.get_benchmark()
corpus = loader.get_corpus()

# Access benchmark entries
for entry in benchmark.get_benchmark_entries():
    print(f"Question: {entry.question}")
    print(f"Ground truth answers: {entry.ground_truth_answers}")

Evaluation Metrics

RAGWorkbench provides comprehensive evaluation metrics through integration with Unitxt:

Context Correctness Metrics

Retrieval@K: Measures retrieval accuracy at different cutoffs (K=1, 3, 5, 10, 20, 40)
MRR (Mean Reciprocal Rank): Evaluates the rank of the first relevant document
MAP (Mean Average Precision): Measures precision across all relevant documents

Answer Correctness Metrics

BERT Score Recall: Semantic similarity using BERT embeddings
Sentence-BERT: Sentence-level semantic similarity
LLM-as-a-Judge: Uses LLMs (Llama, GPT-4) to evaluate answer quality

Using Metrics

from ragbench.eval import MetricDefinition

# Load metrics from YAML definitions
metric = MetricDefinition.from_yaml_key("unitxt.context_correctness.retrieval_at_k")

# Or create custom metrics
custom_metric = MetricDefinition(
    metric_id="custom.metric",
    metric_params={"param": "value"},
    metric_fields=["field1", "field2"],
    vendor="unitxt"
)

Caching System

RagWorkbench includes a sophisticated caching system to speed up experiments:

from pathlib import Path

# Enable caching for all components
cache_dir = Path("./cache")

experiment = Experiment(
    name="cached_experiment",
    data_loader=data_loader,
    ingest_pipeline=ingest_pipeline,
    inference_pipeline=MyInferencePipeline(params, cache_dir=cache_dir),
    eval_metrics=metrics,
    cache_dir=cache_dir  # Enables evaluator caching
)

The caching system supports:

Data Loader Cache: Caches loaded datasets
Generation Cache: Caches inference results
Evaluator Cache: Caches evaluation results

Cost Tracking

RAGWorkbench supports optional cost tracking for experiments using LiteLLM proxy. This feature allows you to monitor API usage and costs during experiment runs by generating unique tracking keys and querying usage statistics.

Prerequisites

Before enabling cost tracking, ensure you have:

LiteLLM Proxy Running: A LiteLLM proxy server must be running (default: http://localhost:4000), with the inference and ingestion calls going through that proxy. The proxy should be configured to track usage by API key (See LiteLLM documentation)
Master Key: Set the LITELLM_MASTER_KEY environment variable to your LiteLLM proxy master key

# .env file
LITELLM_MASTER_KEY=sk-your-master-key-here

Enabling Cost Tracking

Cost tracking is configured at the experiment level in your board.yaml file:

# Experiment-level configuration
experiment:
  usage_tracking: true  # Enable cost tracking

Viewing Cost Tracking Results

When cost tracking is enabled, usage and cost information is available in:

1. Board Results (CSV)

Cost data is included in the main results.csv file in the output/ directory with columns:

total_cost - Total cost in USD
total_tokens - Total tokens used (prompt + completion)
prompt_tokens - Number of prompt tokens
completion_tokens - Number of completion tokens
requests - Number of API requests made
models_used - List of models used

2. Board Markdown Report

Cost metrics can be displayed in the board's markdown report board.md by adding them to your report configuration:

report:
  screens:
    - title: "Performance & Cost"
      columns:
        accuracy_mean: "Accuracy"
        total_cost: "Cost ($)"
        total_tokens: "Tokens"

3. Experiment Results JSON

Detailed cost data is exported to experiment_results_<id>.json files:

{
  "cost_data": {
    "api_key": "sk-...",
    "total_cost": 0.1234,
    "total_tokens": 5000,
    "prompt_tokens": 3000,
    "completion_tokens": 2000,
    "requests": 10,
    "models_used": ["gpt-4", "gpt-3.5-turbo"]
  }
}

Dataset Explorer

RagWorkbench includes an interactive web-based dataset explorer:

python -m ragbench.dataset_exploration.dataset_explorer

Then open your browser to http://localhost:8080

Explorer Features

📋 Browse all available datasets in a sortable table
🔍 Search datasets by name or description
🎨 Filter by domain, retrieval hops, modalities, and more
📊 View detailed dataset statistics and metadata
📋 Copy dataset names with one click

Core Components

DataLoader: Loads and manages benchmark datasets
IngestPipeline: Processes and indexes documents
InferencePipeline: Runs retrieval and generation
Evaluator: Computes evaluation metrics
Experiment: Orchestrates the complete workflow

Development

Running Tests

# Run all tests
pytest

# Run with coverage
pytest --cov=src --cov-report=html

# Run only unit tests
pytest tests/datasets_loader/unit

# Run only integration tests
pytest -m integration

Code Quality

# Format code
black src tests

# Lint code
ruff check src tests

# Type checking
mypy src

Pre-commit Hooks

pre-commit install
pre-commit run --all-files

Contributing

We welcome contributions! Please see our contributing guidelines for more details.

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

Releases

RAGWorkbench follows Semantic Versioning and uses automated releases via GitHub Actions.

Installation

Install the latest stable release from PyPI:

pip install ragworkbench

Install a specific version:

pip install ragworkbench==0.1.0

Release Process

For maintainers preparing a new release:

Prepare the release:
```
./scripts/prepare_release.sh 0.2.0
```

Create and push the tag:

git tag -a v0.2.0 -m "Release version 0.2.0"
git push origin v0.2.0

Monitor the automated workflow at GitHub Actions

The release workflow will automatically:

Build the package
Publish to PyPI
Create a GitHub release with release notes

For detailed release instructions, see RELEASE.md.

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Authors

Matan Orbach - matano@il.ibm.com
Assaf Toledo - assaf.toledo@ibm.com
Benjamin Sznajder - benjams@il.ibm.com
Odellia Boni - odelliab@il.ibm.com

Acknowledgments

Built with Unitxt for evaluation metrics
Uses NiceGUI for the dataset explorer
Integrates with Hugging Face Datasets

Support

For questions, issues, or feature requests, please:

Open an issue on GitHub Issues

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.3

Mar 31, 2026

0.1.2

Mar 31, 2026

0.1.1

Mar 24, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ragworkbench-0.1.3.tar.gz (164.0 kB view details)

Uploaded Mar 31, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

ragworkbench-0.1.3-py3-none-any.whl (124.7 kB view details)

Uploaded Mar 31, 2026 Python 3

File details

Details for the file ragworkbench-0.1.3.tar.gz.

File metadata

Download URL: ragworkbench-0.1.3.tar.gz
Upload date: Mar 31, 2026
Size: 164.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for ragworkbench-0.1.3.tar.gz
Algorithm	Hash digest
SHA256	`7a81302195337ac17512603f351dae88a6eb3d6a477e62a98dc129073873fd27`
MD5	`80a942f2f6e97bf74e2f8f9959044745`
BLAKE2b-256	`b04d3c9957b9a1d1e15503bbf4e114699355a58ebbbc69f010b892966c97dbe9`

See more details on using hashes here.

File details

Details for the file ragworkbench-0.1.3-py3-none-any.whl.

File metadata

Download URL: ragworkbench-0.1.3-py3-none-any.whl
Upload date: Mar 31, 2026
Size: 124.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for ragworkbench-0.1.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`1566c0f84d0c7692f17961855d762e278b03de5fb0cf16fcd7243a4c38fd1f13`
MD5	`ac823b316d17a14afc1a223289765557`
BLAKE2b-256	`5455c370754e363b1e26e5ce3e6103bd2f8d555ff954f10c4cb6460de2fa6024`

See more details on using hashes here.

ragworkbench 0.1.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

RAGWorkbench

Overview

Key Features

Installation

Requirements

Basic Installation

Development Installation

Environment Configuration

Optional Dependencies

Quick Start

Basic Usage

Supported Datasets

Loading Datasets

Evaluation Metrics

Context Correctness Metrics

Answer Correctness Metrics

Using Metrics

Caching System

Cost Tracking

Prerequisites

Enabling Cost Tracking

Viewing Cost Tracking Results

1. Board Results (CSV)

2. Board Markdown Report

3. Experiment Results JSON

Dataset Explorer

Explorer Features

Core Components

Development

Running Tests

Code Quality

Pre-commit Hooks

Contributing

Releases

Installation

Release Process

License

Authors

Acknowledgments

Support

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes