Skip to main content

A comprehensive benchmarking framework for Retrieval-Augmented Generation (RAG) systems

Project description

RAGWorkbench

A comprehensive benchmarking framework for Retrieval-Augmented Generation (RAG) systems.

License Python 3.11+

Overview

RAGWorkbench is a powerful Python framework designed to evaluate and benchmark RAG systems across multiple datasets and metrics. It provides a unified interface for loading diverse RAG benchmarks, running inference pipelines, and computing comprehensive evaluation metrics.

Key Features

  • 🎯 Multiple Benchmark Datasets: Support for 18+ RAG benchmark datasets including AIT-QA, BioASQ, HotpotQA, NarrativeQA, QASPER, and more
  • 📊 Comprehensive Metrics: Built-in evaluation metrics for context correctness (Recall@K, MRR, MAP) and answer correctness (BERT Score, Sentence-BERT, LLM-as-a-Judge)
  • 🔄 Flexible Pipeline: Modular architecture supporting custom ingest and inference pipelines
  • 💾 Smart Caching: File-system based caching for data loading, generation, and evaluation results
  • 💰 Cost Tracking: Automatic API usage and cost tracking via LiteLLM proxy with detailed reporting in results and boards - see details
  • 🌐 Interactive Explorer: Web-based dataset exploration tool with advanced filtering capabilities
  • 🧪 Experiment Management: End-to-end experiment orchestration from data loading to evaluation

Installation

Requirements

  • Python 3.11 or higher

Basic Installation

pip install .

Development Installation

git clone https://github.com/IBM/RagWorkbench.git
cd RagWorkbench
pip install -e ".[dev]"

Environment Configuration

Some evaluation metrics require environment variables to be configured. See ENVIRONMENT_SETUP.md for detailed instructions on setting up credentials for:

  • watsonx.ai LLM-as-a-Judge metrics
  • Azure OpenAI metrics

Optional Dependencies

# For documentation
pip install .[docs]

# For examples
pip install .[examples]

# Install all optional dependencies
pip install .[all]

Quick Start

Basic Usage

from ragbench import DataLoaderFactory, DatasetName, Experiment
from ragbench.api.inference import InferencePipeline, InferenceParams
from ragbench.api.ingest import IngestPipeline
from ragbench.eval import MetricDefinition

# Load a dataset
data_loader = DataLoaderFactory.create_data_loader(DatasetName.HOTPOT_QA)

# Define your custom pipelines
class MyIngestPipeline(IngestPipeline):
    def process(self, data_loader):
        # Your ingestion logic
        pass

class MyInferencePipeline(InferencePipeline):
    def __init__(self, params: InferenceParams, cache_dir=None):
        super().__init__(params, cache_dir)

    def set_ingest_artifacts(self, ingest_artifacts):
        # Set up your retrieval system
        pass

    def process_no_cache(self, benchmark_entry):
        # Your inference logic
        pass

# Define evaluation metrics
metrics = [
    MetricDefinition.from_yaml_key("unitxt.context_correctness.retrieval_at_k"),
    MetricDefinition.from_yaml_key("unitxt.answer_correctness.bert_score_recall"),
]

# Create and run experiment
experiment = Experiment(
    name="my_rag_experiment",
    data_loader=data_loader,
    ingest_pipeline=MyIngestPipeline(),
    inference_pipeline=MyInferencePipeline(InferenceParams()),
    eval_metrics=metrics,
    cache_dir="./cache"
)

results, evaluation = experiment.run()

Supported Datasets

RAGWorkbench supports 18+ benchmark datasets across various domains:

Dataset Domain Retrieval Hops Modalities
AIT-QA Financial Single TEXT, TABLE
BioASQ Biomedical Single TEXT
CLAP-NQ Wikipedia Single TEXT
DA-Code Code Single TEXT
DABStep Code Multi TEXT
HotpotQA Wikipedia Multi TEXT
KramaBench Wikipedia Single TEXT
Mini-Wiki Wikipedia Single TEXT
MLDR Multilingual Single TEXT
NarrativeQA Literature Single TEXT
OfficeQA Technical Docs Single TEXT
QASPER Scientific Papers Single TEXT
SecQue Policies Single TEXT
WatsonX DocsQA Technical Docs Single TEXT
RealMM (4 variants) Financial/Technical Single TEXT, TABLE, IMAGE

Loading Datasets

from ragbench import DataLoaderFactory, DatasetName

# Load a specific dataset
loader = DataLoaderFactory.create_data_loader(DatasetName.BIOASQ)

# Get the benchmark and corpus
benchmark = loader.get_benchmark()
corpus = loader.get_corpus()

# Access benchmark entries
for entry in benchmark.get_benchmark_entries():
    print(f"Question: {entry.question}")
    print(f"Ground truth answers: {entry.ground_truth_answers}")

Evaluation Metrics

RAGWorkbench provides comprehensive evaluation metrics through integration with Unitxt:

Context Correctness Metrics

  • Retrieval@K: Measures retrieval accuracy at different cutoffs (K=1, 3, 5, 10, 20, 40)
  • MRR (Mean Reciprocal Rank): Evaluates the rank of the first relevant document
  • MAP (Mean Average Precision): Measures precision across all relevant documents

Answer Correctness Metrics

  • BERT Score Recall: Semantic similarity using BERT embeddings
  • Sentence-BERT: Sentence-level semantic similarity
  • LLM-as-a-Judge: Uses LLMs (Llama, GPT-4) to evaluate answer quality

Using Metrics

from ragbench.eval import MetricDefinition

# Load metrics from YAML definitions
metric = MetricDefinition.from_yaml_key("unitxt.context_correctness.retrieval_at_k")

# Or create custom metrics
custom_metric = MetricDefinition(
    metric_id="custom.metric",
    metric_params={"param": "value"},
    metric_fields=["field1", "field2"],
    vendor="unitxt"
)

Caching System

RagWorkbench includes a sophisticated caching system to speed up experiments:

from pathlib import Path

# Enable caching for all components
cache_dir = Path("./cache")

experiment = Experiment(
    name="cached_experiment",
    data_loader=data_loader,
    ingest_pipeline=ingest_pipeline,
    inference_pipeline=MyInferencePipeline(params, cache_dir=cache_dir),
    eval_metrics=metrics,
    cache_dir=cache_dir  # Enables evaluator caching
)

The caching system supports:

  • Data Loader Cache: Caches loaded datasets
  • Generation Cache: Caches inference results
  • Evaluator Cache: Caches evaluation results

Cost Tracking

RAGWorkbench supports optional cost tracking for experiments using LiteLLM proxy. This feature allows you to monitor API usage and costs during experiment runs by generating unique tracking keys and querying usage statistics.

Prerequisites

Before enabling cost tracking, ensure you have:

  1. LiteLLM Proxy Running: A LiteLLM proxy server must be running (default: http://localhost:4000), with the inference and ingestion calls going through that proxy. The proxy should be configured to track usage by API key (See LiteLLM documentation)

  2. Master Key: Set the LITELLM_MASTER_KEY environment variable to your LiteLLM proxy master key

# .env file
LITELLM_MASTER_KEY=sk-your-master-key-here

Enabling Cost Tracking

Cost tracking is configured at the experiment level in your board.yaml file:

# Experiment-level configuration
experiment:
  usage_tracking: true  # Enable cost tracking

Viewing Cost Tracking Results

When cost tracking is enabled, usage and cost information is available in:

1. Board Results (CSV)

Cost data is included in the main results.csv file in the output/ directory with columns:

  • total_cost - Total cost in USD
  • total_tokens - Total tokens used (prompt + completion)
  • prompt_tokens - Number of prompt tokens
  • completion_tokens - Number of completion tokens
  • requests - Number of API requests made
  • models_used - List of models used

2. Board Markdown Report

Cost metrics can be displayed in the board's markdown report board.md by adding them to your report configuration:

report:
  screens:
    - title: "Performance & Cost"
      columns:
        accuracy_mean: "Accuracy"
        total_cost: "Cost ($)"
        total_tokens: "Tokens"

3. Experiment Results JSON

Detailed cost data is exported to experiment_results_<id>.json files:

{
  "cost_data": {
    "api_key": "sk-...",
    "total_cost": 0.1234,
    "total_tokens": 5000,
    "prompt_tokens": 3000,
    "completion_tokens": 2000,
    "requests": 10,
    "models_used": ["gpt-4", "gpt-3.5-turbo"]
  }
}

Dataset Explorer

RagWorkbench includes an interactive web-based dataset explorer:

python -m ragbench.dataset_exploration.dataset_explorer

Then open your browser to http://localhost:8080

Explorer Features

  • 📋 Browse all available datasets in a sortable table
  • 🔍 Search datasets by name or description
  • 🎨 Filter by domain, retrieval hops, modalities, and more
  • 📊 View detailed dataset statistics and metadata
  • 📋 Copy dataset names with one click

Core Components

  • DataLoader: Loads and manages benchmark datasets
  • IngestPipeline: Processes and indexes documents
  • InferencePipeline: Runs retrieval and generation
  • Evaluator: Computes evaluation metrics
  • Experiment: Orchestrates the complete workflow

Development

Running Tests

# Run all tests
pytest

# Run with coverage
pytest --cov=src --cov-report=html

# Run only unit tests
pytest tests/datasets_loader/unit

# Run only integration tests
pytest -m integration

Code Quality

# Format code
black src tests

# Lint code
ruff check src tests

# Type checking
mypy src

Pre-commit Hooks

pre-commit install
pre-commit run --all-files

Contributing

We welcome contributions! Please see our contributing guidelines for more details.

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

Releases

RAGWorkbench follows Semantic Versioning and uses automated releases via GitHub Actions.

Installation

Install the latest stable release from PyPI:

pip install ragworkbench

Install a specific version:

pip install ragworkbench==0.1.0

Release Process

For maintainers preparing a new release:

  1. Prepare the release:

    ./scripts/prepare_release.sh 0.2.0
    
  2. Create and push the tag:

    git tag -a v0.2.0 -m "Release version 0.2.0"
    git push origin v0.2.0
    
  3. Monitor the automated workflow at GitHub Actions

The release workflow will automatically:

  • Build the package
  • Publish to PyPI
  • Create a GitHub release with release notes

For detailed release instructions, see RELEASE.md.

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Authors

Acknowledgments

Support

For questions, issues, or feature requests, please:


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ragworkbench-0.1.2.tar.gz (164.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ragworkbench-0.1.2-py3-none-any.whl (124.7 kB view details)

Uploaded Python 3

File details

Details for the file ragworkbench-0.1.2.tar.gz.

File metadata

  • Download URL: ragworkbench-0.1.2.tar.gz
  • Upload date:
  • Size: 164.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for ragworkbench-0.1.2.tar.gz
Algorithm Hash digest
SHA256 dffa745558c7aed6d4056852fcd7a4ecccbb270e2914c2ba0c570aa002666072
MD5 ba06daf0053920934099184bc0e172c8
BLAKE2b-256 bb5172be6f4744cc5dc1516d861768a7254d0c0e0b1aa6af1a80d50bf6548c66

See more details on using hashes here.

File details

Details for the file ragworkbench-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: ragworkbench-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 124.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for ragworkbench-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 7877deadb96f2a43a56695c4b1e4a7230861bd4470c8cfbf76ed1329a55606ac
MD5 5548882fa1d2a46cc6bd8ae520d4efb1
BLAKE2b-256 7697f5538cd281a0b9b8ffbd768537171d87b2fe18b84db6432f08e22d01ea58

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page