Skip to main content

A lightweight retrieval benchmarking toolkit for experimenting with retrieval pipelines.

Project description

Ragaroo: A Lightweight Retrieval Benchmarking Library for Custom Datasets

Ragaroo Logo

Ragaroo is a lightweight Python library for benchmarking retrieval pipelines on custom datasets. It is designed for practical RAG/retrieval experiments: load a dataset, define pipelines, run an experiment, compare quality and latency, and save reproducible artifacts.

Features

  • Strict dataset loading from corpus.jsonl, queries.jsonl, and qrels.tsv
  • BM25 retrieval with bm25s
  • Dense retrieval with Sentence Transformers and FAISS
  • Sparse retrieval with Sentence Transformers sparse encoders
  • Hybrid retrieval with reciprocal-rank fusion or average-score fusion
  • Cross-encoder reranking and retriever-as-reranker workflows
  • LLM-based query augmentation: HyDE, spelling correction, intent clarification
  • Ranking metrics, latency metrics, CSV/JSON exports, plots, and manifests
  • Local index caching for repeatable experiments

Installation

For most users, installation should be simple:

pip install ragaroo

OR,

If you are working directly from the GitHub repository:

git clone https://github.com/marcharaoui/ragaroo.git
cd ragaroo

For development:

uv sync --extra dev

From a local checkout:

pip install -e .

Ragaroo requires Python 3.10+.

Dataset Format

Each dataset folder must contain:

my_dataset/
  corpus.jsonl
  queries.jsonl
  qrels.tsv

corpus.jsonl:

{"id":"d1","text":"Paris is the capital of France.","metadata":{"source":"wiki"}}
{"id":"d2","text":"Berlin is the capital of Germany."}

queries.jsonl:

{"id":"q1","text":"capital of France"}
{"id":"q2","text":"capital of Germany"}

qrels.tsv:

query-id	corpus-id	score
q1	d1	1
q2	d2	1

id and _id are both accepted. Ragaroo validates missing files, malformed JSON, duplicate ids, empty text, and qrels that reference unknown queries or documents.

Quickstart

import ragaroo as roo

dataset = roo.Dataset.from_folder("data/your_custom_dataset")
embedder = roo.SentenceTransformerEmbedder("intfloat/e5-small-v2")

pipelines = [
    roo.Pipeline(
        name="bm25",
        retriever=roo.BM25Retriever(top_k=10),
    ),
    roo.Pipeline(
        name="dense_hnsw",
        retriever=roo.DenseRetriever(
            embedder=embedder,
            top_k=10,
            index_technique="hnsw",
            distance_metric="cosine",
        ),
    ),
    roo.Pipeline(
        name="hybrid_rrf",
        retriever=roo.HybridRetriever(
            retriever_1=roo.DenseRetriever(embedder=embedder, top_k=10),
            retriever_2=roo.BM25Retriever(top_k=10),
            top_k=10,
        ),
    ),
]

report = roo.Experiment(
    dataset=dataset,
    pipelines=pipelines,
    show_progress=True,
).run()

report.summary(sort_by="mrr@10")

Common Patterns

Reranking with a cross-encoder:

roo.Pipeline(
    name="dense_rerank",
    retriever=roo.DenseRetriever(embedder=embedder, top_k=50),
    reranker=roo.CrossEncoderReranker(
        model_name="cross-encoder/ms-marco-MiniLM-L-6-v2",
        top_k=10,
    ),
)

Using a retriever as a reranker:

roo.Pipeline(
    name="bm25_then_dense",
    retriever=roo.BM25Retriever(top_k=50),
    reranker=roo.DenseRetriever(embedder=embedder, top_k=10),
)

HyDE query augmentation:

import os

provider = roo.OpenRouterProvider(
    api_key=os.environ["OPENROUTER_API_KEY"],
    model=os.environ.get("OPENROUTER_MODEL", "openai/gpt-4o-mini"),
)

roo.Pipeline(
    name="dense_hyde",
    retriever=roo.DenseRetriever(embedder=embedder, top_k=10),
    query_augmentation=[
        roo.HyDEQueryTransform(
            provider=provider,
            user_prompt="Write a concise support-style passage for this query.",
            system_prompt="Return only the passage.",
            temperature=0.3,
            max_tokens=220,
        )
    ],
)

If user_prompt, system_prompt, or temperature is omitted, the transform uses its default. Custom user_prompt values are prepended to the dataset query.

By default, experiments evaluate all queries. While prototyping, pass query_limit=50 or any other integer directly to Experiment.

Metrics

Supported quality metrics:

  • recall@k
  • precision@k
  • mrr@k
  • map@k
  • hit_rate@k
  • ndcg@k

Supported latency metrics:

  • latency_ms
  • query_augmentation_latency_ms
  • retrieval_latency_ms
  • rerank_latency_ms
  • p50_latency_ms
  • p95_latency_ms
  • total_time_s

recall, precision, mrr, map, and hit_rate treat qrel scores greater than zero as relevant. ndcg uses graded qrel scores.

Experiment Outputs

Each experiment saves:

  • report.json
  • report.csv
  • manifest.json
  • config.json
  • plots under plots/

With store_query_results=True, Ragaroo also saves per-query metrics and retrieved ids.

The manifest records dataset hashes, pipeline configs, pipeline hashes, dependency versions, platform metadata, git metadata when available, notes, tags, and random seed.

Examples

See examples/README.md.

Included examples:

  • examples/compare_retrievers.py
  • examples/compare_topk.py
  • examples/compare_models.py
  • examples/compare_rerank.py
  • examples/compare_hyde.py
  • examples/compare_multiple_datasets.py

Most examples default to data/your_custom_dataset. Change the constants at the top of each script to use another dataset, model, cache folder, or query limit.

Model Access

Use HF_TOKEN for gated Hugging Face models or higher Hub rate limits. Use OPENROUTER_API_KEY only for LLM-based query augmentation. Dataset paths, model choices, model cache folders, and query limits are regular Python arguments in scripts and examples, not environment variables.

To keep model downloads in a project-local folder:

import ragaroo as roo

roo.store_models("./models")

Limitations

  • Ragaroo evaluates retrieval, not answer generation.
  • It assumes the benchmark target is retrieving the right passages for each query.
  • It currently loads dataset files into memory.
  • Generated query augmentation can affect reproducibility unless the provider/model is controlled.

Development Note

Ragaroo was built with human engineering work and assistance from LLM technologies. The code is tested, but LLM-assisted projects can still contain mistakes. Please validate benchmark setup, metrics, and outputs before relying on them for important decisions.

Author

Ragaroo is created and maintained by Marc Haraoui.

Citation

@software{haraoui_ragaroo,
  author = {Marc Haraoui},
  title = {Ragaroo: A Lightweight Retrieval Benchmarking Library for Custom Datasets},
  year = {2026},
  url = {https://github.com/marcharaoui/ragaroo}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ragaroo-0.1.1.tar.gz (41.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ragaroo-0.1.1-py3-none-any.whl (51.1 kB view details)

Uploaded Python 3

File details

Details for the file ragaroo-0.1.1.tar.gz.

File metadata

  • Download URL: ragaroo-0.1.1.tar.gz
  • Upload date:
  • Size: 41.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.5

File hashes

Hashes for ragaroo-0.1.1.tar.gz
Algorithm Hash digest
SHA256 7000681f5326be442e8fa830651906bbb237e0575bb415d87fadb4ccefe0ed4b
MD5 3e41a9d7cb2c4d5dfa47a0c7d43d98df
BLAKE2b-256 19ea63ff74c142fb7ca090735d9543c1e458af1002f5cc814e50a9ac1c63520a

See more details on using hashes here.

File details

Details for the file ragaroo-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: ragaroo-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 51.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.5

File hashes

Hashes for ragaroo-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 7ec27aa226e24a97bed5966576b3a04b4c3857fe7b366292ac1bc08f1292f7fc
MD5 ddb9d8c3dc69de6f3ee7b48158e287d2
BLAKE2b-256 5061c6bc446b78391fb5770b7e3e8ec4831dd1c0fef26c5aa78f57ac73340d6b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page