Local-only RAG benchmarking CLI — measures recall, MRR, chunk overlap, latency, and BEIR IR metrics
Project description
id: HYDRAG-BENCH-README ticket: null category: guide status: active created: '2026-03-14' updated: '2026-04-07' author: Claude-Sonnet-4.6 summary: 'hydrag-benchmark — local RAG benchmarking CLI for retrieval quality and latency analysis; installation, suites, and command reference' keywords: hydrag-benchmark: 9 benchmark: 8 retrieval-quality: 7 latency: 6 cli: 6 beir: 5 surrealdb: 5 gpu-metrics: 4 rag: 4
hydrag-benchmark
Local-only RAG benchmarking CLI for retrieval quality and latency analysis.
Installation
pip install hydrag-benchmark
Optional GPU path for multi-head dense embeddings:
pip install "hydrag-benchmark[gpu]"
Included Suites
suites/synthetic-smoke.yamlsuites/k8s-kep.yamlsuites/cpython-stdlib.yaml
Quickstart
# List shipped suites
hydrag-bench list-suites --suite-dir ./suites
# Run classic strategy benchmark
hydrag-bench run suites/synthetic-smoke.yaml \
--strategy hydrag \
--corpus-dir ./my-codebase/src \
--output-dir ./results
# Inspect output
python -m json.tool ./results/synthetic-smoke_hydrag.json
Commands
hydrag-bench --help
hydrag-bench --version
# 1) Classic single-strategy benchmark
hydrag-bench run <suite.yaml> --strategy <similarity|hybrid|crag|hydrag> --corpus-dir <path> [options]
# 2) List suites
hydrag-bench list-suites --suite-dir <path>
# 3) Prefill Doc2Query cache (Phase 1a)
hydrag-bench prefill --corpus-dir <path> [options]
# 4) Multi-head harness benchmark (Heads A/B/C)
hydrag-bench multihead <suite.yaml> --corpus-dir <path> [options]
run Arguments
| Flag | Required | Default | Description |
|---|---|---|---|
suite |
yes | - | Path to benchmark suite YAML |
--strategy |
yes | - | One of similarity, hybrid, crag, hydrag |
--corpus-dir |
yes | - | Root directory of files to index |
--output-dir |
no | stdout | Directory to write <suite>_<strategy>.json |
--suite-dir |
no | - | Base dir for resolving relative suite path |
--n-results |
no | 5 |
Top-k retrieval depth |
--seed |
no | 42 |
Seed override |
--embedding-model |
no | Alibaba-NLP/gte-Qwen2-7B-instruct |
Embedding model label passed to runner |
--db-path |
no | temp dir | ChromaDB persistence path |
list-suites Arguments
| Flag | Required | Default | Description |
|---|---|---|---|
--suite-dir |
yes | - | Directory containing .yaml / .yml suites |
prefill Arguments
| Flag | Required | Default | Description |
|---|---|---|---|
--corpus-dir |
yes | - | Root directory to chunk and process |
--doc2query-model |
no | qwen3:4b |
Doc2Query model name |
--doc2query-api-url |
no | http://localhost:11434 |
Doc2Query API base URL |
--doc2query-timeout-s |
no | 30.0 |
Request timeout seconds |
--doc2query-max-retries |
no | 2 |
Retry attempts after first failure |
--doc2query-n-questions |
no | 3 |
Synthetic questions per chunk |
--cache-dir |
no | in-memory only | Directory containing augmentation_cache.json |
multihead Arguments
| Flag | Required | Default | Description |
|---|---|---|---|
suite |
yes | - | Path to benchmark suite YAML |
--corpus-dir |
yes | - | Root directory of files to index |
--output-dir |
no | stdout | Directory to write <suite>_multihead.json and sidecar |
--suite-dir |
no | - | Base dir for resolving relative suite path |
--n-results |
no | 5 |
Top-k retrieval depth |
--seed |
no | 42 |
Seed override |
--use-gpu |
no | false |
Use transformers embedder (requires [gpu]) |
--doc2query-model |
no | qwen3:4b |
Doc2Query model name |
--doc2query-api-url |
no | http://localhost:11434 |
Doc2Query API base URL |
--doc2query-timeout-s |
no | 30.0 |
Request timeout seconds |
--doc2query-max-retries |
no | 2 |
Retry attempts after first failure |
--doc2query-n-questions |
no | 3 |
Synthetic questions per chunk |
--embedding-model |
no | Alibaba-NLP/gte-Qwen2-7B-instruct |
Dense embedding model name |
--alpha |
no | 0.5 |
Head C rerank interpolation weight |
--cache-dir |
no | none | Directory for augmentation_cache.json persistence |
Config Variables and Runtime Inputs
hydrag-benchmarkdoes not readHYDRAG_BENCHMARK_*environment variables.- Operator-facing runtime configuration is via CLI flags and suite YAML fields.
- Suite-level fields consumed by code:
- top-level:
name,version,seed,description,cases environment:strategy,n_results
- top-level:
File Paths and Artifacts
| Path / Pattern | Producer | Meaning |
|---|---|---|
<output-dir>/<suite>_<strategy>.json |
run |
Single-strategy result JSON (schema_version: 0.1) |
<output-dir>/<suite>_multihead.json |
multihead |
Multi-head comparison matrix (schema_version: 0.2) |
<output-dir>/questions_sidecar.json |
multihead |
Head B generated questions sidecar |
<cache-dir>/augmentation_cache.json |
prefill / multihead |
3-state Doc2Query cache shared across phases |
<db-path> |
run |
ChromaDB persistent store location |
Output Schemas
runemits schema0.1with per-case and aggregate metrics.multiheademits schema0.2with 5 config groups:A-onlyB-onlyC-onlyA+BA+B+C
Frozen 0.1 Metrics
| Metric | Description |
|---|---|
recall_at_1 |
1.0 when top result includes a relevant phrase |
recall_at_k |
Fraction of relevant phrases found in top-k |
mrr |
Mean Reciprocal Rank of first relevant result |
chunk_overlap |
Token overlap between retrieved chunks and relevant phrases |
latency_ms.avg |
Mean latency in milliseconds |
latency_ms.p50 |
50th percentile latency |
latency_ms.p95 |
95th percentile latency |
latency_ms.p99 |
99th percentile latency |
Suite YAML Format
name: my-benchmark
version: "1.0"
seed: 42
description: Description of the benchmark suite.
environment:
strategy: hydrag
n_results: 5
cases:
- id: case-001
query: "search query text"
relevant_phrases:
- "expected phrase in results"
- "another expected phrase"
tags: [optional, tags]
Development
cd packages/hydrag-benchmark
pip install -e ".[dev]"
python -m pytest tests/ -v
License
Apache-2.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
hydrag_benchmark-0.6.1.tar.gz
(78.8 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file hydrag_benchmark-0.6.1.tar.gz.
File metadata
- Download URL: hydrag_benchmark-0.6.1.tar.gz
- Upload date:
- Size: 78.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c4ff1d158cfadfe23ff66ff9909e614d1422b6996c0ba2f93b0fee33bf19ff83
|
|
| MD5 |
108da5bfc1a218fffc911123ed1684a3
|
|
| BLAKE2b-256 |
b2fbeb511b8c4deb04e70a22279a7cf36ea29979bfa51670ceac080dcda9e64d
|
File details
Details for the file hydrag_benchmark-0.6.1-py3-none-any.whl.
File metadata
- Download URL: hydrag_benchmark-0.6.1-py3-none-any.whl
- Upload date:
- Size: 69.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
caba2799d87835402508567db1be4051c6d156bd3af5fc0cc6f7fc086e8e5e03
|
|
| MD5 |
492c9799275085062462a5f33d3ddcb1
|
|
| BLAKE2b-256 |
8e598f12af8467bf5f0c52b131bfb3814995a83f9d409788aeb79bbfd0f70500
|