Local-only RAG benchmarking CLI — measures recall, MRR, chunk overlap, latency, and BEIR IR metrics
Project description
hydrag-benchmark
Local-only RAG benchmarking CLI for retrieval quality and latency analysis.
Installation
pip install hydrag-benchmark
Optional GPU path for multi-head dense embeddings:
pip install "hydrag-benchmark[gpu]"
Included Suites
suites/synthetic-smoke.yamlsuites/k8s-kep.yamlsuites/cpython-stdlib.yaml
Quickstart
# List shipped suites
hydrag-bench list-suites --suite-dir ./suites
# Run classic strategy benchmark
hydrag-bench run suites/synthetic-smoke.yaml \
--strategy hydrag \
--corpus-dir ./my-codebase/src \
--output-dir ./results
# Inspect output
python -m json.tool ./results/synthetic-smoke_hydrag.json
Commands
hydrag-bench --help
hydrag-bench --version
# 1) Classic single-strategy benchmark
hydrag-bench run <suite.yaml> --strategy <similarity|hybrid|crag|hydrag> --corpus-dir <path> [options]
# 2) List suites
hydrag-bench list-suites --suite-dir <path>
# 3) Prefill Doc2Query cache (Phase 1a)
hydrag-bench prefill --corpus-dir <path> [options]
# 4) Multi-head harness benchmark (Heads A/B/C)
hydrag-bench multihead <suite.yaml> --corpus-dir <path> [options]
# 5) BEIR benchmark harness (Heads A-E + HydRAG)
hydrag-bench beir --dataset <name> [options]
run Arguments
| Flag | Required | Default | Description |
|---|---|---|---|
suite |
yes | - | Path to benchmark suite YAML |
--strategy |
yes | - | One of similarity, hybrid, crag, hydrag |
--corpus-dir |
yes | - | Root directory of files to index |
--output-dir |
no | stdout | Directory to write <suite>_<strategy>.json |
--suite-dir |
no | - | Base dir for resolving relative suite path |
--n-results |
no | 5 |
Top-k retrieval depth |
--seed |
no | 42 |
Seed override |
--embedding-model |
no | Alibaba-NLP/gte-Qwen2-7B-instruct |
Embedding model label passed to runner |
--db-path |
no | temp dir | ChromaDB persistence path |
list-suites Arguments
| Flag | Required | Default | Description |
|---|---|---|---|
--suite-dir |
yes | - | Directory containing .yaml / .yml suites |
prefill Arguments
| Flag | Required | Default | Description |
|---|---|---|---|
--corpus-dir |
yes | - | Root directory to chunk and process |
--doc2query-model |
no | qwen3:4b |
Doc2Query model name |
--doc2query-api-url |
no | http://localhost:11434 |
Doc2Query API base URL |
--doc2query-timeout-s |
no | 30.0 |
Request timeout seconds |
--doc2query-max-retries |
no | 2 |
Retry attempts after first failure |
--doc2query-n-questions |
no | 3 |
Synthetic questions per chunk |
--cache-dir |
no | in-memory only | Directory containing augmentation_cache.json |
multihead Arguments
| Flag | Required | Default | Description |
|---|---|---|---|
suite |
yes | - | Path to benchmark suite YAML |
--corpus-dir |
yes | - | Root directory of files to index |
--output-dir |
no | stdout | Directory to write <suite>_multihead.json and sidecar |
--suite-dir |
no | - | Base dir for resolving relative suite path |
--n-results |
no | 5 |
Top-k retrieval depth |
--seed |
no | 42 |
Seed override |
--use-gpu |
no | false |
Use transformers embedder (requires [gpu]) |
--doc2query-model |
no | qwen3:4b |
Doc2Query model name |
--doc2query-api-url |
no | http://localhost:11434 |
Doc2Query API base URL |
--doc2query-timeout-s |
no | 30.0 |
Request timeout seconds |
--doc2query-max-retries |
no | 2 |
Retry attempts after first failure |
--doc2query-n-questions |
no | 3 |
Synthetic questions per chunk |
--embedding-model |
no | Alibaba-NLP/gte-Qwen2-7B-instruct |
Dense embedding model name |
--alpha |
no | 0.5 |
Head C rerank interpolation weight |
--cache-dir |
no | none | Directory for augmentation_cache.json persistence |
beir Arguments
| Flag | Required | Default | Description |
|---|---|---|---|
--dataset |
no | scifact |
BEIR dataset name |
--heads |
no | head_d,head_e,head_hydrag |
Comma-separated head list |
--cache-dir |
no | default cache | BEIR dataset cache directory |
--output-dir |
no | stdout | Directory to write result JSON |
--max-queries |
no | 0 |
Limit query count (0 = all) |
--ollama-model |
no | qwen3:4b |
Ollama model for Head E enrichment |
--ollama-host |
no | http://localhost:11434 |
Ollama API endpoint |
--ollama-timeout |
no | 30.0 |
Ollama request timeout seconds |
--use-gpu |
no | false |
Use GPU embedder for Head B/C |
--doc2query-model |
no | qwen3:4b |
Doc2Query model for Head B |
--doc2query-api-url |
no | http://localhost:11434 |
Doc2Query API URL |
--doc2query-timeout-s |
no | 30.0 |
Doc2Query timeout seconds |
--surreal-url |
no | ws://localhost:8000 |
SurrealDB WebSocket URL |
--surreal-user |
no | root |
SurrealDB username |
--surreal-pass |
no | root |
SurrealDB password |
Config Variables and Runtime Inputs
hydrag-benchmarkdoes not readHYDRAG_BENCHMARK_*environment variables.- Operator-facing runtime configuration is via CLI flags and suite YAML fields.
- Suite-level fields consumed by code:
- top-level:
name,version,seed,description,cases environment:strategy,n_results
- top-level:
File Paths and Artifacts
| Path / Pattern | Producer | Meaning |
|---|---|---|
<output-dir>/<suite>_<strategy>.json |
run |
Single-strategy result JSON (schema_version: 0.1) |
<output-dir>/<suite>_multihead.json |
multihead |
Multi-head comparison matrix (schema_version: 0.2) |
<output-dir>/questions_sidecar.json |
multihead |
Head B generated questions sidecar |
<cache-dir>/augmentation_cache.json |
prefill / multihead |
3-state Doc2Query cache shared across phases |
<db-path> |
run |
ChromaDB persistent store location |
Output Schemas
runemits schema0.1with per-case and aggregate metrics.multiheademits schema0.2with 5 config groups:A-onlyB-onlyC-onlyA+BA+B+C
Frozen 0.1 Metrics
| Metric | Description |
|---|---|
recall_at_1 |
1.0 when top result includes a relevant phrase |
recall_at_k |
Fraction of relevant phrases found in top-k |
mrr |
Mean Reciprocal Rank of first relevant result |
chunk_overlap |
Token overlap between retrieved chunks and relevant phrases |
latency_ms.avg |
Mean latency in milliseconds |
latency_ms.p50 |
50th percentile latency |
latency_ms.p95 |
95th percentile latency |
latency_ms.p99 |
99th percentile latency |
Suite YAML Format
name: my-benchmark
version: "1.0"
seed: 42
description: Description of the benchmark suite.
environment:
strategy: hydrag
n_results: 5
cases:
- id: case-001
query: "search query text"
relevant_phrases:
- "expected phrase in results"
- "another expected phrase"
tags: [optional, tags]
Development
git clone https://github.com/gromanchenko/hydrag-benchmark.git
cd hydrag-benchmark
pip install -e ".[dev]"
python -m pytest tests/ -v
License
Apache-2.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
hydrag_benchmark-0.8.0.tar.gz
(81.5 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file hydrag_benchmark-0.8.0.tar.gz.
File metadata
- Download URL: hydrag_benchmark-0.8.0.tar.gz
- Upload date:
- Size: 81.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8c53bd5c432b87f1bd8b4c3b75c627c79e8d948a702818aa49c7a9cc75ff114d
|
|
| MD5 |
411dc0d1be72128d5fe68568def8d11a
|
|
| BLAKE2b-256 |
5347e75323b20a4c3ea3637cf1a41043794d8a7c9d44348f8ef3795d5d6b91a9
|
File details
Details for the file hydrag_benchmark-0.8.0-py3-none-any.whl.
File metadata
- Download URL: hydrag_benchmark-0.8.0-py3-none-any.whl
- Upload date:
- Size: 70.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ecb895f2b22ed0642675a03c9a6d26a30f0f92acc9f00dd65931bf39ed14d5fb
|
|
| MD5 |
f12a3be94c8e1f054d97317b1ea2dc91
|
|
| BLAKE2b-256 |
f40db5fac512bd23a135b67fd8b6e0ceb715da87e0cb969a2ff18d88c7502eaf
|