Local-only RAG benchmarking CLI — measures recall, MRR, chunk overlap, latency, and BEIR IR metrics
Project description
hydrag-benchmark
Local-only RAG benchmarking CLI for retrieval quality and latency analysis.
Installation
pip install hydrag-benchmark
Optional GPU path for multi-head dense embeddings:
pip install "hydrag-benchmark[gpu]"
Included Suites
suites/synthetic-smoke.yamlsuites/k8s-kep.yamlsuites/cpython-stdlib.yaml
Quickstart
# List shipped suites
hydrag-bench list-suites --suite-dir ./suites
# Run classic strategy benchmark
hydrag-bench run suites/synthetic-smoke.yaml \
--strategy hydrag \
--corpus-dir ./my-codebase/src \
--output-dir ./results
# Inspect output
python -m json.tool ./results/synthetic-smoke_hydrag.json
Commands
hydrag-bench --help
hydrag-bench --version
# 1) Classic single-strategy benchmark
hydrag-bench run <suite.yaml> --strategy <similarity|hybrid|crag|hydrag> --corpus-dir <path> [options]
# 2) List suites
hydrag-bench list-suites --suite-dir <path>
# 3) Prefill Doc2Query cache (Phase 1a)
hydrag-bench prefill --corpus-dir <path> [options]
# 4) Multi-head harness benchmark (Heads A/B/C)
hydrag-bench multihead <suite.yaml> --corpus-dir <path> [options]
run Arguments
| Flag | Required | Default | Description |
|---|---|---|---|
suite |
yes | - | Path to benchmark suite YAML |
--strategy |
yes | - | One of similarity, hybrid, crag, hydrag |
--corpus-dir |
yes | - | Root directory of files to index |
--output-dir |
no | stdout | Directory to write <suite>_<strategy>.json |
--suite-dir |
no | - | Base dir for resolving relative suite path |
--n-results |
no | 5 |
Top-k retrieval depth |
--seed |
no | 42 |
Seed override |
--embedding-model |
no | Alibaba-NLP/gte-Qwen2-7B-instruct |
Embedding model label passed to runner |
--db-path |
no | temp dir | ChromaDB persistence path |
list-suites Arguments
| Flag | Required | Default | Description |
|---|---|---|---|
--suite-dir |
yes | - | Directory containing .yaml / .yml suites |
prefill Arguments
| Flag | Required | Default | Description |
|---|---|---|---|
--corpus-dir |
yes | - | Root directory to chunk and process |
--doc2query-model |
no | qwen3:4b |
Doc2Query model name |
--doc2query-api-url |
no | http://localhost:11434 |
Doc2Query API base URL |
--doc2query-timeout-s |
no | 30.0 |
Request timeout seconds |
--doc2query-max-retries |
no | 2 |
Retry attempts after first failure |
--doc2query-n-questions |
no | 3 |
Synthetic questions per chunk |
--cache-dir |
no | in-memory only | Directory containing augmentation_cache.json |
multihead Arguments
| Flag | Required | Default | Description |
|---|---|---|---|
suite |
yes | - | Path to benchmark suite YAML |
--corpus-dir |
yes | - | Root directory of files to index |
--output-dir |
no | stdout | Directory to write <suite>_multihead.json and sidecar |
--suite-dir |
no | - | Base dir for resolving relative suite path |
--n-results |
no | 5 |
Top-k retrieval depth |
--seed |
no | 42 |
Seed override |
--use-gpu |
no | false |
Use transformers embedder (requires [gpu]) |
--doc2query-model |
no | qwen3:4b |
Doc2Query model name |
--doc2query-api-url |
no | http://localhost:11434 |
Doc2Query API base URL |
--doc2query-timeout-s |
no | 30.0 |
Request timeout seconds |
--doc2query-max-retries |
no | 2 |
Retry attempts after first failure |
--doc2query-n-questions |
no | 3 |
Synthetic questions per chunk |
--embedding-model |
no | Alibaba-NLP/gte-Qwen2-7B-instruct |
Dense embedding model name |
--alpha |
no | 0.5 |
Head C rerank interpolation weight |
--cache-dir |
no | none | Directory for augmentation_cache.json persistence |
Config Variables and Runtime Inputs
hydrag-benchmarkdoes not readHYDRAG_BENCHMARK_*environment variables.- Operator-facing runtime configuration is via CLI flags and suite YAML fields.
- Suite-level fields consumed by code:
- top-level:
name,version,seed,description,cases environment:strategy,n_results
- top-level:
File Paths and Artifacts
| Path / Pattern | Producer | Meaning |
|---|---|---|
<output-dir>/<suite>_<strategy>.json |
run |
Single-strategy result JSON (schema_version: 0.1) |
<output-dir>/<suite>_multihead.json |
multihead |
Multi-head comparison matrix (schema_version: 0.2) |
<output-dir>/questions_sidecar.json |
multihead |
Head B generated questions sidecar |
<cache-dir>/augmentation_cache.json |
prefill / multihead |
3-state Doc2Query cache shared across phases |
<db-path> |
run |
ChromaDB persistent store location |
Output Schemas
runemits schema0.1with per-case and aggregate metrics.multiheademits schema0.2with 5 config groups:A-onlyB-onlyC-onlyA+BA+B+C
Frozen 0.1 Metrics
| Metric | Description |
|---|---|
recall_at_1 |
1.0 when top result includes a relevant phrase |
recall_at_k |
Fraction of relevant phrases found in top-k |
mrr |
Mean Reciprocal Rank of first relevant result |
chunk_overlap |
Token overlap between retrieved chunks and relevant phrases |
latency_ms.avg |
Mean latency in milliseconds |
latency_ms.p50 |
50th percentile latency |
latency_ms.p95 |
95th percentile latency |
latency_ms.p99 |
99th percentile latency |
Suite YAML Format
name: my-benchmark
version: "1.0"
seed: 42
description: Description of the benchmark suite.
environment:
strategy: hydrag
n_results: 5
cases:
- id: case-001
query: "search query text"
relevant_phrases:
- "expected phrase in results"
- "another expected phrase"
tags: [optional, tags]
Development
cd packages/hydrag-benchmark
pip install -e ".[dev]"
python -m pytest tests/ -v
License
Apache-2.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
hydrag_benchmark-0.5.7.tar.gz
(57.5 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file hydrag_benchmark-0.5.7.tar.gz.
File metadata
- Download URL: hydrag_benchmark-0.5.7.tar.gz
- Upload date:
- Size: 57.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a34315c5326b3ed532bf9a740a9da094a861cf27bd7894dac5e9287ef70f2976
|
|
| MD5 |
6eb809019369febee408372f97be7599
|
|
| BLAKE2b-256 |
07ab616c8c5b8b01a0b64e684f95632fde34b5f89850386cd9a289b56c9bd4a6
|
File details
Details for the file hydrag_benchmark-0.5.7-py3-none-any.whl.
File metadata
- Download URL: hydrag_benchmark-0.5.7-py3-none-any.whl
- Upload date:
- Size: 47.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7340a0946ff7ae31989d027bd4f736c9f84e194bac08294a5208809e918d3e1d
|
|
| MD5 |
e7d8fce685ed4d2eec9b9df4668f954d
|
|
| BLAKE2b-256 |
7efe8d3bf0f161e801572e5e04683e348a892da2aac85d9e22d5aca043906b41
|