Local-only RAG benchmarking CLI — measures recall, MRR, chunk overlap, latency, and BEIR IR metrics

These details have not been verified by PyPI

Project links

Project description

hydrag-benchmark

Local-only RAG benchmarking CLI for retrieval quality and latency analysis.

Installation

pip install hydrag-benchmark

Optional GPU path for multi-head dense embeddings:

pip install "hydrag-benchmark[gpu]"

Included Suites

suites/synthetic-smoke.yaml
suites/k8s-kep.yaml
suites/cpython-stdlib.yaml

Quickstart

# List shipped suites
hydrag-bench list-suites --suite-dir ./suites

# Run classic strategy benchmark
hydrag-bench run suites/synthetic-smoke.yaml \
  --strategy hydrag \
  --corpus-dir ./my-codebase/src \
  --output-dir ./results

# Inspect output
python -m json.tool ./results/synthetic-smoke_hydrag.json

Commands

hydrag-bench --help
hydrag-bench --version

# 1) Classic single-strategy benchmark
hydrag-bench run <suite.yaml> --strategy <similarity|hybrid|crag|hydrag> --corpus-dir <path> [options]

# 2) List suites
hydrag-bench list-suites --suite-dir <path>

# 3) Prefill Doc2Query cache (Phase 1a)
hydrag-bench prefill --corpus-dir <path> [options]

# 4) Multi-head harness benchmark (Heads A/B/C)
hydrag-bench multihead <suite.yaml> --corpus-dir <path> [options]

# 5) BEIR benchmark harness (Heads A-E + HydRAG)
hydrag-bench beir --dataset <name> [options]

`run` Arguments

Flag	Required	Default	Description
`suite`	yes	-	Path to benchmark suite YAML
`--strategy`	yes	-	One of `similarity`, `hybrid`, `crag`, `hydrag`
`--corpus-dir`	yes	-	Root directory of files to index
`--output-dir`	no	stdout	Directory to write `<suite>_<strategy>.json`
`--suite-dir`	no	-	Base dir for resolving relative `suite` path
`--n-results`	no	`5`	Top-k retrieval depth
`--seed`	no	`42`	Seed override
`--embedding-model`	no	`Alibaba-NLP/gte-Qwen2-7B-instruct`	Embedding model label passed to runner
`--db-path`	no	temp dir	ChromaDB persistence path

`list-suites` Arguments

Flag	Required	Default	Description
`--suite-dir`	yes	-	Directory containing `.yaml` / `.yml` suites

`prefill` Arguments

Flag	Required	Default	Description
`--corpus-dir`	yes	-	Root directory to chunk and process
`--doc2query-model`	no	`qwen3:4b`	Doc2Query model name
`--doc2query-api-url`	no	`http://localhost:11434`	Doc2Query API base URL
`--doc2query-timeout-s`	no	`30.0`	Request timeout seconds
`--doc2query-max-retries`	no	`2`	Retry attempts after first failure
`--doc2query-n-questions`	no	`3`	Synthetic questions per chunk
`--cache-dir`	no	in-memory only	Directory containing `augmentation_cache.json`

`multihead` Arguments

Flag	Required	Default	Description
`suite`	yes	-	Path to benchmark suite YAML
`--corpus-dir`	yes	-	Root directory of files to index
`--output-dir`	no	stdout	Directory to write `<suite>_multihead.json` and sidecar
`--suite-dir`	no	-	Base dir for resolving relative `suite` path
`--n-results`	no	`5`	Top-k retrieval depth
`--seed`	no	`42`	Seed override
`--use-gpu`	no	`false`	Use transformers embedder (requires `[gpu]`)
`--doc2query-model`	no	`qwen3:4b`	Doc2Query model name
`--doc2query-api-url`	no	`http://localhost:11434`	Doc2Query API base URL
`--doc2query-timeout-s`	no	`30.0`	Request timeout seconds
`--doc2query-max-retries`	no	`2`	Retry attempts after first failure
`--doc2query-n-questions`	no	`3`	Synthetic questions per chunk
`--embedding-model`	no	`Alibaba-NLP/gte-Qwen2-7B-instruct`	Dense embedding model name
`--alpha`	no	`0.5`	Head C rerank interpolation weight
`--cache-dir`	no	none	Directory for `augmentation_cache.json` persistence

`beir` Arguments

Flag	Required	Default	Description
`--dataset`	no	`scifact`	BEIR dataset name
`--heads`	no	`head_d,head_e,head_hydrag`	Comma-separated head list
`--cache-dir`	no	default cache	BEIR dataset cache directory
`--output-dir`	no	stdout	Directory to write result JSON
`--max-queries`	no	`0`	Limit query count (`0` = all)
`--ollama-model`	no	`qwen3:4b`	Ollama model for Head E enrichment
`--ollama-host`	no	`http://localhost:11434`	Ollama API endpoint
`--ollama-timeout`	no	`30.0`	Ollama request timeout seconds
`--use-gpu`	no	`false`	Use GPU embedder for Head B/C
`--doc2query-model`	no	`qwen3:4b`	Doc2Query model for Head B
`--doc2query-api-url`	no	`http://localhost:11434`	Doc2Query API URL
`--doc2query-timeout-s`	no	`30.0`	Doc2Query timeout seconds
`--surreal-url`	no	`ws://localhost:8000`	SurrealDB WebSocket URL
`--surreal-user`	no	`root`	SurrealDB username
`--surreal-pass`	no	`root`	SurrealDB password

Config Variables and Runtime Inputs

hydrag-benchmark does not read HYDRAG_BENCHMARK_* environment variables.
Operator-facing runtime configuration is via CLI flags and suite YAML fields.
Suite-level fields consumed by code:
- top-level: name, version, seed, description, cases
- environment: strategy, n_results

File Paths and Artifacts

Path / Pattern	Producer	Meaning
`<output-dir>/<suite>_<strategy>.json`	`run`	Single-strategy result JSON (`schema_version: 0.1`)
`<output-dir>/<suite>_multihead.json`	`multihead`	Multi-head comparison matrix (`schema_version: 0.2`)
`<output-dir>/questions_sidecar.json`	`multihead`	Head B generated questions sidecar
`<cache-dir>/augmentation_cache.json`	`prefill` / `multihead`	3-state Doc2Query cache shared across phases
`<db-path>`	`run`	ChromaDB persistent store location

Output Schemas

run emits schema 0.1 with per-case and aggregate metrics.
multihead emits schema 0.2 with 5 config groups:
- A-only
- B-only
- C-only
- A+B
- A+B+C

Frozen 0.1 Metrics

Metric	Description
`recall_at_1`	1.0 when top result includes a relevant phrase
`recall_at_k`	Fraction of relevant phrases found in top-k
`mrr`	Mean Reciprocal Rank of first relevant result
`chunk_overlap`	Token overlap between retrieved chunks and relevant phrases
`latency_ms.avg`	Mean latency in milliseconds
`latency_ms.p50`	50th percentile latency
`latency_ms.p95`	95th percentile latency
`latency_ms.p99`	99th percentile latency

Suite YAML Format

name: my-benchmark
version: "1.0"
seed: 42
description: Description of the benchmark suite.

environment:
  strategy: hydrag
  n_results: 5

cases:
  - id: case-001
    query: "search query text"
    relevant_phrases:
      - "expected phrase in results"
      - "another expected phrase"
    tags: [optional, tags]

Development

git clone https://github.com/gromanchenko/hydrag-benchmark.git
cd hydrag-benchmark
pip install -e ".[dev]"
python -m pytest tests/ -v

License

Apache-2.0

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.8.0

Apr 26, 2026

0.7.2

Apr 8, 2026

0.6.2

Apr 7, 2026

0.6.1

Apr 7, 2026

0.6.0

Apr 6, 2026

0.5.7

Mar 27, 2026

0.5.5

Mar 21, 2026

0.5.4

Mar 18, 2026

0.5.3

Mar 18, 2026

0.5.2

Mar 18, 2026

0.5.0

Mar 18, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hydrag_benchmark-0.8.0.tar.gz (81.5 kB view details)

Uploaded Apr 26, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

hydrag_benchmark-0.8.0-py3-none-any.whl (70.2 kB view details)

Uploaded Apr 26, 2026 Python 3

File details

Details for the file hydrag_benchmark-0.8.0.tar.gz.

File metadata

Download URL: hydrag_benchmark-0.8.0.tar.gz
Upload date: Apr 26, 2026
Size: 81.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.0

File hashes

Hashes for hydrag_benchmark-0.8.0.tar.gz
Algorithm	Hash digest
SHA256	`8c53bd5c432b87f1bd8b4c3b75c627c79e8d948a702818aa49c7a9cc75ff114d`
MD5	`411dc0d1be72128d5fe68568def8d11a`
BLAKE2b-256	`5347e75323b20a4c3ea3637cf1a41043794d8a7c9d44348f8ef3795d5d6b91a9`

See more details on using hashes here.

File details

Details for the file hydrag_benchmark-0.8.0-py3-none-any.whl.

File metadata

Download URL: hydrag_benchmark-0.8.0-py3-none-any.whl
Upload date: Apr 26, 2026
Size: 70.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.0

File hashes

Hashes for hydrag_benchmark-0.8.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ecb895f2b22ed0642675a03c9a6d26a30f0f92acc9f00dd65931bf39ed14d5fb`
MD5	`f12a3be94c8e1f054d97317b1ea2dc91`
BLAKE2b-256	`f40db5fac512bd23a135b67fd8b6e0ceb715da87e0cb969a2ff18d88c7502eaf`

See more details on using hashes here.

hydrag-benchmark 0.8.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

hydrag-benchmark

Installation

Included Suites

Quickstart

Commands

run Arguments

list-suites Arguments

prefill Arguments

multihead Arguments

beir Arguments

Config Variables and Runtime Inputs

File Paths and Artifacts

Output Schemas

Frozen 0.1 Metrics

Suite YAML Format

Development

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`run` Arguments

`list-suites` Arguments

`prefill` Arguments

`multihead` Arguments

`beir` Arguments