Skip to main content

Local-only RAG benchmarking CLI — measures recall, MRR, chunk overlap, latency, and BEIR IR metrics

Project description

hydrag-benchmark

Local-only RAG benchmarking CLI for retrieval quality and latency analysis.

Installation

pip install hydrag-benchmark

Optional GPU path for multi-head dense embeddings:

pip install "hydrag-benchmark[gpu]"

Included Suites

  • suites/synthetic-smoke.yaml
  • suites/k8s-kep.yaml
  • suites/cpython-stdlib.yaml

Quickstart

# List shipped suites
hydrag-bench list-suites --suite-dir ./suites

# Run classic strategy benchmark
hydrag-bench run suites/synthetic-smoke.yaml \
  --strategy hydrag \
  --corpus-dir ./my-codebase/src \
  --output-dir ./results

# Inspect output
python -m json.tool ./results/synthetic-smoke_hydrag.json

Commands

hydrag-bench --help
hydrag-bench --version

# 1) Classic single-strategy benchmark
hydrag-bench run <suite.yaml> --strategy <similarity|hybrid|crag|hydrag> --corpus-dir <path> [options]

# 2) List suites
hydrag-bench list-suites --suite-dir <path>

# 3) Prefill Doc2Query cache (Phase 1a)
hydrag-bench prefill --corpus-dir <path> [options]

# 4) Multi-head harness benchmark (Heads A/B/C)
hydrag-bench multihead <suite.yaml> --corpus-dir <path> [options]

run Arguments

Flag Required Default Description
suite yes - Path to benchmark suite YAML
--strategy yes - One of similarity, hybrid, crag, hydrag
--corpus-dir yes - Root directory of files to index
--output-dir no stdout Directory to write <suite>_<strategy>.json
--suite-dir no - Base dir for resolving relative suite path
--n-results no 5 Top-k retrieval depth
--seed no 42 Seed override
--embedding-model no Alibaba-NLP/gte-Qwen2-7B-instruct Embedding model label passed to runner
--db-path no temp dir ChromaDB persistence path

list-suites Arguments

Flag Required Default Description
--suite-dir yes - Directory containing .yaml / .yml suites

prefill Arguments

Flag Required Default Description
--corpus-dir yes - Root directory to chunk and process
--doc2query-model no qwen3:4b Doc2Query model name
--doc2query-api-url no http://localhost:11434 Doc2Query API base URL
--doc2query-timeout-s no 30.0 Request timeout seconds
--doc2query-max-retries no 2 Retry attempts after first failure
--doc2query-n-questions no 3 Synthetic questions per chunk
--cache-dir no in-memory only Directory containing augmentation_cache.json

multihead Arguments

Flag Required Default Description
suite yes - Path to benchmark suite YAML
--corpus-dir yes - Root directory of files to index
--output-dir no stdout Directory to write <suite>_multihead.json and sidecar
--suite-dir no - Base dir for resolving relative suite path
--n-results no 5 Top-k retrieval depth
--seed no 42 Seed override
--use-gpu no false Use transformers embedder (requires [gpu])
--doc2query-model no qwen3:4b Doc2Query model name
--doc2query-api-url no http://localhost:11434 Doc2Query API base URL
--doc2query-timeout-s no 30.0 Request timeout seconds
--doc2query-max-retries no 2 Retry attempts after first failure
--doc2query-n-questions no 3 Synthetic questions per chunk
--embedding-model no Alibaba-NLP/gte-Qwen2-7B-instruct Dense embedding model name
--alpha no 0.5 Head C rerank interpolation weight
--cache-dir no none Directory for augmentation_cache.json persistence

Config Variables and Runtime Inputs

  • hydrag-benchmark does not read HYDRAG_BENCHMARK_* environment variables.
  • Operator-facing runtime configuration is via CLI flags and suite YAML fields.
  • Suite-level fields consumed by code:
    • top-level: name, version, seed, description, cases
    • environment: strategy, n_results

File Paths and Artifacts

Path / Pattern Producer Meaning
<output-dir>/<suite>_<strategy>.json run Single-strategy result JSON (schema_version: 0.1)
<output-dir>/<suite>_multihead.json multihead Multi-head comparison matrix (schema_version: 0.2)
<output-dir>/questions_sidecar.json multihead Head B generated questions sidecar
<cache-dir>/augmentation_cache.json prefill / multihead 3-state Doc2Query cache shared across phases
<db-path> run ChromaDB persistent store location

Output Schemas

  • run emits schema 0.1 with per-case and aggregate metrics.
  • multihead emits schema 0.2 with 5 config groups:
    • A-only
    • B-only
    • C-only
    • A+B
    • A+B+C

Frozen 0.1 Metrics

Metric Description
recall_at_1 1.0 when top result includes a relevant phrase
recall_at_k Fraction of relevant phrases found in top-k
mrr Mean Reciprocal Rank of first relevant result
chunk_overlap Token overlap between retrieved chunks and relevant phrases
latency_ms.avg Mean latency in milliseconds
latency_ms.p50 50th percentile latency
latency_ms.p95 95th percentile latency
latency_ms.p99 99th percentile latency

Suite YAML Format

name: my-benchmark
version: "1.0"
seed: 42
description: Description of the benchmark suite.

environment:
  strategy: hydrag
  n_results: 5

cases:
  - id: case-001
    query: "search query text"
    relevant_phrases:
      - "expected phrase in results"
      - "another expected phrase"
    tags: [optional, tags]

Development

cd packages/hydrag-benchmark
pip install -e ".[dev]"
python -m pytest tests/ -v

License

Apache-2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hydrag_benchmark-0.5.3.tar.gz (54.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

hydrag_benchmark-0.5.3-py3-none-any.whl (44.5 kB view details)

Uploaded Python 3

File details

Details for the file hydrag_benchmark-0.5.3.tar.gz.

File metadata

  • Download URL: hydrag_benchmark-0.5.3.tar.gz
  • Upload date:
  • Size: 54.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.0

File hashes

Hashes for hydrag_benchmark-0.5.3.tar.gz
Algorithm Hash digest
SHA256 3d03a9fa6e6f94342b133268146c6bc33a3764337be70230cac02215dbbdffa7
MD5 19fb4a36a241574b2e3c33f02dfd5849
BLAKE2b-256 a3888d24967a8ff641ea947101065d0c6508295ab66326eeeee75bb8ac1f8be0

See more details on using hashes here.

File details

Details for the file hydrag_benchmark-0.5.3-py3-none-any.whl.

File metadata

File hashes

Hashes for hydrag_benchmark-0.5.3-py3-none-any.whl
Algorithm Hash digest
SHA256 112947e1b16d17f9771cb93bd268dc42f9611dccc20d94f03306a5feba6d463c
MD5 97699d5cbfb3842d56a3f36552c4a36f
BLAKE2b-256 9261ca097ede7b60160adcbbebbf0a62126b8ade808a358345a40412788b3e65

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page