Skip to main content

Local-only RAG benchmarking CLI — measures recall, MRR, chunk overlap, latency, and BEIR IR metrics

Project description

hydrag-benchmark

Local-only RAG benchmarking CLI for retrieval quality and latency analysis.

Installation

pip install hydrag-benchmark

Optional GPU path for multi-head dense embeddings:

pip install "hydrag-benchmark[gpu]"

Included Suites

  • suites/synthetic-smoke.yaml
  • suites/k8s-kep.yaml
  • suites/cpython-stdlib.yaml

Quickstart

# List shipped suites
hydrag-bench list-suites --suite-dir ./suites

# Run classic strategy benchmark
hydrag-bench run suites/synthetic-smoke.yaml \
  --strategy hydrag \
  --corpus-dir ./my-codebase/src \
  --output-dir ./results

# Inspect output
python -m json.tool ./results/synthetic-smoke_hydrag.json

Commands

hydrag-bench --help
hydrag-bench --version

# 1) Classic single-strategy benchmark
hydrag-bench run <suite.yaml> --strategy <similarity|hybrid|crag|hydrag> --corpus-dir <path> [options]

# 2) List suites
hydrag-bench list-suites --suite-dir <path>

# 3) Prefill Doc2Query cache (Phase 1a)
hydrag-bench prefill --corpus-dir <path> [options]

# 4) Multi-head harness benchmark (Heads A/B/C)
hydrag-bench multihead <suite.yaml> --corpus-dir <path> [options]

run Arguments

Flag Required Default Description
suite yes - Path to benchmark suite YAML
--strategy yes - One of similarity, hybrid, crag, hydrag
--corpus-dir yes - Root directory of files to index
--output-dir no stdout Directory to write <suite>_<strategy>.json
--suite-dir no - Base dir for resolving relative suite path
--n-results no 5 Top-k retrieval depth
--seed no 42 Seed override
--embedding-model no Alibaba-NLP/gte-Qwen2-7B-instruct Embedding model label passed to runner
--db-path no temp dir ChromaDB persistence path

list-suites Arguments

Flag Required Default Description
--suite-dir yes - Directory containing .yaml / .yml suites

prefill Arguments

Flag Required Default Description
--corpus-dir yes - Root directory to chunk and process
--doc2query-model no qwen3:4b Doc2Query model name
--doc2query-api-url no http://localhost:11434 Doc2Query API base URL
--doc2query-timeout-s no 30.0 Request timeout seconds
--doc2query-max-retries no 2 Retry attempts after first failure
--doc2query-n-questions no 3 Synthetic questions per chunk
--cache-dir no in-memory only Directory containing augmentation_cache.json

multihead Arguments

Flag Required Default Description
suite yes - Path to benchmark suite YAML
--corpus-dir yes - Root directory of files to index
--output-dir no stdout Directory to write <suite>_multihead.json and sidecar
--suite-dir no - Base dir for resolving relative suite path
--n-results no 5 Top-k retrieval depth
--seed no 42 Seed override
--use-gpu no false Use transformers embedder (requires [gpu])
--doc2query-model no qwen3:4b Doc2Query model name
--doc2query-api-url no http://localhost:11434 Doc2Query API base URL
--doc2query-timeout-s no 30.0 Request timeout seconds
--doc2query-max-retries no 2 Retry attempts after first failure
--doc2query-n-questions no 3 Synthetic questions per chunk
--embedding-model no Alibaba-NLP/gte-Qwen2-7B-instruct Dense embedding model name
--alpha no 0.5 Head C rerank interpolation weight
--cache-dir no none Directory for augmentation_cache.json persistence

Config Variables and Runtime Inputs

  • hydrag-benchmark does not read HYDRAG_BENCHMARK_* environment variables.
  • Operator-facing runtime configuration is via CLI flags and suite YAML fields.
  • Suite-level fields consumed by code:
    • top-level: name, version, seed, description, cases
    • environment: strategy, n_results

File Paths and Artifacts

Path / Pattern Producer Meaning
<output-dir>/<suite>_<strategy>.json run Single-strategy result JSON (schema_version: 0.1)
<output-dir>/<suite>_multihead.json multihead Multi-head comparison matrix (schema_version: 0.2)
<output-dir>/questions_sidecar.json multihead Head B generated questions sidecar
<cache-dir>/augmentation_cache.json prefill / multihead 3-state Doc2Query cache shared across phases
<db-path> run ChromaDB persistent store location

Output Schemas

  • run emits schema 0.1 with per-case and aggregate metrics.
  • multihead emits schema 0.2 with 5 config groups:
    • A-only
    • B-only
    • C-only
    • A+B
    • A+B+C

Frozen 0.1 Metrics

Metric Description
recall_at_1 1.0 when top result includes a relevant phrase
recall_at_k Fraction of relevant phrases found in top-k
mrr Mean Reciprocal Rank of first relevant result
chunk_overlap Token overlap between retrieved chunks and relevant phrases
latency_ms.avg Mean latency in milliseconds
latency_ms.p50 50th percentile latency
latency_ms.p95 95th percentile latency
latency_ms.p99 99th percentile latency

Suite YAML Format

name: my-benchmark
version: "1.0"
seed: 42
description: Description of the benchmark suite.

environment:
  strategy: hydrag
  n_results: 5

cases:
  - id: case-001
    query: "search query text"
    relevant_phrases:
      - "expected phrase in results"
      - "another expected phrase"
    tags: [optional, tags]

Development

cd packages/hydrag-benchmark
pip install -e ".[dev]"
python -m pytest tests/ -v

License

Apache-2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hydrag_benchmark-0.5.7.tar.gz (57.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

hydrag_benchmark-0.5.7-py3-none-any.whl (47.1 kB view details)

Uploaded Python 3

File details

Details for the file hydrag_benchmark-0.5.7.tar.gz.

File metadata

  • Download URL: hydrag_benchmark-0.5.7.tar.gz
  • Upload date:
  • Size: 57.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.0

File hashes

Hashes for hydrag_benchmark-0.5.7.tar.gz
Algorithm Hash digest
SHA256 a34315c5326b3ed532bf9a740a9da094a861cf27bd7894dac5e9287ef70f2976
MD5 6eb809019369febee408372f97be7599
BLAKE2b-256 07ab616c8c5b8b01a0b64e684f95632fde34b5f89850386cd9a289b56c9bd4a6

See more details on using hashes here.

File details

Details for the file hydrag_benchmark-0.5.7-py3-none-any.whl.

File metadata

File hashes

Hashes for hydrag_benchmark-0.5.7-py3-none-any.whl
Algorithm Hash digest
SHA256 7340a0946ff7ae31989d027bd4f736c9f84e194bac08294a5208809e918d3e1d
MD5 e7d8fce685ed4d2eec9b9df4668f954d
BLAKE2b-256 7efe8d3bf0f161e801572e5e04683e348a892da2aac85d9e22d5aca043906b41

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page