Pre-deployment domain difficulty diagnostic for RAG. Know if your benchmark transfers — before you deploy.

These details have not been verified by PyPI

Project links

Project description

ragprobe

Pre-deployment domain difficulty diagnostic for RAG. Know if your benchmark transfers — before you deploy.

The problem

RAG pipelines get benchmarked on one domain and deployed on another. A system that scores 95% recall on financial documents scores 28% on legal text — same architecture, same embedding model, same settings. The failure is invisible until production users report wrong answers.

There is no standard way to predict, before deploying, whether your benchmark results will hold on your actual domain.

The fix

Measure the domain, not the pipeline. Vocabulary specificity — how uniquely query terms identify target passages — predicts retrieval difficulty in seconds, without running a single embedding. A corpus where query terms appear in 5 passages is trivially retrievable. A corpus where query terms appear in 500 passages will defeat any embedding model.

ragprobe quantifies this gap before you deploy, not after.

pip install ragprobe

Quick start

ragprobe score --corpus ./docs --queries queries.json

Domain Difficulty Report
========================

Overall specificity:  0.177  (HARD)
Reference match:      closest to GDPR regulatory text
                      Expect NeedleCoverage@5 in 15-30% range.

                      WARNING: If your benchmarks used HotpotQA (0.95)
                      or FinanceBench (0.98), results will NOT transfer.

Per-query breakdown:
  EASY  (3 queries)   specificity > 0.7
  HARD  (17 queries)  specificity < 0.3

Top ambiguous terms (appear in 100+ passages):
  "data" (838), "processing" (412), "controller" (389),
  "subject" (301), "personal" (287)

No embeddings. No vector store. No API keys. Runs in seconds.

Python API

from ragprobe import DomainProbe

probe = DomainProbe(
    corpus=["path/to/docs/"],
    queries=["What are the controller's obligations?", "What fines apply?"],
)
report = probe.score()

print(report.specificity)          # 0.177
print(report.difficulty)           # "hard"
print(report.closest_reference)    # "GDPR regulatory text"
print(report.hardest_queries[:3])  # per-query breakdown
print(report.ambiguous_terms[:5])  # terms causing collisions

What the scores mean

Specificity	Difficulty	What to expect
> 0.8	Easy	Your benchmark probably transfers. Any decent embedder will work.
0.3 – 0.8	Medium	Benchmark may partially transfer. Build 10-20 domain-specific test queries before deploying.
< 0.3	Hard	Your benchmark is lying to you. Build domain-specific needle annotations. Don't deploy without domain-specific evaluation.

Built-in reference profiles

ragprobe ships with measured profiles from real corpora so you can see where your domain sits:

Domain	Specificity	Difficulty	Source
CaseHOLD (legal holdings)	0.985	Easy	Case names, statute numbers act as unique identifiers
HotpotQA (Wikipedia)	0.946	Easy	Named entities, dates, specific facts
Financial (SEC filings)	~0.95	Easy	Company names, dates, figures
Technical docs (product-specific)	0.70–0.90	Medium	Product terms provide moderate specificity
Medical (clinical)	0.40–0.70	Medium	Clinical terminology helps but overlaps
GDPR (regulatory)	0.177	Hard	Generic vocabulary: "data," "processing," "controller" everywhere
RFC (technical standards)	0.024	Very hard	"client," "server," "request," "response" in every passage

ragprobe score --corpus ./my-docs --queries my-queries.json --compare-references

When ragprobe is useful

Before choosing a benchmark. "Should I trust my HotpotQA results for this legal corpus?" Run ragprobe. 5 seconds. Answer: no.
Before deploying to a new domain. You built a RAG system on product docs (medium difficulty). Now the team wants to add compliance policies (hard). How much will retrieval degrade? Measure it.
In CI/CD. Gate deployment if domain difficulty exceeds a threshold without domain-specific evaluation queries.
For mixed corpora. Real knowledge bases aren't single-domain. ragprobe tells you which parts of your corpus are easy and which will fail silently.

ragprobe score --corpus ./docs --queries queries.json --ci --max-difficulty hard

When ragprobe is NOT useful

Single well-known domain. If you know you're deploying on GDPR and you've already built domain-specific evaluation, you don't need ragprobe to tell you it's hard.
Predicting exact recall numbers. ragprobe predicts a difficulty tier, not a precise metric. It tells you "this is hard" not "you'll get 23.7% recall."
Comparing retrieval architectures. ragprobe measures domain difficulty, not pipeline quality. Use ragtune for retrieval benchmarking.

How it works

Tokenize each query into non-stopword terms
Build an inverted index of the corpus (which terms appear in which passages)
Compute specificity per query: fraction of terms appearing in fewer than 5 passages
Compute IDF statistics: average and max inverse document frequency per query
Identify ambiguous terms: terms with highest document frequency
Compare against built-in reference profiles
Report difficulty tier, per-query breakdown, and actionable recommendations

The core insight: if the text is lexically ambiguous (many passages share the same vocabulary), no retrieval method — keyword, dense, or hybrid — will have an easy time. Embeddings compress text into vectors; they don't invent semantic distinctions that aren't in the text. ragprobe measures a difficulty floor that applies regardless of architecture.

CLI reference

# Score a corpus against queries
ragprobe score --corpus ./docs --queries queries.json

# JSON output for CI/CD
ragprobe score --corpus ./docs --queries queries.json --format json

# Compare against built-in reference profiles
ragprobe score --corpus ./docs --queries queries.json --compare-references

# CI mode: exit 1 if difficulty exceeds threshold without domain-specific eval
ragprobe score --corpus ./docs --queries queries.json --ci --max-difficulty hard

# Score pre-chunked text (one file per chunk)
ragprobe score --corpus ./chunks/ --queries queries.json --pre-chunked

# Read queries from a plain text file (one per line)
ragprobe score --corpus ./docs --queries questions.txt

Part of the RAG measurement ecosystem

ragprobe is one of three independent tools for RAG retrieval quality:

Tool	Layer	Question it answers
chunkweaver	Ingestion	"Are my chunks structurally coherent?"
ragtune	Evaluation	"How does my retrieval actually perform?"
ragprobe	Pre-deployment	"Will my benchmark results transfer to this domain?"

They compose through standard formats (text files, JSON), not shared dependencies:

chunkweaver legal_doc.txt --preset legal-eu --format jsonl > chunks.jsonl
ragprobe score --corpus ./chunks/ --queries queries.json --pre-chunked
ragtune ingest ./chunks/ --collection test --pre-chunked
ragtune simulate --collection test --queries queries.json

Research

Vocabulary specificity is a pre-retrieval difficulty metric rooted in Query Performance Prediction (QPP), an established area of information retrieval research. The core insight — that retrieval difficulty is predictable from corpus statistics before any embedding is computed — has been validated across TREC benchmarks since the early 2000s.

Key references:

Hauff, Hiemstra & de Jong, "A Survey of Pre-Retrieval Query Performance Predictors" (CIKM 2008) — foundational survey of pre-retrieval QPP methods
Thakur et al., "BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models" (NeurIPS 2021) — demonstrates the benchmark transfer problem across 18 domains
Alexapolsky, "Your Benchmark Doesn't Generalize" — cross-domain RAG experiments showing vocabulary specificity predicts NeedleCoverage collapse from 95% to 28%

ragprobe is, to our knowledge, the first pip-installable tool that makes pre-retrieval domain difficulty metrics accessible to RAG practitioners. The QPP research community produced 20 years of validated metrics; ragprobe packages the most actionable ones for modern retrieval pipelines.

Architecture

ragprobe/
├── __init__.py      # Public API: DomainProbe
├── scorer.py        # Core: inverted index, specificity, IDF, difficulty tiers
├── models.py        # DomainReport, QueryDifficulty dataclasses
├── profiles.py      # Built-in reference profiles (GDPR, RFC, HotpotQA, etc.)
├── loaders.py       # Corpus and query loaders (files, JSON, directories)
└── cli.py           # CLI entry point

Design principles:

Zero dependencies for core — stdlib only, no heavy ML frameworks
CLI requires only click (pip install ragprobe[cli])
All scores are deterministic and reproducible
JSON output for CI/CD integration

Limitations

Lexical only. ragprobe measures word-level specificity, not semantic similarity. Two passages with identical vocabulary but different meaning (e.g., "shall erase" vs "may erase") will appear equally specific. This means ragprobe predicts a difficulty floor — actual retrieval may perform slightly better with strong embedding models.
Correlation, not causation. Vocabulary specificity correlates with retrieval difficulty (validated on GDPR, RFC, HotpotQA, CaseHOLD) but is one of several factors. Answer dispersion (how many passages contain the answer) and semantic role diversity also matter.
English-centric stopword list. The default stopword list is English. For other languages, pass a custom stopword set.

License

MIT

Author

Oleksii Alexapolsky — Senior Python & Applied AI Engineer. Building measurement tools for RAG retrieval: ragtune, chunkweaver, ragprobe. Writing about what actually works at medium.com/@TheWake.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Mar 24, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ragprobe-0.1.0.tar.gz (15.6 kB view details)

Uploaded Mar 24, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

ragprobe-0.1.0-py3-none-any.whl (14.5 kB view details)

Uploaded Mar 24, 2026 Python 3

File details

Details for the file ragprobe-0.1.0.tar.gz.

File metadata

Download URL: ragprobe-0.1.0.tar.gz
Upload date: Mar 24, 2026
Size: 15.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.2

File hashes

Hashes for ragprobe-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`5329aec79d03080b8ecb572a6711e3f43030d37148768b805d47c1d6f6483f24`
MD5	`8fb70dbd77ed1f70ce5aa5e05a11140f`
BLAKE2b-256	`237bea28ad553582f2e42cd78293d5b48a441d1f89a855309cad6123f8166d6c`

See more details on using hashes here.

File details

Details for the file ragprobe-0.1.0-py3-none-any.whl.

File metadata

Download URL: ragprobe-0.1.0-py3-none-any.whl
Upload date: Mar 24, 2026
Size: 14.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.2

File hashes

Hashes for ragprobe-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`60739088567030548d088908307b24dd85f1e9470a35fe018655fdcad4f0afac`
MD5	`1733d7697b95d6c042b5cf59f6aef601`
BLAKE2b-256	`7f81ed4bccc0c6fae730189cd9b434af2378742d75d7bdcc863efd02c16d92aa`

See more details on using hashes here.

ragprobe 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

ragprobe

The problem

The fix

Quick start

Python API

What the scores mean

Built-in reference profiles

When ragprobe is useful

When ragprobe is NOT useful

How it works

CLI reference

Part of the RAG measurement ecosystem

Research

Architecture

Limitations

License

Author

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes