ragdrift

Silent regression detector for RAG pipelines

These details have not been verified by PyPI

Project links

Project description

RAG systems don't crash, they rot. A changed PDF template, a parser upgrade, a stale index rebuild. Retrieval accuracy drops from 80% to 25% and nobody notices for weeks. ragdrift catches this in hours.

Metric	Without ragdrift	With ragdrift
Detection lag	2 to 6 weeks (user complaints)	Under 48 hours
Root cause visibility	None	Chunk delta %, heading shifts, token drift, table misalignment
Retrieval accuracy measurement	Manual spot-check	Automated BM25 golden query probing
API keys required		Zero (core mode)

See it in action

ragdrift drift scan

Run with --explain for LLM-powered root-cause diagnosis and prevention tips:

ragdrift LLM explanation

Quickstart
Use with your own data
What ragdrift catches
Example output
Example --explain diagnosis
CLI reference
Golden queries
Severity rules
Where ragdrift fits
Architecture
Fingerprinting: file vs. extractor drift
Optional extras
Research
When ragdrift is NOT the right tool
Tech stack

Quickstart

pip install ragdrift
ragdrift demo --inject-drift

Two commands. Zero config. Zero API keys. Zero Docker.

What the demo does

Copies 20 sample documents (finance, healthcare, legal, tech) to a temp directory
Takes a reference snapshot: chunk counts, heading structure, token stats
Injects 5 types of drift into the corpus:
- Heading hierarchy collapse (H2 to H4)
- Missing paragraphs (content deleted)
- Hidden unicode characters inserted
- Markdown table columns misaligned
- Chunk explosion (document reformatted, 3x more chunks)
Rescans and diffs against the snapshot
Reports severity, retrieval accuracy drop, and recommended actions per document

Use with your own data

# Step 1: snapshot your corpus
ragdrift init --corpus ./docs --golden ./golden_queries.json

# Step 2: your pipeline runs, documents change, parser gets upgraded

# Step 3: scan before re-ingesting
ragdrift scan --corpus ./docs

Point it at the folder you ingest from. Run it before every re-ingestion.

ragdrift works entirely at the document layer. It does not connect to your vector store or your production BM25 index. It builds its own temporary index from your source files and measures whether the documents that should answer your golden queries still answer them as well as they did at snapshot time.

What ragdrift catches

In scope:

File content changes (corruption, accidental edits, encoding issues)
Parser or extractor upgrades that silently alter chunk boundaries
Hidden unicode characters: zero-width spaces, homoglyphs, non-breaking spaces
Markdown table column misalignment breaking structured data
Chunk explosions or collapses from reformatted documents
Retrieval degradation on golden queries, even when the right document still ranks in top-k

Out of scope:

Embedding model drift inside your vector DB
Query-time retrieval degradation in production
Anything that happens after ingestion: reranking, generation quality, hallucination

If you need inference-time observability, see When ragdrift is NOT the right tool.

Example output

This is what 3 weeks of undetected degradation looks like:

╭──────────────────────────────────────────────────────────────────╮
│                       DRIFT SCAN REPORT                          │
╰──────────────────────────────────────────────────────────────────╯
  Scan ID:    6d82d75a
  Timestamp:  2026-03-28T23:11:43+00:00
  Sampled:    20 documents
  Drifted:    6 documents
  Severity:   🔴 CRITICAL

  Retrieval Accuracy:
    Reference: 100.0%
    Current:   87.0%
    Delta:     -13.0pp ⚠️

  ────────────────────────────────────────────────────────────────
  DRIFT EVENTS
  ────────────────────────────────────────────────────────────────

  🟠 08_drug_interactions.md — HIGH [RETRIEVAL IMPACT]
     Chunks: 12 → 12 (+0.0%)
     Headings: removed: ## Pharmacokinetic Interactions
               added:   ## Pharmacokinetic Interactions  ← zero-width char
     Anomalies: zero_width_chars: 28 found, non_breaking_spaces: 8 found
     Action: re-ingest

  🔴 16_cloud_architecture.md — CRITICAL [RETRIEVAL IMPACT]
     Chunks: 11 → 18 (+63.6%)
     Headings: added: ## Resilience Patterns: Operational Best Practices
     Anomalies: token_shift=0.4264
     Action: re-ingest

  🟠 11_contract_law.md — HIGH [RETRIEVAL IMPACT]
     Chunks: 11 → 8 (-27.3%)
     Headings: removed: ## Defenses to Enforcement, ### Consideration
     Anomalies: token_shift=0.3149
     Action: re-ingest

  🟠 03_regulatory_compliance.md — HIGH [RETRIEVAL IMPACT]
     Chunks: 12 → 12 (+0.0%)
     Headings: removed: ## Anti-Money Laundering (AML) Requirements
               removed: ## Know Your Customer (KYC) Framework
     Anomalies: misaligned_columns: 1 table(s) have inconsistent pipe counts
     Action: re-ingest

  🟡 18_data_pipelines.md — MEDIUM
     Chunks: 12 → 12 (+0.0%)
     Anomalies: misaligned_columns: 4 table(s) have inconsistent pipe counts
                table_rows: 16 → 15
     Action: monitor

  🟡 15_dispute_resolution.md — MEDIUM
     Chunks: 13 → 13 (+0.0%)
     Anomalies: misaligned_columns: 2 table(s) have inconsistent pipe counts
     Action: monitor

Example `--explain` diagnosis

ragdrift demo --inject-drift --explain --provider anthropic --format pretty

  ────────────────────────────────────────────────────────────
  DRIFT SCAN REPORT
  ────────────────────────────────────────────────────────────
  scan      1c35ce7e
  time      2026-03-29T04:01:14+00:00
  sampled   20 documents
  drifted   5 documents
  severity  🔴 CRITICAL

  retrieval accuracy
    reference  100.0%
    current    88.1%
    delta      -11.9pp ⚠️

  ────────────────────────────────────────────────────────
  DRIFT EVENTS
  ────────────────────────────────────────────────────────

  🟠 08_drug_interactions.md — HIGH [RETRIEVAL IMPACT]
     chunks     12 → 12  (+0.0%)
     headings   removed: Pharmacokinetic Interactions
                removed: Pharmacodynamic Interactions
     anomalies  zero-width chars: 28 found,  non-breaking spaces: 8 found
                directional markers: 3 found
     action     re-ingest

  🔴 16_cloud_architecture.md — CRITICAL [RETRIEVAL IMPACT]
     chunks     11 → 18  (+63.6%)
     headings   added: Resilience Patterns: Advanced Considerations
                added: Resilience Patterns: Operational Best Practices
     anomalies  token shift=0.4264
     action     re-ingest

  🟡 18_data_pipelines.md — MEDIUM
     chunks     12 → 12  (+0.0%)
     anomalies  misaligned columns: 2 table(s) have inconsistent pipe counts
                table rows: 16 → 15
     action     monitor

  🟠 11_contract_law.md — HIGH [RETRIEVAL IMPACT]
     chunks     11 → 8  (-27.3%)
     headings   removed: Defenses to Enforcement
                removed: Consideration
     anomalies  token shift=0.3149
     action     re-ingest

  🟠 03_regulatory_compliance.md — HIGH [RETRIEVAL IMPACT]
     chunks     12 → 12  (+0.0%)
     headings   removed: Know Your Customer (KYC) Framework
                removed: Compliance Program Elements
     action     re-ingest

  ┌─────────────────────────────────────────────────────────┐
  │  LLM DIAGNOSIS  ·  Claude Haiku                         │
  └─────────────────────────────────────────────────────────┘

  overall   Document corpus experienced structural degradation across 5
            files through content removal, heading hierarchy collapse,
            chunk explosion, and unicode injection, causing retrieval
            accuracy to drop 11.9 percentage points.
  urgency   Critical: two documents (16_cloud_architecture.md,
            11_contract_law.md) directly broke golden queries by removing
            or fragmenting query-critical content sections, and heading
            restructuring in 03_regulatory_compliance.md eliminates BM25
            section-level ranking signals.
  risk      🔴 CRITICAL

  ────────────────────────────────────────────────────────
  Per-Document Analysis  (document 1 of 5)
  ────────────────────────────────────────────────────────

  🔴 16_cloud_architecture.md  CRITICAL  [high confidence]

  cause
    Chunk explosion: 11 → 18 chunks (+63.64%), token_shift=0.4264
    indicating new content insertion or reformatting; two new heading
    sections added ('Resilience Patterns: Advanced Considerations',
    'Resilience Patterns: Operational Best Practices').

  impact
    Query 'How do microservices communicate synchronously and
    asynchronously in cloud-native architecture?' dropped to 77%
    accuracy. Chunk fragmentation splits synchronous/asynchronous
    communication concepts across 7 additional fragments, breaking
    contextual window coherence for retrieval rankers. New sections
    dilute relevance density and push original communication patterns
    content further down chunk ordering.

  fix
    Revert to reference snapshot hash=8ef1262c... or re-ingest with
    original chunking strategy (size=512, overlap=50). If new content
    is intentional, re-chunk entire document to consolidate
    communication patterns under single coherent section with semantic
    boundaries aligned to query intent.

  effort   10 minutes

  ────────────────────────────────────────────────────────
  Per-Document Analysis  (document 2 of 5)
  ────────────────────────────────────────────────────────

  🔴 11_contract_law.md  CRITICAL  [high confidence]

  cause
    Content removal: chunk count dropped 11 → 8 (-27.27%); critical
    headings deleted: '## Defenses to Enforcement' and '###
    Consideration', token_shift=0.3149.

  impact
    Two golden queries directly failed: 'What defenses such as duress,
    unconscionability and statute of frauds can render a contract
    unenforceable?' (46% accuracy) and 'What is the role of
    consideration and why is past consideration generally insufficient
    in contract law?' (59% accuracy). Removal of 'Defenses to
    Enforcement' eliminates vocabulary match for 'duress',
    'unconscionability', 'statute of frauds'. BM25 signals completely
    absent for both query targets.

  fix
    Restore from reference snapshot hash=4a8c1860... immediately. If
    intentional deletion, re-add both sections with original structure
    before re-ingestion. Validate chunk count matches reference (11).

  effort   5 minutes

3 more documents analysed (docs 3–5)

  ────────────────────────────────────────────────────────
  Per-Document Analysis  (document 3 of 5)
  ────────────────────────────────────────────────────────

  🟠 03_regulatory_compliance.md  HIGH  [high confidence]

  cause
    Heading hierarchy collapse: 5 H2 headings demoted to H4 —
    'Know Your Customer (KYC) Framework', 'Compliance Program
    Elements', 'Capital Adequacy and Prudential Standards',
    'Anti-Money Laundering (AML) Requirements', 'Key Regulatory
    Bodies'. Chunk count unchanged (12 → 12); structural ranking
    signals corrupted.

  impact
    Demotion from H2 to H4 removes top-level section signals used
    by BM25 and semantic rankers for queries mentioning 'KYC',
    'AML', 'compliance program', 'capital adequacy'. These terms
    are now buried under deeper heading hierarchy, reducing
    section-level re-ranking boost.

  fix
    Revert headings to H2 (##) level or re-ingest with original
    structure from hash=4f1522c9....

  effort   3 minutes

  ────────────────────────────────────────────────────────
  Per-Document Analysis  (document 4 of 5)
  ────────────────────────────────────────────────────────

  🟠 08_drug_interactions.md  HIGH  [high confidence]

  cause
    Unicode injection: 28 zero-width characters (U+200B), 8
    non-breaking spaces (U+00A0), 3 directional markers
    (U+202A-U+202E) inserted into heading text. Chunk count
    unchanged (12 → 12), semantic_drift_score=0.0.

  impact
    Zero-width character in heading 'Pharmacokinetic Interactions'
    corrupts exact-match tokenization. BM25 indexing will not
    match clean query strings against the zero-width variant,
    causing vocabulary fragmentation and lowering retrieval scores
    for drug interaction queries.

  fix
    Sanitize document: strip all zero-width characters, non-
    breaking spaces, and directional markers. Validate against
    reference hash=a1784989....

  effort   7 minutes

  ────────────────────────────────────────────────────────
  Per-Document Analysis  (document 5 of 5)
  ────────────────────────────────────────────────────────

  🟡 18_data_pipelines.md  MEDIUM  [medium confidence]

  cause
    Table structural corruption: 2 tables with misaligned pipe
    counts, 1 row deletion (16 → 15 rows). Chunk count unchanged
    (12 → 12), semantic_drift_score=0.0.

  impact
    Markdown table parsing failure may cause partial or skipped
    chunk ingestion. Row deletion removes data points but does not
    break golden queries directly. Unaligned pipes create
    tokenization errors during table extraction.

  fix
    Repair pipe alignment in markdown source. Restore deleted row
    or re-ingest from reference hash=d5e686c6....

  effort   8 minutes

  ────────────────────────────────────────────────────────
  Cross-Document Patterns
  ────────────────────────────────────────────────────────

  Independent failure modes: (1) structural content removal
  (11_contract_law.md) directly broke query-critical sections;
  (2) heading hierarchy collapse (03_regulatory_compliance.md)
  and unicode injection (08_drug_interactions.md) degrade ranking
  signals without removing content; (3) chunk explosion
  (16_cloud_architecture.md) fragments context through over-
  splitting; (4) table corruption (18_data_pipelines.md)
  introduces syntax errors. Root causes span file editing,
  encoding issues, content pruning, and document expansion.
  No single systematic cause explains all five: likely manual
  edits or file corruption during sync.

  ────────────────────────────────────────────────────────
  Prevention
  ────────────────────────────────────────────────────────

  Implement pre-ingestion validation pipeline: (1) run markdown
  lint + unicode sanity check (reject documents with zero-width
  chars, non-breaking spaces, directional markers); (2) compute
  heading structure fingerprint and compare against reference
  before chunking; (3) enforce chunk count tolerance bands (warn
  if delta >±15%); (4) add golden query regression test to CI/CD
  and block deployment if score_accuracy drops >5pp for any query.

CLI reference

# Initialize: take a reference snapshot of your corpus
ragdrift init --corpus ./docs --golden ./golden_queries.json

# Scan: diff current corpus against the snapshot
ragdrift scan --corpus ./docs
ragdrift scan --corpus ./docs --format pretty
ragdrift scan --corpus ./docs --sample-rate 0.5

# Scan with LLM root-cause diagnosis
ragdrift scan --corpus ./docs --explain                       # Anthropic Claude Haiku
ragdrift scan --corpus ./docs --explain --provider ollama     # local Ollama, no API key

# Demo: one command, zero setup
ragdrift demo --inject-drift                                  # moderate drift (default)
ragdrift demo --inject-drift --level mild                     # 2 docs, minor changes
ragdrift demo --inject-drift --level catastrophic             # 10 docs, heavy corruption

# History: view past scans
ragdrift report --corpus ./docs --format pretty

Golden queries

ragdrift uses a golden query set to probe retrieval accuracy before and after drift. Create a JSON file:

[
  {
    "query": "What is consideration in contract law?",
    "expected_doc_ids": ["11_contract_law.md"],
    "domain": "legal"
  }
]

Pass it at init time: ragdrift init --corpus ./docs --golden ./golden_queries.json

Why not recall@k? Plain recall@k always returns 1.0 for topic-distinct corpora. If only one document covers contract law, it will always rank first even after content loss. ragdrift measures score-accuracy: how much the expected document's BM25 score drops relative to its reference baseline. This catches content loss even when the right document still appears in top-k results.

Severity rules

Thresholds were calibrated from production experience with a legal document corpus and validated across synthetic finance, healthcare, and tech corpora. All rules apply independently: a document can be CRITICAL on chunk delta alone while showing no heading changes.

Signal	Threshold	Severity
Chunk count delta	> 50%	🔴 critical
Chunk count delta	> 20%	🟠 high
Chunk count delta	> 10%	🟡 medium
Heading level shift	any	🟡 medium
Token distribution shift	> 0.4	🟠 high
Hidden unicode characters	any	🟡 medium
Table column misalignment	any	🟡 medium
Semantic drift (cosine)	> 0.15	🟠 high
Retrieval score drop	> 10%	🔴 critical (with structural drift)

Where ragdrift fits

ragdrift is a pre-ingestion quality gate.

Architecture

The core diff engine is pure Python. No LLM, no network, no API keys.

Structural diff: chunk count delta, heading hierarchy changes, markdown table misalignment
Lexical diff: token distribution shift, hidden unicode characters (zero-width spaces, homoglyphs)
Semantic diff (optional, ragdrift[vector]): embedding centroid cosine distance
Retrieval probe: BM25 score-accuracy on golden queries, catching content loss even when the right document still ranks in top-k

Fingerprinting: file vs. extractor drift

Every snapshot records four fingerprint fields per document:

Field	What it tracks
`file_hash`	MD5 of the source file: did the content change?
`extractor_version`	Which extraction code version was used
`chunker_config`	Chunk size and overlap settings
`parser_type`	`text` / `markdown` / `pdf`

When drift is detected, you immediately know whether the cause was a file content change or an extractor/chunker upgrade. Two very different remediation paths. No other lightweight RAG observability tool makes this distinction.

Optional extras

pip install ragdrift            # core: CLI, BM25 probing, structural + lexical diff
pip install ragdrift[vector]    # + Qdrant + sentence-transformers (semantic diff)
pip install ragdrift[explain]   # + Anthropic SDK (--explain flag)
pip install ragdrift[api]       # + FastAPI server
pip install ragdrift[ui]        # + Streamlit dashboard
pip install ragdrift[agent]     # + LangGraph agentic pipeline
pip install ragdrift[all]       # everything

Research

ragdrift's --explain flag uses LLM-as-a-judge evaluation: a methodology that published research shows outperforms BLEU/BERTScore for assessing text extraction quality at scale.

Can LLM Reasoning Be Trusted? A Comparative Study Crish Nagarkar · arXiv:2601.14479 Key finding: LLM-as-a-judge outperforms BLEU/BERTScore for measuring semantic quality (Kendall's τ = 0.536 vs. 0.022 for BERTScore).

This is why ragdrift's --explain output goes beyond "chunk count changed". It reasons about why the change is likely to hurt retrieval and what to do about it.

When ragdrift is NOT the right tool

You already have full observability (LangSmith, Arize, Galileo) with retrieval metric dashboards: ragdrift adds value as a lightweight pre-ingestion regression check, not a replacement for inference-time tracing.
Your corpus is fewer than ~50 documents: at that scale, drift is visible through manual review. ragdrift pays off at 500+ documents or when multiple people own different parts of the pipeline.
You need real-time alerting: ragdrift is a scan-based tool designed to run on a schedule (e.g. cron after each ingestion job). It does not hook into your inference path or stream events.
Your documents are scanned PDFs or images: ragdrift diffs text extraction output. It cannot detect OCR quality degradation or image rendering failures upstream of the text layer.

Tech stack

Core: Python 3.11+ · SQLite · rank-bm25 · LangGraph Optional: Qdrant · sentence-transformers · FastAPI · Streamlit · Docker

MIT License · Crish Nagarkar

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Mar 30, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ragdrift-0.1.0.tar.gz (58.4 kB view details)

Uploaded Mar 30, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

ragdrift-0.1.0-py3-none-any.whl (57.1 kB view details)

Uploaded Mar 30, 2026 Python 3

File details

Details for the file ragdrift-0.1.0.tar.gz.

File metadata

Download URL: ragdrift-0.1.0.tar.gz
Upload date: Mar 30, 2026
Size: 58.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.3

File hashes

Hashes for ragdrift-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`ea825d53ec7899e51f93b6513d8f272c3d3d2f4c120e4c9ae634b7866a458b24`
MD5	`7ae9fa289b77eaa5e4f167a4eaba12db`
BLAKE2b-256	`d68e7ee30d7e8ed7892481a8b626407a2b327ea6b898908fd8d7cc38be8836dd`

See more details on using hashes here.

File details

Details for the file ragdrift-0.1.0-py3-none-any.whl.

File metadata

Download URL: ragdrift-0.1.0-py3-none-any.whl
Upload date: Mar 30, 2026
Size: 57.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.3

File hashes

Hashes for ragdrift-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`28fc82115f6be821598f483836d2f44c748fb9765967afa113009a4211e43b5f`
MD5	`2bc0ad274986c33b8ddc3c9bc06cf4fe`
BLAKE2b-256	`c017531b06fb59ea7021a42f4ed4371e105f3f627463e0564cb1d5d43346de2b`

See more details on using hashes here.

ragdrift 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

See it in action

Contents

Quickstart

What the demo does

Use with your own data

What ragdrift catches

Example output

Example --explain diagnosis

CLI reference

Golden queries

Severity rules

Where ragdrift fits

Architecture

Fingerprinting: file vs. extractor drift

Optional extras

Research

When ragdrift is NOT the right tool

Tech stack

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Example `--explain` diagnosis