Skip to main content

Pre-indexing QA auditor for RAG document ingestion pipelines

Project description

DocIngestQA

Pre-indexing QA auditor for RAG document ingestion pipelines.

DocIngestQA answers a question every RAG team must ask before indexing: are these chunks actually good enough to retrieve from? It runs 11 deterministic checks on your exported chunks — missing pages, OCR noise, duplicates, encoding corruption, poor split boundaries, and more — and produces a structured JSON/Markdown/HTML report with issue-level evidence.

PyPI Python License: MIT

Checks Output CLI pip

How it works

flowchart TD
    IN["📄 chunks.jsonl\nExported segment-level chunks\nchunk_id · source · page · text"] --> AUD

    subgraph AUD["DocIngestQA Engine  src/docingestqa/"]
        V1["v0.1 Checks\ninput_summary · metadata_completeness\ndocument_coverage · page_coverage\nchunk_length · ocr_noise\nduplicate_chunks · source_distribution"]
        V2["v0.2 Checks\nchunk_overlap · encoding_health\nsplit_quality"]
    end

    MAN["📋 source_manifest.json\nExpected sources + page counts\n(optional but recommended)"] --> AUD

    AUD --> SUM["Executive Summary\nOverall FAIL / WARN / PASS\nCheck-level status counts"]
    SUM --> ISS["Issue List\nChunk-level evidence · severity HIGH/MEDIUM\nSource · page · text preview"]
    ISS --> OUT["Output Formats"]

    subgraph OUT["Structured Outputs"]
        JSON["result.to_json()\nMachine-readable · LLM-naratable"]
        MD["result.to_markdown()\nCheck-by-check tables"]
        HTML["result.to_html()\nSelf-contained report"]
    end

    OUT --> NOTE["⚠️ Interpretation Note — mandatory, cannot be suppressed\nIngestion quality signals only · no retrieval relevance · no answer faithfulness\nUse as pre-indexing review, not as a RAG correctness guarantee"]

Why DocIngestQA exists

RAG pipelines fail silently at ingestion. Missing pages, OCR garble, duplicate chunks, and mojibake encoding errors all make it into the vector database undetected — then show up as hallucinations or missed retrievals in production. By the time you trace the error back to a bad chunk, you have already shipped the issue.

DocIngestQA moves that quality gate to before indexing, where it is cheap to fix.


Install

pip install docingestqa

For local development:

git clone https://github.com/sidharthkriplani/docingestqa
cd docingestqa
pip install -e ".[dev]"
python examples/generate_demo_data.py     # creates examples/assets/
python examples/audit_demo.py             # writes outputs/ingestion_audit.*

Quick start

from docingestqa import AuditConfig, IngestionAuditor

auditor = IngestionAuditor(
    chunks_path="chunks.jsonl",
    documents_path="source_manifest.json",   # optional
    config=AuditConfig(),
)
report = auditor.run()
report.to_json("outputs/ingestion_audit.json")
report.to_markdown("outputs/ingestion_audit.md")
report.to_html("outputs/ingestion_audit.html")

summary = report.to_dict()["executive_summary"]
print(summary["overall_status"])   # FAIL / WARN / PASS

CLI

python -m docingestqa chunks.jsonl --manifest source_manifest.json --out outputs/
# exits with code 1 if overall status is FAIL

Chunk format

DocIngestQA reads JSONL files where each line is one chunk:

{"chunk_id": "abc123", "source": "annual_report.pdf", "page": 4, "text": "Revenue grew 34%..."}

All fields except text are optional but recommended. Chunks missing source or page are flagged by metadata_completeness. If you provide a source manifest, document_coverage and page_coverage checks also activate.

Source manifest format

[
  {"source": "annual_report.pdf", "pages": 12},
  {"source": "onboarding_guide.pdf", "pages": 8}
]

Real output

The following is actual output from examples/audit_demo.py on 64 chunks across 10 synthetic documents, with deliberately seeded defects.

Executive summary:

Overall status : FAIL
Chunks audited : 64
Sources        : 10
Check counts   : PASS=4  WARN=6  FAIL=1

Check results:

Check Status Summary
input_summary PASS 64 chunks across 10 sources
metadata_completeness WARN 2 chunks missing source or page metadata
document_coverage PASS 0 missing documents, 0 orphan sources
page_coverage FAIL 10 missing pages detected across 2 documents
chunk_length WARN 3 short chunks (possible headers/fragments)
ocr_noise WARN 2 chunks show OCR/extraction noise
duplicate_chunks WARN 1 exact duplicate chunk pair
source_distribution PASS 10 sources, largest at 14.1%
chunk_overlap PASS 1 high-overlap consecutive pair (below threshold)
encoding_health WARN 2 chunks contain mojibake sequences
split_quality WARN 4 chunks with poor split boundaries

Interpretation note (always included in every output):

DocIngestQA reports deterministic ingestion quality signals for already-generated chunks. It does not parse documents, evaluate retrieval relevance, verify answer faithfulness, or prove that a RAG system is correct. Use these outputs as pre-indexing review signals before loading chunks into a vector database.


The 11 checks

v0.1 checks

Check What it detects Status triggers
input_summary Empty chunk sets FAIL if no chunks
metadata_completeness Missing source or page FAIL >20%, WARN >5%
document_coverage Documents in manifest with no chunks; orphan sources FAIL if missing docs
page_coverage Pages expected by manifest but absent in chunk set FAIL if any missing
chunk_length Empty, very short (<80 chars), very long (>1500 chars) chunks FAIL if >10% empty
ocr_noise Replacement chars (U+FFFD), repeated junk runs, non-printable ratio FAIL if >20% noisy
duplicate_chunks Exact duplicates (SHA-1 hash) and near-duplicates (Jaccard on 5-grams) FAIL if ≥30 pairs
source_distribution One source dominating >80% of all chunks WARN

v0.2 checks

Check What it detects Status triggers
chunk_overlap Consecutive chunks from the same source with Jaccard ≥ 0.40 on 4-grams — sliding-window splitter artifacts FAIL if ≥15 high-overlap pairs, WARN if ≥3 flagged
encoding_health Null bytes, BOM markers, control characters, mojibake patterns (é, ©, etc.) FAIL if null bytes or mojibake rate ≥20%
split_quality Mid-sentence starts, mid-sentence ends, navigation fragments (bare page numbers, TOC entries) FAIL if ≥30% flagged

API reference

IngestionAuditor(chunks_path, documents_path, config)

Parameter Type Description
chunks_path str | Path Path to JSONL chunk file
documents_path str | Path | None Optional path to JSON source manifest
config AuditConfig Threshold configuration (all fields have defaults)

auditor.run() → IngestionAuditReport

Returns a report with:

Method Returns
.to_json(path=None) JSON string; writes file if path given
.to_markdown(path=None) Markdown string; writes file if path given
.to_html(path=None) Self-contained HTML report
.to_dict() Full payload dict matching the JSON schema

Key AuditConfig thresholds

AuditConfig(
    min_chunk_chars=80,
    max_chunk_chars=1500,
    noisy_text_ratio_threshold=0.05,
    replacement_char_threshold=3,
    ngram_size=5,
    near_duplicate_jaccard_threshold=0.85,
    warn_duplicate_pair_count=5,
    fail_duplicate_pair_count=30,
    # v0.2
    overlap_ngram_size=4,
    overlap_jaccard_warn_threshold=0.40,
    overlap_jaccard_fail_threshold=0.70,
    warn_overlap_pair_count=3,
    fail_overlap_pair_count=15,
    null_byte_fail_threshold=1,
    mojibake_warn_rate=0.05,
    mojibake_fail_rate=0.20,
    warn_bad_split_rate=0.10,
    fail_bad_split_rate=0.30,
)

Output schema (v0.2)

{
  "schema_version": "0.2",
  "metadata": { "docingestqa_version": "0.2.0", "generated_at": "...", "chunk_count": 64 },
  "executive_summary": {
    "overall_status": "FAIL",
    "chunk_count": 64,
    "check_counts": { "PASS": 4, "WARN": 6, "FAIL": 1 }
  },
  "checks": [
    {
      "check": "page_coverage",
      "status": "FAIL",
      "summary": "10 missing pages and 0 out-of-range pages detected.",
      "metrics": { "missing_page_total": 10, "extra_page_total": 0 },
      "issues": [
        {
          "check": "page_coverage",
          "severity": "HIGH",
          "message": "Document has pages missing from the chunk set.",
          "source": "onboarding_guide.pdf",
          "evidence": { "missing_pages": [4, 5], "expected_pages": 8 }
        }
      ],
      "recommendation": "Inspect parser logs for missing pages before indexing."
    }
  ],
  "interpretation_note": "DocIngestQA reports deterministic ingestion quality signals..."
}

What DocIngestQA is not

Not a document parser. DocIngestQA audits chunks you already generated. It does not extract text from PDFs or other documents.

Not a retrieval evaluator. It does not measure whether your chunks are semantically relevant to queries. For that, use a retrieval eval framework.

Not a RAG correctness checker. It does not verify whether answers generated from these chunks are faithful or accurate.

Not a statistical significance tester. Checks are deterministic heuristics, not hypothesis tests. Issue counts are investigation priorities, not p-values.

Not an embedding quality checker. It works on raw text, not embeddings. Embedding quality (cluster separation, isotropy) is a separate concern.


Comparison

Capability DocIngestQA Manual inspection Generic data quality
Missing page detection Yes No No
OCR noise detection Yes Slow No
Duplicate/near-dup chunks Yes No Partial
Mojibake / encoding errors Yes Slow No
Sliding-window overlap Yes No No
Split boundary quality Yes Slow No
Structured JSON output Yes No Varies
Zero non-stdlib dependencies Yes No

Roadmap

Version Scope
v0.1.0 8 checks, JSON/MD/HTML output
v0.2.0 3 new checks (overlap, encoding, split quality), CLI, Python 3.10 compat
v0.3 Configurable severity overrides, per-source report sections
v1.0 Semantic coherence check (embedding cosine within chunk), LLM-narrated summary option

Contributing

See CONTRIBUTING.md. Issues and PRs welcome.


License

MIT © Sidharth Kriplani

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docingestqa-0.2.0.tar.gz (27.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

docingestqa-0.2.0-py3-none-any.whl (23.6 kB view details)

Uploaded Python 3

File details

Details for the file docingestqa-0.2.0.tar.gz.

File metadata

  • Download URL: docingestqa-0.2.0.tar.gz
  • Upload date:
  • Size: 27.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for docingestqa-0.2.0.tar.gz
Algorithm Hash digest
SHA256 42b9686061d04e6d61144bf140b322e7b835836a32a325355e8b2ae78474989c
MD5 eb56667bd55478ba7052eb22d3a6e060
BLAKE2b-256 ed69e6a0a4addf771765fb500fffefad096c4d4906ab382a05bd903722fddd04

See more details on using hashes here.

Provenance

The following attestation bundles were made for docingestqa-0.2.0.tar.gz:

Publisher: publish.yml on SidharthKriplani/docingestqa

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file docingestqa-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: docingestqa-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 23.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for docingestqa-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 3495bb6a71344b8b0b58e1a0cbd83f3954ccbe27da0a0ae5651e1b826db14fcd
MD5 c5c0eade9c030aed61d2b7a0ca1e8d0a
BLAKE2b-256 30c756eb837f37953dec1b8fa37a4502e271f5525eaef230bca5c2376614ac3f

See more details on using hashes here.

Provenance

The following attestation bundles were made for docingestqa-0.2.0-py3-none-any.whl:

Publisher: publish.yml on SidharthKriplani/docingestqa

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page