Pre-indexing QA auditor for RAG document ingestion pipelines

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

skriplani

These details have not been verified by PyPI

Project description

DocIngestQA

Pre-indexing QA auditor for RAG document ingestion pipelines.

DocIngestQA answers a question every RAG team must ask before indexing: are these chunks actually good enough to retrieve from? It runs 11 deterministic checks on your exported chunks — missing pages, OCR noise, duplicates, encoding corruption, poor split boundaries, and more — and produces a structured JSON/Markdown/HTML report with issue-level evidence.

How it works

flowchart TD
    IN["📄 chunks.jsonl\nExported segment-level chunks\nchunk_id · source · page · text"] --> AUD

    subgraph AUD["DocIngestQA Engine  src/docingestqa/"]
        V1["v0.1 Checks\ninput_summary · metadata_completeness\ndocument_coverage · page_coverage\nchunk_length · ocr_noise\nduplicate_chunks · source_distribution"]
        V2["v0.2 Checks\nchunk_overlap · encoding_health\nsplit_quality"]
    end

    MAN["📋 source_manifest.json\nExpected sources + page counts\n(optional but recommended)"] --> AUD

    AUD --> SUM["Executive Summary\nOverall FAIL / WARN / PASS\nCheck-level status counts"]
    SUM --> ISS["Issue List\nChunk-level evidence · severity HIGH/MEDIUM\nSource · page · text preview"]
    ISS --> OUT["Output Formats"]

    subgraph OUT["Structured Outputs"]
        JSON["result.to_json()\nMachine-readable · LLM-naratable"]
        MD["result.to_markdown()\nCheck-by-check tables"]
        HTML["result.to_html()\nSelf-contained report"]
    end

    OUT --> NOTE["⚠️ Interpretation Note — mandatory, cannot be suppressed\nIngestion quality signals only · no retrieval relevance · no answer faithfulness\nUse as pre-indexing review, not as a RAG correctness guarantee"]

Why DocIngestQA exists

RAG pipelines fail silently at ingestion. Missing pages, OCR garble, duplicate chunks, and mojibake encoding errors all make it into the vector database undetected — then show up as hallucinations or missed retrievals in production. By the time you trace the error back to a bad chunk, you have already shipped the issue.

DocIngestQA moves that quality gate to before indexing, where it is cheap to fix.

Install

pip install docingestqa

For local development:

git clone https://github.com/sidharthkriplani/docingestqa
cd docingestqa
pip install -e ".[dev]"
python examples/generate_demo_data.py     # creates examples/assets/
python examples/audit_demo.py             # writes outputs/ingestion_audit.*

Quick start

from docingestqa import AuditConfig, IngestionAuditor

auditor = IngestionAuditor(
    chunks_path="chunks.jsonl",
    documents_path="source_manifest.json",   # optional
    config=AuditConfig(),
)
report = auditor.run()
report.to_json("outputs/ingestion_audit.json")
report.to_markdown("outputs/ingestion_audit.md")
report.to_html("outputs/ingestion_audit.html")

summary = report.to_dict()["executive_summary"]
print(summary["overall_status"])   # FAIL / WARN / PASS

CLI

python -m docingestqa chunks.jsonl --manifest source_manifest.json --out outputs/
# exits with code 1 if overall status is FAIL

Chunk format

DocIngestQA reads JSONL files where each line is one chunk:

{"chunk_id": "abc123", "source": "annual_report.pdf", "page": 4, "text": "Revenue grew 34%..."}

All fields except text are optional but recommended. Chunks missing source or page are flagged by metadata_completeness. If you provide a source manifest, document_coverage and page_coverage checks also activate.

Source manifest format

[
  {"source": "annual_report.pdf", "pages": 12},
  {"source": "onboarding_guide.pdf", "pages": 8}
]

Real output

The following is actual output from examples/audit_demo.py on 64 chunks across 10 synthetic documents, with deliberately seeded defects.

Executive summary:

Overall status : FAIL
Chunks audited : 64
Sources        : 10
Check counts   : PASS=4  WARN=6  FAIL=1

Check results:

Check	Status	Summary
`input_summary`	PASS	64 chunks across 10 sources
`metadata_completeness`	WARN	2 chunks missing source or page metadata
`document_coverage`	PASS	0 missing documents, 0 orphan sources
`page_coverage`	FAIL	10 missing pages detected across 2 documents
`chunk_length`	WARN	3 short chunks (possible headers/fragments)
`ocr_noise`	WARN	2 chunks show OCR/extraction noise
`duplicate_chunks`	WARN	1 exact duplicate chunk pair
`source_distribution`	PASS	10 sources, largest at 14.1%
`chunk_overlap`	PASS	1 high-overlap consecutive pair (below threshold)
`encoding_health`	WARN	2 chunks contain mojibake sequences
`split_quality`	WARN	4 chunks with poor split boundaries

Interpretation note (always included in every output):

DocIngestQA reports deterministic ingestion quality signals for already-generated chunks. It does not parse documents, evaluate retrieval relevance, verify answer faithfulness, or prove that a RAG system is correct. Use these outputs as pre-indexing review signals before loading chunks into a vector database.

The 11 checks

v0.1 checks

Check	What it detects	Status triggers
`input_summary`	Empty chunk sets	FAIL if no chunks
`metadata_completeness`	Missing `source` or `page`	FAIL >20%, WARN >5%
`document_coverage`	Documents in manifest with no chunks; orphan sources	FAIL if missing docs
`page_coverage`	Pages expected by manifest but absent in chunk set	FAIL if any missing
`chunk_length`	Empty, very short (<80 chars), very long (>1500 chars) chunks	FAIL if >10% empty
`ocr_noise`	Replacement chars (U+FFFD), repeated junk runs, non-printable ratio	FAIL if >20% noisy
`duplicate_chunks`	Exact duplicates (SHA-1 hash) and near-duplicates (Jaccard on 5-grams)	FAIL if ≥30 pairs
`source_distribution`	One source dominating >80% of all chunks	WARN

v0.2 checks

Check	What it detects	Status triggers
`chunk_overlap`	Consecutive chunks from the same source with Jaccard ≥ 0.40 on 4-grams — sliding-window splitter artifacts	FAIL if ≥15 high-overlap pairs, WARN if ≥3 flagged
`encoding_health`	Null bytes, BOM markers, control characters, mojibake patterns (Ã©, Â©, etc.)	FAIL if null bytes or mojibake rate ≥20%
`split_quality`	Mid-sentence starts, mid-sentence ends, navigation fragments (bare page numbers, TOC entries)	FAIL if ≥30% flagged

API reference

`IngestionAuditor(chunks_path, documents_path, config)`

Parameter	Type	Description
`chunks_path`	`str \| Path`	Path to JSONL chunk file
`documents_path`	`str \| Path \| None`	Optional path to JSON source manifest
`config`	`AuditConfig`	Threshold configuration (all fields have defaults)

`auditor.run() → IngestionAuditReport`

Returns a report with:

Method	Returns
`.to_json(path=None)`	JSON string; writes file if path given
`.to_markdown(path=None)`	Markdown string; writes file if path given
`.to_html(path=None)`	Self-contained HTML report
`.to_dict()`	Full payload dict matching the JSON schema

Key `AuditConfig` thresholds

AuditConfig(
    min_chunk_chars=80,
    max_chunk_chars=1500,
    noisy_text_ratio_threshold=0.05,
    replacement_char_threshold=3,
    ngram_size=5,
    near_duplicate_jaccard_threshold=0.85,
    warn_duplicate_pair_count=5,
    fail_duplicate_pair_count=30,
    # v0.2
    overlap_ngram_size=4,
    overlap_jaccard_warn_threshold=0.40,
    overlap_jaccard_fail_threshold=0.70,
    warn_overlap_pair_count=3,
    fail_overlap_pair_count=15,
    null_byte_fail_threshold=1,
    mojibake_warn_rate=0.05,
    mojibake_fail_rate=0.20,
    warn_bad_split_rate=0.10,
    fail_bad_split_rate=0.30,
)

Output schema (v0.2)

{
  "schema_version": "0.2",
  "metadata": { "docingestqa_version": "0.2.0", "generated_at": "...", "chunk_count": 64 },
  "executive_summary": {
    "overall_status": "FAIL",
    "chunk_count": 64,
    "check_counts": { "PASS": 4, "WARN": 6, "FAIL": 1 }
  },
  "checks": [
    {
      "check": "page_coverage",
      "status": "FAIL",
      "summary": "10 missing pages and 0 out-of-range pages detected.",
      "metrics": { "missing_page_total": 10, "extra_page_total": 0 },
      "issues": [
        {
          "check": "page_coverage",
          "severity": "HIGH",
          "message": "Document has pages missing from the chunk set.",
          "source": "onboarding_guide.pdf",
          "evidence": { "missing_pages": [4, 5], "expected_pages": 8 }
        }
      ],
      "recommendation": "Inspect parser logs for missing pages before indexing."
    }
  ],
  "interpretation_note": "DocIngestQA reports deterministic ingestion quality signals..."
}

What DocIngestQA is not

Not a document parser. DocIngestQA audits chunks you already generated. It does not extract text from PDFs or other documents.

Not a retrieval evaluator. It does not measure whether your chunks are semantically relevant to queries. For that, use a retrieval eval framework.

Not a RAG correctness checker. It does not verify whether answers generated from these chunks are faithful or accurate.

Not a statistical significance tester. Checks are deterministic heuristics, not hypothesis tests. Issue counts are investigation priorities, not p-values.

Not an embedding quality checker. It works on raw text, not embeddings. Embedding quality (cluster separation, isotropy) is a separate concern.

Comparison

Capability	DocIngestQA	Manual inspection	Generic data quality
Missing page detection	Yes	No	No
OCR noise detection	Yes	Slow	No
Duplicate/near-dup chunks	Yes	No	Partial
Mojibake / encoding errors	Yes	Slow	No
Sliding-window overlap	Yes	No	No
Split boundary quality	Yes	Slow	No
Structured JSON output	Yes	No	Varies
Zero non-stdlib dependencies	Yes	—	No

Roadmap

Version	Scope
v0.1.0	8 checks, JSON/MD/HTML output
v0.2.0	3 new checks (overlap, encoding, split quality), CLI, Python 3.10 compat
v0.3	Configurable severity overrides, per-source report sections
v1.0	Semantic coherence check (embedding cosine within chunk), LLM-narrated summary option

Contributing

See CONTRIBUTING.md. Issues and PRs welcome.

License

MIT © Sidharth Kriplani

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

skriplani

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.2.0

May 5, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docingestqa-0.2.0.tar.gz (27.1 kB view details)

Uploaded May 5, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

docingestqa-0.2.0-py3-none-any.whl (23.6 kB view details)

Uploaded May 5, 2026 Python 3

File details

Details for the file docingestqa-0.2.0.tar.gz.

File metadata

Download URL: docingestqa-0.2.0.tar.gz
Upload date: May 5, 2026
Size: 27.1 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for docingestqa-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`42b9686061d04e6d61144bf140b322e7b835836a32a325355e8b2ae78474989c`
MD5	`eb56667bd55478ba7052eb22d3a6e060`
BLAKE2b-256	`ed69e6a0a4addf771765fb500fffefad096c4d4906ab382a05bd903722fddd04`

See more details on using hashes here.

Provenance

The following attestation bundles were made for docingestqa-0.2.0.tar.gz:

Publisher: publish.yml on SidharthKriplani/docingestqa

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: docingestqa-0.2.0.tar.gz
- Subject digest: 42b9686061d04e6d61144bf140b322e7b835836a32a325355e8b2ae78474989c
- Sigstore transparency entry: 1440324212
- Sigstore integration time: May 5, 2026
Source repository:
- Permalink: SidharthKriplani/docingestqa@8106c85d37b4ebfbb63f55471a29989d570a0f26
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/SidharthKriplani
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@8106c85d37b4ebfbb63f55471a29989d570a0f26
- Trigger Event: release

File details

Details for the file docingestqa-0.2.0-py3-none-any.whl.

File metadata

Download URL: docingestqa-0.2.0-py3-none-any.whl
Upload date: May 5, 2026
Size: 23.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for docingestqa-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`3495bb6a71344b8b0b58e1a0cbd83f3954ccbe27da0a0ae5651e1b826db14fcd`
MD5	`c5c0eade9c030aed61d2b7a0ca1e8d0a`
BLAKE2b-256	`30c756eb837f37953dec1b8fa37a4502e271f5525eaef230bca5c2376614ac3f`

See more details on using hashes here.

Provenance

The following attestation bundles were made for docingestqa-0.2.0-py3-none-any.whl:

Publisher: publish.yml on SidharthKriplani/docingestqa

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: docingestqa-0.2.0-py3-none-any.whl
- Subject digest: 3495bb6a71344b8b0b58e1a0cbd83f3954ccbe27da0a0ae5651e1b826db14fcd
- Sigstore transparency entry: 1440324224
- Sigstore integration time: May 5, 2026
Source repository:
- Permalink: SidharthKriplani/docingestqa@8106c85d37b4ebfbb63f55471a29989d570a0f26
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/SidharthKriplani
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@8106c85d37b4ebfbb63f55471a29989d570a0f26
- Trigger Event: release

docingestqa 0.2.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

DocIngestQA

How it works

Why DocIngestQA exists

Install

Quick start

CLI

Chunk format

Source manifest format

Real output

The 11 checks

v0.1 checks

v0.2 checks

API reference

IngestionAuditor(chunks_path, documents_path, config)

auditor.run() → IngestionAuditReport

Key AuditConfig thresholds

Output schema (v0.2)

What DocIngestQA is not

Comparison

Roadmap

Contributing

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

`IngestionAuditor(chunks_path, documents_path, config)`

`auditor.run() → IngestionAuditReport`

Key `AuditConfig` thresholds