Skip to main content

Document quality profiler for ML pipelines. Score, deduplicate, and validate your corpus before embedding. Zero mandatory dependencies.

Project description

doc-quality

Document quality profiler for ML pipelines. Score, deduplicate, and validate your corpus before embedding. Zero mandatory dependencies.

Tests PyPI Dependencies Python License LinkedIn

Install

pip install corpus-quality

The problem

Most RAG failures start at ingestion -- when nobody checked whether the documents were actually clean before embedding them. Teams spend days tuning chunk sizes and prompt templates when the real problem is that 20% of their corpus is failed PDF extractions, near-duplicates, or wall-of-text that will split badly at any chunk boundary.

doc-quality runs before any of that.

Quick start

from doc_quality import CorpusProfiler

profiler = CorpusProfiler(corpus_name="my_rag_corpus")
report = profiler.profile_directory("./documents/")
print(report.summary())
=== DOC-QUALITY -- CORPUS QUALITY REPORT ===
  Documents:      47
  Pass:           34
  Warn:           9
  Fail:           4
  Pass rate:      72%
  Avg score:      74.2/100
  Duplicates:     6 pairs
  Recommended chunking: recursive

Single document

from doc_quality import DocumentProfiler

profiler = DocumentProfiler()
profile = profiler.profile(text, name="annual_report.pdf")

print(profile.quality_score)      # 83.4
print(profile.quality_level)      # QualityLevel.PASS
print(profile.boilerplate_ratio)  # 0.18
print(profile.chunk_risk_score)   # 0.22
print(profile.issues)

Near-duplicate detection (no embeddings needed)

from doc_quality import find_duplicates

pairs = find_duplicates({"doc_a.txt": text_a, "doc_b.txt": text_b}, threshold=0.85)
for pair in pairs:
    print(pair)
# DuplicatePair('policy_v1.txt' <-> 'policy_v1_copy.txt', EXACT)

Data card for model cards and papers

print(report.data_card.to_markdown())

CLI

doc-quality profile report.txt
doc-quality corpus ./documents/ --name "my_corpus"
doc-quality corpus ./documents/ --data-card
doc-quality deduplicate ./documents/
doc-quality corpus ./documents/ --json

Quality dimensions scored

Dimension Weight Detects
Encoding 25% Replacement chars, mojibake
Text density 20% Whitespace-heavy extractions
Uniqueness 20% Repeated lines, template noise
Sentences 15% List dumps, failed extractions
Boilerplate 10% Headers, footers, disclaimers
Table integrity 5% Malformed table structure
Chunk boundary risk 5% Wall-of-text, no split points

Pipeline position

doc-quality (profile) -> chunk-bench (benchmark) -> rag-eval-kit (evaluate)

Linda Oraegbunam | LinkedIn | GitHub

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

corpus_quality-1.0.0.tar.gz (15.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

corpus_quality-1.0.0-py3-none-any.whl (14.9 kB view details)

Uploaded Python 3

File details

Details for the file corpus_quality-1.0.0.tar.gz.

File metadata

  • Download URL: corpus_quality-1.0.0.tar.gz
  • Upload date:
  • Size: 15.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for corpus_quality-1.0.0.tar.gz
Algorithm Hash digest
SHA256 3ba5dac7fb0e425b0fd222c59527045b86b8c7c0f9295b88fb4740d049e13104
MD5 ab287cec99f0985c55178b672a446cc5
BLAKE2b-256 f07632c77e4d17c75b91301dfb3fdec7131f4e4408fc25287306195b77cb955b

See more details on using hashes here.

Provenance

The following attestation bundles were made for corpus_quality-1.0.0.tar.gz:

Publisher: publish.yml on obielin/doc-quality

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file corpus_quality-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: corpus_quality-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 14.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for corpus_quality-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 590c13a2ae325c89e2de4e9467b393a6a6fbc7b3db46d7bda740d94e97c54515
MD5 50c2e7f6d6026783ade4c9aa268a827d
BLAKE2b-256 4d3a95a2a2f4e12db73cbe88ce160cfdb3d6d66b5efa786a47a6f96d245f8e1c

See more details on using hashes here.

Provenance

The following attestation bundles were made for corpus_quality-1.0.0-py3-none-any.whl:

Publisher: publish.yml on obielin/doc-quality

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page