Document quality profiler for ML pipelines. Score, deduplicate, and validate your corpus before embedding. Zero mandatory dependencies.
Project description
doc-quality
Document quality profiler for ML pipelines. Score, deduplicate, and validate your corpus before embedding. Zero mandatory dependencies.
Install
pip install corpus-quality
The problem
Most RAG failures start at ingestion -- when nobody checked whether the documents were actually clean before embedding them. Teams spend days tuning chunk sizes and prompt templates when the real problem is that 20% of their corpus is failed PDF extractions, near-duplicates, or wall-of-text that will split badly at any chunk boundary.
doc-quality runs before any of that.
Quick start
from doc_quality import CorpusProfiler
profiler = CorpusProfiler(corpus_name="my_rag_corpus")
report = profiler.profile_directory("./documents/")
print(report.summary())
=== DOC-QUALITY -- CORPUS QUALITY REPORT ===
Documents: 47
Pass: 34
Warn: 9
Fail: 4
Pass rate: 72%
Avg score: 74.2/100
Duplicates: 6 pairs
Recommended chunking: recursive
Single document
from doc_quality import DocumentProfiler
profiler = DocumentProfiler()
profile = profiler.profile(text, name="annual_report.pdf")
print(profile.quality_score) # 83.4
print(profile.quality_level) # QualityLevel.PASS
print(profile.boilerplate_ratio) # 0.18
print(profile.chunk_risk_score) # 0.22
print(profile.issues)
Near-duplicate detection (no embeddings needed)
from doc_quality import find_duplicates
pairs = find_duplicates({"doc_a.txt": text_a, "doc_b.txt": text_b}, threshold=0.85)
for pair in pairs:
print(pair)
# DuplicatePair('policy_v1.txt' <-> 'policy_v1_copy.txt', EXACT)
Data card for model cards and papers
print(report.data_card.to_markdown())
CLI
doc-quality profile report.txt
doc-quality corpus ./documents/ --name "my_corpus"
doc-quality corpus ./documents/ --data-card
doc-quality deduplicate ./documents/
doc-quality corpus ./documents/ --json
Quality dimensions scored
| Dimension | Weight | Detects |
|---|---|---|
| Encoding | 25% | Replacement chars, mojibake |
| Text density | 20% | Whitespace-heavy extractions |
| Uniqueness | 20% | Repeated lines, template noise |
| Sentences | 15% | List dumps, failed extractions |
| Boilerplate | 10% | Headers, footers, disclaimers |
| Table integrity | 5% | Malformed table structure |
| Chunk boundary risk | 5% | Wall-of-text, no split points |
Pipeline position
doc-quality (profile) -> chunk-bench (benchmark) -> rag-eval-kit (evaluate)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file corpus_quality-1.0.0.tar.gz.
File metadata
- Download URL: corpus_quality-1.0.0.tar.gz
- Upload date:
- Size: 15.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3ba5dac7fb0e425b0fd222c59527045b86b8c7c0f9295b88fb4740d049e13104
|
|
| MD5 |
ab287cec99f0985c55178b672a446cc5
|
|
| BLAKE2b-256 |
f07632c77e4d17c75b91301dfb3fdec7131f4e4408fc25287306195b77cb955b
|
Provenance
The following attestation bundles were made for corpus_quality-1.0.0.tar.gz:
Publisher:
publish.yml on obielin/doc-quality
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
corpus_quality-1.0.0.tar.gz -
Subject digest:
3ba5dac7fb0e425b0fd222c59527045b86b8c7c0f9295b88fb4740d049e13104 - Sigstore transparency entry: 1281777204
- Sigstore integration time:
-
Permalink:
obielin/doc-quality@dff380f45f2e1ed957efbf620d7e4c694d17769d -
Branch / Tag:
refs/tags/v1.0.0 - Owner: https://github.com/obielin
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@dff380f45f2e1ed957efbf620d7e4c694d17769d -
Trigger Event:
release
-
Statement type:
File details
Details for the file corpus_quality-1.0.0-py3-none-any.whl.
File metadata
- Download URL: corpus_quality-1.0.0-py3-none-any.whl
- Upload date:
- Size: 14.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
590c13a2ae325c89e2de4e9467b393a6a6fbc7b3db46d7bda740d94e97c54515
|
|
| MD5 |
50c2e7f6d6026783ade4c9aa268a827d
|
|
| BLAKE2b-256 |
4d3a95a2a2f4e12db73cbe88ce160cfdb3d6d66b5efa786a47a6f96d245f8e1c
|
Provenance
The following attestation bundles were made for corpus_quality-1.0.0-py3-none-any.whl:
Publisher:
publish.yml on obielin/doc-quality
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
corpus_quality-1.0.0-py3-none-any.whl -
Subject digest:
590c13a2ae325c89e2de4e9467b393a6a6fbc7b3db46d7bda740d94e97c54515 - Sigstore transparency entry: 1281777221
- Sigstore integration time:
-
Permalink:
obielin/doc-quality@dff380f45f2e1ed957efbf620d7e4c694d17769d -
Branch / Tag:
refs/tags/v1.0.0 - Owner: https://github.com/obielin
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@dff380f45f2e1ed957efbf620d7e4c694d17769d -
Trigger Event:
release
-
Statement type: