Document screening and quality reporting pipeline for RAG preprocessing, PII detection, readability, and compliance workflows
Project description
doculens
Screen documents for quality, PII, readability, and compliance — before they enter your RAG pipeline or review workflow.
What is doculens?
doculens is a document screening and quality reporting pipeline. It ingests documents (PDF, DOCX, TXT, HTML), runs them through a configurable set of screeners, and produces structured reports in JSON or HTML.
Features
- Multi-format ingestion — PDF, DOCX, TXT, HTML
- 6 built-in screeners — readability, word count, language detection, PII detection, grammar, duplicate detection
- Structured reports — JSON and HTML output with color-coded score bars
- CLI interface — single file and batch screening with Rich-formatted terminal output
- Pluggable architecture — add custom screeners to the pipeline
- Configurable thresholds — min/max word count, expected language, readability floor, and more
- PII redaction — replace detected sensitive data with labelled placeholders like
[EMAIL_REDACTED] - Auto-detection — PII and grammar screeners are automatically enabled when their dependencies are installed
Installation
pip install doculens
Optional extras
pip install doculens[pii] # PII detection (Presidio + spaCy)
pip install doculens[grammar] # Grammar checking (LanguageTool)
pip install doculens[all] # Everything
Quickstart
CLI
# Screen a single document
doculens run report.pdf
# Choose specific screeners and output format
doculens run report.pdf --screeners readability,wordcount,pii --format html -o report.html
# Screen all documents in a folder
doculens batch ./documents --recursive --format json --output-dir ./reports
# Redact PII from a document
doculens redact contract.pdf -o contract_clean.txt
# List available screeners
doculens list-screeners
Python API
from doculens import ScreeningConfig, ScreeningPipeline
# Configure and run
config = ScreeningConfig(
screeners=["readability", "wordcount", "language"],
min_word_count=50,
expected_language="en",
)
pipeline = ScreeningPipeline(config)
report = pipeline.screen_file("report.pdf")
print(report.overall_passed) # True / False
print(report.summary) # {'total_screeners': 3, 'passed': 3, ...}
for result in report.results:
print(f"{result.screener_name}: {result.score:.2f} — {'PASS' if result.passed else 'FAIL'}")
PII redaction
from doculens.screeners.pii import PIIScreener
screener = PIIScreener()
redacted = screener.redact("Contact John Smith at john@example.com")
print(redacted)
# "Contact [NAME_REDACTED] at [EMAIL_REDACTED]"
Generate reports
from doculens import HTMLReportGenerator, JSONReportGenerator
# HTML report
html = HTMLReportGenerator()
html.save(report, "report.html")
# JSON report
json_gen = JSONReportGenerator()
json_gen.save(report, "report.json")
Screeners
| Screener | Key | What it checks | Library |
|---|---|---|---|
| Readability | readability |
Flesch score, grade level, Gunning Fog, SMOG | textstat |
| Word Count | wordcount |
Min/max words, line count, avg word length | built-in |
| Language | language |
Detects language, validates against expected | langdetect |
| PII Detection | pii |
Emails, phones, names, credit cards, IPs | presidio + spaCy |
| Grammar | grammar |
Spelling and grammar errors | language-tool-python |
| Duplicates | duplicate |
Exact and near-duplicate paragraphs | built-in |
CLI Options
doculens run <file> [OPTIONS]
--screeners, -s Comma-separated screener names (auto-includes pii/grammar if installed)
--format, -f Report format: json or html (default: json)
--output, -o Save report to file
--min-words Minimum word count threshold (default: 50)
--lang Expected language code, e.g. "en"
--dup-threshold Similarity threshold for duplicate detection (default: 0.8)
--verbose, -v Show warnings in output
doculens batch <folder> [OPTIONS]
--screeners, -s Comma-separated screener names
--format, -f Report format: json or html (default: json)
--output-dir, -o Directory to save individual reports
--min-words Minimum word count threshold (default: 50)
--recursive, -r Scan subdirectories
--dup-threshold Similarity threshold for duplicate detection (default: 0.8)
--verbose, -v Show warnings in output
doculens redact <file> [OPTIONS]
--output, -o Save redacted text to file (default: print to stdout)
Sample CLI output
doculens — screening report.pdf
╭────────────┬────────┬──────────────────────┬──────────────────────────────────╮
│ Screener │ Status │ Score │ Summary │
├────────────┼────────┼──────────────────────┼──────────────────────────────────┤
│ Readability│ PASS │ ████████████ 72% │ Standard — suitable for most │
│ │ │ │ business documents │
│ Word Count │ PASS │ ████████████ 100% │ 1,243 words, 48 lines │
│ Language │ PASS │ ████████████ 100% │ English (100% confident) │
│ PII │ FAIL │ ██████ 42% │ 3 PII items: 2 emails, 1 name │
╰────────────┴────────┴──────────────────────┴──────────────────────────────────╯
╭──────────────────────────────────────────────────────────────╮
│ Overall: FAILED | Words: 1,243 | Screeners: 3/4 passed │
╰──────────────────────────────────────────────────────────────╯
Supported formats
| Format | Extensions | Library |
|---|---|---|
.pdf |
pdfplumber |
|
| Word | .docx |
python-docx |
| HTML | .html, .htm |
beautifulsoup4 |
| Plain text | .txt |
built-in |
Contributing
Contributions are welcome. Please open an issue first to discuss what you would like to change.
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file doculens-0.1.0.tar.gz.
File metadata
- Download URL: doculens-0.1.0.tar.gz
- Upload date:
- Size: 35.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
61b6ac26a0b83a66cc57ab4dd245ec108e5192befb99db181233f29cc4651d91
|
|
| MD5 |
d15428568882adac8e1af000403698ad
|
|
| BLAKE2b-256 |
cd0f163300e2a5c78990cfa06d6a9146f8f8fcf92839443a09f10cda11076806
|
File details
Details for the file doculens-0.1.0-py3-none-any.whl.
File metadata
- Download URL: doculens-0.1.0-py3-none-any.whl
- Upload date:
- Size: 32.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3925af07d6b00f6a4c3781d21d08ee34ca7edc9e5ff7770829955a7cebb9f642
|
|
| MD5 |
cd8cb6fc4f48b178b00fdbdedb497205
|
|
| BLAKE2b-256 |
464bcaa867c7342b603c65057af4293944457ad17a07f8c98c91c174c8f91b57
|