Document screening and quality reporting pipeline for RAG preprocessing, PII detection, readability, and compliance workflows

These details have not been verified by PyPI

Project links

Project description

doculens

Screen documents for quality, PII, readability, and compliance — before they enter your RAG pipeline or review workflow.

What is doculens?

doculens is a document screening and quality reporting pipeline. It ingests documents (PDF, DOCX, TXT, HTML), runs them through a configurable set of screeners, and produces structured reports in JSON or HTML.

Features

Multi-format ingestion — PDF, DOCX, TXT, HTML
6 built-in screeners — readability, word count, language detection, PII detection, grammar, duplicate detection
Structured reports — JSON and HTML output with color-coded score bars
CLI interface — single file and batch screening with Rich-formatted terminal output
Pluggable architecture — add custom screeners to the pipeline
Configurable thresholds — min/max word count, expected language, readability floor, and more
PII redaction — replace detected sensitive data with labelled placeholders like [EMAIL_REDACTED]
Auto-detection — PII and grammar screeners are automatically enabled when their dependencies are installed

Installation

pip install doculens

Optional extras

pip install doculens[pii]       # PII detection (Presidio + spaCy)
pip install doculens[grammar]   # Grammar checking (LanguageTool)
pip install doculens[all]       # Everything

Quickstart

CLI

# Screen a single document
doculens run report.pdf

# Choose specific screeners and output format
doculens run report.pdf --screeners readability,wordcount,pii --format html -o report.html

# Screen all documents in a folder
doculens batch ./documents --recursive --format json --output-dir ./reports

# Redact PII from a document
doculens redact contract.pdf -o contract_clean.txt

# List available screeners
doculens list-screeners

Python API

from doculens import ScreeningConfig, ScreeningPipeline

# Configure and run
config = ScreeningConfig(
    screeners=["readability", "wordcount", "language"],
    min_word_count=50,
    expected_language="en",
)
pipeline = ScreeningPipeline(config)
report = pipeline.screen_file("report.pdf")

print(report.overall_passed)        # True / False
print(report.summary)               # {'total_screeners': 3, 'passed': 3, ...}

for result in report.results:
    print(f"{result.screener_name}: {result.score:.2f} — {'PASS' if result.passed else 'FAIL'}")

PII redaction

from doculens.screeners.pii import PIIScreener

screener = PIIScreener()
redacted = screener.redact("Contact John Smith at john@example.com")
print(redacted)
# "Contact [NAME_REDACTED] at [EMAIL_REDACTED]"

Generate reports

from doculens import HTMLReportGenerator, JSONReportGenerator

# HTML report
html = HTMLReportGenerator()
html.save(report, "report.html")

# JSON report
json_gen = JSONReportGenerator()
json_gen.save(report, "report.json")

Screeners

Screener	Key	What it checks	Library
Readability	`readability`	Flesch score, grade level, Gunning Fog, SMOG	`textstat`
Word Count	`wordcount`	Min/max words, line count, avg word length	built-in
Language	`language`	Detects language, validates against expected	`langdetect`
PII Detection	`pii`	Emails, phones, names, credit cards, IPs	`presidio` + `spaCy`
Grammar	`grammar`	Spelling and grammar errors	`language-tool-python`
Duplicates	`duplicate`	Exact and near-duplicate paragraphs	built-in

CLI Options

doculens run <file> [OPTIONS]
  --screeners, -s    Comma-separated screener names (auto-includes pii/grammar if installed)
  --format, -f       Report format: json or html (default: json)
  --output, -o       Save report to file
  --min-words        Minimum word count threshold (default: 50)
  --lang             Expected language code, e.g. "en"
  --dup-threshold    Similarity threshold for duplicate detection (default: 0.8)
  --verbose, -v      Show warnings in output

doculens batch <folder> [OPTIONS]
  --screeners, -s    Comma-separated screener names
  --format, -f       Report format: json or html (default: json)
  --output-dir, -o   Directory to save individual reports
  --min-words        Minimum word count threshold (default: 50)
  --recursive, -r    Scan subdirectories
  --dup-threshold    Similarity threshold for duplicate detection (default: 0.8)
  --verbose, -v      Show warnings in output

doculens redact <file> [OPTIONS]
  --output, -o       Save redacted text to file (default: print to stdout)

Sample CLI output

doculens — screening report.pdf

╭────────────┬────────┬──────────────────────┬──────────────────────────────────╮
│ Screener   │ Status │ Score                │ Summary                          │
├────────────┼────────┼──────────────────────┼──────────────────────────────────┤
│ Readability│  PASS  │ ████████████     72% │ Standard — suitable for most     │
│            │        │                      │ business documents               │
│ Word Count │  PASS  │ ████████████    100% │ 1,243 words, 48 lines           │
│ Language   │  PASS  │ ████████████    100% │ English (100% confident)         │
│ PII        │  FAIL  │ ██████           42% │ 3 PII items: 2 emails, 1 name   │
╰────────────┴────────┴──────────────────────┴──────────────────────────────────╯
╭──────────────────────────────────────────────────────────────╮
│ Overall: FAILED  |  Words: 1,243  |  Screeners: 3/4 passed  │
╰──────────────────────────────────────────────────────────────╯

Supported formats

Format	Extensions	Library
PDF	`.pdf`	`pdfplumber`
Word	`.docx`	`python-docx`
HTML	`.html`, `.htm`	`beautifulsoup4`
Plain text	`.txt`	built-in

Contributing

Contributions are welcome. Please open an issue first to discuss what you would like to change.

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

May 20, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

doculens-0.1.0.tar.gz (35.4 kB view details)

Uploaded May 20, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

doculens-0.1.0-py3-none-any.whl (32.3 kB view details)

Uploaded May 20, 2026 Python 3

File details

Details for the file doculens-0.1.0.tar.gz.

File metadata

Download URL: doculens-0.1.0.tar.gz
Upload date: May 20, 2026
Size: 35.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.0

File hashes

Hashes for doculens-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`61b6ac26a0b83a66cc57ab4dd245ec108e5192befb99db181233f29cc4651d91`
MD5	`d15428568882adac8e1af000403698ad`
BLAKE2b-256	`cd0f163300e2a5c78990cfa06d6a9146f8f8fcf92839443a09f10cda11076806`

See more details on using hashes here.

File details

Details for the file doculens-0.1.0-py3-none-any.whl.

File metadata

Download URL: doculens-0.1.0-py3-none-any.whl
Upload date: May 20, 2026
Size: 32.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.0

File hashes

Hashes for doculens-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`3925af07d6b00f6a4c3781d21d08ee34ca7edc9e5ff7770829955a7cebb9f642`
MD5	`cd8cb6fc4f48b178b00fdbdedb497205`
BLAKE2b-256	`464bcaa867c7342b603c65057af4293944457ad17a07f8c98c91c174c8f91b57`

See more details on using hashes here.

doculens 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

doculens

What is doculens?

Features

Installation

Optional extras

Quickstart

CLI

Python API

PII redaction

Generate reports

Screeners

CLI Options

Sample CLI output

Supported formats

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes