Skip to main content

Document screening and quality reporting pipeline for RAG preprocessing, PII detection, readability, and compliance workflows

Project description

doculens

Screen documents for quality, PII, readability, and compliance — before they enter your RAG pipeline or review workflow.

PyPI version Python 3.9+ License: MIT

What is doculens?

doculens is a document screening and quality reporting pipeline. It ingests documents (PDF, DOCX, TXT, HTML), runs them through a configurable set of screeners, and produces structured reports in JSON or HTML.

Features

  • Multi-format ingestion — PDF, DOCX, TXT, HTML
  • 6 built-in screeners — readability, word count, language detection, PII detection, grammar, duplicate detection
  • Structured reports — JSON and HTML output with color-coded score bars
  • CLI interface — single file and batch screening with Rich-formatted terminal output
  • Pluggable architecture — add custom screeners to the pipeline
  • Configurable thresholds — min/max word count, expected language, readability floor, and more
  • PII redaction — replace detected sensitive data with labelled placeholders like [EMAIL_REDACTED]
  • Auto-detection — PII and grammar screeners are automatically enabled when their dependencies are installed

Installation

pip install doculens

Optional extras

pip install doculens[pii]       # PII detection (Presidio + spaCy)
pip install doculens[grammar]   # Grammar checking (LanguageTool)
pip install doculens[all]       # Everything

Quickstart

CLI

# Screen a single document
doculens run report.pdf

# Choose specific screeners and output format
doculens run report.pdf --screeners readability,wordcount,pii --format html -o report.html

# Screen all documents in a folder
doculens batch ./documents --recursive --format json --output-dir ./reports

# Redact PII from a document
doculens redact contract.pdf -o contract_clean.txt

# List available screeners
doculens list-screeners

Python API

from doculens import ScreeningConfig, ScreeningPipeline

# Configure and run
config = ScreeningConfig(
    screeners=["readability", "wordcount", "language"],
    min_word_count=50,
    expected_language="en",
)
pipeline = ScreeningPipeline(config)
report = pipeline.screen_file("report.pdf")

print(report.overall_passed)        # True / False
print(report.summary)               # {'total_screeners': 3, 'passed': 3, ...}

for result in report.results:
    print(f"{result.screener_name}: {result.score:.2f}{'PASS' if result.passed else 'FAIL'}")

PII redaction

from doculens.screeners.pii import PIIScreener

screener = PIIScreener()
redacted = screener.redact("Contact John Smith at john@example.com")
print(redacted)
# "Contact [NAME_REDACTED] at [EMAIL_REDACTED]"

Generate reports

from doculens import HTMLReportGenerator, JSONReportGenerator

# HTML report
html = HTMLReportGenerator()
html.save(report, "report.html")

# JSON report
json_gen = JSONReportGenerator()
json_gen.save(report, "report.json")

Screeners

Screener Key What it checks Library
Readability readability Flesch score, grade level, Gunning Fog, SMOG textstat
Word Count wordcount Min/max words, line count, avg word length built-in
Language language Detects language, validates against expected langdetect
PII Detection pii Emails, phones, names, credit cards, IPs presidio + spaCy
Grammar grammar Spelling and grammar errors language-tool-python
Duplicates duplicate Exact and near-duplicate paragraphs built-in

CLI Options

doculens run <file> [OPTIONS]
  --screeners, -s    Comma-separated screener names (auto-includes pii/grammar if installed)
  --format, -f       Report format: json or html (default: json)
  --output, -o       Save report to file
  --min-words        Minimum word count threshold (default: 50)
  --lang             Expected language code, e.g. "en"
  --dup-threshold    Similarity threshold for duplicate detection (default: 0.8)
  --verbose, -v      Show warnings in output

doculens batch <folder> [OPTIONS]
  --screeners, -s    Comma-separated screener names
  --format, -f       Report format: json or html (default: json)
  --output-dir, -o   Directory to save individual reports
  --min-words        Minimum word count threshold (default: 50)
  --recursive, -r    Scan subdirectories
  --dup-threshold    Similarity threshold for duplicate detection (default: 0.8)
  --verbose, -v      Show warnings in output

doculens redact <file> [OPTIONS]
  --output, -o       Save redacted text to file (default: print to stdout)

Sample CLI output

doculens — screening report.pdf

╭────────────┬────────┬──────────────────────┬──────────────────────────────────╮
│ Screener   │ Status │ Score                │ Summary                          │
├────────────┼────────┼──────────────────────┼──────────────────────────────────┤
│ Readability│  PASS  │ ████████████     72% │ Standard — suitable for most     │
│            │        │                      │ business documents               │
│ Word Count │  PASS  │ ████████████    100% │ 1,243 words, 48 lines           │
│ Language   │  PASS  │ ████████████    100% │ English (100% confident)         │
│ PII        │  FAIL  │ ██████           42% │ 3 PII items: 2 emails, 1 name   │
╰────────────┴────────┴──────────────────────┴──────────────────────────────────╯
╭──────────────────────────────────────────────────────────────╮
│ Overall: FAILED  |  Words: 1,243  |  Screeners: 3/4 passed  │
╰──────────────────────────────────────────────────────────────╯

Supported formats

Format Extensions Library
PDF .pdf pdfplumber
Word .docx python-docx
HTML .html, .htm beautifulsoup4
Plain text .txt built-in

Contributing

Contributions are welcome. Please open an issue first to discuss what you would like to change.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

doculens-0.1.0.tar.gz (35.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

doculens-0.1.0-py3-none-any.whl (32.3 kB view details)

Uploaded Python 3

File details

Details for the file doculens-0.1.0.tar.gz.

File metadata

  • Download URL: doculens-0.1.0.tar.gz
  • Upload date:
  • Size: 35.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.0

File hashes

Hashes for doculens-0.1.0.tar.gz
Algorithm Hash digest
SHA256 61b6ac26a0b83a66cc57ab4dd245ec108e5192befb99db181233f29cc4651d91
MD5 d15428568882adac8e1af000403698ad
BLAKE2b-256 cd0f163300e2a5c78990cfa06d6a9146f8f8fcf92839443a09f10cda11076806

See more details on using hashes here.

File details

Details for the file doculens-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: doculens-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 32.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.0

File hashes

Hashes for doculens-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 3925af07d6b00f6a4c3781d21d08ee34ca7edc9e5ff7770829955a7cebb9f642
MD5 cd8cb6fc4f48b178b00fdbdedb497205
BLAKE2b-256 464bcaa867c7342b603c65057af4293944457ad17a07f8c98c91c174c8f91b57

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page