A tool for processing and redacting PDFs based on target words using OCR.

These details have not been verified by PyPI

Project links

Project description

SpectrePDF: Python PDF Redaction and Annotation Tool

Overview

This project provides a Python-based tool for processing PDF documents to detect, annotate, and redact specific text using Optical Character Recognition (OCR) and image processing techniques. It leverages libraries like PyMuPDF, Pillow (PIL), pytesseract, and img2pdf to identify target words, group them into lines, merge adjacent target word boxes, and optionally redact and replace text in the PDF. The tool is designed to be flexible, allowing users to customize the redaction process, visualize detected text with bounding boxes, and save the output as a new PDF.

Features

Text Detection: Uses Tesseract OCR to identify text and their bounding boxes in PDF pages rendered as images.
Target Word Identification: Detects specified target words (case-insensitive) in the PDF content.
Line Grouping: Groups words into lines based on their vertical proximity, using a threshold derived from the median word height.
Merged Bounding Boxes: Combines adjacent target words into a single bounding box for consistent redaction or annotation.
Redaction and Replacement: Optionally redacts target words by covering them with a white rectangle and replacing them with user-specified text from a redaction dictionary.
Bounding Box Visualization: Draws colored bounding boxes around detected text (blue for target words, red for non-target words, or black for all boxes when specified).
Font Size Estimation: Dynamically estimates the appropriate font size to fit replacement text within the redacted area.
Output Generation: Converts processed images back into a PDF file.
Customizable Parameters: Allows users to toggle redaction, choose whether to show all or only target boxes, and customize target words and replacement text.

Dependencies

PyMuPDF (pymupdf): For PDF handling and rendering pages as images.
Pillow (PIL): For image processing and drawing.
pytesseract: For OCR to extract text and bounding box data.
img2pdf: For converting processed images back to PDF.
statistics: For calculating median word height to group words into lines.

Install dependencies using:

pip install pymupdf Pillow pytesseract img2pdf

Additionally, you need to have Tesseract-OCR installed on your system and specify its path in the script (e.g., C:\Program Files\Tesseract-OCR\tesseract.exe for Windows).

Usage

Prepare Input PDF: Ensure you have an input PDF file (e.g., input.pdf) to process.
Configure Parameters:
- target_words: List of words to detect (case-insensitive).
- redaction_dict: Dictionary mapping target words to their replacement text.
- show_all_boxes: Set to True to draw black boxes around all detected words, or False to use blue (target) and red (non-target) boxes.
- only_target_boxes: Set to True to draw boxes only around target words.
- redact_targets: Set to True to redact target words and replace them with text from redaction_dict.
Run the Script:
```
python script.py
```
The script processes the input PDF, applies the specified operations, and saves the output to output_pdf (e.g., redacted.pdf).
Output: The processed PDF will contain annotated or redacted text as per the configuration.

Example

target_words = ["first_name", "last_name", "name1", "name2", "(alphanumeric)", "name3", "name4", "group", "name"]
redaction_dict = {
    "first_name": "Homer",
    "last_name": "Simpson",
    "(alphanumeric)": "(ABC123456)",
    "name1": "The King",
    "name2": "of England",
    "name3": "Yobbo",
    "name4": "Muppet",
    "group": "123",
    "name": "456",
}
process_pdf(
    input_pdf='input.pdf',
    output_pdf='redacted.pdf',
    target_words=target_words,
    redaction_dict=redaction_dict,
    show_all_boxes=False,
    only_target_boxes=False,
    redact_targets=True
)

This configuration redacts target words in input.pdf and replaces them with corresponding values from redaction_dict, saving the result to redacted.pdf.

Key Functions

estimate_font_size_for_phrase(phrase, box_width, box_height, font_path, max_iterations): Estimates the optimal font size to fit a replacement phrase within a bounding box, falling back to a default font if the specified font is unavailable.
process_pdf(input_pdf, output_pdf, target_words, redaction_dict, show_all_boxes, only_target_boxes, redact_targets): Main function to process the PDF, perform OCR, group words, merge boxes, and apply redaction or annotation.

Performance

The script measures and prints the processing time for the entire operation.
High DPI (500) is used for rendering PDF pages to ensure accurate OCR results, which may increase processing time for large documents.

Limitations

Requires Tesseract-OCR to be installed and properly configured.
Font size estimation may not always perfectly fit complex phrases due to variations in font metrics.
The tool assumes the input PDF contains text that can be accurately detected by OCR.
Only supports TrueType fonts or the default PIL font for text replacement.

Future Improvements

Add support for custom font paths and multiple font options.
Optimize OCR performance for large PDFs by processing pages in parallel.
Enhance line grouping logic to handle varied text layouts (e.g., multi-column documents).
Add support for regex-based target word matching.
Provide a GUI for easier configuration and preview of results.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.2.1

Jul 14, 2025

0.2.0

Jul 14, 2025

This version

0.1.0

Jul 14, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

spectrepdf-0.1.0.tar.gz (6.5 kB view details)

Uploaded Jul 14, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

spectrepdf-0.1.0-py3-none-any.whl (7.0 kB view details)

Uploaded Jul 14, 2025 Python 3

File details

Details for the file spectrepdf-0.1.0.tar.gz.

File metadata

Download URL: spectrepdf-0.1.0.tar.gz
Upload date: Jul 14, 2025
Size: 6.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.11.4

File hashes

Hashes for spectrepdf-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`9d04ef17358f6e2282c1cf85039e4ba1e0c3665b50b05c3334e06c889227b479`
MD5	`095b3d1e5340f227fd231367a3a5edf8`
BLAKE2b-256	`a89f6480b502603ec3bc0d68256ce0c1c99d3ecfafc3e1f7319576f3704ab219`

See more details on using hashes here.

File details

Details for the file spectrepdf-0.1.0-py3-none-any.whl.

File metadata

Download URL: spectrepdf-0.1.0-py3-none-any.whl
Upload date: Jul 14, 2025
Size: 7.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.11.4

File hashes

Hashes for spectrepdf-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`225c294227510140e0e2c7b87a6df93bba791e157f41819bd4169e850bd14881`
MD5	`421e04e8f20b42f001e11298ff79b329`
BLAKE2b-256	`cfad4e06ce164827ccc5f7c6e8d3fa71ab2f70ee0680b5c86222d678ae15b247`

See more details on using hashes here.

SpectrePDF 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

SpectrePDF: Python PDF Redaction and Annotation Tool

Overview

Features

Dependencies

Usage

Example

Key Functions

Performance

Limitations

Future Improvements

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes