A tool for processing and redacting PDFs based on target words using OCR.
Project description
SpectrePDF: Python PDF Redaction and Annotation Tool
Overview
This project provides a Python-based tool for processing PDF documents to detect, annotate, and redact specific text using Optical Character Recognition (OCR) and image processing techniques. It leverages libraries like PyMuPDF, Pillow (PIL), pytesseract, and img2pdf to identify target words, group them into lines, merge adjacent target word boxes, and optionally redact and replace text in the PDF. The tool is designed to be flexible, allowing users to customize the redaction process, visualize detected text with bounding boxes, and save the output as a new PDF.
Features
- Text Detection: Uses Tesseract OCR to identify text and their bounding boxes in PDF pages rendered as images.
- Target Word Identification: Detects specified target words (case-insensitive) in the PDF content.
- Line Grouping: Groups words into lines based on their vertical proximity, using a threshold derived from the median word height.
- Merged Bounding Boxes: Combines adjacent target words into a single bounding box for consistent redaction or annotation.
- Redaction and Replacement: Optionally redacts target words by covering them with a white rectangle and replacing them with user-specified text from a redaction dictionary.
- Bounding Box Visualization: Draws colored bounding boxes around detected text (blue for target words, red for non-target words, or black for all boxes when specified).
- Font Size Estimation: Dynamically estimates the appropriate font size to fit replacement text within the redacted area.
- Output Generation: Converts processed images back into a PDF file.
- Customizable Parameters: Allows users to toggle redaction, choose whether to show all or only target boxes, and customize target words and replacement text.
Dependencies
- PyMuPDF (
pymupdf): For PDF handling and rendering pages as images. - Pillow (
PIL): For image processing and drawing. - pytesseract: For OCR to extract text and bounding box data.
- img2pdf: For converting processed images back to PDF.
- statistics: For calculating median word height to group words into lines.
Install dependencies using:
pip install pymupdf Pillow pytesseract img2pdf
Additionally, you need to have Tesseract-OCR installed on your system and specify its path in the script (e.g., C:\Program Files\Tesseract-OCR\tesseract.exe for Windows).
Usage
- Prepare Input PDF: Ensure you have an input PDF file (e.g.,
input.pdf) to process. - Configure Parameters:
target_words: List of words to detect (case-insensitive).redaction_dict: Dictionary mapping target words to their replacement text.show_all_boxes: Set toTrueto draw black boxes around all detected words, orFalseto use blue (target) and red (non-target) boxes.only_target_boxes: Set toTrueto draw boxes only around target words.redact_targets: Set toTrueto redact target words and replace them with text fromredaction_dict.
- Run the Script:
python script.py
The script processes the input PDF, applies the specified operations, and saves the output tooutput_pdf(e.g.,redacted.pdf). - Output: The processed PDF will contain annotated or redacted text as per the configuration.
Example
target_words = ["first_name", "last_name", "name1", "name2", "(alphanumeric)", "name3", "name4", "group", "name"]
redaction_dict = {
"first_name": "Homer",
"last_name": "Simpson",
"(alphanumeric)": "(ABC123456)",
"name1": "The King",
"name2": "of England",
"name3": "Yobbo",
"name4": "Muppet",
"group": "123",
"name": "456",
}
process_pdf(
input_pdf='input.pdf',
output_pdf='redacted.pdf',
target_words=target_words,
redaction_dict=redaction_dict,
show_all_boxes=False,
only_target_boxes=False,
redact_targets=True
)
This configuration redacts target words in input.pdf and replaces them with corresponding values from redaction_dict, saving the result to redacted.pdf.
Key Functions
estimate_font_size_for_phrase(phrase, box_width, box_height, font_path, max_iterations): Estimates the optimal font size to fit a replacement phrase within a bounding box, falling back to a default font if the specified font is unavailable.process_pdf(input_pdf, output_pdf, target_words, redaction_dict, show_all_boxes, only_target_boxes, redact_targets): Main function to process the PDF, perform OCR, group words, merge boxes, and apply redaction or annotation.
Performance
- The script measures and prints the processing time for the entire operation.
- High DPI (500) is used for rendering PDF pages to ensure accurate OCR results, which may increase processing time for large documents.
Limitations
- Requires Tesseract-OCR to be installed and properly configured.
- Font size estimation may not always perfectly fit complex phrases due to variations in font metrics.
- The tool assumes the input PDF contains text that can be accurately detected by OCR.
- Only supports TrueType fonts or the default PIL font for text replacement.
Future Improvements
- Add support for custom font paths and multiple font options.
- Optimize OCR performance for large PDFs by processing pages in parallel.
- Enhance line grouping logic to handle varied text layouts (e.g., multi-column documents).
- Add support for regex-based target word matching.
- Provide a GUI for easier configuration and preview of results.
License
This project is licensed under the MIT License. See the LICENSE file for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file spectrepdf-0.1.0.tar.gz.
File metadata
- Download URL: spectrepdf-0.1.0.tar.gz
- Upload date:
- Size: 6.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9d04ef17358f6e2282c1cf85039e4ba1e0c3665b50b05c3334e06c889227b479
|
|
| MD5 |
095b3d1e5340f227fd231367a3a5edf8
|
|
| BLAKE2b-256 |
a89f6480b502603ec3bc0d68256ce0c1c99d3ecfafc3e1f7319576f3704ab219
|
File details
Details for the file spectrepdf-0.1.0-py3-none-any.whl.
File metadata
- Download URL: spectrepdf-0.1.0-py3-none-any.whl
- Upload date:
- Size: 7.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
225c294227510140e0e2c7b87a6df93bba791e157f41819bd4169e850bd14881
|
|
| MD5 |
421e04e8f20b42f001e11298ff79b329
|
|
| BLAKE2b-256 |
cfad4e06ce164827ccc5f7c6e8d3fa71ab2f70ee0680b5c86222d678ae15b247
|