A library for processing PDFs with OCR and masking sensitive information
Project description
PDF Masking Library
pdf-masking-library is a Python library designed to process PDF files by masking sensitive information using Optical Character Recognition (OCR). It supports masking predefined patterns such as Aadhaar numbers, PAN numbers, and custom patterns provided by the user.
A Simple Example
import base64
from pdf_masking_library import process_pdf
base64_pdf_input = "Your base64 here"
custom_pattern = [r"\b\d{2}\b"]
psm = 6 # Default PSM is 6
lang = "eng+kan" # Default OCR language is English and Kannada
aadhar = True # Enable Aadhaar masking
pan = True # Enable PAN masking
base64_pdf_output = process_pdf(base64_pdf_input, custom_pattern=custom_pattern, psm=psm, lang=lang, aadhar=aadhar, pan=pan)
# Save the masked PDF to a file
with open("masked_output.pdf", "wb") as output_file:
output_file.write(base64.b64decode(base64_pdf_output))
Masking Information
The library allows independent control over masking of Aadhaar numbers, PAN numbers, and custom patterns:
- Aadhaar Numbers: 12-digit Indian identification numbers (enabled using --aadhar).
- PAN Numbers: 10-character alphanumeric Permanent Account Numbers (enabled using --pan).
- Custom Patterns: User-defined patterns using regular expressions(enabled using --custom-pattern).
Command-Line Interface (CLI)
The library includes a CLI tool for easy integration into scripts and workflows.
-
Mask Aadhaar Numbers:
python -m pdf_masking_library input.pdf output.pdf --aadhar
-
Mask PAN Numbers:
python -m pdf_masking_library input.pdf output.pdf --pan
-
Mask Using Custom Patterns:
python -m pdf_masking_library input.pdf output.pdf --custom-pattern "\b\d{2}\b"
-
Specify OCR Page Segmentation Mode (psm):
python -m pdf_masking_library input.pdf output.pdf --psm 3
Default is 6 if not specified.
-
Specify OCR Language (lang):
python -m pdf_masking_library input.pdf output.pdf --lang eng+kan+tel
Default is eng+kan. Multiple languages should be separated using +.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pdf_masking_library-0.1.7-py3-none-any.whl.
File metadata
- Download URL: pdf_masking_library-0.1.7-py3-none-any.whl
- Upload date:
- Size: 10.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.13.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
04d786d7194eab310584dd91f8670680a9e30c2dad0bcb54a16d5d1edbd150d3
|
|
| MD5 |
57a7d108196c2ea62ac51ac637371c01
|
|
| BLAKE2b-256 |
bc8101684bdfd2cd64b92c4cefb7b4bfd0b912b2a448c1cab926d176a343855c
|