Skip to main content

A library for processing PDFs with OCR and masking sensitive information

Project description

PDF Masking Library

pdf-masking-library is a Python library designed to process PDF files by masking sensitive information using Optical Character Recognition (OCR). It supports masking predefined patterns such as Aadhaar numbers, PAN numbers, and custom patterns provided by the user.

A Simple Example

import base64
from pdf_masking_library import process_pdf

base64_pdf_input = "Your base64 here"
custom_pattern = [r"\b\d{2}\b"]
psm = 6  # Default PSM is 6
lang = "eng+kan"  # Default OCR language is English and Kannada
aadhar = True  # Enable Aadhaar masking
pan = True  # Enable PAN masking

base64_pdf_output = process_pdf(base64_pdf_input, custom_pattern=custom_pattern, psm=psm, lang=lang, aadhar=aadhar, pan=pan)

# Save the masked PDF to a file
with open("masked_output.pdf", "wb") as output_file:
    output_file.write(base64.b64decode(base64_pdf_output))

Masking Information

The library allows independent control over masking of Aadhaar numbers, PAN numbers, and custom patterns:

  • Aadhaar Numbers: 12-digit Indian identification numbers (enabled using --aadhar).
  • PAN Numbers: 10-character alphanumeric Permanent Account Numbers (enabled using --pan).
  • Custom Patterns: User-defined patterns using regular expressions(enabled using --custom-pattern).

Command-Line Interface (CLI)

The library includes a CLI tool for easy integration into scripts and workflows.

  • Mask Aadhaar Numbers:

    python -m pdf_masking_library input.pdf output.pdf --aadhar

  • Mask PAN Numbers:

    python -m pdf_masking_library input.pdf output.pdf --pan

  • Mask Using Custom Patterns:

    python -m pdf_masking_library input.pdf output.pdf --custom-pattern "\b\d{2}\b"

  • Specify OCR Page Segmentation Mode (psm):

    python -m pdf_masking_library input.pdf output.pdf --psm 3

    Default is 6 if not specified.

  • Specify OCR Language (lang):

    python -m pdf_masking_library input.pdf output.pdf --lang eng+kan+tel

    Default is eng+kan. Multiple languages should be separated using +.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pdf_masking_library-0.1.6-py3-none-any.whl (10.1 kB view details)

Uploaded Python 3

File details

Details for the file pdf_masking_library-0.1.6-py3-none-any.whl.

File metadata

File hashes

Hashes for pdf_masking_library-0.1.6-py3-none-any.whl
Algorithm Hash digest
SHA256 927885c2692eeb42c56581c0bb0bd71ec4d3fd15c948ab4ed1825c542672889e
MD5 f56c0f76a17351512eac8b3bdb61e1a0
BLAKE2b-256 18d5a38c0d346ef8e68cd060460faaab8d182d863c15cb98e6269cd5d95ad73f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page