Skip to main content

A library for processing PDFs with OCR and masking sensitive information

Project description

PDF Masking Library

pdf-masking-library is a Python library designed to process PDF files by masking sensitive information using Optical Character Recognition (OCR). It supports masking predefined patterns such as Aadhaar numbers, PAN numbers, and custom patterns provided by the user.

A Simple Example

import base64
from pdf_masking_library import process_pdf

base64_pdf_input = "Your base64 here"
custom_pattern = [r"\b\d{2}\b"]
psm = 6  # Default PSM is 6
lang = "eng+kan"  # Default OCR language is English and Kannada
aadhar = True  # Enable Aadhaar masking
pan = True  # Enable PAN masking

base64_pdf_output = process_pdf(base64_pdf_input, custom_pattern=custom_pattern, psm=psm, lang=lang, aadhar=aadhar, pan=pan)

# Save the masked PDF to a file
with open("masked_output.pdf", "wb") as output_file:
    output_file.write(base64.b64decode(base64_pdf_output))

Masking Information

The library allows independent control over masking of Aadhaar numbers, PAN numbers, and custom patterns:

  • Aadhaar Numbers: 12-digit Indian identification numbers (enabled using --aadhar).
  • PAN Numbers: 10-character alphanumeric Permanent Account Numbers (enabled using --pan).
  • Custom Patterns: User-defined patterns using regular expressions(enabled using --custom-pattern).

Command-Line Interface (CLI)

The library includes a CLI tool for easy integration into scripts and workflows.

  • Mask Aadhaar Numbers:

    python -m pdf_masking_library input.pdf output.pdf --aadhar

  • Mask PAN Numbers:

    python -m pdf_masking_library input.pdf output.pdf --pan

  • Mask Using Custom Patterns:

    python -m pdf_masking_library input.pdf output.pdf --custom-pattern "\b\d{2}\b"

  • Specify OCR Page Segmentation Mode (psm):

    python -m pdf_masking_library input.pdf output.pdf --psm 3

    Default is 6 if not specified.

  • Specify OCR Language (lang):

    python -m pdf_masking_library input.pdf output.pdf --lang eng+kan+tel

    Default is eng+kan. Multiple languages should be separated using +.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pdf_masking_library-0.1.7-py3-none-any.whl (10.2 kB view details)

Uploaded Python 3

File details

Details for the file pdf_masking_library-0.1.7-py3-none-any.whl.

File metadata

File hashes

Hashes for pdf_masking_library-0.1.7-py3-none-any.whl
Algorithm Hash digest
SHA256 04d786d7194eab310584dd91f8670680a9e30c2dad0bcb54a16d5d1edbd150d3
MD5 57a7d108196c2ea62ac51ac637371c01
BLAKE2b-256 bc8101684bdfd2cd64b92c4cefb7b4bfd0b912b2a448c1cab926d176a343855c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page