Skip to main content

Remove watermarks from PDF using Otsu threshold segmentation and OpenCV inpaint

Project description

PDF Watermark Removal Tool

PyPI version Version Python 3.8+ License: MIT GitHub

A command-line tool to remove watermarks from PDF files using advanced image processing techniques including adaptive thresholding, intelligent color detection, and OpenCV inpainting. Features interactive watermark color selection and beautiful CLI with progress visualization.

๐ŸŽฏ Key Features

  • Intelligent Color Detection: Automatically classifies colors as BACKGROUND, WATERMARK, TEXT, or NOISE

    • Multi-pass analysis with confidence scoring
    • Interactive visual color picker with preview
    • Smart background protection to avoid false positives
  • Advanced Watermark Detection:

    • Adaptive Gaussian thresholding for better precision than traditional Otsu
    • Combined color and saturation analysis
    • Automatic background (white area) exclusion
    • Morphological operations for noise removal
  • Precision Inpainting:

    • OpenCV TELEA algorithm with dynamic radius adjustment
    • Coverage-based parameter optimization
    • Progressive multi-pass removal for stubborn watermarks
    • Accurate color space handling (RGB โ†” BGR conversion)
  • Production Quality CLI:

    • Beautiful Rich-formatted panels and progress bars
    • Internationalization support (English & Chinese)
    • Detailed logging and statistics
    • Robust error handling
  • Flexible Processing:

    • Batch process multiple pages
    • Select specific pages or ranges
    • Adjustable DPI for different quality needs
    • Per-page statistics and coverage reporting

Installation

Using uv (recommended)

uv tool install pdf-watermark-removal-otsu-inpaint

Using pip

pip install pdf-watermark-removal-otsu-inpaint

From local directory

cd pdf-watermark-removal-otsu-inpaint
uv tool install --editable .

Quick Start

Basic Usage (All Pages, Interactive Color Selection)

pdf-watermark-removal input.pdf output.pdf

Specify Watermark Color Explicitly

pdf-watermark-removal input.pdf output.pdf --color "200,200,200"

The color format is R,G,B with values from 0-255.

Skip Interactive Selection

pdf-watermark-removal input.pdf output.pdf --auto-color

Process Specific Pages Only

pdf-watermark-removal input.pdf output.pdf --pages 1,3,5
pdf-watermark-removal input.pdf output.pdf --pages 1-10

With Advanced Options

pdf-watermark-removal input.pdf output.pdf \
  --color "180,180,180" \
  --kernel-size 5 \
  --inpaint-radius 3 \
  --multi-pass 2 \
  --dpi 300 \
  --verbose

Command-Line Options

INPUT_PDF                  Path to input PDF file
OUTPUT_PDF                 Path to output PDF file

OPTIONS:
  --color TEXT             Watermark color as 'R,G,B' (e.g., '128,128,128')
                          Interactive selection if not specified
  --auto-color             Skip interactive selection, use automatic detection
  --pages TEXT             Pages to process (e.g., '1,3,5' or '1-5')
                          Process all pages if not specified
  --kernel-size INTEGER    Morphological kernel size (default: 3)
  --inpaint-radius INTEGER Inpainting radius (default: 2)
  --multi-pass INTEGER     Number of removal passes (default: 1)
  --dpi INTEGER            DPI for PDF rendering (default: 150)
  -v, --verbose            Enable verbose output
  --help                   Show help message

How It Works

1. Color Detection & Selection

  • Analyzes first page to detect dominant colors
  • Shows most common non-photo colors (likely watermark/text)
  • User selects watermark color or confirms automatic selection
  • Supports coarse (3) and fine (10) color options

2. Watermark Detection (Otsu Threshold)

  • Converts each PDF page to image at specified DPI
  • Converts image to grayscale
  • Applies Otsu's automatic thresholding to create binary image
  • Uses morphological operations (open and close) to refine mask
  • Combines with color saturation analysis for better detection
  • Filters small noise components

3. Watermark Removal (Inpainting)

  • Uses detected mask to identify watermark regions
  • Applies OpenCV's TELEA inpainting method
  • Reconstructs watermarked areas using surrounding texture
  • Supports multi-pass for stubborn watermarks

4. PDF Reconstruction

  • Converts processed images back to PDF
  • Preserves document layout and quality

Algorithm Details

1. Intelligent Color Classification

The tool uses multi-dimensional analysis to classify colors:

  • BACKGROUND: Gray level 240-255 + coverage >60% โ†’ confidence 0%
  • WATERMARK: Gray level 180-240 + coverage 2-15% โ†’ dynamic confidence (20-100%)
  • TEXT: Gray level 0-80 + coverage <5% โ†’ confidence 0%
  • NOISE: All other patterns โ†’ confidence 0%

Confidence scoring formula:

confidence = (gray_factor ร— 0.5 + coverage_factor ร— 0.5) ร— 100
           + bonus_for_typical_range

2. Adaptive Watermark Detection

Combines multiple detection methods:

  • Adaptive Gaussian Thresholding: Handles varying lighting conditions
  • Color-based Detection: Uses detected watermark color to refine mask
  • Saturation Analysis: Identifies low-saturation regions (watermarks, text)
  • Background Protection: Explicitly excludes white/bright areas (>250 gray)
  • Morphological Refinement: Opens (removes small noise) then closes (fills holes)

3. TELEA Inpainting

Uses OpenCV's Fast Marching Method:

  • Dynamic Radius: Adjusted based on watermark coverage (radius = 2 + coverageร—5)
  • Color Space Accuracy: Converts RGBโ†’BGR for processing, maintains accuracy
  • Early Termination: Skips processing if no watermark detected
  • Multi-pass Support: Progressive mask expansion for difficult watermarks

4. PDF Reconstruction

Preserves document fidelity:

  • Maintains original page layout
  • Preserves resolution based on input DPI
  • Reconstructs from processed image sequence

Requirements

  • Python 3.8+
  • uv package manager (for tool installation)

Automatic Dependencies

  • OpenCV (opencv-python) - Image processing and inpainting
  • NumPy - Array operations
  • Pillow - Image I/O and PDF generation
  • PyPDF - PDF utilities
  • Click - CLI framework
  • PyMuPDF - Fast PDF rendering
  • Rich - Beautiful CLI with colors and progress bars

Examples

Example 1: Interactive Color Selection with Rich UI

$ pdf-watermark-removal contract.pdf contract_clean.pdf

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ Configuration โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ PDF Watermark Removal Tool                                                โ”‚
โ”‚ Input:  contract.pdf                                                      โ”‚
โ”‚ Output: contract_clean.pdf                                                โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Would you like to interactively select the watermark color? [y/N]: y
Use coarse color selection (3 main colors)? [Y/n]: y

============================================================
WATERMARK COLOR DETECTION
============================================================

Analyzing 3 most common colors in the document...

Detected colors (likely watermark or text):

โ”Œโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Index โ”‚ Color Preview          โ”‚ RGB Value        โ”‚ Gray Level โ”‚ Percentage โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ 0   โ”‚ โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘ โ”‚ RGB(200, 200, 200) โ”‚ 200        โ”‚ 45.3%   โ”‚
โ”‚ 1   โ”‚ โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘ โ”‚ RGB(150, 150, 150) โ”‚ 150        โ”‚ 28.1%   โ”‚
โ”‚ 2   โ”‚ โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘ โ”‚ RGB(100, 100, 100) โ”‚ 100        โ”‚ 18.2%   โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Select color number (0-indexed) or 'a' for automatic [a]: 0

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ Selected Watermark Color โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                                                                 โ”‚
โ”‚                                                                 โ”‚
โ”‚ RGB Value: (200, 200, 200)                                     โ”‚
โ”‚ Gray Level: 200                                                โ”‚
โ”‚ Percentage in document: 45.30%                                 โ”‚
โ”‚                                                                 โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Step 1: Converting PDF to images...
โ ‹ Loading PDF โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ 0%
Loaded 34 pages

Step 2: Removing watermarks...
Processing pages โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘ 66% - 0:00:45
Watermark removal completed

Step 3: Converting images back to PDF...
โ ™ Saving PDF

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ Success โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Watermark removal completed successfully!                    โ”‚
โ”‚ Output saved to: contract_clean.pdf                          โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Example 2: Explicit Color and Multi-Pass

pdf-watermark-removal document.pdf clean.pdf \
  --color "220,220,220" \
  --multi-pass 2 \
  --verbose

Example 3: High-Quality Processing

pdf-watermark-removal thesis.pdf thesis_clean.pdf \
  --dpi 300 \
  --kernel-size 5 \
  --inpaint-radius 3 \
  --auto-color

Performance

Typical processing times on modern systems:

  • Single page: 1-2 seconds
  • 10 pages: 10-20 seconds
  • 100 pages: 2-5 minutes

Factors affecting speed:

  • PDF resolution (DPI)
  • Page complexity
  • Inpaint radius
  • Multi-pass count
  • System CPU/memory

Troubleshooting

Poor Watermark Detection

Symptoms: Watermark not fully detected

Solutions:

  1. Try fine color selection: --color "180,180,180" with different values
  2. Increase kernel size: --kernel-size 5 or --kernel-size 7
  3. Use multi-pass: --multi-pass 2

Artifacts or Blurriness

Symptoms: Cleaned PDF has blurry or distorted areas

Solutions:

  1. Reduce inpaint radius: --inpaint-radius 1
  2. Lower DPI: --dpi 150 (default is good for most documents)
  3. Single pass: --multi-pass 1 (default)

Memory Issues

Symptoms: "Out of memory" error on large PDFs

Solutions:

  1. Lower DPI: --dpi 100
  2. Process specific pages: --pages 1-50 (then 51-100, etc.)
  3. Increase system available memory

License

MIT

Contributing

Contributions welcome! Areas for enhancement:

  • GPU acceleration for large documents
  • Additional inpainting algorithms (e.g., Criminisi)
  • Batch API interface
  • Additional language support
  • Performance benchmarking

See Also

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdf_watermark_removal_otsu_inpaint-0.2.0.tar.gz (27.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

File details

Details for the file pdf_watermark_removal_otsu_inpaint-0.2.0.tar.gz.

File metadata

File hashes

Hashes for pdf_watermark_removal_otsu_inpaint-0.2.0.tar.gz
Algorithm Hash digest
SHA256 61876ceac22d023400169069cbf8496c154a7c16189bf41aacdfdc3e536d3410
MD5 b591e0057d490684d1f8421eb88b6b93
BLAKE2b-256 8137907b7a3876c3ef690dde59f338dcabd80586de10420128e601bca40229df

See more details on using hashes here.

File details

Details for the file pdf_watermark_removal_otsu_inpaint-0.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for pdf_watermark_removal_otsu_inpaint-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c62610f931ce86e98710f9476541fde22dd20d04f7f8388f8f318112e98ee41a
MD5 9a7d2c1fcc5735a016197f3da0219a35
BLAKE2b-256 7152e0d161645e62c5746b74242828d791255ddcde2ffe4f6745af2341896636

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page