Skip to main content

Remove watermarks from PDF using Otsu threshold segmentation and OpenCV inpaint

Project description

PDF Watermark Removal Tool

A command-line tool to remove watermarks from PDF files using advanced image processing techniques including adaptive thresholding, intelligent color detection, and OpenCV inpainting. Features interactive watermark color selection and beautiful CLI with progress visualization.

๐ŸŽฏ Key Features

  • Intelligent Color Detection: Automatically classifies colors as BACKGROUND, WATERMARK, TEXT, or NOISE

    • Multi-pass analysis with confidence scoring
    • Interactive visual color picker with preview
    • Smart background protection to avoid false positives
  • Advanced Watermark Detection:

    • Adaptive Gaussian thresholding for better precision than traditional Otsu
    • Combined color and saturation analysis
    • Automatic background (white area) exclusion
    • Morphological operations for noise removal
  • Precision Inpainting:

    • OpenCV TELEA algorithm with dynamic radius adjustment
    • Coverage-based parameter optimization
    • Progressive multi-pass removal for stubborn watermarks
    • Accurate color space handling (RGB โ†” BGR conversion)
  • Production Quality CLI:

    • Beautiful Rich-formatted panels and progress bars
    • Internationalization support (English & Chinese)
    • Detailed logging and statistics
    • Robust error handling
  • Flexible Processing:

    • Batch process multiple pages
    • Select specific pages or ranges
    • Adjustable DPI for different quality needs
    • Per-page statistics and coverage reporting

Installation

Using uv (recommended)

uv tool install pdf-watermark-removal-otsu-inpaint

Using pip

pip install pdf-watermark-removal-otsu-inpaint

From local directory

cd pdf-watermark-removal-otsu-inpaint
uv tool install --editable .

Quick Start

Basic Usage (All Pages, Interactive Color Selection)

pdf-watermark-removal input.pdf output.pdf

Specify Watermark Color Explicitly

pdf-watermark-removal input.pdf output.pdf --color "200,200,200"

The color format is R,G,B with values from 0-255.

Skip Interactive Selection

pdf-watermark-removal input.pdf output.pdf --auto-color

Process Specific Pages Only

pdf-watermark-removal input.pdf output.pdf --pages 1,3,5
pdf-watermark-removal input.pdf output.pdf --pages 1-10

With Advanced Options

pdf-watermark-removal input.pdf output.pdf \
  --color "180,180,180" \
  --kernel-size 5 \
  --inpaint-radius 3 \
  --multi-pass 2 \
  --dpi 300 \
  --verbose

Command-Line Options

INPUT_PDF                  Path to input PDF file
OUTPUT_PDF                 Path to output PDF file

OPTIONS:
  --color TEXT             Watermark color as 'R,G,B' (e.g., '128,128,128')
                          Interactive selection if not specified
  --auto-color             Skip interactive selection, use automatic detection
  --pages TEXT             Pages to process (e.g., '1,3,5' or '1-5')
                          Process all pages if not specified
  --kernel-size INTEGER    Morphological kernel size (default: 3)
  --inpaint-radius INTEGER Inpainting radius (default: 2)
  --multi-pass INTEGER     Number of removal passes (default: 1)
  --dpi INTEGER            DPI for PDF rendering (default: 150)
  -v, --verbose            Enable verbose output
  --help                   Show help message

How It Works

1. Color Detection & Selection

  • Analyzes first page to detect dominant colors
  • Shows most common non-photo colors (likely watermark/text)
  • User selects watermark color or confirms automatic selection
  • Supports coarse (3) and fine (10) color options

2. Watermark Detection (Otsu Threshold)

  • Converts each PDF page to image at specified DPI
  • Converts image to grayscale
  • Applies Otsu's automatic thresholding to create binary image
  • Uses morphological operations (open and close) to refine mask
  • Combines with color saturation analysis for better detection
  • Filters small noise components

3. Watermark Removal (Inpainting)

  • Uses detected mask to identify watermark regions
  • Applies OpenCV's TELEA inpainting method
  • Reconstructs watermarked areas using surrounding texture
  • Supports multi-pass for stubborn watermarks

4. PDF Reconstruction

  • Converts processed images back to PDF
  • Preserves document layout and quality

Algorithm Details

1. Intelligent Color Classification

The tool uses multi-dimensional analysis to classify colors:

  • BACKGROUND: Gray level 240-255 + coverage >60% โ†’ confidence 0%
  • WATERMARK: Gray level 180-240 + coverage 2-15% โ†’ dynamic confidence (20-100%)
  • TEXT: Gray level 0-80 + coverage <5% โ†’ confidence 0%
  • NOISE: All other patterns โ†’ confidence 0%

Confidence scoring formula:

confidence = (gray_factor ร— 0.5 + coverage_factor ร— 0.5) ร— 100
           + bonus_for_typical_range

2. Adaptive Watermark Detection

Combines multiple detection methods:

  • Adaptive Gaussian Thresholding: Handles varying lighting conditions
  • Color-based Detection: Uses detected watermark color to refine mask
  • Saturation Analysis: Identifies low-saturation regions (watermarks, text)
  • Background Protection: Explicitly excludes white/bright areas (>250 gray)
  • Morphological Refinement: Opens (removes small noise) then closes (fills holes)

3. TELEA Inpainting

Uses OpenCV's Fast Marching Method:

  • Dynamic Radius: Adjusted based on watermark coverage (radius = 2 + coverageร—5)
  • Color Space Accuracy: Converts RGBโ†’BGR for processing, maintains accuracy
  • Early Termination: Skips processing if no watermark detected
  • Multi-pass Support: Progressive mask expansion for difficult watermarks

4. PDF Reconstruction

Preserves document fidelity:

  • Maintains original page layout
  • Preserves resolution based on input DPI
  • Reconstructs from processed image sequence

Requirements

  • Python 3.8+
  • uv package manager (for tool installation)

Automatic Dependencies

  • OpenCV (opencv-python) - Image processing and inpainting
  • NumPy - Array operations
  • Pillow - Image I/O and PDF generation
  • PyPDF - PDF utilities
  • Click - CLI framework
  • PyMuPDF - Fast PDF rendering
  • Rich - Beautiful CLI with colors and progress bars

Examples

Example 1: Interactive Color Selection with Rich UI

$ pdf-watermark-removal contract.pdf contract_clean.pdf

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ Configuration โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ PDF Watermark Removal Tool                                                โ”‚
โ”‚ Input:  contract.pdf                                                      โ”‚
โ”‚ Output: contract_clean.pdf                                                โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Would you like to interactively select the watermark color? [y/N]: y
Use coarse color selection (3 main colors)? [Y/n]: y

============================================================
WATERMARK COLOR DETECTION
============================================================

Analyzing 3 most common colors in the document...

Detected colors (likely watermark or text):

โ”Œโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Index โ”‚ Color Preview          โ”‚ RGB Value        โ”‚ Gray Level โ”‚ Percentage โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ 0   โ”‚ โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘ โ”‚ RGB(200, 200, 200) โ”‚ 200        โ”‚ 45.3%   โ”‚
โ”‚ 1   โ”‚ โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘ โ”‚ RGB(150, 150, 150) โ”‚ 150        โ”‚ 28.1%   โ”‚
โ”‚ 2   โ”‚ โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘ โ”‚ RGB(100, 100, 100) โ”‚ 100        โ”‚ 18.2%   โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Select color number (0-indexed) or 'a' for automatic [a]: 0

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ Selected Watermark Color โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                                                                 โ”‚
โ”‚                                                                 โ”‚
โ”‚ RGB Value: (200, 200, 200)                                     โ”‚
โ”‚ Gray Level: 200                                                โ”‚
โ”‚ Percentage in document: 45.30%                                 โ”‚
โ”‚                                                                 โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Step 1: Converting PDF to images...
โ ‹ Loading PDF โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ 0%
Loaded 34 pages

Step 2: Removing watermarks...
Processing pages โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘ 66% - 0:00:45
Watermark removal completed

Step 3: Converting images back to PDF...
โ ™ Saving PDF

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ Success โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Watermark removal completed successfully!                    โ”‚
โ”‚ Output saved to: contract_clean.pdf                          โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Example 2: Explicit Color and Multi-Pass

pdf-watermark-removal document.pdf clean.pdf \
  --color "220,220,220" \
  --multi-pass 2 \
  --verbose

Example 3: High-Quality Processing

pdf-watermark-removal thesis.pdf thesis_clean.pdf \
  --dpi 300 \
  --kernel-size 5 \
  --inpaint-radius 3 \
  --auto-color

Performance

Typical processing times on modern systems:

  • Single page: 1-2 seconds
  • 10 pages: 10-20 seconds
  • 100 pages: 2-5 minutes

Factors affecting speed:

  • PDF resolution (DPI)
  • Page complexity
  • Inpaint radius
  • Multi-pass count
  • System CPU/memory

Troubleshooting

Poor Watermark Detection

Symptoms: Watermark not fully detected

Solutions:

  1. Try fine color selection: --color "180,180,180" with different values
  2. Increase kernel size: --kernel-size 5 or --kernel-size 7
  3. Use multi-pass: --multi-pass 2

Artifacts or Blurriness

Symptoms: Cleaned PDF has blurry or distorted areas

Solutions:

  1. Reduce inpaint radius: --inpaint-radius 1
  2. Lower DPI: --dpi 150 (default is good for most documents)
  3. Single pass: --multi-pass 1 (default)

Memory Issues

Symptoms: "Out of memory" error on large PDFs

Solutions:

  1. Lower DPI: --dpi 100
  2. Process specific pages: --pages 1-50 (then 51-100, etc.)
  3. Increase system available memory

License

MIT

Changelog

v0.1.0 (2024)

Initial Release - Production Ready

Core Features:

  • โœ… Multi-stage watermark detection (adaptive thresholding + color analysis)
  • โœ… Intelligent color classification (BACKGROUND/WATERMARK/TEXT/NOISE)
  • โœ… OpenCV TELEA inpainting with dynamic parameters
  • โœ… Interactive color selection with confidence scoring
  • โœ… Multi-pass progressive watermark removal
  • โœ… Support for batch processing and page ranges

Algorithm Improvements:

  • โœ… Adaptive Gaussian thresholding (replaces simple Otsu)
  • โœ… Background protection (excludes white/bright areas)
  • โœ… Color space accuracy (RGB โ†” BGR proper handling)
  • โœ… Dynamic inpaint radius based on coverage
  • โœ… Morphological noise removal with connected components

Quality Assurance:

  • โœ… ruff linting compliance
  • โœ… Comprehensive error handling
  • โœ… Detailed logging and statistics
  • โœ… Verified on multiple PDF types

Contributing

Contributions welcome! Areas for enhancement:

  • GPU acceleration for large documents
  • Additional inpainting algorithms (e.g., Criminisi)
  • Batch API interface
  • Additional language support
  • Performance benchmarking

See Also

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdf_watermark_removal_otsu_inpaint-0.1.0.tar.gz (26.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

File details

Details for the file pdf_watermark_removal_otsu_inpaint-0.1.0.tar.gz.

File metadata

File hashes

Hashes for pdf_watermark_removal_otsu_inpaint-0.1.0.tar.gz
Algorithm Hash digest
SHA256 bf71f28073a0d4a44b7bda8a46d6996af652a072ae508b2ca148afdbafcfbfbc
MD5 c2ec6d0b5de65c9b246c4182621f5be6
BLAKE2b-256 8301dccb76e858c34a14a4692e060b825d7ddc9d6795723d64210cf0315c7f0a

See more details on using hashes here.

File details

Details for the file pdf_watermark_removal_otsu_inpaint-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for pdf_watermark_removal_otsu_inpaint-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c91f8d8823d4de442b13a84c4ac87bfe4de2d2e8991e55b598224ee8279f0fcf
MD5 88cb613ada4dfe447812afeda07271f7
BLAKE2b-256 98ea9bea9c6ba30143f4a404fdb694a99f6a0b7c73ef7561e502e02438c67dea

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page