Remove watermarks from PDF using Otsu threshold segmentation and OpenCV inpaint

Project description

PDF Watermark Removal Tool

A command-line tool to remove watermarks from PDF files using advanced image processing techniques including adaptive thresholding, intelligent color detection, and OpenCV inpainting. Features interactive watermark color selection and beautiful CLI with progress visualization.

🎯 Key Features

Intelligent Color Detection: Automatically classifies colors as BACKGROUND, WATERMARK, TEXT, or NOISE
- Multi-pass analysis with confidence scoring
- Interactive visual color picker with preview
- Smart background protection to avoid false positives
Advanced Watermark Detection:
- Adaptive Gaussian thresholding for better precision than traditional Otsu
- Combined color and saturation analysis
- Automatic background (white area) exclusion
- Morphological operations for noise removal
Precision Inpainting:
- OpenCV TELEA algorithm with dynamic radius adjustment
- Coverage-based parameter optimization
- Progressive multi-pass removal for stubborn watermarks
- Accurate color space handling (RGB ↔ BGR conversion)
Production Quality CLI:
- Beautiful Rich-formatted panels and progress bars
- Internationalization support (English & Chinese)
- Detailed logging and statistics
- Robust error handling
Flexible Processing:
- Batch process multiple pages
- Select specific pages or ranges
- Adjustable DPI for different quality needs
- Per-page statistics and coverage reporting

Installation

Using uv (recommended)

uv tool install pdf-watermark-removal-otsu-inpaint

Using pip

pip install pdf-watermark-removal-otsu-inpaint

From local directory

cd pdf-watermark-removal-otsu-inpaint
uv tool install --editable .

Quick Start

Basic Usage (All Pages, Interactive Color Selection)

pdf-watermark-removal input.pdf output.pdf

Specify Watermark Color Explicitly

pdf-watermark-removal input.pdf output.pdf --color "200,200,200"

The color format is R,G,B with values from 0-255.

Skip Interactive Selection

pdf-watermark-removal input.pdf output.pdf --auto-color

Process Specific Pages Only

pdf-watermark-removal input.pdf output.pdf --pages 1,3,5
pdf-watermark-removal input.pdf output.pdf --pages 1-10

With Advanced Options

pdf-watermark-removal input.pdf output.pdf \
  --color "180,180,180" \
  --kernel-size 5 \
  --inpaint-radius 3 \
  --multi-pass 2 \
  --dpi 300 \
  --verbose

Command-Line Options

INPUT_PDF                  Path to input PDF file
OUTPUT_PDF                 Path to output PDF file

OPTIONS:
  --color TEXT             Watermark color as 'R,G,B' (e.g., '128,128,128')
                          Interactive selection if not specified
  --auto-color             Skip interactive selection, use automatic detection
  --pages TEXT             Pages to process (e.g., '1,3,5' or '1-5')
                          Process all pages if not specified
  --kernel-size INTEGER    Morphological kernel size (default: 3)
  --inpaint-radius INTEGER Inpainting radius (default: 2)
  --multi-pass INTEGER     Number of removal passes (default: 1)
  --dpi INTEGER            DPI for PDF rendering (default: 150)
  -v, --verbose            Enable verbose output
  --help                   Show help message

How It Works

1. Color Detection & Selection

Analyzes first page to detect dominant colors
Shows most common non-photo colors (likely watermark/text)
User selects watermark color or confirms automatic selection
Supports coarse (3) and fine (10) color options

2. Watermark Detection (Otsu Threshold)

Converts each PDF page to image at specified DPI
Converts image to grayscale
Applies Otsu's automatic thresholding to create binary image
Uses morphological operations (open and close) to refine mask
Combines with color saturation analysis for better detection
Filters small noise components

3. Watermark Removal (Inpainting)

Uses detected mask to identify watermark regions
Applies OpenCV's TELEA inpainting method
Reconstructs watermarked areas using surrounding texture
Supports multi-pass for stubborn watermarks

4. PDF Reconstruction

Converts processed images back to PDF
Preserves document layout and quality

Algorithm Details

1. Intelligent Color Classification

The tool uses multi-dimensional analysis to classify colors:

BACKGROUND: Gray level 240-255 + coverage >60% → confidence 0%
WATERMARK: Gray level 180-240 + coverage 2-15% → dynamic confidence (20-100%)
TEXT: Gray level 0-80 + coverage <5% → confidence 0%
NOISE: All other patterns → confidence 0%

Confidence scoring formula:

confidence = (gray_factor × 0.5 + coverage_factor × 0.5) × 100
           + bonus_for_typical_range

2. Adaptive Watermark Detection

Combines multiple detection methods:

Adaptive Gaussian Thresholding: Handles varying lighting conditions
Color-based Detection: Uses detected watermark color to refine mask
Saturation Analysis: Identifies low-saturation regions (watermarks, text)
Background Protection: Explicitly excludes white/bright areas (>250 gray)
Morphological Refinement: Opens (removes small noise) then closes (fills holes)

3. TELEA Inpainting

Uses OpenCV's Fast Marching Method:

Dynamic Radius: Adjusted based on watermark coverage (radius = 2 + coverage×5)
Color Space Accuracy: Converts RGB→BGR for processing, maintains accuracy
Early Termination: Skips processing if no watermark detected
Multi-pass Support: Progressive mask expansion for difficult watermarks

4. PDF Reconstruction

Preserves document fidelity:

Maintains original page layout
Preserves resolution based on input DPI
Reconstructs from processed image sequence

Requirements

Python 3.8+
uv package manager (for tool installation)

Automatic Dependencies

OpenCV (opencv-python) - Image processing and inpainting
NumPy - Array operations
Pillow - Image I/O and PDF generation
PyPDF - PDF utilities
Click - CLI framework
PyMuPDF - Fast PDF rendering
Rich - Beautiful CLI with colors and progress bars

Examples

Example 1: Interactive Color Selection with Rich UI

$ pdf-watermark-removal contract.pdf contract_clean.pdf

┌─────────────────────────────── Configuration ──────────────────────────────┐
│ PDF Watermark Removal Tool                                                │
│ Input:  contract.pdf                                                      │
│ Output: contract_clean.pdf                                                │
└────────────────────────────────────────────────────────────────────────────┘

Would you like to interactively select the watermark color? [y/N]: y
Use coarse color selection (3 main colors)? [Y/n]: y

============================================================
WATERMARK COLOR DETECTION
============================================================

Analyzing 3 most common colors in the document...

Detected colors (likely watermark or text):

┌─────┬──────────────────────────┬──────────────────┬────────────┬──────────┐
│ Index │ Color Preview          │ RGB Value        │ Gray Level │ Percentage │
├─────┼──────────────────────────┼──────────────────┼────────────┼──────────┤
│ 0   │ ████████████████████░░░░ │ RGB(200, 200, 200) │ 200        │ 45.3%   │
│ 1   │ ███████████░░░░░░░░░░░░░ │ RGB(150, 150, 150) │ 150        │ 28.1%   │
│ 2   │ ████████░░░░░░░░░░░░░░░░ │ RGB(100, 100, 100) │ 100        │ 18.2%   │
└─────┴──────────────────────────┴──────────────────┴────────────┴──────────┘

Select color number (0-indexed) or 'a' for automatic [a]: 0

┌───────────────────── Selected Watermark Color ─────────────────┐
│                                                                 │
│                                                                 │
│ RGB Value: (200, 200, 200)                                     │
│ Gray Level: 200                                                │
│ Percentage in document: 45.30%                                 │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Step 1: Converting PDF to images...
⠋ Loading PDF ────────────────────────────────── 0%
Loaded 34 pages

Step 2: Removing watermarks...
Processing pages ████████████████████░░░░░░░░░░ 66% - 0:00:45
Watermark removal completed

Step 3: Converting images back to PDF...
⠙ Saving PDF

┌──────────────────────────── Success ──────────────────────────┐
│ Watermark removal completed successfully!                    │
│ Output saved to: contract_clean.pdf                          │
└──────────────────────────────────────────────────────────────┘

Example 2: Explicit Color and Multi-Pass

pdf-watermark-removal document.pdf clean.pdf \
  --color "220,220,220" \
  --multi-pass 2 \
  --verbose

Example 3: High-Quality Processing

pdf-watermark-removal thesis.pdf thesis_clean.pdf \
  --dpi 300 \
  --kernel-size 5 \
  --inpaint-radius 3 \
  --auto-color

Performance

Typical processing times on modern systems:

Single page: 1-2 seconds
10 pages: 10-20 seconds
100 pages: 2-5 minutes

Factors affecting speed:

PDF resolution (DPI)
Page complexity
Inpaint radius
Multi-pass count
System CPU/memory

Troubleshooting

Poor Watermark Detection

Symptoms: Watermark not fully detected

Solutions:

Try fine color selection: --color "180,180,180" with different values
Increase kernel size: --kernel-size 5 or --kernel-size 7
Use multi-pass: --multi-pass 2

Artifacts or Blurriness

Symptoms: Cleaned PDF has blurry or distorted areas

Solutions:

Reduce inpaint radius: --inpaint-radius 1
Lower DPI: --dpi 150 (default is good for most documents)
Single pass: --multi-pass 1 (default)

Memory Issues

Symptoms: "Out of memory" error on large PDFs

Solutions:

Lower DPI: --dpi 100
Process specific pages: --pages 1-50 (then 51-100, etc.)
Increase system available memory

License

MIT

Changelog

v0.1.0 (2024)

Initial Release - Production Ready

Core Features:

✅ Multi-stage watermark detection (adaptive thresholding + color analysis)
✅ Intelligent color classification (BACKGROUND/WATERMARK/TEXT/NOISE)
✅ OpenCV TELEA inpainting with dynamic parameters
✅ Interactive color selection with confidence scoring
✅ Multi-pass progressive watermark removal
✅ Support for batch processing and page ranges

Algorithm Improvements:

✅ Adaptive Gaussian thresholding (replaces simple Otsu)
✅ Background protection (excludes white/bright areas)
✅ Color space accuracy (RGB ↔ BGR proper handling)
✅ Dynamic inpaint radius based on coverage
✅ Morphological noise removal with connected components

Quality Assurance:

✅ ruff linting compliance
✅ Comprehensive error handling
✅ Detailed logging and statistics
✅ Verified on multiple PDF types

Contributing

Contributions welcome! Areas for enhancement:

GPU acceleration for large documents
Additional inpainting algorithms (e.g., Criminisi)
Batch API interface
Additional language support
Performance benchmarking

Project details

Release history Release notifications | RSS feed

0.4.1

Nov 11, 2025

0.4.0

Nov 11, 2025

0.3.0

Nov 11, 2025

0.2.1

Nov 11, 2025

0.2.0

Nov 11, 2025

This version

0.1.0

Nov 11, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdf_watermark_removal_otsu_inpaint-0.1.0.tar.gz (26.3 kB view details)

Uploaded Nov 11, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pdf_watermark_removal_otsu_inpaint-0.1.0-py3-none-any.whl (23.4 kB view details)

Uploaded Nov 11, 2025 Python 3

File details

Details for the file pdf_watermark_removal_otsu_inpaint-0.1.0.tar.gz.

File metadata

Download URL: pdf_watermark_removal_otsu_inpaint-0.1.0.tar.gz
Upload date: Nov 11, 2025
Size: 26.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for pdf_watermark_removal_otsu_inpaint-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`bf71f28073a0d4a44b7bda8a46d6996af652a072ae508b2ca148afdbafcfbfbc`
MD5	`c2ec6d0b5de65c9b246c4182621f5be6`
BLAKE2b-256	`8301dccb76e858c34a14a4692e060b825d7ddc9d6795723d64210cf0315c7f0a`

See more details on using hashes here.

File details

Details for the file pdf_watermark_removal_otsu_inpaint-0.1.0-py3-none-any.whl.

File metadata

Download URL: pdf_watermark_removal_otsu_inpaint-0.1.0-py3-none-any.whl
Upload date: Nov 11, 2025
Size: 23.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for pdf_watermark_removal_otsu_inpaint-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c91f8d8823d4de442b13a84c4ac87bfe4de2d2e8991e55b598224ee8279f0fcf`
MD5	`88cb613ada4dfe447812afeda07271f7`
BLAKE2b-256	`98ea9bea9c6ba30143f4a404fdb694a99f6a0b7c73ef7561e502e02438c67dea`

See more details on using hashes here.

pdf-watermark-removal-otsu-inpaint 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

PDF Watermark Removal Tool

🎯 Key Features

Installation

Using uv (recommended)

Using pip

From local directory

Quick Start

Basic Usage (All Pages, Interactive Color Selection)

Specify Watermark Color Explicitly

Skip Interactive Selection

Process Specific Pages Only

With Advanced Options

Command-Line Options

How It Works

1. Color Detection & Selection

2. Watermark Detection (Otsu Threshold)

3. Watermark Removal (Inpainting)

4. PDF Reconstruction

Algorithm Details

1. Intelligent Color Classification

2. Adaptive Watermark Detection

3. TELEA Inpainting

4. PDF Reconstruction

Requirements

Automatic Dependencies

Examples

Example 1: Interactive Color Selection with Rich UI

Example 2: Explicit Color and Multi-Pass

Example 3: High-Quality Processing

Performance

Troubleshooting

Poor Watermark Detection

Artifacts or Blurriness

Memory Issues

License

Changelog

v0.1.0 (2024)

Contributing

See Also

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes