Skip to main content

Extract images from PDFs and create annotated versions with watermarks

Project description

PDFImageExtractAnnotate

A Python package for extracting images from PDF documents and creating annotated versions with watermarks showing the extracted image filenames.

Features

  • Image Extraction: Extract all images from PDF documents with configurable filters
  • Page-based Organization: Images are organized by page number for easy reference
  • Watermark Annotation: Add watermarks to the original PDF showing extracted image filenames
  • Flexible Filtering: Filter images by dimensions, file size, or relative compression
  • Azure Blob Storage Support: Optional support for storing images in Azure Blob Storage
  • Customizable Watermarks: Configure font size, color, background, and text format

Installation

From PyPI (when published)

pip install pdf-image-extract-annotate

From Source

git clone https://github.com/thijshakkenbergecolab/pdf-image-extract-annotate
cd pdf-image-extract-annotate
pip install -e .

With Azure Support

pip install pdf-image-extract-annotate[azure]

Quick Start

Basic Image Extraction

from pathlib import Path
from pdf_image_extract_annotate import PDFImageExtractor, ExtractionConfig

# Configure extraction
config = ExtractionConfig(
    output_dir="extracted_images",
    dim_limit=50,  # Minimum dimension in pixels
    abs_size=1000  # Minimum file size in bytes
)

# Extract images
extractor = PDFImageExtractor(config)
result = extractor.extract_all_images(Path("document.pdf"))

print(f"Extracted {result['images_extracted']} images")
print(f"Saved to: {result['output_directory']}")

Extract and Watermark PDF

from pathlib import Path
from pdf_image_extract_annotate import PDFImageWatermarker, WatermarkConfig

# Configure watermark appearance
watermark_config = WatermarkConfig(
    font_size=10,
    font_color=(1.0, 0.0, 0.0),  # Red text
    background_color=(1.0, 1.0, 1.0, 0.7),  # Semi-transparent white
    text_format="filename"  # Show just the filename
)

# Process PDF
watermarker = PDFImageWatermarker(
    pdf_path=Path("document.pdf"),
    watermark_config=watermark_config
)

result = watermarker.process_pdf_with_watermarks()

# Save the annotated PDF
result.output_pdf.save("annotated_document.pdf")
result.output_pdf.close()

print(f"Extracted {result.images_extracted} images")
print(f"Watermarked {result.images_watermarked} images")

Configuration Options

ExtractionConfig

  • output_dir (str): Directory to save extracted images
  • dim_limit (int): Minimum dimension filter (0 = no limit)
  • rel_size (float): Relative size filter (0.0-1.0, 0 = no limit)
  • abs_size (int): Absolute size filter in bytes (0 = no limit)
  • blob_connection_string (str, optional): Azure Blob Storage connection string

WatermarkConfig

  • font_size (int): Font size for watermark text
  • font_color (tuple): RGB color values (0.0-1.0)
  • background_color (tuple): RGBA background color
  • text_format (str): Format for watermark text ("filename", "filepath", or "custom")
  • padding (int): Padding around text in pixels

Output Structure

Images are organized using a page-based structure:

output_dir/
├── images/
│   ├── page_1/
│   │   ├── img00001.png
│   │   └── img00002.jpg
│   ├── page_2/
│   │   └── img00003.png
│   └── page_N/
│       └── imgXXXXX.ext
└── annotated_pdf.pdf  (if using watermarker)

Advanced Usage

Using with Azure Blob Storage

from pdf_image_extract_annotate import PDFImageExtractor, ExtractionConfig

config = ExtractionConfig(
    output_dir="my-container",
    blob_connection_string="DefaultEndpointsProtocol=https;..."
)

extractor = PDFImageExtractor(config)
result = extractor.extract_all_images(Path("document.pdf"))

Custom Image Filtering

from pdf_image_extract_annotate import PDFImageExtractor, ExtractionConfig

# Only extract large, high-quality images
config = ExtractionConfig(
    output_dir="high_quality_images",
    dim_limit=200,      # At least 200px in smallest dimension
    rel_size=0.5,       # At least 50% of uncompressed size
    abs_size=50000      # At least 50KB
)

Dependency Injection in Larger Projects

from pathlib import Path
from pdf_image_extract_annotate import PDFImageWatermarker, ExtractionConfig, WatermarkConfig

class DocumentProcessor:
    def __init__(self, extraction_config: ExtractionConfig, watermark_config: WatermarkConfig):
        self.extraction_config = extraction_config
        self.watermark_config = watermark_config

    def process_document(self, pdf_path: Path):
        watermarker = PDFImageWatermarker(
            pdf_path=pdf_path,
            extraction_config=self.extraction_config,
            watermark_config=self.watermark_config
        )
        return watermarker.process_pdf_with_watermarks()

API Reference

Classes

  • PDFImageExtractor: Core image extraction functionality
  • PDFImageWatermarker: Extended extractor with watermarking capabilities
  • ExtractionConfig: Configuration for image extraction
  • WatermarkConfig: Configuration for watermark appearance
  • ImageMetadata: Metadata for extracted images
  • ImageWatermarkEntry: Entry for images with watermark information
  • WatermarkResult: Result of the PDF watermarking process

Requirements

  • Python 3.11+
  • PyMuPDF >= 1.23.0
  • pydantic >= 2.0.0
  • azure-storage-blob >= 12.0.0 (optional, for Azure support)

Development

Setting up development environment

# Clone the repository
git clone https://github.com/thijshakkenbergecolab/pdf-image-extract-annotate
cd pdf-image-extract-annotate

# Install in development mode with dev dependencies
pip install -e ".[dev]"

# Run tests
pytest

# Format code
black pdf_image_extract_annotate tests

# Type checking
mypy pdf_image_extract_annotate

Running Tests

# Run all tests
pytest

# Run with coverage
pytest --cov=pdf_image_extract_annotate

# Run specific test file
pytest tests/test_extractor.py

License

MIT License - see LICENSE file for details

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/AmazingFeature)
  3. Commit your changes (git commit -m 'Add some AmazingFeature')
  4. Push to the branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

Support

For issues, questions, or suggestions, please open an issue on GitHub.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdf_image_extract_annotate-1.0.0.tar.gz (21.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pdf_image_extract_annotate-1.0.0-py3-none-any.whl (24.7 kB view details)

Uploaded Python 3

File details

Details for the file pdf_image_extract_annotate-1.0.0.tar.gz.

File metadata

File hashes

Hashes for pdf_image_extract_annotate-1.0.0.tar.gz
Algorithm Hash digest
SHA256 61783c6d8b0bd305ae6e4cb65fda801b81d228f3011bf10bf173eab0f311415e
MD5 d62509f097acd17ba390ac8e69488933
BLAKE2b-256 88adb356371f75ed4e0bcee06d27da2e89f3e9e11de00f3af341fdceffad96c0

See more details on using hashes here.

File details

Details for the file pdf_image_extract_annotate-1.0.0-py3-none-any.whl.

File metadata

File hashes

Hashes for pdf_image_extract_annotate-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d4a68dd65b961c2d4470026808a64532b96a78206525b122699b761167861afa
MD5 047a7dd424c501e5a3f023b60bd2cb33
BLAKE2b-256 dbf94049d465e73c9593e953838f1235f6d7213529e14adafeb0945f28988c76

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page