Skip to main content

Python bindings for MicroPDF - High-performance PDF manipulation library

Project description

MicroPDF Python Bindings

High-performance PDF manipulation library for Python with native Rust FFI bindings.

Features

  • 🚀 Fast - Powered by Rust and MuPDF
  • 🐍 Pythonic - Clean, idiomatic Python API
  • 🔧 Easy to Use - Simple API for common tasks
  • 🎯 Type-Safe - Full type hints with mypy support
  • 📦 Zero Dependencies - Only requires cffi
  • 🔒 Memory Safe - Automatic resource management

Installation

From Source

# Build the Rust library first
cd ../micropdf-rs
cargo build --release

# Install Python package
cd ../micropdf-py
pip install -e .

Requirements

  • Python 3.8+
  • cffi >= 1.16.0
  • Compiled micropdf-rs library

Quick Start

Easy API (Recommended for Beginners)

from micropdf import EasyPDF

# Extract text from all pages
text = EasyPDF.extract_text('document.pdf')
print(text)

# Extract text from specific page
text = EasyPDF.extract_text('document.pdf', page=0)

# Render page to PNG
EasyPDF.render_to_png('document.pdf', 'output.png', page=0, dpi=300)

# Get document info
info = EasyPDF.get_info('document.pdf')
print(f"Pages: {info.page_count}")
print(f"Title: {info.title}")
print(f"Author: {info.author}")

Fluent API with Context Manager

from micropdf import EasyPDF

with EasyPDF.open('document.pdf') as pdf:
    # Get info
    print(f"Pages: {pdf.page_count()}")
    print(f"Metadata: {pdf.get_metadata()}")

    # Extract text from all pages
    all_text = pdf.extract_all_text()

    # Extract text from specific page
    page_text = pdf.extract_page_text(0)

    # Search across all pages
    results = pdf.search_all('keyword')
    for result in results:
        print(f"Found on page {result['page_num']}: {result['bbox']}")

    # Render pages
    pdf.render_page(0, 'page0.png', dpi=300)

    # Render all pages
    paths = pdf.render_all_pages('output_dir', dpi=150)
    print(f"Generated {len(paths)} images")

Low-Level API (Advanced)

from micropdf import Context, Document, Pixmap, Colorspace, Matrix

# Create context
with Context() as ctx:
    # Open document
    with Document.open(ctx, 'document.pdf') as doc:
        print(f"Pages: {doc.page_count()}")
        print(f"Title: {doc.get_metadata('Title')}")

        # Load page
        with doc.load_page(0) as page:
            # Get page bounds
            bounds = page.bounds()
            print(f"Size: {bounds.width()} x {bounds.height()}")

            # Extract text
            text = page.extract_text()
            print(text[:100])

            # Render to pixmap
            matrix = Matrix.scale(2.0, 2.0)  # 2x scale = 144 DPI
            colorspace = Colorspace.device_rgb(ctx)

            with Pixmap.from_page(ctx, page, matrix, colorspace) as pix:
                print(f"Image: {pix.width()}x{pix.height()}")
                pix.save_png('output.png')

API Reference

Easy API

Static Methods

# One-liner operations
text = EasyPDF.extract_text('file.pdf')
text = EasyPDF.extract_text('file.pdf', page=0)
EasyPDF.render_to_png('in.pdf', 'out.png', page=0, dpi=300)
count = EasyPDF.get_page_count('file.pdf')
info = EasyPDF.get_info('file.pdf')

Instance Methods

with EasyPDF.open('file.pdf') as pdf:
    # Document info
    pages = pdf.page_count()
    encrypted = pdf.is_encrypted()
    metadata = pdf.get_metadata()
    info = pdf.get_info()

    # Text extraction
    all_text = pdf.extract_all_text()
    page_text = pdf.extract_page_text(0)

    # Search
    results = pdf.search_all('keyword')

    # Rendering
    pdf.render_page(0, 'page.png', dpi=300)
    paths = pdf.render_all_pages('output_dir', dpi=150)

    # Page info
    bounds = pdf.get_page_bounds(0)

Core Classes

Context

ctx = Context()  # Default 256 MB cache
ctx = Context(max_store=512 * 1024 * 1024)  # 512 MB

with Context() as ctx:
    # ... operations ...
    pass  # Auto-cleanup

Document

# Open from file
doc = Document.open(ctx, 'file.pdf')

# Open from bytes
with open('file.pdf', 'rb') as f:
    data = f.read()
doc = Document.from_bytes(ctx, data)

# Operations
page_count = doc.page_count()
is_encrypted = doc.needs_password()
success = doc.authenticate('password')
title = doc.get_metadata('Title')
page = doc.load_page(0)
doc.save('output.pdf')

Page

page = doc.load_page(0)

# Get bounds
bounds = page.bounds()  # Returns Rect

# Extract text
text = page.extract_text()

# Search
quads = page.search_text('keyword', max_hits=512)

Pixmap

# Create pixmap
colorspace = Colorspace.device_rgb(ctx)
pix = Pixmap.create(ctx, colorspace, 100, 100, alpha=False)

# Render from page
matrix = Matrix.scale(2.0, 2.0)
pix = Pixmap.from_page(ctx, page, matrix, colorspace)

# Operations
width = pix.width()
height = pix.height()
components = pix.components()
raw_data = pix.samples()
png_data = pix.to_png()
pix.save_png('output.png')

Geometry

from micropdf.geometry import Point, Rect, Matrix, Quad

# Point
p = Point(10, 20)
distance = p.distance(other_point)
p2 = p.transform(matrix)

# Rect
r = Rect(0, 0, 612, 792)  # US Letter size
width = r.width()
height = r.height()
area = r.area()
contains = r.contains(point)
intersection = r1.intersect(r2)
union = r1.union(r2)

# Matrix
m = Matrix.identity()
m = Matrix.scale(2.0, 2.0)
m = Matrix.translate(10, 20)
m = Matrix.rotate(90)
m3 = m1.concat(m2)
m3 = m1 @ m2  # Matrix multiplication

# Quad (for rotated text bounding boxes)
q = Quad.from_rect(rect)
r = q.to_rect()

Colorspace

gray = Colorspace.device_gray(ctx)
rgb = Colorspace.device_rgb(ctx)
bgr = Colorspace.device_bgr(ctx)
cmyk = Colorspace.device_cmyk(ctx)

n = colorspace.components()  # 1, 3, or 4
name = colorspace.name()  # "DeviceRGB", etc.

Examples

Extract Text from All Pages

from micropdf import EasyPDF

text = EasyPDF.extract_text('document.pdf')
print(text)

Render All Pages to PNG

from micropdf import EasyPDF

with EasyPDF.open('document.pdf') as pdf:
    paths = pdf.render_all_pages(
        output_dir='pages',
        prefix='page',
        dpi=300
    )
    print(f"Generated {len(paths)} images")

Search and Highlight

from micropdf import EasyPDF

with EasyPDF.open('document.pdf') as pdf:
    results = pdf.search_all('Python')

    for result in results:
        page_num = result['page_num']
        bbox = result['bbox']
        print(f"Found 'Python' on page {page_num} at {bbox}")

Password-Protected PDFs

from micropdf import EasyPDF

with EasyPDF.open_with_password('secret.pdf', 'password') as pdf:
    text = pdf.extract_all_text()
    print(text)

Custom DPI Rendering

from micropdf import Context, Document, Matrix, Pixmap, Colorspace

with Context() as ctx:
    with Document.open(ctx, 'document.pdf') as doc:
        with doc.load_page(0) as page:
            # 300 DPI = 300/72 scale factor
            dpi = 300
            scale = dpi / 72.0
            matrix = Matrix.scale(scale, scale)
            colorspace = Colorspace.device_rgb(ctx)

            with Pixmap.from_page(ctx, page, matrix, colorspace) as pix:
                pix.save_png('high_res.png')

Development

Setup Development Environment

# Install with dev dependencies
pip install -e ".[dev]"

# Run tests
pytest

# Run tests with coverage
pytest --cov=micropdf --cov-report=html

# Type checking
mypy src/micropdf

# Linting
ruff check src/micropdf

# Formatting
black src/micropdf tests

Building Documentation

cd docs
make html
open _build/html/index.html

Architecture

  • FFI Layer (ffi.py): Low-level cffi bindings to Rust library
  • Core Classes: Pythonic wrappers around FFI handles
  • Easy API (easy.py): High-level, simplified interface
  • Automatic Cleanup: Context managers for resource management

Performance

MicroPDF Python bindings provide near-native performance by:

  1. Using cffi for efficient C interop
  2. Minimizing Python/Rust boundary crossings
  3. Leveraging Rust's zero-cost abstractions
  4. Direct memory access for pixel data

Comparison with Other Libraries

Feature MicroPDF PyMuPDF pdfplumber PyPDF2
Speed ⚡⚡⚡ ⚡⚡⚡
Memory ✅ Low ✅ Low ⚠️ High ⚠️ High
Type Hints ✅ Full ⚠️ Partial ✅ Full ⚠️ Partial
Easy API ✅ Yes ❌ No ✅ Yes ❌ No
Rendering ✅ Yes ✅ Yes ❌ No ❌ No

License

Apache 2.0

Links

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

micropdf-0.9.1.tar.gz (66.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

micropdf-0.9.1-py3-none-any.whl (70.5 kB view details)

Uploaded Python 3

File details

Details for the file micropdf-0.9.1.tar.gz.

File metadata

  • Download URL: micropdf-0.9.1.tar.gz
  • Upload date:
  • Size: 66.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for micropdf-0.9.1.tar.gz
Algorithm Hash digest
SHA256 e681bbff67dce3fb176a8e90b6d869e4d3ca2d12db3caf49068a69257e05c0cc
MD5 e42cf3ecb9d09a0ca7a4131e5861f139
BLAKE2b-256 37e6ac8cfea26fd2e4c6c25814cadaa03cc3cbadc490c9a2b3834a74fb5d0dc9

See more details on using hashes here.

File details

Details for the file micropdf-0.9.1-py3-none-any.whl.

File metadata

  • Download URL: micropdf-0.9.1-py3-none-any.whl
  • Upload date:
  • Size: 70.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for micropdf-0.9.1-py3-none-any.whl
Algorithm Hash digest
SHA256 27c0bc8d8c51bb89e4800737c89f5e7a277abc74ca5de357d7131ea48e2c530d
MD5 5069355aef6e664da4c93099bf48d6d1
BLAKE2b-256 9e9a7479da5d76bec7b9c0806a92d8efb78f63e16d2c0db3719959fc7ab2bff3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page