Skip to main content

Python bindings for MicroPDF - High-performance PDF manipulation library

Project description

MicroPDF Python Bindings

High-performance PDF manipulation library for Python with native Rust FFI bindings.

Features

  • 🚀 Fast - Powered by Rust and MuPDF
  • 🐍 Pythonic - Clean, idiomatic Python API
  • 🔧 Easy to Use - Simple API for common tasks
  • 🎯 Type-Safe - Full type hints with mypy support
  • 📦 Zero Dependencies - Only requires cffi
  • 🔒 Memory Safe - Automatic resource management

Installation

From Source

# Build the Rust library first
cd ../micropdf-rs
cargo build --release

# Install Python package
cd ../micropdf-py
pip install -e .

Requirements

  • Python 3.8+
  • cffi >= 1.16.0
  • Compiled micropdf-rs library

Quick Start

Easy API (Recommended for Beginners)

from micropdf import EasyPDF

# Extract text from all pages
text = EasyPDF.extract_text('document.pdf')
print(text)

# Extract text from specific page
text = EasyPDF.extract_text('document.pdf', page=0)

# Render page to PNG
EasyPDF.render_to_png('document.pdf', 'output.png', page=0, dpi=300)

# Get document info
info = EasyPDF.get_info('document.pdf')
print(f"Pages: {info.page_count}")
print(f"Title: {info.title}")
print(f"Author: {info.author}")

Fluent API with Context Manager

from micropdf import EasyPDF

with EasyPDF.open('document.pdf') as pdf:
    # Get info
    print(f"Pages: {pdf.page_count()}")
    print(f"Metadata: {pdf.get_metadata()}")

    # Extract text from all pages
    all_text = pdf.extract_all_text()

    # Extract text from specific page
    page_text = pdf.extract_page_text(0)

    # Search across all pages
    results = pdf.search_all('keyword')
    for result in results:
        print(f"Found on page {result['page_num']}: {result['bbox']}")

    # Render pages
    pdf.render_page(0, 'page0.png', dpi=300)

    # Render all pages
    paths = pdf.render_all_pages('output_dir', dpi=150)
    print(f"Generated {len(paths)} images")

Low-Level API (Advanced)

from micropdf import Context, Document, Pixmap, Colorspace, Matrix

# Create context
with Context() as ctx:
    # Open document
    with Document.open(ctx, 'document.pdf') as doc:
        print(f"Pages: {doc.page_count()}")
        print(f"Title: {doc.get_metadata('Title')}")

        # Load page
        with doc.load_page(0) as page:
            # Get page bounds
            bounds = page.bounds()
            print(f"Size: {bounds.width()} x {bounds.height()}")

            # Extract text
            text = page.extract_text()
            print(text[:100])

            # Render to pixmap
            matrix = Matrix.scale(2.0, 2.0)  # 2x scale = 144 DPI
            colorspace = Colorspace.device_rgb(ctx)

            with Pixmap.from_page(ctx, page, matrix, colorspace) as pix:
                print(f"Image: {pix.width()}x{pix.height()}")
                pix.save_png('output.png')

API Reference

Easy API

Static Methods

# One-liner operations
text = EasyPDF.extract_text('file.pdf')
text = EasyPDF.extract_text('file.pdf', page=0)
EasyPDF.render_to_png('in.pdf', 'out.png', page=0, dpi=300)
count = EasyPDF.get_page_count('file.pdf')
info = EasyPDF.get_info('file.pdf')

Instance Methods

with EasyPDF.open('file.pdf') as pdf:
    # Document info
    pages = pdf.page_count()
    encrypted = pdf.is_encrypted()
    metadata = pdf.get_metadata()
    info = pdf.get_info()

    # Text extraction
    all_text = pdf.extract_all_text()
    page_text = pdf.extract_page_text(0)

    # Search
    results = pdf.search_all('keyword')

    # Rendering
    pdf.render_page(0, 'page.png', dpi=300)
    paths = pdf.render_all_pages('output_dir', dpi=150)

    # Page info
    bounds = pdf.get_page_bounds(0)

Core Classes

Context

ctx = Context()  # Default 256 MB cache
ctx = Context(max_store=512 * 1024 * 1024)  # 512 MB

with Context() as ctx:
    # ... operations ...
    pass  # Auto-cleanup

Document

# Open from file
doc = Document.open(ctx, 'file.pdf')

# Open from bytes
with open('file.pdf', 'rb') as f:
    data = f.read()
doc = Document.from_bytes(ctx, data)

# Operations
page_count = doc.page_count()
is_encrypted = doc.needs_password()
success = doc.authenticate('password')
title = doc.get_metadata('Title')
page = doc.load_page(0)
doc.save('output.pdf')

Page

page = doc.load_page(0)

# Get bounds
bounds = page.bounds()  # Returns Rect

# Extract text
text = page.extract_text()

# Search
quads = page.search_text('keyword', max_hits=512)

Pixmap

# Create pixmap
colorspace = Colorspace.device_rgb(ctx)
pix = Pixmap.create(ctx, colorspace, 100, 100, alpha=False)

# Render from page
matrix = Matrix.scale(2.0, 2.0)
pix = Pixmap.from_page(ctx, page, matrix, colorspace)

# Operations
width = pix.width()
height = pix.height()
components = pix.components()
raw_data = pix.samples()
png_data = pix.to_png()
pix.save_png('output.png')

Geometry

from micropdf.geometry import Point, Rect, Matrix, Quad

# Point
p = Point(10, 20)
distance = p.distance(other_point)
p2 = p.transform(matrix)

# Rect
r = Rect(0, 0, 612, 792)  # US Letter size
width = r.width()
height = r.height()
area = r.area()
contains = r.contains(point)
intersection = r1.intersect(r2)
union = r1.union(r2)

# Matrix
m = Matrix.identity()
m = Matrix.scale(2.0, 2.0)
m = Matrix.translate(10, 20)
m = Matrix.rotate(90)
m3 = m1.concat(m2)
m3 = m1 @ m2  # Matrix multiplication

# Quad (for rotated text bounding boxes)
q = Quad.from_rect(rect)
r = q.to_rect()

Colorspace

gray = Colorspace.device_gray(ctx)
rgb = Colorspace.device_rgb(ctx)
bgr = Colorspace.device_bgr(ctx)
cmyk = Colorspace.device_cmyk(ctx)

n = colorspace.components()  # 1, 3, or 4
name = colorspace.name()  # "DeviceRGB", etc.

Examples

Extract Text from All Pages

from micropdf import EasyPDF

text = EasyPDF.extract_text('document.pdf')
print(text)

Render All Pages to PNG

from micropdf import EasyPDF

with EasyPDF.open('document.pdf') as pdf:
    paths = pdf.render_all_pages(
        output_dir='pages',
        prefix='page',
        dpi=300
    )
    print(f"Generated {len(paths)} images")

Search and Highlight

from micropdf import EasyPDF

with EasyPDF.open('document.pdf') as pdf:
    results = pdf.search_all('Python')

    for result in results:
        page_num = result['page_num']
        bbox = result['bbox']
        print(f"Found 'Python' on page {page_num} at {bbox}")

Password-Protected PDFs

from micropdf import EasyPDF

with EasyPDF.open_with_password('secret.pdf', 'password') as pdf:
    text = pdf.extract_all_text()
    print(text)

Custom DPI Rendering

from micropdf import Context, Document, Matrix, Pixmap, Colorspace

with Context() as ctx:
    with Document.open(ctx, 'document.pdf') as doc:
        with doc.load_page(0) as page:
            # 300 DPI = 300/72 scale factor
            dpi = 300
            scale = dpi / 72.0
            matrix = Matrix.scale(scale, scale)
            colorspace = Colorspace.device_rgb(ctx)

            with Pixmap.from_page(ctx, page, matrix, colorspace) as pix:
                pix.save_png('high_res.png')

Development

Setup Development Environment

# Install with dev dependencies
pip install -e ".[dev]"

# Run tests
pytest

# Run tests with coverage
pytest --cov=micropdf --cov-report=html

# Type checking
mypy src/micropdf

# Linting
ruff check src/micropdf

# Formatting
black src/micropdf tests

Building Documentation

cd docs
make html
open _build/html/index.html

Architecture

  • FFI Layer (ffi.py): Low-level cffi bindings to Rust library
  • Core Classes: Pythonic wrappers around FFI handles
  • Easy API (easy.py): High-level, simplified interface
  • Automatic Cleanup: Context managers for resource management

Performance

MicroPDF Python bindings provide near-native performance by:

  1. Using cffi for efficient C interop
  2. Minimizing Python/Rust boundary crossings
  3. Leveraging Rust's zero-cost abstractions
  4. Direct memory access for pixel data

Comparison with Other Libraries

Feature MicroPDF PyMuPDF pdfplumber PyPDF2
Speed ⚡⚡⚡ ⚡⚡⚡
Memory ✅ Low ✅ Low ⚠️ High ⚠️ High
Type Hints ✅ Full ⚠️ Partial ✅ Full ⚠️ Partial
Easy API ✅ Yes ❌ No ✅ Yes ❌ No
Rendering ✅ Yes ✅ Yes ❌ No ❌ No

License

Apache 2.0

Links

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

micropdf-0.9.0.tar.gz (66.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

micropdf-0.9.0-py3-none-any.whl (70.5 kB view details)

Uploaded Python 3

File details

Details for the file micropdf-0.9.0.tar.gz.

File metadata

  • Download URL: micropdf-0.9.0.tar.gz
  • Upload date:
  • Size: 66.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for micropdf-0.9.0.tar.gz
Algorithm Hash digest
SHA256 74f609e3b6a30ce8d44104033a6a50e85b4823d51cb8912b777ced645678c0ea
MD5 ded05825b553c31ca163be3b93053609
BLAKE2b-256 bdbc340cdd2133af1bc3ef59c89ef25f77408ba7707149797a8f0e4a66e2f159

See more details on using hashes here.

File details

Details for the file micropdf-0.9.0-py3-none-any.whl.

File metadata

  • Download URL: micropdf-0.9.0-py3-none-any.whl
  • Upload date:
  • Size: 70.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for micropdf-0.9.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e25b4000247f3f8f7fe21741f86bdb0ff5af58c69ec94ba5a1213478db706b84
MD5 2953f80e0a9d40d2532df2294aff54e4
BLAKE2b-256 b606b034b59c9da312bd58ea80c54196783169e14d6ce9878d9bdd3d8cc6faa3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page