Python bindings for MicroPDF - High-performance PDF manipulation library
Project description
MicroPDF Python Bindings
High-performance PDF manipulation library for Python with native Rust FFI bindings.
Features
- 🚀 Fast - Powered by Rust and MuPDF
- 🐍 Pythonic - Clean, idiomatic Python API
- 🔧 Easy to Use - Simple API for common tasks
- 🎯 Type-Safe - Full type hints with mypy support
- 📦 Zero Dependencies - Only requires cffi
- 🔒 Memory Safe - Automatic resource management
Installation
From Source
# Build the Rust library first
cd ../micropdf-rs
cargo build --release
# Install Python package
cd ../micropdf-py
pip install -e .
Requirements
- Python 3.8+
- cffi >= 1.16.0
- Compiled micropdf-rs library
Quick Start
Easy API (Recommended for Beginners)
from micropdf import EasyPDF
# Extract text from all pages
text = EasyPDF.extract_text('document.pdf')
print(text)
# Extract text from specific page
text = EasyPDF.extract_text('document.pdf', page=0)
# Render page to PNG
EasyPDF.render_to_png('document.pdf', 'output.png', page=0, dpi=300)
# Get document info
info = EasyPDF.get_info('document.pdf')
print(f"Pages: {info.page_count}")
print(f"Title: {info.title}")
print(f"Author: {info.author}")
Fluent API with Context Manager
from micropdf import EasyPDF
with EasyPDF.open('document.pdf') as pdf:
# Get info
print(f"Pages: {pdf.page_count()}")
print(f"Metadata: {pdf.get_metadata()}")
# Extract text from all pages
all_text = pdf.extract_all_text()
# Extract text from specific page
page_text = pdf.extract_page_text(0)
# Search across all pages
results = pdf.search_all('keyword')
for result in results:
print(f"Found on page {result['page_num']}: {result['bbox']}")
# Render pages
pdf.render_page(0, 'page0.png', dpi=300)
# Render all pages
paths = pdf.render_all_pages('output_dir', dpi=150)
print(f"Generated {len(paths)} images")
Low-Level API (Advanced)
from micropdf import Context, Document, Pixmap, Colorspace, Matrix
# Create context
with Context() as ctx:
# Open document
with Document.open(ctx, 'document.pdf') as doc:
print(f"Pages: {doc.page_count()}")
print(f"Title: {doc.get_metadata('Title')}")
# Load page
with doc.load_page(0) as page:
# Get page bounds
bounds = page.bounds()
print(f"Size: {bounds.width()} x {bounds.height()}")
# Extract text
text = page.extract_text()
print(text[:100])
# Render to pixmap
matrix = Matrix.scale(2.0, 2.0) # 2x scale = 144 DPI
colorspace = Colorspace.device_rgb(ctx)
with Pixmap.from_page(ctx, page, matrix, colorspace) as pix:
print(f"Image: {pix.width()}x{pix.height()}")
pix.save_png('output.png')
API Reference
Easy API
Static Methods
# One-liner operations
text = EasyPDF.extract_text('file.pdf')
text = EasyPDF.extract_text('file.pdf', page=0)
EasyPDF.render_to_png('in.pdf', 'out.png', page=0, dpi=300)
count = EasyPDF.get_page_count('file.pdf')
info = EasyPDF.get_info('file.pdf')
Instance Methods
with EasyPDF.open('file.pdf') as pdf:
# Document info
pages = pdf.page_count()
encrypted = pdf.is_encrypted()
metadata = pdf.get_metadata()
info = pdf.get_info()
# Text extraction
all_text = pdf.extract_all_text()
page_text = pdf.extract_page_text(0)
# Search
results = pdf.search_all('keyword')
# Rendering
pdf.render_page(0, 'page.png', dpi=300)
paths = pdf.render_all_pages('output_dir', dpi=150)
# Page info
bounds = pdf.get_page_bounds(0)
Core Classes
Context
ctx = Context() # Default 256 MB cache
ctx = Context(max_store=512 * 1024 * 1024) # 512 MB
with Context() as ctx:
# ... operations ...
pass # Auto-cleanup
Document
# Open from file
doc = Document.open(ctx, 'file.pdf')
# Open from bytes
with open('file.pdf', 'rb') as f:
data = f.read()
doc = Document.from_bytes(ctx, data)
# Operations
page_count = doc.page_count()
is_encrypted = doc.needs_password()
success = doc.authenticate('password')
title = doc.get_metadata('Title')
page = doc.load_page(0)
doc.save('output.pdf')
Page
page = doc.load_page(0)
# Get bounds
bounds = page.bounds() # Returns Rect
# Extract text
text = page.extract_text()
# Search
quads = page.search_text('keyword', max_hits=512)
Pixmap
# Create pixmap
colorspace = Colorspace.device_rgb(ctx)
pix = Pixmap.create(ctx, colorspace, 100, 100, alpha=False)
# Render from page
matrix = Matrix.scale(2.0, 2.0)
pix = Pixmap.from_page(ctx, page, matrix, colorspace)
# Operations
width = pix.width()
height = pix.height()
components = pix.components()
raw_data = pix.samples()
png_data = pix.to_png()
pix.save_png('output.png')
Geometry
from micropdf.geometry import Point, Rect, Matrix, Quad
# Point
p = Point(10, 20)
distance = p.distance(other_point)
p2 = p.transform(matrix)
# Rect
r = Rect(0, 0, 612, 792) # US Letter size
width = r.width()
height = r.height()
area = r.area()
contains = r.contains(point)
intersection = r1.intersect(r2)
union = r1.union(r2)
# Matrix
m = Matrix.identity()
m = Matrix.scale(2.0, 2.0)
m = Matrix.translate(10, 20)
m = Matrix.rotate(90)
m3 = m1.concat(m2)
m3 = m1 @ m2 # Matrix multiplication
# Quad (for rotated text bounding boxes)
q = Quad.from_rect(rect)
r = q.to_rect()
Colorspace
gray = Colorspace.device_gray(ctx)
rgb = Colorspace.device_rgb(ctx)
bgr = Colorspace.device_bgr(ctx)
cmyk = Colorspace.device_cmyk(ctx)
n = colorspace.components() # 1, 3, or 4
name = colorspace.name() # "DeviceRGB", etc.
Examples
Extract Text from All Pages
from micropdf import EasyPDF
text = EasyPDF.extract_text('document.pdf')
print(text)
Render All Pages to PNG
from micropdf import EasyPDF
with EasyPDF.open('document.pdf') as pdf:
paths = pdf.render_all_pages(
output_dir='pages',
prefix='page',
dpi=300
)
print(f"Generated {len(paths)} images")
Search and Highlight
from micropdf import EasyPDF
with EasyPDF.open('document.pdf') as pdf:
results = pdf.search_all('Python')
for result in results:
page_num = result['page_num']
bbox = result['bbox']
print(f"Found 'Python' on page {page_num} at {bbox}")
Password-Protected PDFs
from micropdf import EasyPDF
with EasyPDF.open_with_password('secret.pdf', 'password') as pdf:
text = pdf.extract_all_text()
print(text)
Custom DPI Rendering
from micropdf import Context, Document, Matrix, Pixmap, Colorspace
with Context() as ctx:
with Document.open(ctx, 'document.pdf') as doc:
with doc.load_page(0) as page:
# 300 DPI = 300/72 scale factor
dpi = 300
scale = dpi / 72.0
matrix = Matrix.scale(scale, scale)
colorspace = Colorspace.device_rgb(ctx)
with Pixmap.from_page(ctx, page, matrix, colorspace) as pix:
pix.save_png('high_res.png')
Development
Setup Development Environment
# Install with dev dependencies
pip install -e ".[dev]"
# Run tests
pytest
# Run tests with coverage
pytest --cov=micropdf --cov-report=html
# Type checking
mypy src/micropdf
# Linting
ruff check src/micropdf
# Formatting
black src/micropdf tests
Building Documentation
cd docs
make html
open _build/html/index.html
Architecture
- FFI Layer (
ffi.py): Low-level cffi bindings to Rust library - Core Classes: Pythonic wrappers around FFI handles
- Easy API (
easy.py): High-level, simplified interface - Automatic Cleanup: Context managers for resource management
Performance
MicroPDF Python bindings provide near-native performance by:
- Using cffi for efficient C interop
- Minimizing Python/Rust boundary crossings
- Leveraging Rust's zero-cost abstractions
- Direct memory access for pixel data
Comparison with Other Libraries
| Feature | MicroPDF | PyMuPDF | pdfplumber | PyPDF2 |
|---|---|---|---|---|
| Speed | ⚡⚡⚡ | ⚡⚡⚡ | ⚡ | ⚡ |
| Memory | ✅ Low | ✅ Low | ⚠️ High | ⚠️ High |
| Type Hints | ✅ Full | ⚠️ Partial | ✅ Full | ⚠️ Partial |
| Easy API | ✅ Yes | ❌ No | ✅ Yes | ❌ No |
| Rendering | ✅ Yes | ✅ Yes | ❌ No | ❌ No |
License
Apache 2.0
Links
- Documentation: https://lexmata.github.io/micropdf/api/python/
- Repository: https://bitbucket.org/lexmata/micropdf
- Issues: https://bitbucket.org/lexmata/micropdf/issues
- Rust Core: https://docs.rs/micropdf
- Node.js: https://lexmata.github.io/micropdf/api/nodejs/
- Go: https://pkg.go.dev/bitbucket.org/lexmata/micropdf/go-micropdf
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
micropdf-0.7.0.tar.gz
(52.5 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
micropdf-0.7.0-py3-none-any.whl
(56.0 kB
view details)
File details
Details for the file micropdf-0.7.0.tar.gz.
File metadata
- Download URL: micropdf-0.7.0.tar.gz
- Upload date:
- Size: 52.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3b7ef507d3a52cc4899c46d2bc67b4d0fa025620483ec8f33cd46a33676f2cd7
|
|
| MD5 |
48d99befba575a2d58400f71a4357772
|
|
| BLAKE2b-256 |
21b04d2a0ddabb2bac9e288d484dfb3c379413ab3c704b4ecaaf80e308fda65b
|
File details
Details for the file micropdf-0.7.0-py3-none-any.whl.
File metadata
- Download URL: micropdf-0.7.0-py3-none-any.whl
- Upload date:
- Size: 56.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
74e21ae4ea6e800e891d26987c6c4517264b83762a8a7cfbd490bcacbf8e13e1
|
|
| MD5 |
94373d0ac24585a16122092455c2fe65
|
|
| BLAKE2b-256 |
13f27c7f37bddcb598f25370917573b54a89cf50ab54c05a4e3dd3ba7436fb47
|