Skip to main content

The fastest Python PDF library: 0.8ms mean, 5× faster than PyMuPDF. Text extraction, markdown conversion, PDF creation. 100% pass rate on 3,830 PDFs.

Project description

PDF Oxide - The Fastest PDF Toolkit for Python, Rust, WASM, CLI & AI

The fastest PDF library for text extraction, image extraction, and markdown conversion. Rust core with Python bindings, WASM support, CLI tool, and MCP server for AI assistants. 0.8ms mean per document, 5× faster than PyMuPDF, 15× faster than pypdf. 100% pass rate on 3,830 real-world PDFs. MIT licensed.

Crates.io PyPI PyPI Downloads npm Documentation Build Status License: MIT OR Apache-2.0

Quick Start

Python

from pdf_oxide import PdfDocument

# path can be str or pathlib.Path; use with for scoped access
doc = PdfDocument("paper.pdf")
# or: with PdfDocument("paper.pdf") as doc: ...
text = doc.extract_text(0)
chars = doc.extract_chars(0)
markdown = doc.to_markdown(0, detect_headings=True)
pip install pdf_oxide

Rust

use pdf_oxide::PdfDocument;

let mut doc = PdfDocument::open("paper.pdf")?;
let text = doc.extract_text(0)?;
let images = doc.extract_images(0)?;
let markdown = doc.to_markdown(0, Default::default())?;
[dependencies]
pdf_oxide = "0.3"

CLI

pdf-oxide text document.pdf
pdf-oxide markdown document.pdf -o output.md
pdf-oxide search document.pdf "pattern"
pdf-oxide merge a.pdf b.pdf -o combined.pdf
brew install yfedoseev/tap/pdf-oxide

MCP Server (for AI assistants)

# Install
brew install yfedoseev/tap/pdf-oxide   # includes pdf-oxide-mcp

# Configure in Claude Desktop / Claude Code / Cursor
{
  "mcpServers": {
    "pdf-oxide": { "command": "crgx", "args": ["pdf_oxide_mcp@latest"] }
  }
}

Why pdf_oxide?

  • Fast — 0.8ms mean per document, 5× faster than PyMuPDF, 15× faster than pypdf, 29× faster than pdfplumber
  • Reliable — 100% pass rate on 3,830 test PDFs, zero panics, zero timeouts
  • Complete — Text extraction, image extraction, PDF creation, and editing in one library
  • Multi-platform — Rust, Python, JavaScript/WASM, CLI, and MCP server for AI assistants
  • Permissive license — MIT / Apache-2.0 — use freely in commercial and open-source projects

Performance

Benchmarked on 3,830 PDFs from three independent public test suites (veraPDF, Mozilla pdf.js, DARPA SafeDocs). Text extraction libraries only (no OCR). Single-thread, 60s timeout, no warm-up.

Python Libraries

Library Mean p99 Pass Rate License
PDF Oxide 0.8ms 9ms 100% MIT
PyMuPDF 4.6ms 28ms 99.3% AGPL-3.0
pypdfium2 4.1ms 42ms 99.2% Apache-2.0
pymupdf4llm 55.5ms 280ms 99.1% AGPL-3.0
pdftext 7.3ms 82ms 99.0% GPL-3.0
pdfminer 16.8ms 124ms 98.8% MIT
pdfplumber 23.2ms 189ms 98.8% MIT
markitdown 108.8ms 378ms 98.6% MIT
pypdf 12.1ms 97ms 98.4% BSD-3

Rust Libraries

Library Mean p99 Pass Rate Text Extraction
PDF Oxide 0.8ms 9ms 100% Built-in
oxidize_pdf 13.5ms 11ms 99.1% Basic
unpdf 2.8ms 10ms 95.1% Basic
pdf_extract 4.08ms 37ms 91.5% Basic
lopdf 0.3ms 2ms 80.2% No built-in extraction

Text Quality

99.5% text parity vs PyMuPDF and pypdfium2 across the full corpus. PDF Oxide extracts text from 7–10× more "hard" files than it misses vs any competitor.

Corpus

Suite PDFs Pass Rate
veraPDF (PDF/A compliance) 2,907 100%
Mozilla pdf.js 897 99.2%
SafeDocs (targeted edge cases) 26 100%
Total 3,830 100%

100% pass rate on all valid PDFs — the 7 non-passing files across the corpus are intentionally broken test fixtures (missing PDF header, fuzz-corrupted catalogs, invalid xref streams).

Features

Extract Create Edit
Text & Layout Documents Annotations
Images Tables Form Fields
Forms Graphics Bookmarks
Annotations Templates Links
Bookmarks Images Content

Python API

from pdf_oxide import PdfDocument

# Path can be str or pathlib.Path; use "with PdfDocument(...) as doc" for context manager
doc = PdfDocument("report.pdf")
print(f"Pages: {doc.page_count()}")
print(f"Version: {doc.version()}")

# 1. Scoped extraction (v0.3.14)
# Extract only from a specific area: (x, y, width, height)
header = doc.within(0, (0, 700, 612, 92)).extract_text()

# 2. Word-level extraction (v0.3.14)
words = doc.extract_words(0)
for w in words:
    print(f"{w.text} at {w.bbox}")
    # Access individual characters in the word
    # print(w.chars[0].font_name)

# Optional: override the adaptive word gap threshold (in PDF points)
words = doc.extract_words(0, word_gap_threshold=2.5)

# 3. Line-level extraction (v0.3.14)
lines = doc.extract_text_lines(0)
for line in lines:
    print(f"Line: {line.text}")

# Optional: override word and/or line gap thresholds (in PDF points)
lines = doc.extract_text_lines(0, word_gap_threshold=2.5, line_gap_threshold=4.0)

# Inspect the adaptive thresholds before overriding
params = doc.page_layout_params(0)
print(f"word gap: {params.word_gap_threshold:.1f}, line gap: {params.line_gap_threshold:.1f}")

# Use a pre-tuned extraction profile for specific document types
from pdf_oxide import ExtractionProfile
words = doc.extract_words(0, profile=ExtractionProfile.form())
lines = doc.extract_text_lines(0, profile=ExtractionProfile.academic())

# 4. Table extraction (v0.3.14)
tables = doc.extract_tables(0)
for table in tables:
    print(f"Table with {table.row_count} rows")

# 5. Traditional extraction
text = doc.extract_text(0)
chars = doc.extract_chars(0)

Form Fields

# Extract form fields
fields = doc.get_form_fields()
for f in fields:
    print(f"{f.name} ({f.field_type}) = {f.value}")

# Fill and save
doc.set_form_field_value("employee_name", "Jane Doe")
doc.set_form_field_value("wages", "85000.00")
doc.save("filled.pdf")

Rust API

use pdf_oxide::PdfDocument;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let mut doc = PdfDocument::open("paper.pdf")?;

    // Extract text
    let text = doc.extract_text(0)?;

    // Character-level extraction
    let chars = doc.extract_chars(0)?;

    // Extract images
    let images = doc.extract_images(0)?;

    // Vector graphics
    let paths = doc.extract_paths(0)?;

    Ok(())
}

Form Fields (Rust)

use pdf_oxide::editor::{DocumentEditor, EditableDocument, SaveOptions};
use pdf_oxide::editor::form_fields::FormFieldValue;

let mut editor = DocumentEditor::open("w2.pdf")?;
editor.set_form_field_value("employee_name", FormFieldValue::Text("Jane Doe".into()))?;
editor.save_with_options("filled.pdf", SaveOptions::incremental())?;

Installation

Python

pip install pdf_oxide

Wheels available for Linux, macOS, and Windows. Python 3.8–3.14.

Rust

[dependencies]
pdf_oxide = "0.3"

JavaScript/WASM

npm install pdf-oxide-wasm
const { WasmPdfDocument } = require("pdf-oxide-wasm");

CLI

brew install yfedoseev/tap/pdf-oxide    # Homebrew (macOS/Linux)
cargo install pdf_oxide_cli             # Cargo
cargo binstall pdf_oxide_cli            # Pre-built binary via cargo-binstall

MCP Server

brew install yfedoseev/tap/pdf-oxide    # Included with CLI in Homebrew
cargo install pdf_oxide_mcp             # Cargo

CLI

22 commands for PDF processing directly from your terminal:

pdf-oxide text report.pdf                      # Extract text
pdf-oxide markdown report.pdf -o report.md     # Convert to Markdown
pdf-oxide html report.pdf -o report.html       # Convert to HTML
pdf-oxide info report.pdf                      # Show metadata
pdf-oxide search report.pdf "neural.?network"  # Search (regex)
pdf-oxide images report.pdf -o ./images/       # Extract images
pdf-oxide merge a.pdf b.pdf -o combined.pdf    # Merge PDFs
pdf-oxide split report.pdf -o ./pages/         # Split into pages
pdf-oxide watermark doc.pdf "DRAFT"            # Add watermark
pdf-oxide forms w2.pdf --fill "name=Jane"      # Fill form fields

Run pdf-oxide with no arguments for interactive REPL mode. Use --pages 1-5 to process specific pages, --json for machine-readable output.

MCP Server

pdf-oxide-mcp lets AI assistants (Claude, Cursor, etc.) extract content from PDFs locally via the Model Context Protocol.

Add to your MCP client configuration:

{
  "mcpServers": {
    "pdf-oxide": { "command": "crgx", "args": ["pdf_oxide_mcp@latest"] }
  }
}

The server exposes an extract tool that supports text, markdown, and HTML output formats with optional page ranges and image extraction. All processing runs locally — no files leave your machine.

Building from Source

# Clone and build
git clone https://github.com/yfedoseev/pdf_oxide
cd pdf_oxide
cargo build --release

# Run tests
cargo test

# Build Python bindings
maturin develop

Documentation

Use Cases

  • RAG / LLM pipelines — Convert PDFs to clean Markdown for retrieval-augmented generation with LangChain, LlamaIndex, or any framework
  • AI assistants — Give Claude, Cursor, or any MCP-compatible tool direct PDF access via the MCP server
  • Document processing at scale — Extract text, images, and metadata from thousands of PDFs in seconds
  • Data extraction — Pull structured data from forms, tables, and layouts
  • Academic research — Parse papers, extract citations, and process large corpora
  • PDF generation — Create invoices, reports, certificates, and templated documents programmatically
  • PyMuPDF alternative — MIT licensed, 5× faster, no AGPL restrictions

License

Dual-licensed under MIT or Apache-2.0 at your option. Unlike AGPL-licensed alternatives, pdf_oxide can be used freely in any project — commercial or open-source — with no copyleft restrictions.

Contributing

We welcome contributions! See CONTRIBUTING.md for guidelines.

cargo build && cargo test && cargo fmt && cargo clippy -- -D warnings

Citation

@software{pdf_oxide,
  title = {PDF Oxide: Fast PDF Toolkit for Rust and Python},
  author = {Yury Fedoseev},
  year = {2025},
  url = {https://github.com/yfedoseev/pdf_oxide}
}

Rust + Python + WASM + CLI + MCP | MIT/Apache-2.0 | 100% pass rate on 3,830 PDFs | 0.8ms mean | 5× faster than the industry leaders

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdf_oxide-0.3.22.tar.gz (4.9 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

pdf_oxide-0.3.22-cp314-cp314t-win_arm64.whl (4.2 MB view details)

Uploaded CPython 3.14tWindows ARM64

pdf_oxide-0.3.22-cp313-cp313t-win_arm64.whl (4.2 MB view details)

Uploaded CPython 3.13tWindows ARM64

pdf_oxide-0.3.22-cp38-abi3-win_arm64.whl (4.2 MB view details)

Uploaded CPython 3.8+Windows ARM64

pdf_oxide-0.3.22-cp38-abi3-win_amd64.whl (4.5 MB view details)

Uploaded CPython 3.8+Windows x86-64

pdf_oxide-0.3.22-cp38-abi3-musllinux_1_2_x86_64.whl (4.8 MB view details)

Uploaded CPython 3.8+musllinux: musl 1.2+ x86-64

pdf_oxide-0.3.22-cp38-abi3-musllinux_1_2_aarch64.whl (4.5 MB view details)

Uploaded CPython 3.8+musllinux: musl 1.2+ ARM64

pdf_oxide-0.3.22-cp38-abi3-manylinux_2_34_x86_64.whl (4.6 MB view details)

Uploaded CPython 3.8+manylinux: glibc 2.34+ x86-64

pdf_oxide-0.3.22-cp38-abi3-manylinux_2_34_aarch64.whl (4.3 MB view details)

Uploaded CPython 3.8+manylinux: glibc 2.34+ ARM64

pdf_oxide-0.3.22-cp38-abi3-macosx_11_0_arm64.whl (4.1 MB view details)

Uploaded CPython 3.8+macOS 11.0+ ARM64

pdf_oxide-0.3.22-cp38-abi3-macosx_10_12_x86_64.whl (4.4 MB view details)

Uploaded CPython 3.8+macOS 10.12+ x86-64

File details

Details for the file pdf_oxide-0.3.22.tar.gz.

File metadata

  • Download URL: pdf_oxide-0.3.22.tar.gz
  • Upload date:
  • Size: 4.9 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pdf_oxide-0.3.22.tar.gz
Algorithm Hash digest
SHA256 06571385623d6edf95b39a65ee4793affb0d7e7890302e985d58bb96e20755af
MD5 6fbd4373751d77fb07b5f14198d68cc6
BLAKE2b-256 4a62c09a947c64eb59cbf024c5bb54368dc886724e10e8ddb00fffdf6a683209

See more details on using hashes here.

File details

Details for the file pdf_oxide-0.3.22-cp314-cp314t-win_arm64.whl.

File metadata

File hashes

Hashes for pdf_oxide-0.3.22-cp314-cp314t-win_arm64.whl
Algorithm Hash digest
SHA256 03d9b2e4574bf7fe13855cbd8ed0ce62e27d7f0691061b0622cd51ece72bb413
MD5 3a6d4101511dd9fe4eccc7ccca7a733c
BLAKE2b-256 c82d55217fa41a554ec51cc07c85c7bfc65673ef02d1d99d4a0dade2d48f2076

See more details on using hashes here.

File details

Details for the file pdf_oxide-0.3.22-cp313-cp313t-win_arm64.whl.

File metadata

File hashes

Hashes for pdf_oxide-0.3.22-cp313-cp313t-win_arm64.whl
Algorithm Hash digest
SHA256 eede8ac88ff88f38d0ee647d8b6e13bb2c9b3b74823b11d265869e118e73b7a1
MD5 82a04a90530bfde2daa7a0eed669f912
BLAKE2b-256 cf23fdd55b2d03c25b96c7543fe18b63663a3010271c491f7119119a19685ec9

See more details on using hashes here.

File details

Details for the file pdf_oxide-0.3.22-cp38-abi3-win_arm64.whl.

File metadata

  • Download URL: pdf_oxide-0.3.22-cp38-abi3-win_arm64.whl
  • Upload date:
  • Size: 4.2 MB
  • Tags: CPython 3.8+, Windows ARM64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pdf_oxide-0.3.22-cp38-abi3-win_arm64.whl
Algorithm Hash digest
SHA256 7742278c8a2de781c3a02ed42901a5ed9e898c4b2b3c18cf9bec6cf0859e98ea
MD5 63cd4f754107ff1c321f9e30dec3373d
BLAKE2b-256 75b356aa9b49652933c66b540f52ca59c910664e19caa32d4b1fefbd9f60d0c3

See more details on using hashes here.

File details

Details for the file pdf_oxide-0.3.22-cp38-abi3-win_amd64.whl.

File metadata

  • Download URL: pdf_oxide-0.3.22-cp38-abi3-win_amd64.whl
  • Upload date:
  • Size: 4.5 MB
  • Tags: CPython 3.8+, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pdf_oxide-0.3.22-cp38-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 a54145685b66fbb61b70f8b8b119bc2b340a427ac9eb449e3394428dfbbdb9ed
MD5 ce383828565d1c472ddbc5f9410dbb92
BLAKE2b-256 9533e5bc74fe7948822ec98e4c7517318d3b9dd237707cc4e10963b9f5722451

See more details on using hashes here.

File details

Details for the file pdf_oxide-0.3.22-cp38-abi3-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for pdf_oxide-0.3.22-cp38-abi3-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 e6c736143ac9263e2a1c53c2e8b53d23fd6da6a6c1dabcc37ce74ff3463842eb
MD5 f9158b77a203cc535a2f91b6f1b4fe77
BLAKE2b-256 90198c781b859096b67333463f4a390b1f5d739312fd4c61bf2a1a6a3e997b30

See more details on using hashes here.

File details

Details for the file pdf_oxide-0.3.22-cp38-abi3-musllinux_1_2_aarch64.whl.

File metadata

File hashes

Hashes for pdf_oxide-0.3.22-cp38-abi3-musllinux_1_2_aarch64.whl
Algorithm Hash digest
SHA256 2945a9bb9196fad51608f0189b8b61302d2033f294a2712e7d248d3bdbdecc7e
MD5 3478154dbaf0225658ebb221548a16d6
BLAKE2b-256 ad3bd026827ba1cf62d5d8b16ee058bc2cc62a16465e058ae4a3facefb7e5464

See more details on using hashes here.

File details

Details for the file pdf_oxide-0.3.22-cp38-abi3-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for pdf_oxide-0.3.22-cp38-abi3-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 92ef261b4b91e65f7e12495c20a06209ac010c023c3db4e95aae1ae5ea69c83d
MD5 d4d1d37199490693e39a75189e8a4234
BLAKE2b-256 066687bbc34d63304ecc6225df891b736a01ce2afabf146773857f6105068ffa

See more details on using hashes here.

File details

Details for the file pdf_oxide-0.3.22-cp38-abi3-manylinux_2_34_aarch64.whl.

File metadata

File hashes

Hashes for pdf_oxide-0.3.22-cp38-abi3-manylinux_2_34_aarch64.whl
Algorithm Hash digest
SHA256 686a0ce6659c9877776c23b0826d23b4d23378e888fc91fe3f46f0691847cfc8
MD5 679ce92dfc0e15e244646a4d0eee25d0
BLAKE2b-256 bec9307c0a53adcdff2c1bc3a0f60a9d33692a3fda119318e02ccaaede42dd2d

See more details on using hashes here.

File details

Details for the file pdf_oxide-0.3.22-cp38-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for pdf_oxide-0.3.22-cp38-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 ec7fb14f1d74e3aa01799c7c9c22afffbbc8ba537ca0d0e53527caa475cf9542
MD5 854ccb775dce6e149a2e3893ac8632df
BLAKE2b-256 ca9ba778aca17b8d1e0a83a7553643912ac74e7476b4650021697e8aeb84cfe6

See more details on using hashes here.

File details

Details for the file pdf_oxide-0.3.22-cp38-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for pdf_oxide-0.3.22-cp38-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 d37fc20e0fa7e8dd2a21a99bf9d60acdf67487199853133607d22f4fdc57f6ba
MD5 1ea40e4f8b73fa788a99a326e4263a21
BLAKE2b-256 ac36247b55256312878c638cd6191b1002f2f99790cb2ad6740e3606bb59e956

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page