Skip to main content

The fastest Python PDF library: 0.8ms mean, 5× faster than PyMuPDF. Text extraction, markdown conversion, PDF creation. 100% pass rate on 3,830 PDFs.

Project description

PDF Oxide - The Fastest PDF Toolkit for Python, Rust, Go, JS/TS, C#, WASM, CLI & AI

More language bindings coming in May 2026. Java, Ruby, PHP, Swift, and Kotlin are on the roadmap. Want another language? Open an issue and tell us.

The fastest PDF library for text extraction, image extraction, and markdown conversion. Rust core with bindings for Python, Go, JavaScript / TypeScript, C# / .NET, and WASM, plus a CLI tool and MCP server for AI assistants. 0.8ms mean per document, 5× faster than PyMuPDF, 15× faster than pypdf. 100% pass rate on 3,830 real-world PDFs. MIT licensed.

Crates.io PyPI PyPI Downloads npm Documentation Build Status License: MIT OR Apache-2.0

New in v0.3.24 — now available in Go, JavaScript / TypeScript, and C# / .NET, alongside the existing Python, Rust, and WASM bindings. Same Rust core, same 0.8 ms extraction speed, same 100% pass rate. See the language guides: Python · Go · JavaScript / TypeScript · C# / .NET · WASM

Quick Start

Python

from pdf_oxide import PdfDocument

# path can be str or pathlib.Path; use with for scoped access
doc = PdfDocument("paper.pdf")
# or: with PdfDocument("paper.pdf") as doc: ...
text = doc.extract_text(0)
chars = doc.extract_chars(0)
markdown = doc.to_markdown(0, detect_headings=True)
pip install pdf_oxide

Rust

use pdf_oxide::PdfDocument;

let mut doc = PdfDocument::open("paper.pdf")?;
let text = doc.extract_text(0)?;
let images = doc.extract_images(0)?;
let markdown = doc.to_markdown(0, Default::default())?;
[dependencies]
pdf_oxide = "0.3"

CLI

pdf-oxide text document.pdf
pdf-oxide markdown document.pdf -o output.md
pdf-oxide search document.pdf "pattern"
pdf-oxide merge a.pdf b.pdf -o combined.pdf
brew install yfedoseev/tap/pdf-oxide

MCP Server (for AI assistants)

# Install
brew install yfedoseev/tap/pdf-oxide   # includes pdf-oxide-mcp

# Configure in Claude Desktop / Claude Code / Cursor
{
  "mcpServers": {
    "pdf-oxide": { "command": "crgx", "args": ["pdf_oxide_mcp@latest"] }
  }
}

Why pdf_oxide?

  • Fast — 0.8ms mean per document, 5× faster than PyMuPDF, 15× faster than pypdf, 29× faster than pdfplumber
  • Reliable — 100% pass rate on 3,830 test PDFs, zero panics, zero timeouts
  • Complete — Text extraction, image extraction, PDF creation, and editing in one library
  • Multi-platform — Rust, Python, Go, JavaScript/TypeScript, C#/.NET, WASM, CLI, and MCP server for AI assistants
  • Permissive license — MIT / Apache-2.0 — use freely in commercial and open-source projects

Performance

Benchmarked on 3,830 PDFs from three independent public test suites (veraPDF, Mozilla pdf.js, DARPA SafeDocs). Text extraction libraries only (no OCR). Single-thread, 60s timeout, no warm-up.

Python Libraries

Library Mean p99 Pass Rate License
PDF Oxide 0.8ms 9ms 100% MIT
PyMuPDF 4.6ms 28ms 99.3% AGPL-3.0
pypdfium2 4.1ms 42ms 99.2% Apache-2.0
pymupdf4llm 55.5ms 280ms 99.1% AGPL-3.0
pdftext 7.3ms 82ms 99.0% GPL-3.0
pdfminer 16.8ms 124ms 98.8% MIT
pdfplumber 23.2ms 189ms 98.8% MIT
markitdown 108.8ms 378ms 98.6% MIT
pypdf 12.1ms 97ms 98.4% BSD-3

Rust Libraries

Library Mean p99 Pass Rate Text Extraction
PDF Oxide 0.8ms 9ms 100% Built-in
oxidize_pdf 13.5ms 11ms 99.1% Basic
unpdf 2.8ms 10ms 95.1% Basic
pdf_extract 4.08ms 37ms 91.5% Basic
lopdf 0.3ms 2ms 80.2% No built-in extraction

Text Quality

99.5% text parity vs PyMuPDF and pypdfium2 across the full corpus. PDF Oxide extracts text from 7–10× more "hard" files than it misses vs any competitor.

Corpus

Suite PDFs Pass Rate
veraPDF (PDF/A compliance) 2,907 100%
Mozilla pdf.js 897 99.2%
SafeDocs (targeted edge cases) 26 100%
Total 3,830 100%

100% pass rate on all valid PDFs — the 7 non-passing files across the corpus are intentionally broken test fixtures (missing PDF header, fuzz-corrupted catalogs, invalid xref streams).

Features

Extract Create Edit
Text & Layout Documents Annotations
Images Tables Form Fields
Forms Graphics Bookmarks
Annotations Templates Links
Bookmarks Images Content

Python API

from pdf_oxide import PdfDocument

# Path can be str or pathlib.Path; use "with PdfDocument(...) as doc" for context manager
doc = PdfDocument("report.pdf")
print(f"Pages: {doc.page_count()}")
print(f"Version: {doc.version()}")

# 1. Scoped extraction (v0.3.14)
# Extract only from a specific area: (x, y, width, height)
header = doc.within(0, (0, 700, 612, 92)).extract_text()

# 2. Word-level extraction (v0.3.14)
words = doc.extract_words(0)
for w in words:
    print(f"{w.text} at {w.bbox}")
    # Access individual characters in the word
    # print(w.chars[0].font_name)

# Optional: override the adaptive word gap threshold (in PDF points)
words = doc.extract_words(0, word_gap_threshold=2.5)

# 3. Line-level extraction (v0.3.14)
lines = doc.extract_text_lines(0)
for line in lines:
    print(f"Line: {line.text}")

# Optional: override word and/or line gap thresholds (in PDF points)
lines = doc.extract_text_lines(0, word_gap_threshold=2.5, line_gap_threshold=4.0)

# Inspect the adaptive thresholds before overriding
params = doc.page_layout_params(0)
print(f"word gap: {params.word_gap_threshold:.1f}, line gap: {params.line_gap_threshold:.1f}")

# Use a pre-tuned extraction profile for specific document types
from pdf_oxide import ExtractionProfile
words = doc.extract_words(0, profile=ExtractionProfile.form())
lines = doc.extract_text_lines(0, profile=ExtractionProfile.academic())

# 4. Table extraction (v0.3.14)
tables = doc.extract_tables(0)
for table in tables:
    print(f"Table with {table.row_count} rows")

# 5. Traditional extraction
text = doc.extract_text(0)
chars = doc.extract_chars(0)

Form Fields

# Extract form fields
fields = doc.get_form_fields()
for f in fields:
    print(f"{f.name} ({f.field_type}) = {f.value}")

# Fill and save
doc.set_form_field_value("employee_name", "Jane Doe")
doc.set_form_field_value("wages", "85000.00")
doc.save("filled.pdf")

Rust API

use pdf_oxide::PdfDocument;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let mut doc = PdfDocument::open("paper.pdf")?;

    // Extract text
    let text = doc.extract_text(0)?;

    // Character-level extraction
    let chars = doc.extract_chars(0)?;

    // Extract images
    let images = doc.extract_images(0)?;

    // Vector graphics
    let paths = doc.extract_paths(0)?;

    Ok(())
}

Form Fields (Rust)

use pdf_oxide::editor::{DocumentEditor, EditableDocument, SaveOptions};
use pdf_oxide::editor::form_fields::FormFieldValue;

let mut editor = DocumentEditor::open("w2.pdf")?;
editor.set_form_field_value("employee_name", FormFieldValue::Text("Jane Doe".into()))?;
editor.save_with_options("filled.pdf", SaveOptions::incremental())?;

Installation

Python

pip install pdf_oxide

Wheels available for Linux, macOS, and Windows. Python 3.8–3.14.

Rust

[dependencies]
pdf_oxide = "0.3"

JavaScript/WASM

npm install pdf-oxide-wasm
const { WasmPdfDocument } = require("pdf-oxide-wasm");

CLI

brew install yfedoseev/tap/pdf-oxide    # Homebrew (macOS/Linux)
cargo install pdf_oxide_cli             # Cargo
cargo binstall pdf_oxide_cli            # Pre-built binary via cargo-binstall

MCP Server

brew install yfedoseev/tap/pdf-oxide    # Included with CLI in Homebrew
cargo install pdf_oxide_mcp             # Cargo

Other languages

  • Gogo get github.com/yfedoseev/pdf_oxide/go — see go/README.md
  • JavaScript / TypeScript (Node.js)npm install pdf-oxide — see js/README.md
  • C# / .NETdotnet add package PdfOxide — see csharp/README.md

All three share the same Rust core as the Python and WASM bindings, so everything you read in this README applies to them as well — just with each language's native naming conventions.

CLI

22 commands for PDF processing directly from your terminal:

pdf-oxide text report.pdf                      # Extract text
pdf-oxide markdown report.pdf -o report.md     # Convert to Markdown
pdf-oxide html report.pdf -o report.html       # Convert to HTML
pdf-oxide info report.pdf                      # Show metadata
pdf-oxide search report.pdf "neural.?network"  # Search (regex)
pdf-oxide images report.pdf -o ./images/       # Extract images
pdf-oxide merge a.pdf b.pdf -o combined.pdf    # Merge PDFs
pdf-oxide split report.pdf -o ./pages/         # Split into pages
pdf-oxide watermark doc.pdf "DRAFT"            # Add watermark
pdf-oxide forms w2.pdf --fill "name=Jane"      # Fill form fields

Run pdf-oxide with no arguments for interactive REPL mode. Use --pages 1-5 to process specific pages, --json for machine-readable output.

MCP Server

pdf-oxide-mcp lets AI assistants (Claude, Cursor, etc.) extract content from PDFs locally via the Model Context Protocol.

Add to your MCP client configuration:

{
  "mcpServers": {
    "pdf-oxide": { "command": "crgx", "args": ["pdf_oxide_mcp@latest"] }
  }
}

The server exposes an extract tool that supports text, markdown, and HTML output formats with optional page ranges and image extraction. All processing runs locally — no files leave your machine.

Building from Source

# Clone and build
git clone https://github.com/yfedoseev/pdf_oxide
cd pdf_oxide
cargo build --release

# Run tests
cargo test

# Build Python bindings
maturin develop

# Build the shared library for Go, JS/TS, and C# bindings
cargo build --release --lib
# Output: target/release/libpdf_oxide.{so,dylib} or pdf_oxide.dll

Documentation

Use Cases

  • RAG / LLM pipelines — Convert PDFs to clean Markdown for retrieval-augmented generation with LangChain, LlamaIndex, or any framework
  • AI assistants — Give Claude, Cursor, or any MCP-compatible tool direct PDF access via the MCP server
  • Document processing at scale — Extract text, images, and metadata from thousands of PDFs in seconds
  • Data extraction — Pull structured data from forms, tables, and layouts
  • Academic research — Parse papers, extract citations, and process large corpora
  • PDF generation — Create invoices, reports, certificates, and templated documents programmatically
  • PyMuPDF alternative — MIT licensed, 5× faster, no AGPL restrictions

Why I built this

I needed PyMuPDF's speed without its AGPL license, and I needed it in more than one language. Nothing existed that ticked all three boxes — fast, MIT, multi-language — so I wrote it. The Rust core is what does the real work; the bindings for Python, Go, JS/TS, C#, and WASM are thin shells around the same code, so a bug fix in one lands in all of them. It now passes 100% of the veraPDF + Mozilla pdf.js + DARPA SafeDocs test corpora (3,830 PDFs) on every platform I've tested.

If it's useful to you, a star on GitHub genuinely helps. If something's broken or missing, open an issue — I read all of them.

— Yury

License

Dual-licensed under MIT or Apache-2.0 at your option. Unlike AGPL-licensed alternatives, pdf_oxide can be used freely in any project — commercial or open-source — with no copyleft restrictions.

Contributing

We welcome contributions! See CONTRIBUTING.md for guidelines.

cargo build && cargo test && cargo fmt && cargo clippy -- -D warnings

Citation

@software{pdf_oxide,
  title = {PDF Oxide: Fast PDF Toolkit for Rust, Python, Go, JavaScript, and C#},
  author = {Yury Fedoseev},
  year = {2025},
  url = {https://github.com/yfedoseev/pdf_oxide}
}

Rust + Python + Go + JS/TS + C# + WASM + CLI + MCP | MIT/Apache-2.0 | 100% pass rate on 3,830 PDFs | 0.8ms mean | 5× faster than the industry leaders

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdf_oxide-0.3.41.tar.gz (2.3 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

pdf_oxide-0.3.41-cp38-abi3-win_arm64.whl (7.5 MB view details)

Uploaded CPython 3.8+Windows ARM64

pdf_oxide-0.3.41-cp38-abi3-win_amd64.whl (8.1 MB view details)

Uploaded CPython 3.8+Windows x86-64

pdf_oxide-0.3.41-cp38-abi3-musllinux_1_2_x86_64.whl (8.3 MB view details)

Uploaded CPython 3.8+musllinux: musl 1.2+ x86-64

pdf_oxide-0.3.41-cp38-abi3-musllinux_1_2_aarch64.whl (7.8 MB view details)

Uploaded CPython 3.8+musllinux: musl 1.2+ ARM64

pdf_oxide-0.3.41-cp38-abi3-manylinux_2_35_x86_64.whl (8.0 MB view details)

Uploaded CPython 3.8+manylinux: glibc 2.35+ x86-64

pdf_oxide-0.3.41-cp38-abi3-manylinux_2_35_aarch64.whl (7.6 MB view details)

Uploaded CPython 3.8+manylinux: glibc 2.35+ ARM64

pdf_oxide-0.3.41-cp38-abi3-manylinux_2_28_x86_64.whl (8.1 MB view details)

Uploaded CPython 3.8+manylinux: glibc 2.28+ x86-64

pdf_oxide-0.3.41-cp38-abi3-manylinux_2_28_aarch64.whl (7.6 MB view details)

Uploaded CPython 3.8+manylinux: glibc 2.28+ ARM64

pdf_oxide-0.3.41-cp38-abi3-macosx_11_0_arm64.whl (7.3 MB view details)

Uploaded CPython 3.8+macOS 11.0+ ARM64

pdf_oxide-0.3.41-cp38-abi3-macosx_10_12_x86_64.whl (7.8 MB view details)

Uploaded CPython 3.8+macOS 10.12+ x86-64

File details

Details for the file pdf_oxide-0.3.41.tar.gz.

File metadata

  • Download URL: pdf_oxide-0.3.41.tar.gz
  • Upload date:
  • Size: 2.3 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pdf_oxide-0.3.41.tar.gz
Algorithm Hash digest
SHA256 bb27061bc62e5cc08b63693a4f88b9a8d6071aa7870d3258235cbdbcd4e62d1c
MD5 bcd5995856e99ddf53281b660fa50c39
BLAKE2b-256 9dc1cb010affe250a188ad9d325264fcbd3ef6d4f9ceac3ee0ca39e37d59a32c

See more details on using hashes here.

File details

Details for the file pdf_oxide-0.3.41-cp38-abi3-win_arm64.whl.

File metadata

  • Download URL: pdf_oxide-0.3.41-cp38-abi3-win_arm64.whl
  • Upload date:
  • Size: 7.5 MB
  • Tags: CPython 3.8+, Windows ARM64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pdf_oxide-0.3.41-cp38-abi3-win_arm64.whl
Algorithm Hash digest
SHA256 2f58da7f130844f94f4d7f2c823053ac03c92265a60dc59ae4723cb24ecd0117
MD5 64c58f5379958b0f15bc81c922c04007
BLAKE2b-256 6a61d2ca360c306dab23144b234a8b62c2fe9e32a52fea307a0d4aa928e4326d

See more details on using hashes here.

File details

Details for the file pdf_oxide-0.3.41-cp38-abi3-win_amd64.whl.

File metadata

  • Download URL: pdf_oxide-0.3.41-cp38-abi3-win_amd64.whl
  • Upload date:
  • Size: 8.1 MB
  • Tags: CPython 3.8+, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pdf_oxide-0.3.41-cp38-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 6a466bf8c24ae0613e5ec91be255e1aaa4e97886dcbfd9597bdd93553f90e5f5
MD5 2e26e86b999be1e9fbc1e8c2445aef00
BLAKE2b-256 65ea8cab778ac86ddfd4f41765240452d7d43ac929d73f84c8611b528253fddb

See more details on using hashes here.

File details

Details for the file pdf_oxide-0.3.41-cp38-abi3-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for pdf_oxide-0.3.41-cp38-abi3-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 cfb17f848fe5568fa47c59ece444bbabe913d7e9554448c835d1e7fc34dcec1c
MD5 4fe96ba9a56de25f9ceb4a809e1e53d3
BLAKE2b-256 7a2176ed1d7ee89a1dff38473c3c25a3c58839991f2326a1fb1dc7d5df64b2d2

See more details on using hashes here.

File details

Details for the file pdf_oxide-0.3.41-cp38-abi3-musllinux_1_2_aarch64.whl.

File metadata

File hashes

Hashes for pdf_oxide-0.3.41-cp38-abi3-musllinux_1_2_aarch64.whl
Algorithm Hash digest
SHA256 81e2258315fdbc7e7a5ca6dc07c25145cce97de4f0ebb73eada969576fea0b20
MD5 96c14bcb039cd65846e63d0eb735d703
BLAKE2b-256 4d47bc2a85e5523d1264489317ac3328d5ca05655252935678af3760b0621033

See more details on using hashes here.

File details

Details for the file pdf_oxide-0.3.41-cp38-abi3-manylinux_2_35_x86_64.whl.

File metadata

File hashes

Hashes for pdf_oxide-0.3.41-cp38-abi3-manylinux_2_35_x86_64.whl
Algorithm Hash digest
SHA256 00c6fd415c498d1fa605722cdc4d0c4636a8afd34b1db91b312ce423e0882630
MD5 ef70dfff48fa68d16e65d2fbe339958e
BLAKE2b-256 67cc37fcf719b303a0dc84680a9948f8d54e02329af6893be08dbabbfc67047d

See more details on using hashes here.

File details

Details for the file pdf_oxide-0.3.41-cp38-abi3-manylinux_2_35_aarch64.whl.

File metadata

File hashes

Hashes for pdf_oxide-0.3.41-cp38-abi3-manylinux_2_35_aarch64.whl
Algorithm Hash digest
SHA256 6fd9cdb5de7491ef3729afc2672bb07741a51971ce57aa08f9588a81bce91dfa
MD5 78325ab56d2b91d005380b4f64b0eb87
BLAKE2b-256 5d33c0822ca1b7b4683fdd543c87518852123eec16201c44ae9716c23015dd91

See more details on using hashes here.

File details

Details for the file pdf_oxide-0.3.41-cp38-abi3-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for pdf_oxide-0.3.41-cp38-abi3-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 7939a283ba9a34f83a8c45c2cf2fa1c12f4b70189fa41bcd501bc4fcaec61d4e
MD5 590b45e31d18215f33de57dcb58bb172
BLAKE2b-256 f66d155bc1a6ae385a265228037eed1f8cfcbe38ca51340faeaf0981f3d2c8b0

See more details on using hashes here.

File details

Details for the file pdf_oxide-0.3.41-cp38-abi3-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for pdf_oxide-0.3.41-cp38-abi3-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 886047edec1af1849b0648e3cb02234615c22e5e8c7d673d3e97b8404936dbec
MD5 a77af75e5b076329d155df45e7d52581
BLAKE2b-256 d5362f9a0de8d4c4d7ccbe3437ed60f706149931f74b2e9a3dde07ff7d3ad05e

See more details on using hashes here.

File details

Details for the file pdf_oxide-0.3.41-cp38-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for pdf_oxide-0.3.41-cp38-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 66de0bb4dca8e12d4e65c554486408ff7fd43a740ef009f0e27d62dcde629c12
MD5 f34ec7f8505d467d6f5a132b9c220de3
BLAKE2b-256 9433796ab334ba4d6f6fc051e6b16c2db1d3c7d9a90ad5481306de0e0e80d20f

See more details on using hashes here.

File details

Details for the file pdf_oxide-0.3.41-cp38-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for pdf_oxide-0.3.41-cp38-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 77f1681ed547142a6689bf704d4646543195980a1e5b24fd72e29178fe8be8da
MD5 760d638ab01b36701d8cf041f30fc773
BLAKE2b-256 5737eef3e3f11b5041d59ca64183409a3a1c888fa5b1d4b3a0ef9655c10e1a46

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page