Skip to main content

The fastest Python PDF library: 0.8ms mean, 5× faster than PyMuPDF. Text extraction, markdown conversion, PDF creation. 100% pass rate on 3,830 PDFs.

Project description

PDF Oxide - The Fastest PDF Toolkit for Python, Rust, Go, JS/TS, C#, WASM, CLI & AI

More language bindings coming in May 2026. Java, Ruby, PHP, Swift, and Kotlin are on the roadmap. Want another language? Open an issue and tell us.

The fastest PDF library for text extraction, image extraction, and markdown conversion. Rust core with bindings for Python, Go, JavaScript / TypeScript, C# / .NET, and WASM, plus a CLI tool and MCP server for AI assistants. 0.8ms mean per document, 5× faster than PyMuPDF, 15× faster than pypdf. 100% pass rate on 3,830 real-world PDFs. MIT licensed.

Crates.io PyPI PyPI Downloads npm Documentation Build Status License: MIT OR Apache-2.0

New in v0.3.24 — now available in Go, JavaScript / TypeScript, and C# / .NET, alongside the existing Python, Rust, and WASM bindings. Same Rust core, same 0.8 ms extraction speed, same 100% pass rate. See the language guides: Python · Go · JavaScript / TypeScript · C# / .NET · WASM

Quick Start

Python

from pdf_oxide import PdfDocument

# path can be str or pathlib.Path; use with for scoped access
doc = PdfDocument("paper.pdf")
# or: with PdfDocument("paper.pdf") as doc: ...
text = doc.extract_text(0)
chars = doc.extract_chars(0)
markdown = doc.to_markdown(0, detect_headings=True)
pip install pdf_oxide

Rust

use pdf_oxide::PdfDocument;

let mut doc = PdfDocument::open("paper.pdf")?;
let text = doc.extract_text(0)?;
let images = doc.extract_images(0)?;
let markdown = doc.to_markdown(0, Default::default())?;
[dependencies]
pdf_oxide = "0.3"

CLI

pdf-oxide text document.pdf
pdf-oxide markdown document.pdf -o output.md
pdf-oxide search document.pdf "pattern"
pdf-oxide merge a.pdf b.pdf -o combined.pdf
brew install yfedoseev/tap/pdf-oxide

MCP Server (for AI assistants)

# Install
brew install yfedoseev/tap/pdf-oxide   # includes pdf-oxide-mcp

# Configure in Claude Desktop / Claude Code / Cursor
{
  "mcpServers": {
    "pdf-oxide": { "command": "crgx", "args": ["pdf_oxide_mcp@latest"] }
  }
}

Why pdf_oxide?

  • Fast — 0.8ms mean per document, 5× faster than PyMuPDF, 15× faster than pypdf, 29× faster than pdfplumber
  • Reliable — 100% pass rate on 3,830 test PDFs, zero panics, zero timeouts
  • Complete — Text extraction, image extraction, PDF creation, and editing in one library
  • Multi-platform — Rust, Python, Go, JavaScript/TypeScript, C#/.NET, WASM, CLI, and MCP server for AI assistants
  • Permissive license — MIT / Apache-2.0 — use freely in commercial and open-source projects

Performance

Benchmarked on 3,830 PDFs from three independent public test suites (veraPDF, Mozilla pdf.js, DARPA SafeDocs). Text extraction libraries only (no OCR). Single-thread, 60s timeout, no warm-up.

Python Libraries

Library Mean p99 Pass Rate License
PDF Oxide 0.8ms 9ms 100% MIT
PyMuPDF 4.6ms 28ms 99.3% AGPL-3.0
pypdfium2 4.1ms 42ms 99.2% Apache-2.0
pymupdf4llm 55.5ms 280ms 99.1% AGPL-3.0
pdftext 7.3ms 82ms 99.0% GPL-3.0
pdfminer 16.8ms 124ms 98.8% MIT
pdfplumber 23.2ms 189ms 98.8% MIT
markitdown 108.8ms 378ms 98.6% MIT
pypdf 12.1ms 97ms 98.4% BSD-3

Rust Libraries

Library Mean p99 Pass Rate Text Extraction
PDF Oxide 0.8ms 9ms 100% Built-in
oxidize_pdf 13.5ms 11ms 99.1% Basic
unpdf 2.8ms 10ms 95.1% Basic
pdf_extract 4.08ms 37ms 91.5% Basic
lopdf 0.3ms 2ms 80.2% No built-in extraction

Text Quality

99.5% text parity vs PyMuPDF and pypdfium2 across the full corpus. PDF Oxide extracts text from 7–10× more "hard" files than it misses vs any competitor.

Corpus

Suite PDFs Pass Rate
veraPDF (PDF/A compliance) 2,907 100%
Mozilla pdf.js 897 99.2%
SafeDocs (targeted edge cases) 26 100%
Total 3,830 100%

100% pass rate on all valid PDFs — the 7 non-passing files across the corpus are intentionally broken test fixtures (missing PDF header, fuzz-corrupted catalogs, invalid xref streams).

Features

Extract Create Edit
Text & Layout Documents Annotations
Images Tables Form Fields
Forms Graphics Bookmarks
Annotations Templates Links
Bookmarks Images Content

Python API

from pdf_oxide import PdfDocument

# Path can be str or pathlib.Path; use "with PdfDocument(...) as doc" for context manager
doc = PdfDocument("report.pdf")
print(f"Pages: {doc.page_count()}")
print(f"Version: {doc.version()}")

# 1. Scoped extraction (v0.3.14)
# Extract only from a specific area: (x, y, width, height)
header = doc.within(0, (0, 700, 612, 92)).extract_text()

# 2. Word-level extraction (v0.3.14)
words = doc.extract_words(0)
for w in words:
    print(f"{w.text} at {w.bbox}")
    # Access individual characters in the word
    # print(w.chars[0].font_name)

# Optional: override the adaptive word gap threshold (in PDF points)
words = doc.extract_words(0, word_gap_threshold=2.5)

# 3. Line-level extraction (v0.3.14)
lines = doc.extract_text_lines(0)
for line in lines:
    print(f"Line: {line.text}")

# Optional: override word and/or line gap thresholds (in PDF points)
lines = doc.extract_text_lines(0, word_gap_threshold=2.5, line_gap_threshold=4.0)

# Inspect the adaptive thresholds before overriding
params = doc.page_layout_params(0)
print(f"word gap: {params.word_gap_threshold:.1f}, line gap: {params.line_gap_threshold:.1f}")

# Use a pre-tuned extraction profile for specific document types
from pdf_oxide import ExtractionProfile
words = doc.extract_words(0, profile=ExtractionProfile.form())
lines = doc.extract_text_lines(0, profile=ExtractionProfile.academic())

# 4. Table extraction (v0.3.14)
tables = doc.extract_tables(0)
for table in tables:
    print(f"Table with {table.row_count} rows")

# 5. Traditional extraction
text = doc.extract_text(0)
chars = doc.extract_chars(0)

Form Fields

# Extract form fields
fields = doc.get_form_fields()
for f in fields:
    print(f"{f.name} ({f.field_type}) = {f.value}")

# Fill and save
doc.set_form_field_value("employee_name", "Jane Doe")
doc.set_form_field_value("wages", "85000.00")
doc.save("filled.pdf")

Rust API

use pdf_oxide::PdfDocument;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let mut doc = PdfDocument::open("paper.pdf")?;

    // Extract text
    let text = doc.extract_text(0)?;

    // Character-level extraction
    let chars = doc.extract_chars(0)?;

    // Extract images
    let images = doc.extract_images(0)?;

    // Vector graphics
    let paths = doc.extract_paths(0)?;

    Ok(())
}

Form Fields (Rust)

use pdf_oxide::editor::{DocumentEditor, EditableDocument, SaveOptions};
use pdf_oxide::editor::form_fields::FormFieldValue;

let mut editor = DocumentEditor::open("w2.pdf")?;
editor.set_form_field_value("employee_name", FormFieldValue::Text("Jane Doe".into()))?;
editor.save_with_options("filled.pdf", SaveOptions::incremental())?;

Installation

Python

pip install pdf_oxide

Wheels available for Linux, macOS, and Windows. Python 3.8–3.14.

Rust

[dependencies]
pdf_oxide = "0.3"

JavaScript/WASM

npm install pdf-oxide-wasm
const { WasmPdfDocument } = require("pdf-oxide-wasm");

CLI

brew install yfedoseev/tap/pdf-oxide    # Homebrew (macOS/Linux)
cargo install pdf_oxide_cli             # Cargo
cargo binstall pdf_oxide_cli            # Pre-built binary via cargo-binstall

MCP Server

brew install yfedoseev/tap/pdf-oxide    # Included with CLI in Homebrew
cargo install pdf_oxide_mcp             # Cargo

Other languages

  • Gogo get github.com/yfedoseev/pdf_oxide/go — see go/README.md
  • JavaScript / TypeScript (Node.js)npm install pdf-oxide — see js/README.md
  • C# / .NETdotnet add package PdfOxide — see csharp/README.md

All three share the same Rust core as the Python and WASM bindings, so everything you read in this README applies to them as well — just with each language's native naming conventions.

CLI

22 commands for PDF processing directly from your terminal:

pdf-oxide text report.pdf                      # Extract text
pdf-oxide markdown report.pdf -o report.md     # Convert to Markdown
pdf-oxide html report.pdf -o report.html       # Convert to HTML
pdf-oxide info report.pdf                      # Show metadata
pdf-oxide search report.pdf "neural.?network"  # Search (regex)
pdf-oxide images report.pdf -o ./images/       # Extract images
pdf-oxide merge a.pdf b.pdf -o combined.pdf    # Merge PDFs
pdf-oxide split report.pdf -o ./pages/         # Split into pages
pdf-oxide watermark doc.pdf "DRAFT"            # Add watermark
pdf-oxide forms w2.pdf --fill "name=Jane"      # Fill form fields

Run pdf-oxide with no arguments for interactive REPL mode. Use --pages 1-5 to process specific pages, --json for machine-readable output.

MCP Server

pdf-oxide-mcp lets AI assistants (Claude, Cursor, etc.) extract content from PDFs locally via the Model Context Protocol.

Add to your MCP client configuration:

{
  "mcpServers": {
    "pdf-oxide": { "command": "crgx", "args": ["pdf_oxide_mcp@latest"] }
  }
}

The server exposes an extract tool that supports text, markdown, and HTML output formats with optional page ranges and image extraction. All processing runs locally — no files leave your machine.

Building from Source

# Clone and build
git clone https://github.com/yfedoseev/pdf_oxide
cd pdf_oxide
cargo build --release

# Run tests
cargo test

# Build Python bindings
maturin develop

# Build the shared library for Go, JS/TS, and C# bindings
cargo build --release --lib
# Output: target/release/libpdf_oxide.{so,dylib} or pdf_oxide.dll

Documentation

Use Cases

  • RAG / LLM pipelines — Convert PDFs to clean Markdown for retrieval-augmented generation with LangChain, LlamaIndex, or any framework
  • AI assistants — Give Claude, Cursor, or any MCP-compatible tool direct PDF access via the MCP server
  • Document processing at scale — Extract text, images, and metadata from thousands of PDFs in seconds
  • Data extraction — Pull structured data from forms, tables, and layouts
  • Academic research — Parse papers, extract citations, and process large corpora
  • PDF generation — Create invoices, reports, certificates, and templated documents programmatically
  • PyMuPDF alternative — MIT licensed, 5× faster, no AGPL restrictions

Why I built this

I needed PyMuPDF's speed without its AGPL license, and I needed it in more than one language. Nothing existed that ticked all three boxes — fast, MIT, multi-language — so I wrote it. The Rust core is what does the real work; the bindings for Python, Go, JS/TS, C#, and WASM are thin shells around the same code, so a bug fix in one lands in all of them. It now passes 100% of the veraPDF + Mozilla pdf.js + DARPA SafeDocs test corpora (3,830 PDFs) on every platform I've tested.

If it's useful to you, a star on GitHub genuinely helps. If something's broken or missing, open an issue — I read all of them.

— Yury

License

Dual-licensed under MIT or Apache-2.0 at your option. Unlike AGPL-licensed alternatives, pdf_oxide can be used freely in any project — commercial or open-source — with no copyleft restrictions.

Contributing

We welcome contributions! See CONTRIBUTING.md for guidelines.

cargo build && cargo test && cargo fmt && cargo clippy -- -D warnings

Citation

@software{pdf_oxide,
  title = {PDF Oxide: Fast PDF Toolkit for Rust, Python, Go, JavaScript, and C#},
  author = {Yury Fedoseev},
  year = {2025},
  url = {https://github.com/yfedoseev/pdf_oxide}
}

Rust + Python + Go + JS/TS + C# + WASM + CLI + MCP | MIT/Apache-2.0 | 100% pass rate on 3,830 PDFs | 0.8ms mean | 5× faster than the industry leaders

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdf_oxide-0.3.40.tar.gz (2.3 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

pdf_oxide-0.3.40-cp38-abi3-win_arm64.whl (7.5 MB view details)

Uploaded CPython 3.8+Windows ARM64

pdf_oxide-0.3.40-cp38-abi3-win_amd64.whl (8.0 MB view details)

Uploaded CPython 3.8+Windows x86-64

pdf_oxide-0.3.40-cp38-abi3-musllinux_1_2_x86_64.whl (8.3 MB view details)

Uploaded CPython 3.8+musllinux: musl 1.2+ x86-64

pdf_oxide-0.3.40-cp38-abi3-musllinux_1_2_aarch64.whl (7.8 MB view details)

Uploaded CPython 3.8+musllinux: musl 1.2+ ARM64

pdf_oxide-0.3.40-cp38-abi3-manylinux_2_35_x86_64.whl (7.9 MB view details)

Uploaded CPython 3.8+manylinux: glibc 2.35+ x86-64

pdf_oxide-0.3.40-cp38-abi3-manylinux_2_35_aarch64.whl (7.5 MB view details)

Uploaded CPython 3.8+manylinux: glibc 2.35+ ARM64

pdf_oxide-0.3.40-cp38-abi3-manylinux_2_28_x86_64.whl (8.0 MB view details)

Uploaded CPython 3.8+manylinux: glibc 2.28+ x86-64

pdf_oxide-0.3.40-cp38-abi3-manylinux_2_28_aarch64.whl (7.6 MB view details)

Uploaded CPython 3.8+manylinux: glibc 2.28+ ARM64

pdf_oxide-0.3.40-cp38-abi3-macosx_11_0_arm64.whl (7.3 MB view details)

Uploaded CPython 3.8+macOS 11.0+ ARM64

pdf_oxide-0.3.40-cp38-abi3-macosx_10_12_x86_64.whl (7.7 MB view details)

Uploaded CPython 3.8+macOS 10.12+ x86-64

File details

Details for the file pdf_oxide-0.3.40.tar.gz.

File metadata

  • Download URL: pdf_oxide-0.3.40.tar.gz
  • Upload date:
  • Size: 2.3 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pdf_oxide-0.3.40.tar.gz
Algorithm Hash digest
SHA256 db4bcb2464c997ace0c9509dbc56e4bf2d83a9ce51d9a37bede169793a09d0ba
MD5 64c3011a49e87176b7bbc0b89011e00c
BLAKE2b-256 d452639594dca438945dbce2519e7770d23f9036ded16f9b76125a62ec52aec7

See more details on using hashes here.

File details

Details for the file pdf_oxide-0.3.40-cp38-abi3-win_arm64.whl.

File metadata

  • Download URL: pdf_oxide-0.3.40-cp38-abi3-win_arm64.whl
  • Upload date:
  • Size: 7.5 MB
  • Tags: CPython 3.8+, Windows ARM64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pdf_oxide-0.3.40-cp38-abi3-win_arm64.whl
Algorithm Hash digest
SHA256 ecbda11925a0e8d584b7d0168185d1c8aa4cc20128f066be4e40e90c9d3ef027
MD5 57b6c97cd369aaf1f90e5dd0f9d63e3c
BLAKE2b-256 baac9d5da622c3c4b7f8b0aa0053137a9ec05166f2baccdf2c4713ea385f083f

See more details on using hashes here.

File details

Details for the file pdf_oxide-0.3.40-cp38-abi3-win_amd64.whl.

File metadata

  • Download URL: pdf_oxide-0.3.40-cp38-abi3-win_amd64.whl
  • Upload date:
  • Size: 8.0 MB
  • Tags: CPython 3.8+, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pdf_oxide-0.3.40-cp38-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 d4810f1059959a70c4c7ae830ee9015e0bf84e5e1c374f840402e37095fb6a50
MD5 2656a80a323b2f47012fae38a225d655
BLAKE2b-256 651d6de7d34d6c0c6b2ce27d8aa94c88116ce198b44acdfdd72d898e33254a49

See more details on using hashes here.

File details

Details for the file pdf_oxide-0.3.40-cp38-abi3-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for pdf_oxide-0.3.40-cp38-abi3-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 8e83dab44b8414a5ee29aa7b9281fd934f2c812c3f684148d3cd64f0cc6b2aee
MD5 6e0b382cbf9f60dd9b1191add8f0f15b
BLAKE2b-256 ddda0c9736d6d29704c5f1b87d7ac9522e61e053d4f0c3c7c95a3d80bf3996be

See more details on using hashes here.

File details

Details for the file pdf_oxide-0.3.40-cp38-abi3-musllinux_1_2_aarch64.whl.

File metadata

File hashes

Hashes for pdf_oxide-0.3.40-cp38-abi3-musllinux_1_2_aarch64.whl
Algorithm Hash digest
SHA256 9dfcf427215fe0cb95b9b2fde898569d9d92ff30bfa49001739aa93a7774e59b
MD5 04259c530befd06436546adf1491e740
BLAKE2b-256 384172d0fe39fec3895f8e1fb72afda4a2179e424e9f63eaefb16e7613de4186

See more details on using hashes here.

File details

Details for the file pdf_oxide-0.3.40-cp38-abi3-manylinux_2_35_x86_64.whl.

File metadata

File hashes

Hashes for pdf_oxide-0.3.40-cp38-abi3-manylinux_2_35_x86_64.whl
Algorithm Hash digest
SHA256 028200005327a90993e53e27f6fcc95f72c42c17d1f72812fcaadbb74e3eff72
MD5 dd9891cd9dc0a82e828781068a7b4080
BLAKE2b-256 e3067fb0628ca63a5c2167261e9874bd58f0d38ac418025b930f16ce3bb5b9f9

See more details on using hashes here.

File details

Details for the file pdf_oxide-0.3.40-cp38-abi3-manylinux_2_35_aarch64.whl.

File metadata

File hashes

Hashes for pdf_oxide-0.3.40-cp38-abi3-manylinux_2_35_aarch64.whl
Algorithm Hash digest
SHA256 e1083b72bf3ae954fbb0d871c35dea86919e78952ac52ac6e6678457a94604dd
MD5 cd5551b2369f0458e55756f9a0b60e79
BLAKE2b-256 2ad3a9146484a78b5f315ddc2047b0068db94d97535f9eb7b4bf11a6445e4d5e

See more details on using hashes here.

File details

Details for the file pdf_oxide-0.3.40-cp38-abi3-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for pdf_oxide-0.3.40-cp38-abi3-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 9e8186367689a3434e9f322fb964763a4d9d8dbac75f51bba0da4106466957b8
MD5 9644f54d4fa57e3a1fea385e2694271d
BLAKE2b-256 727df75a82852ae4efa7a85e58e2c0abbf38fc38221f4c259f68c31ef3a55633

See more details on using hashes here.

File details

Details for the file pdf_oxide-0.3.40-cp38-abi3-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for pdf_oxide-0.3.40-cp38-abi3-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 1926422a98a94fa2ead2884e87cdab92b7f11568a49ac8b3d0b5f19323cca204
MD5 8d83cb1a055ca42be61c79219ae11f43
BLAKE2b-256 3f6a12e725de9fe339df372143996f0aca221a866917ac4e9f65780003ea62ed

See more details on using hashes here.

File details

Details for the file pdf_oxide-0.3.40-cp38-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for pdf_oxide-0.3.40-cp38-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 eb7047a36b0f184963b8a20203191ebce81b097003394dfc13e9e7e7d17a67a0
MD5 52894930db921d0a8c7aaa5ee88830da
BLAKE2b-256 149522b8dc9dc84ed2010ea7c3de39d516cdde487fbc8e252f2d3cd5b7fb3ee4

See more details on using hashes here.

File details

Details for the file pdf_oxide-0.3.40-cp38-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for pdf_oxide-0.3.40-cp38-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 43715b39470a6167dd78fbbcd9c139c14a84a7793863aba57f2021f0d65cdf6f
MD5 4d642a94605b1e6a8fbcf8a28e6ddee4
BLAKE2b-256 5e43f343880dbc13a40d42b16d7b940de9c220854821d31193342f92c44d9496

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page