Skip to main content

The fastest Python PDF library: 0.8ms mean, 5× faster than PyMuPDF. Text extraction, markdown conversion, PDF creation. 100% pass rate on 3,830 PDFs.

Project description

PDF Oxide - The Fastest PDF Toolkit for Python, Rust, Go, JS/TS, C#, Java, WASM, CLI & AI

New in v0.3.54 — text-extraction fidelity pass (Hebrew / RTL visual-vs-logical detection, ToUnicode CMap fallback for bullet & ligature decode, multi-column prose reading order, reference-style two-column reading order). Java is the 8th binding (fyi.oxide:pdf-oxide:0.3.54 on Maven Central, JDK 11+, free Kotlin interop via the same JAR). Ruby, PHP, and Swift are next on the roadmap. Want another language? Open an issue and tell us.

The fastest PDF library for text extraction, image extraction, and markdown conversion. Rust core with bindings for Python, Go, JavaScript / TypeScript, C# / .NET, Java (JDK 11+, Kotlin-compatible), and WASM, plus a CLI tool and MCP server for AI assistants. 0.8ms mean per document, 5× faster than PyMuPDF, 15× faster than pypdf. 100% pass rate on 3,830 real-world PDFs. MIT licensed.

Crates.io PyPI PyPI Downloads npm Documentation Build Status License: MIT OR Apache-2.0

New in v0.3.24 — now available in Go, JavaScript / TypeScript, and C# / .NET, alongside the existing Python, Rust, and WASM bindings. Same Rust core, same 0.8 ms extraction speed, same 100% pass rate. See the language guides: Python · Go · JavaScript / TypeScript · C# / .NET · Java / Kotlin · WASM

Quick Start

Python

from pdf_oxide import PdfDocument

with PdfDocument("paper.pdf") as doc:
    print(len(doc))                          # number of pages
    for page in doc:
        text = page.text                     # lazy property
        chars = page.chars                   # lazy property
        md = page.markdown(detect_headings=True)

# Direct page access by index
doc = PdfDocument("paper.pdf")
page = doc[0]
text = page.text
pip install pdf_oxide

Rust

use pdf_oxide::PdfDocument;

let mut doc = PdfDocument::open("paper.pdf")?;
let text = doc.extract_text(0)?;
let images = doc.extract_images(0)?;
let markdown = doc.to_markdown(0, Default::default())?;
[dependencies]
pdf_oxide = "0.3"

CLI

pdf-oxide text document.pdf
pdf-oxide markdown document.pdf -o output.md
pdf-oxide search document.pdf "pattern"
pdf-oxide merge a.pdf b.pdf -o combined.pdf
brew install yfedoseev/tap/pdf-oxide

MCP Server (for AI assistants)

# Install
brew install yfedoseev/tap/pdf-oxide   # includes pdf-oxide-mcp

# Configure in Claude Desktop / Claude Code / Cursor
{
  "mcpServers": {
    "pdf-oxide": { "command": "crgx", "args": ["pdf_oxide_mcp@latest"] }
  }
}

Why pdf_oxide?

  • Fast — 0.8ms mean per document, 5× faster than PyMuPDF, 15× faster than pypdf, 29× faster than pdfplumber
  • Reliable — 100% pass rate on 3,830 test PDFs, zero panics, zero timeouts
  • Complete — Text extraction, image extraction, PDF creation, and editing in one library
  • Multi-platform — Rust, Python, Go, JavaScript/TypeScript, C#/.NET, Java/Kotlin, WASM, CLI, and MCP server for AI assistants
  • Permissive license — MIT / Apache-2.0 — use freely in commercial and open-source projects

Performance

Benchmarked on 3,830 PDFs from three independent public test suites (veraPDF, Mozilla pdf.js, DARPA SafeDocs). Text extraction libraries only (no OCR). Single-thread, 60s timeout, no warm-up.

Python Libraries

Library Mean p99 Pass Rate License
PDF Oxide 0.8ms 9ms 100% MIT
PyMuPDF 4.6ms 28ms 99.3% AGPL-3.0
pypdfium2 4.1ms 42ms 99.2% Apache-2.0
pymupdf4llm 55.5ms 280ms 99.1% AGPL-3.0
pdftext 7.3ms 82ms 99.0% GPL-3.0
pdfminer 16.8ms 124ms 98.8% MIT
pdfplumber 23.2ms 189ms 98.8% MIT
markitdown 108.8ms 378ms 98.6% MIT
pypdf 12.1ms 97ms 98.4% BSD-3

Rust Libraries

Library Mean p99 Pass Rate Text Extraction
PDF Oxide 0.8ms 9ms 100% Built-in
oxidize_pdf 13.5ms 11ms 99.1% Basic
unpdf 2.8ms 10ms 95.1% Basic
pdf_extract 4.08ms 37ms 91.5% Basic
lopdf 0.3ms 2ms 80.2% No built-in extraction

Text Quality

99.5% text parity vs PyMuPDF and pypdfium2 across the full corpus. PDF Oxide extracts text from 7–10× more "hard" files than it misses vs any competitor.

Corpus

Suite PDFs Pass Rate
veraPDF (PDF/A compliance) 2,907 100%
Mozilla pdf.js 897 99.2%
SafeDocs (targeted edge cases) 26 100%
Total 3,830 100%

100% pass rate on all valid PDFs — the 7 non-passing files across the corpus are intentionally broken test fixtures (missing PDF header, fuzz-corrupted catalogs, invalid xref streams).

Features

Extract Create Edit
Text & Layout Documents Annotations
Images Tables Form Fields
Forms Graphics Bookmarks
Annotations Templates Links
Bookmarks Images Content

Python API

Page-oriented API

from pdf_oxide import PdfDocument

with PdfDocument("report.pdf") as doc:
    print(len(doc))          # page count
    print(doc.version())

    # Iterate or index pages
    for page in doc:
        text   = page.text                      # str, lazy
        chars  = page.chars                     # list[TextChar], lazy
        words  = page.words                     # list[Word], lazy
        lines  = page.lines                     # list[TextLine], lazy
        tables = page.tables                    # list[Table], lazy
        images = page.images                    # list[Image], lazy
        md     = page.markdown(detect_headings=True)
        html   = page.html()
        print(f"Page {page.index}: {page.width:.0f}×{page.height:.0f} pts")

    # Direct index access (supports negative indices)
    first = doc[0]
    last  = doc[-1]

Scoped extraction

# Extract from a region: (x, y, width, height) in PDF points
header = doc.within(0, (0, 700, 612, 92)).extract_text()
region = doc.within(0, (50, 400, 500, 200))
region_words  = region.extract_words()
region_images = region.extract_images()

Extraction profiles

from pdf_oxide import ExtractionProfile

# Pre-tuned profiles for different document types
words = doc.extract_words(0, profile=ExtractionProfile.form())
lines = doc.extract_text_lines(0, profile=ExtractionProfile.academic())

# Override adaptive thresholds (in PDF points)
words = doc.extract_words(0, word_gap_threshold=2.5)
lines = doc.extract_text_lines(0, word_gap_threshold=2.5, line_gap_threshold=4.0)
params = doc.page_layout_params(0)
print(f"word gap: {params.word_gap_threshold:.1f}")

Form Fields

# Extract form fields
fields = doc.get_form_fields()
for f in fields:
    print(f"{f.name} ({f.field_type}) = {f.value}")

# Fill and save
doc.set_form_field_value("employee_name", "Jane Doe")
doc.set_form_field_value("wages", "85000.00")
doc.save("filled.pdf")

Rust API

use pdf_oxide::PdfDocument;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let mut doc = PdfDocument::open("paper.pdf")?;

    // Extract text
    let text = doc.extract_text(0)?;

    // Character-level extraction
    let chars = doc.extract_chars(0)?;

    // Extract images
    let images = doc.extract_images(0)?;

    // Vector graphics
    let paths = doc.extract_paths(0)?;

    Ok(())
}

Form Fields (Rust)

use pdf_oxide::editor::{DocumentEditor, EditableDocument, SaveOptions};
use pdf_oxide::editor::form_fields::FormFieldValue;

let mut editor = DocumentEditor::open("w2.pdf")?;
editor.set_form_field_value("employee_name", FormFieldValue::Text("Jane Doe".into()))?;
editor.save_with_options("filled.pdf", SaveOptions::incremental())?;

Installation

Python

pip install pdf_oxide

Wheels available for Linux, macOS, and Windows. Python 3.8–3.14.

Rust

[dependencies]
pdf_oxide = "0.3"

JavaScript/WASM

npm install pdf-oxide-wasm
const { WasmPdfDocument } = require("pdf-oxide-wasm");

CLI

brew install yfedoseev/tap/pdf-oxide    # Homebrew (macOS/Linux)
cargo install pdf_oxide_cli             # Cargo
cargo binstall pdf_oxide_cli            # Pre-built binary via cargo-binstall

MCP Server

brew install yfedoseev/tap/pdf-oxide    # Included with CLI in Homebrew
cargo install pdf_oxide_mcp             # Cargo

Other languages

  • Gogo get github.com/yfedoseev/pdf_oxide/go — see go/README.md

  • JavaScript / TypeScript (Node.js)npm install pdf-oxide — see js/README.md

  • C# / .NETdotnet add package PdfOxide — see csharp/README.md

  • Java / Kotlin (JDK 11+) — Maven coords fyi.oxide:pdf-oxide:0.3.60 — see java/README.md

    <dependency>
      <groupId>fyi.oxide</groupId>
      <artifactId>pdf-oxide</artifactId>
      <version>0.3.67</version>
    </dependency>
    
    // Gradle (Kotlin DSL)
    implementation("fyi.oxide:pdf-oxide:0.3.60")
    

All four share the same Rust core as the Python and WASM bindings, so everything you read in this README applies to them as well — just with each language's native naming conventions.

CLI

22 commands for PDF processing directly from your terminal:

pdf-oxide text report.pdf                      # Extract text
pdf-oxide markdown report.pdf -o report.md     # Convert to Markdown
pdf-oxide html report.pdf -o report.html       # Convert to HTML
pdf-oxide info report.pdf                      # Show metadata
pdf-oxide search report.pdf "neural.?network"  # Search (regex)
pdf-oxide images report.pdf -o ./images/       # Extract images
pdf-oxide merge a.pdf b.pdf -o combined.pdf    # Merge PDFs
pdf-oxide split report.pdf -o ./pages/         # Split into pages
pdf-oxide watermark doc.pdf "DRAFT"            # Add watermark
pdf-oxide forms w2.pdf --fill "name=Jane"      # Fill form fields

Run pdf-oxide with no arguments for interactive REPL mode. Use --pages 1-5 to process specific pages, --json for machine-readable output.

MCP Server

pdf-oxide-mcp lets AI assistants (Claude, Cursor, etc.) extract content from PDFs locally via the Model Context Protocol.

Add to your MCP client configuration:

{
  "mcpServers": {
    "pdf-oxide": { "command": "crgx", "args": ["pdf_oxide_mcp@latest"] }
  }
}

The server exposes an extract tool that supports text, markdown, and HTML output formats with optional page ranges and image extraction. All processing runs locally — no files leave your machine.

Building from Source

# Clone and build
git clone https://github.com/yfedoseev/pdf_oxide
cd pdf_oxide
cargo build --release

# Run tests
cargo test

# Build Python bindings
maturin develop

# Build the shared library for Go, JS/TS, and C# bindings
cargo build --release --lib
# Output: target/release/libpdf_oxide.{so,dylib} or pdf_oxide.dll

Documentation

Use Cases

  • RAG / LLM pipelines — Convert PDFs to clean Markdown for retrieval-augmented generation with LangChain, LlamaIndex, or any framework
  • AI assistants — Give Claude, Cursor, or any MCP-compatible tool direct PDF access via the MCP server
  • Document processing at scale — Extract text, images, and metadata from thousands of PDFs in seconds
  • Data extraction — Pull structured data from forms, tables, and layouts
  • Academic research — Parse papers, extract citations, and process large corpora
  • PDF generation — Create invoices, reports, certificates, and templated documents programmatically
  • PyMuPDF alternative — MIT licensed, 5× faster, no AGPL restrictions

Why I built this

I needed PyMuPDF's speed without its AGPL license, and I needed it in more than one language. Nothing existed that ticked all three boxes — fast, MIT, multi-language — so I wrote it. The Rust core is what does the real work; the bindings for Python, Go, JS/TS, C#, and WASM are thin shells around the same code, so a bug fix in one lands in all of them. It now passes 100% of the veraPDF + Mozilla pdf.js + DARPA SafeDocs test corpora (3,830 PDFs) on every platform I've tested.

If it's useful to you, a star on GitHub genuinely helps. If something's broken or missing, open an issue — I read all of them.

— Yury

License

Dual-licensed under MIT or Apache-2.0 at your option. Unlike AGPL-licensed alternatives, pdf_oxide can be used freely in any project — commercial or open-source — with no copyleft restrictions.

Contributing

We welcome contributions! See CONTRIBUTING.md for guidelines.

cargo build && cargo test && cargo fmt && cargo clippy -- -D warnings

Citation

@software{pdf_oxide,
  title = {PDF Oxide: Fast PDF Toolkit for Rust, Python, Go, JavaScript, and C#},
  author = {Yury Fedoseev},
  year = {2025},
  url = {https://github.com/yfedoseev/pdf_oxide}
}

Rust + Python + Go + JS/TS + C# + WASM + CLI + MCP | MIT/Apache-2.0 | 100% pass rate on 3,830 PDFs | 0.8ms mean | 5× faster than the industry leaders

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdf_oxide-0.3.67.tar.gz (6.6 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

pdf_oxide-0.3.67-cp38-abi3-win_arm64.whl (10.2 MB view details)

Uploaded CPython 3.8+Windows ARM64

pdf_oxide-0.3.67-cp38-abi3-win_amd64.whl (11.0 MB view details)

Uploaded CPython 3.8+Windows x86-64

pdf_oxide-0.3.67-cp38-abi3-musllinux_1_2_x86_64.whl (11.2 MB view details)

Uploaded CPython 3.8+musllinux: musl 1.2+ x86-64

pdf_oxide-0.3.67-cp38-abi3-musllinux_1_2_aarch64.whl (10.5 MB view details)

Uploaded CPython 3.8+musllinux: musl 1.2+ ARM64

pdf_oxide-0.3.67-cp38-abi3-manylinux_2_28_x86_64.whl (11.0 MB view details)

Uploaded CPython 3.8+manylinux: glibc 2.28+ x86-64

pdf_oxide-0.3.67-cp38-abi3-manylinux_2_28_aarch64.whl (10.3 MB view details)

Uploaded CPython 3.8+manylinux: glibc 2.28+ ARM64

pdf_oxide-0.3.67-cp38-abi3-macosx_11_0_arm64.whl (10.0 MB view details)

Uploaded CPython 3.8+macOS 11.0+ ARM64

pdf_oxide-0.3.67-cp38-abi3-macosx_10_12_x86_64.whl (10.6 MB view details)

Uploaded CPython 3.8+macOS 10.12+ x86-64

File details

Details for the file pdf_oxide-0.3.67.tar.gz.

File metadata

  • Download URL: pdf_oxide-0.3.67.tar.gz
  • Upload date:
  • Size: 6.6 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.13

File hashes

Hashes for pdf_oxide-0.3.67.tar.gz
Algorithm Hash digest
SHA256 e6e532ea252201a3810d69cb1cfdee596cfaf7d842c943fedbf5a7194dd1e94b
MD5 a8b54c3202403a13bccec67872c8f9a9
BLAKE2b-256 4aed04c9c59cc8f144c05f52c48cfe7523ec68130e85dbedfd70a5090634e08d

See more details on using hashes here.

File details

Details for the file pdf_oxide-0.3.67-cp38-abi3-win_arm64.whl.

File metadata

  • Download URL: pdf_oxide-0.3.67-cp38-abi3-win_arm64.whl
  • Upload date:
  • Size: 10.2 MB
  • Tags: CPython 3.8+, Windows ARM64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.13

File hashes

Hashes for pdf_oxide-0.3.67-cp38-abi3-win_arm64.whl
Algorithm Hash digest
SHA256 f3b5ac8c621d2a3c6b7448a2577946359b081dbb8dd4b734a77e69d32dcc9d4d
MD5 7183ea09fc0e8bdf49cb4064a5164fb5
BLAKE2b-256 9e5df772228925be8bd9b472269bfc6a16b687f190b8aa1c1047472274679869

See more details on using hashes here.

File details

Details for the file pdf_oxide-0.3.67-cp38-abi3-win_amd64.whl.

File metadata

  • Download URL: pdf_oxide-0.3.67-cp38-abi3-win_amd64.whl
  • Upload date:
  • Size: 11.0 MB
  • Tags: CPython 3.8+, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.13

File hashes

Hashes for pdf_oxide-0.3.67-cp38-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 eec385c2b17d471a532be76139b033ddffaaa237cd287597bd758dccae14b45b
MD5 43f9321902202329914cf42d875ddd1d
BLAKE2b-256 8e42f7a695a0ae5890dee1f1b113638b5e9c39b2772ea95184e6acdc3facc95c

See more details on using hashes here.

File details

Details for the file pdf_oxide-0.3.67-cp38-abi3-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for pdf_oxide-0.3.67-cp38-abi3-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 986c7848b0e3b73a66a4a44beea39e85edab2f994c65949c0d93cac5ca69570b
MD5 c3f3870e523c291d5e0d102248dba404
BLAKE2b-256 134374ce479766a854c99d586b29dbb489ea360547e5fd88a9f4d96ada248f4e

See more details on using hashes here.

File details

Details for the file pdf_oxide-0.3.67-cp38-abi3-musllinux_1_2_aarch64.whl.

File metadata

File hashes

Hashes for pdf_oxide-0.3.67-cp38-abi3-musllinux_1_2_aarch64.whl
Algorithm Hash digest
SHA256 7ac3e67bcb4a37d76f29fad50eba23ff6df04b7d4128168d41389cf6bf33d853
MD5 5c6776e52e509dc018bdc80b5a56eebf
BLAKE2b-256 9b3c3f2308178220b78ebbdd12c8d7250f3fb284bd4c40ea479e54d2ebdf536b

See more details on using hashes here.

File details

Details for the file pdf_oxide-0.3.67-cp38-abi3-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for pdf_oxide-0.3.67-cp38-abi3-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 b3a8b5e2a2da86ebbee0cbf34501c9c8f9715e7e85569711a8ceaf977faa3431
MD5 90b601acdf170db7ff7974f699788460
BLAKE2b-256 9a26ff19adc16e358452b7d294f1a8692d4503f99f6f4b811928279f01f92070

See more details on using hashes here.

File details

Details for the file pdf_oxide-0.3.67-cp38-abi3-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for pdf_oxide-0.3.67-cp38-abi3-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 c3e120fea32fc5d9a1c8660520c10d3311d0607544255920bcfd887904db0ad3
MD5 d808a2d3323f4ba3b2b03241a717e5ab
BLAKE2b-256 a22f0b87cc5f0a22052c4710cb1ebc6329ecdff62ee02b47114dc61d8685cc48

See more details on using hashes here.

File details

Details for the file pdf_oxide-0.3.67-cp38-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for pdf_oxide-0.3.67-cp38-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 869d9f474f81f36c1a0e69bde63442f727b4f91d3060fa9b9ef51173ab5112a3
MD5 3cb206d548968f0a0fa26487ec8f2aa9
BLAKE2b-256 d76f92aaeb753d1bbd7d7457e918d6fdaa233429d94b740873b8e848a9362326

See more details on using hashes here.

File details

Details for the file pdf_oxide-0.3.67-cp38-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for pdf_oxide-0.3.67-cp38-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 aab83a28667d63eb12e2e5a5fb0058a3983807bc1520f3ba5ece02d8ef10dc4c
MD5 319c43268ac543ee0afea4bc537b436f
BLAKE2b-256 b4c3456bfb901eab8dec6648682f0075fcd7a341ad1a494838e5184732fd9b9b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page