Skip to main content

The fastest Python PDF library: 0.8ms mean, 5× faster than PyMuPDF. Text extraction, markdown conversion, PDF creation. 100% pass rate on 3,830 PDFs.

Project description

PDFOxide - The Fastest PDF Toolkit for 20 Languages — Python, Rust, Go, JS/TS, C#, Java, Kotlin, Swift, C++ & more, plus CLI & AI

New in v0.3.69 — eleven new language bindings. PDFOxide now ships idiomatic bindings for C++, Swift, Kotlin, Dart, R, Julia, Zig, Scala, Clojure, Objective-C, and Elixir, each built over the stable C ABI with its own CI workflow, api-coverage tests, and runnable examples. That brings the toolkit to 20 languages (Rust core + 19 bindings). Want another language? Open an issue and tell us.

The fastest PDF library for text extraction, image extraction, and markdown conversion. A Rust core with bindings for 19 languages — Python, Go, JavaScript / TypeScript, C# / .NET, Java, Kotlin, Scala, Clojure, Ruby, PHP, C++, Objective-C, Swift, Dart, R, Julia, Zig, Elixir, and WASM — plus a CLI tool and MCP server for AI assistants. 0.8ms mean per document, 5× faster than PyMuPDF, 15× faster than pypdf. 100% pass rate on 3,830 real-world PDFs. MIT licensed.

Crates.io PyPI PyPI Downloads npm Documentation Build Status License: MIT OR Apache-2.0

Quick Start

Python

from pdf_oxide import PdfDocument

with PdfDocument("paper.pdf") as doc:
    print(len(doc))                          # number of pages
    for page in doc:
        text = page.text                     # lazy property
        chars = page.chars                   # lazy property
        md = page.markdown(detect_headings=True)

# Direct page access by index
doc = PdfDocument("paper.pdf")
page = doc[0]
text = page.text
pip install pdf_oxide

Rust

use pdf_oxide::PdfDocument;

let mut doc = PdfDocument::open("paper.pdf")?;
let text = doc.extract_text(0)?;
let images = doc.extract_images(0)?;
let markdown = doc.to_markdown(0, Default::default())?;
[dependencies]
pdf_oxide = "0.3"

CLI

pdf-oxide text document.pdf
pdf-oxide markdown document.pdf -o output.md
pdf-oxide search document.pdf "pattern"
pdf-oxide merge a.pdf b.pdf -o combined.pdf
brew install yfedoseev/tap/pdf-oxide

MCP Server (for AI assistants)

# Install
brew install yfedoseev/tap/pdf-oxide   # includes pdf-oxide-mcp

# Configure in Claude Desktop / Claude Code / Cursor
{
  "mcpServers": {
    "pdf-oxide": { "command": "crgx", "args": ["pdf_oxide_mcp@latest"] }
  }
}

Why PDFOxide?

  • Fast — 0.8ms mean per document, 5× faster than PyMuPDF, 15× faster than pypdf, 29× faster than pdfplumber
  • Reliable — 100% pass rate on 3,830 test PDFs, zero panics, zero timeouts
  • Complete — Text extraction, image extraction, PDF creation, and editing in one library
  • Multi-platform — 20 languages (Rust core + 19 bindings: Python, Go, JS/TS, C#/.NET, Java, Kotlin, Scala, Clojure, Ruby, PHP, C++, Objective-C, Swift, Dart, R, Julia, Zig, Elixir, WASM), plus a CLI and MCP server for AI assistants
  • Permissive license — MIT / Apache-2.0 — use freely in commercial and open-source projects

Performance

Benchmarked on 3,830 PDFs from three independent public test suites (veraPDF, Mozilla pdf.js, DARPA SafeDocs). Text extraction libraries only (no OCR). Single-thread, 60s timeout, no warm-up.

Python Libraries

Library Mean p99 Pass Rate License
PDFOxide 0.8ms 9ms 100% MIT
PyMuPDF 4.6ms 28ms 99.3% AGPL-3.0
pypdfium2 4.1ms 42ms 99.2% Apache-2.0
pymupdf4llm 55.5ms 280ms 99.1% AGPL-3.0
pdftext 7.3ms 82ms 99.0% GPL-3.0
pdfminer 16.8ms 124ms 98.8% MIT
pdfplumber 23.2ms 189ms 98.8% MIT
markitdown 108.8ms 378ms 98.6% MIT
pypdf 12.1ms 97ms 98.4% BSD-3

Rust Libraries

Library Mean p99 Pass Rate Text Extraction
PDFOxide 0.8ms 9ms 100% Built-in
oxidize_pdf 13.5ms 11ms 99.1% Basic
unpdf 2.8ms 10ms 95.1% Basic
pdf_extract 4.08ms 37ms 91.5% Basic
lopdf 0.3ms 2ms 80.2% No built-in extraction

Text Quality

99.5% text parity vs PyMuPDF and pypdfium2 across the full corpus. PDFOxide extracts text from 7–10× more "hard" files than it misses vs any competitor.

Corpus

Suite PDFs Pass Rate
veraPDF (PDF/A compliance) 2,907 100%
Mozilla pdf.js 897 99.2%
SafeDocs (targeted edge cases) 26 100%
Total 3,830 100%

100% pass rate on all valid PDFs — the 7 non-passing files across the corpus are intentionally broken test fixtures (missing PDF header, fuzz-corrupted catalogs, invalid xref streams).

Features

Extract Create Edit
Text & Layout Documents Annotations
Images Tables Form Fields
Forms Graphics Bookmarks
Annotations Templates Links
Bookmarks Images Content

Python API

Page-oriented API

from pdf_oxide import PdfDocument

with PdfDocument("report.pdf") as doc:
    print(len(doc))          # page count
    print(doc.version())

    # Iterate or index pages
    for page in doc:
        text   = page.text                      # str, lazy
        chars  = page.chars                     # list[TextChar], lazy
        words  = page.words                     # list[Word], lazy
        lines  = page.lines                     # list[TextLine], lazy
        tables = page.tables                    # list[Table], lazy
        images = page.images                    # list[Image], lazy
        md     = page.markdown(detect_headings=True)
        html   = page.html()
        print(f"Page {page.index}: {page.width:.0f}×{page.height:.0f} pts")

    # Direct index access (supports negative indices)
    first = doc[0]
    last  = doc[-1]

Scoped extraction

# Extract from a region: (x, y, width, height) in PDF points
header = doc.within(0, (0, 700, 612, 92)).extract_text()
region = doc.within(0, (50, 400, 500, 200))
region_words  = region.extract_words()
region_images = region.extract_images()

Extraction profiles

from pdf_oxide import ExtractionProfile

# Pre-tuned profiles for different document types
words = doc.extract_words(0, profile=ExtractionProfile.form())
lines = doc.extract_text_lines(0, profile=ExtractionProfile.academic())

# Override adaptive thresholds (in PDF points)
words = doc.extract_words(0, word_gap_threshold=2.5)
lines = doc.extract_text_lines(0, word_gap_threshold=2.5, line_gap_threshold=4.0)
params = doc.page_layout_params(0)
print(f"word gap: {params.word_gap_threshold:.1f}")

Form Fields

# Extract form fields
fields = doc.get_form_fields()
for f in fields:
    print(f"{f.name} ({f.field_type}) = {f.value}")

# Fill and save
doc.set_form_field_value("employee_name", "Jane Doe")
doc.set_form_field_value("wages", "85000.00")
doc.save("filled.pdf")

Rust API

use pdf_oxide::PdfDocument;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let mut doc = PdfDocument::open("paper.pdf")?;

    // Extract text
    let text = doc.extract_text(0)?;

    // Character-level extraction
    let chars = doc.extract_chars(0)?;

    // Extract images
    let images = doc.extract_images(0)?;

    // Vector graphics
    let paths = doc.extract_paths(0)?;

    Ok(())
}

Form Fields (Rust)

use pdf_oxide::editor::{DocumentEditor, EditableDocument, SaveOptions};
use pdf_oxide::editor::form_fields::FormFieldValue;

let mut editor = DocumentEditor::open("w2.pdf")?;
editor.set_form_field_value("employee_name", FormFieldValue::Text("Jane Doe".into()))?;
editor.save_with_options("filled.pdf", SaveOptions::incremental())?;

Installation

Python

pip install pdf_oxide

Wheels available for Linux, macOS, and Windows. Python 3.8–3.14.

Rust

[dependencies]
pdf_oxide = "0.3"

JavaScript/WASM

npm install pdf-oxide-wasm
const { WasmPdfDocument } = require("pdf-oxide-wasm");

CLI

brew install yfedoseev/tap/pdf-oxide    # Homebrew (macOS/Linux)
cargo install pdf_oxide_cli             # Cargo
cargo binstall pdf_oxide_cli            # Pre-built binary via cargo-binstall

MCP Server

brew install yfedoseev/tap/pdf-oxide    # Included with CLI in Homebrew
cargo install pdf_oxide_mcp             # Cargo

Other languages

Established bindings:

  • Gogo get github.com/yfedoseev/pdf_oxide/go — see go/README.md
  • JavaScript / TypeScript (Node.js)npm install pdf-oxide — see js/README.md
  • C# / .NETdotnet add package PdfOxide — see csharp/README.md
  • Java (JDK 11+) — Maven coords fyi.oxide:pdf-oxide:0.3.69 — see java/README.md
  • Rubygem install pdf_oxide — see ruby/README.md
  • PHPcomposer require oxide/pdf-oxide — see php/README.md

New in v0.3.69 (all over the stable C ABI):

<!-- Java (Maven) -->
<dependency>
  <groupId>fyi.oxide</groupId>
  <artifactId>pdf-oxide</artifactId>
  <version>0.3.69</version>
</dependency>
// Kotlin (Gradle, Kotlin DSL)
implementation("fyi.oxide:pdf-oxide-kotlin:0.3.69")

Every binding shares the same Rust core, so a bug fix in one lands in all of them — everything you read in this README applies, just with each language's native naming conventions. Publishing details for each registry are in docs/RELEASING-bindings.md.

CLI

22 commands for PDF processing directly from your terminal:

pdf-oxide text report.pdf                      # Extract text
pdf-oxide markdown report.pdf -o report.md     # Convert to Markdown
pdf-oxide html report.pdf -o report.html       # Convert to HTML
pdf-oxide info report.pdf                      # Show metadata
pdf-oxide search report.pdf "neural.?network"  # Search (regex)
pdf-oxide images report.pdf -o ./images/       # Extract images
pdf-oxide merge a.pdf b.pdf -o combined.pdf    # Merge PDFs
pdf-oxide split report.pdf -o ./pages/         # Split into pages
pdf-oxide watermark doc.pdf "DRAFT"            # Add watermark
pdf-oxide forms w2.pdf --fill "name=Jane"      # Fill form fields

Run pdf-oxide with no arguments for interactive REPL mode. Use --pages 1-5 to process specific pages, --json for machine-readable output.

MCP Server

pdf-oxide-mcp lets AI assistants (Claude, Cursor, etc.) extract content from PDFs locally via the Model Context Protocol.

Add to your MCP client configuration:

{
  "mcpServers": {
    "pdf-oxide": { "command": "crgx", "args": ["pdf_oxide_mcp@latest"] }
  }
}

The server exposes an extract tool that supports text, markdown, and HTML output formats with optional page ranges and image extraction. All processing runs locally — no files leave your machine.

Building from Source

# Clone and build
git clone https://github.com/yfedoseev/pdf_oxide
cd pdf_oxide
cargo build --release

# Run tests
cargo test

# Build Python bindings
maturin develop

# Build the shared library for Go, JS/TS, and C# bindings
cargo build --release --lib
# Output: target/release/libpdf_oxide.{so,dylib} or pdf_oxide.dll

Documentation

Use Cases

  • RAG / LLM pipelines — Convert PDFs to clean Markdown for retrieval-augmented generation with LangChain, LlamaIndex, or any framework
  • AI assistants — Give Claude, Cursor, or any MCP-compatible tool direct PDF access via the MCP server
  • Document processing at scale — Extract text, images, and metadata from thousands of PDFs in seconds
  • Data extraction — Pull structured data from forms, tables, and layouts
  • Academic research — Parse papers, extract citations, and process large corpora
  • PDF generation — Create invoices, reports, certificates, and templated documents programmatically
  • PyMuPDF alternative — MIT licensed, 5× faster, no AGPL restrictions

Why I built this

I needed PyMuPDF's speed without its AGPL license, and I needed it in more than one language. Nothing existed that ticked all three boxes — fast, MIT, multi-language — so I wrote it. The Rust core is what does the real work; the bindings for Python, Go, JS/TS, C#, and WASM are thin shells around the same code, so a bug fix in one lands in all of them. It now passes 100% of the veraPDF + Mozilla pdf.js + DARPA SafeDocs test corpora (3,830 PDFs) on every platform I've tested.

If it's useful to you, a star on GitHub genuinely helps. If something's broken or missing, open an issue — I read all of them.

— Yury

License

Dual-licensed under MIT or Apache-2.0 at your option. Unlike AGPL-licensed alternatives, PDFOxide can be used freely in any project — commercial or open-source — with no copyleft restrictions.

Contributing

We welcome contributions! See CONTRIBUTING.md for guidelines.

cargo build && cargo test && cargo fmt && cargo clippy -- -D warnings

Citation

@software{pdf_oxide,
  title = {PDFOxide: Fast Multi-Language PDF Toolkit (Rust core, 19 language bindings)},
  author = {Yury Fedoseev},
  year = {2025},
  url = {https://github.com/yfedoseev/pdf_oxide}
}

20 languages (Rust + Python + Go + JS/TS + C# + Java + Kotlin + Scala + Clojure + Ruby + PHP + C++ + Objective-C + Swift + Dart + R + Julia + Zig + Elixir + WASM) + CLI + MCP | MIT/Apache-2.0 | 100% pass rate on 3,830 PDFs | 0.8ms mean | 5× faster than the industry leaders

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdf_oxide-0.3.69.tar.gz (6.6 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

pdf_oxide-0.3.69-cp38-abi3-win_arm64.whl (10.3 MB view details)

Uploaded CPython 3.8+Windows ARM64

pdf_oxide-0.3.69-cp38-abi3-win_amd64.whl (11.1 MB view details)

Uploaded CPython 3.8+Windows x86-64

pdf_oxide-0.3.69-cp38-abi3-musllinux_1_2_x86_64.whl (11.3 MB view details)

Uploaded CPython 3.8+musllinux: musl 1.2+ x86-64

pdf_oxide-0.3.69-cp38-abi3-musllinux_1_2_aarch64.whl (10.6 MB view details)

Uploaded CPython 3.8+musllinux: musl 1.2+ ARM64

pdf_oxide-0.3.69-cp38-abi3-manylinux_2_28_x86_64.whl (11.1 MB view details)

Uploaded CPython 3.8+manylinux: glibc 2.28+ x86-64

pdf_oxide-0.3.69-cp38-abi3-manylinux_2_28_aarch64.whl (10.4 MB view details)

Uploaded CPython 3.8+manylinux: glibc 2.28+ ARM64

pdf_oxide-0.3.69-cp38-abi3-macosx_11_0_arm64.whl (10.1 MB view details)

Uploaded CPython 3.8+macOS 11.0+ ARM64

pdf_oxide-0.3.69-cp38-abi3-macosx_10_12_x86_64.whl (10.7 MB view details)

Uploaded CPython 3.8+macOS 10.12+ x86-64

File details

Details for the file pdf_oxide-0.3.69.tar.gz.

File metadata

  • Download URL: pdf_oxide-0.3.69.tar.gz
  • Upload date:
  • Size: 6.6 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.13

File hashes

Hashes for pdf_oxide-0.3.69.tar.gz
Algorithm Hash digest
SHA256 98ce387f45b67c79cb144eb539714c6071c4f3b77870ea41305b939682ab937a
MD5 34bc6949083bc62690f3eec98bd3a349
BLAKE2b-256 147f9245fbb923111e93ee5a2a1a77cf3c570c6033a0c550199e81bc726a9529

See more details on using hashes here.

File details

Details for the file pdf_oxide-0.3.69-cp38-abi3-win_arm64.whl.

File metadata

  • Download URL: pdf_oxide-0.3.69-cp38-abi3-win_arm64.whl
  • Upload date:
  • Size: 10.3 MB
  • Tags: CPython 3.8+, Windows ARM64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.13

File hashes

Hashes for pdf_oxide-0.3.69-cp38-abi3-win_arm64.whl
Algorithm Hash digest
SHA256 3ec8865a960af5d6bc7677c0534d0f6ed244ed1d1b56b14567d1f7fe9ca1c62b
MD5 21bb1672e9c2d3747416acca4cfc590c
BLAKE2b-256 6545745237280f6084be82bcef268c65d9462e15c0dec19d7d80d62ef360d22f

See more details on using hashes here.

File details

Details for the file pdf_oxide-0.3.69-cp38-abi3-win_amd64.whl.

File metadata

  • Download URL: pdf_oxide-0.3.69-cp38-abi3-win_amd64.whl
  • Upload date:
  • Size: 11.1 MB
  • Tags: CPython 3.8+, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.13

File hashes

Hashes for pdf_oxide-0.3.69-cp38-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 9921f890b0944663e9403bba6954c63e1e7029512a87589b7e7094bb42bedbd8
MD5 fd3ed3139434094ff75260ece681fb86
BLAKE2b-256 98d523d47920d898cf634b22be1418250637ef2562e60411ee78398510f781ce

See more details on using hashes here.

File details

Details for the file pdf_oxide-0.3.69-cp38-abi3-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for pdf_oxide-0.3.69-cp38-abi3-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 e3ce18c6b10f680975f55a74becd0fd66c70731500065c849faaef5a4429948f
MD5 734eb560cc536bd0cd4ab81ec26022aa
BLAKE2b-256 2244e4729ba473873fab8890a26ad393d4bfc009862cbd8adffc5f6f7f27e7c7

See more details on using hashes here.

File details

Details for the file pdf_oxide-0.3.69-cp38-abi3-musllinux_1_2_aarch64.whl.

File metadata

File hashes

Hashes for pdf_oxide-0.3.69-cp38-abi3-musllinux_1_2_aarch64.whl
Algorithm Hash digest
SHA256 0ce626b200969e77f069949ddf2a7cc369558edfe943581e7cc718fe53bf0a85
MD5 d171837dc3892daa4187120e95ab80f5
BLAKE2b-256 d94e78f465a7f5c3d0ad5b87407e84bba363b3f69a0245194af80113e0558a25

See more details on using hashes here.

File details

Details for the file pdf_oxide-0.3.69-cp38-abi3-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for pdf_oxide-0.3.69-cp38-abi3-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 c61e2ce42b5150da798f937e4e563cbc0888660d702475ccaf5151e9c0f29a84
MD5 074fb672844497339da358800ec6fffa
BLAKE2b-256 14c8205e389d6d182fc1373d70aeb7c5050fef1983f64ac58ded5ee7039b4e04

See more details on using hashes here.

File details

Details for the file pdf_oxide-0.3.69-cp38-abi3-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for pdf_oxide-0.3.69-cp38-abi3-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 ac0b4b7098a797e0a9377a4ff98059b36c15fdabf9fc9472e290d63fa81a5c6f
MD5 42115d5ec1b34c7541ab6a832fc88582
BLAKE2b-256 1218e59379ecc4ec1f12532c0c093f3016e484627254e5f9092aec217923d493

See more details on using hashes here.

File details

Details for the file pdf_oxide-0.3.69-cp38-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for pdf_oxide-0.3.69-cp38-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 704b4ad96ec246ca5a599113ca24b5504e4c6a4045a87060ea2c38d556144955
MD5 a1faf36902f4bc01a10afec947c4519f
BLAKE2b-256 792be8cc0286136f8c9c16ecdbd4c87957dc4ebde3b5f446974956b8bb3c6d4e

See more details on using hashes here.

File details

Details for the file pdf_oxide-0.3.69-cp38-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for pdf_oxide-0.3.69-cp38-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 dcbccce3bfdd20b470658756d9060d70fa552d9ed3ead03ef93c31f3162fa267
MD5 bdf6985eb37d4b0ce7b292b75cef852d
BLAKE2b-256 a1d6d8e775838861135fa896314edfb9a6676dc05071919c67b493963b752d6d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page