Rust-native PDF extraction. 73x faster than pdfplumber.

These details have not been verified by PyPI

Project links

Project description

ripdoc

Rust-native PDF extraction. 73x faster than pdfplumber.

ripdoc is a drop-in replacement for pdfplumber built entirely in Rust with Python bindings via PyO3. Extract text, tables, words, and more from PDFs at speeds that feel instant.

$ python bench.py report.pdf (200 pages)

ripdoc       0.16s  ██
pymupdf      5.97s  ████████████████████████████████████████
pdfplumber  11.63s  █████████████████████████████████████████████████████████████████████████
pdfminer    15.55s  ██████████████████████████████████████████████████████████████████████████████████████████████████

Install

pip install ripdoc

Requires Python 3.8+. Pre-built wheels for macOS (arm64). Other platforms build from source (requires Rust toolchain).

Quick start

import ripdoc

pdf = ripdoc.open("report.pdf")

for page in pdf.pages:
    # Extract text
    text = page.extract_text()

    # Extract with layout preservation
    text = page.extract_text(layout=True)

    # Extract words with bounding boxes
    words = page.extract_words()

    # Extract tables
    tables = page.extract_tables()

    # Search for text
    results = page.search("revenue")

Drop-in pdfplumber replacement

Swap one import — everything else stays the same:

# Before
import pdfplumber
pdf = pdfplumber.open("report.pdf")

# After
import ripdoc as pdfplumber
pdf = pdfplumber.open("report.pdf")

Or use the explicit compat module:

import ripdoc.compat as pdfplumber

API

`ripdoc.open(path) -> PDF`

Open a PDF file. Also supports PDF.from_bytes(bytes).

`PDF`

Property / Method	Description
`pdf.pages`	List of `Page` objects
`pdf.page_count`	Number of pages
`pdf.metadata`	Document metadata dict
`pdf.page(n)`	Get page by number (1-indexed)

`Page`

Property / Method	Description
`page.extract_text(layout=False)`	Extract text, optionally preserving spatial layout
`page.extract_words()`	Words with bounding boxes (`x0`, `top`, `x1`, `bottom`)
`page.extract_tables()`	Tables as list of row lists
`page.extract_table()`	Largest table on the page
`page.find_tables()`	Table objects with metadata
`page.search(query)`	Find text matches with positions
`page.chars`	Individual characters with font info
`page.lines`	Line segments
`page.rects`	Rectangles
`page.edges`	Edges (used for table detection)
`page.crop(bbox)`	Crop to bounding box `(x0, top, x1, bottom)`
`page.within_bbox(bbox)`	Filter objects within bounding box
`page.width` / `page.height`	Page dimensions in points
`page.page_number`	1-indexed page number

Architecture

ripdoc
├── ripdoc-core     Pure Rust library (~5500 LOC)
│   ├── content_stream    PDF operator interpreter
│   ├── fonts/            Encoding, CMap, metrics
│   ├── geometry/         BBox, CTM, clustering
│   ├── text/             Word grouping, layout, search
│   ├── table/            Nurminen/Tabula algorithm
│   └── output/           Markdown, JSON, HTML, CSV
└── ripdoc-python   PyO3 bindings (~450 LOC)

Built on lopdf for low-level PDF structure parsing. All text extraction, table detection, and layout analysis is implemented from scratch in Rust.

Features

Text extraction — simple and layout-preserving modes
Table detection — Nurminen/Tabula algorithm with merged cell support
Search — full-text search with bounding box positions
Reading order — XY-cut algorithm + tagged PDF structure tree
Output formats — Markdown, JSON, HTML, CSV
Spatial queries — crop, within_bbox, character-level access
pdfplumber compatible — same API, same patterns

Development

# Build and install locally
cd crates/ripdoc-python
maturin develop --release

# Run tests
cargo test

# Type check the visualizer frontend
cd visualizer/frontend && npx tsc --noEmit

License

MIT OR Apache-2.0

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.2.1

Feb 23, 2026

0.1.2

Feb 23, 2026

0.1.1

Feb 23, 2026

0.1.0

Feb 23, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ripdoc-0.2.1.tar.gz (60.9 kB view details)

Uploaded Feb 23, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

ripdoc-0.2.1-cp38-abi3-macosx_11_0_arm64.whl (713.2 kB view details)

Uploaded Feb 23, 2026 CPython 3.8+macOS 11.0+ ARM64

File details

Details for the file ripdoc-0.2.1.tar.gz.

File metadata

Download URL: ripdoc-0.2.1.tar.gz
Upload date: Feb 23, 2026
Size: 60.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: maturin/1.12.3

File hashes

Hashes for ripdoc-0.2.1.tar.gz
Algorithm	Hash digest
SHA256	`1575cf6ed8be998fd6efe3f9d123f3f0818fcc7cb64370c6121c3da0404b9f58`
MD5	`9fca006a48b48265c18c37284012ceec`
BLAKE2b-256	`afa4269739572c3405fd768986dc948d2957984701ea3c1c54678eacac858895`

See more details on using hashes here.

File details

Details for the file ripdoc-0.2.1-cp38-abi3-macosx_11_0_arm64.whl.

File metadata

Download URL: ripdoc-0.2.1-cp38-abi3-macosx_11_0_arm64.whl
Upload date: Feb 23, 2026
Size: 713.2 kB
Tags: CPython 3.8+, macOS 11.0+ ARM64
Uploaded using Trusted Publishing? No
Uploaded via: maturin/1.12.3

File hashes

Hashes for ripdoc-0.2.1-cp38-abi3-macosx_11_0_arm64.whl
Algorithm	Hash digest
SHA256	`dbbceab088b74e19325a8d85e0451ec71f61be63242c88881c938f4c8f8bdc26`
MD5	`a6a786199393c418b5e781e9f593c3cd`
BLAKE2b-256	`122549baeb595dfb73ddf5d564ec23d8e5e9784b02674bf5e95ff8f031c43e4d`

See more details on using hashes here.

ripdoc 0.2.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

ripdoc

Install

Quick start

Drop-in pdfplumber replacement

API

`ripdoc.open(path) -> PDF`

`PDF`

`Page`

Architecture

Features

Development

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes