Skip to main content

Rust-native PDF extraction. 73x faster than pdfplumber.

Project description

ripdoc

Rust-native PDF extraction. 73x faster than pdfplumber.

ripdoc is a drop-in replacement for pdfplumber built entirely in Rust with Python bindings via PyO3. Extract text, tables, words, and more from PDFs at speeds that feel instant.

$ python bench.py report.pdf (200 pages)

ripdoc       0.16s  ██
pymupdf      5.97s  ████████████████████████████████████████
pdfplumber  11.63s  █████████████████████████████████████████████████████████████████████████
pdfminer    15.55s  ██████████████████████████████████████████████████████████████████████████████████████████████████

Install

pip install ripdoc

Requires Python 3.8+. Pre-built wheels for macOS (arm64). Other platforms build from source (requires Rust toolchain).

Quick start

import ripdoc

pdf = ripdoc.open("report.pdf")

for page in pdf.pages:
    # Extract text
    text = page.extract_text()

    # Extract with layout preservation
    text = page.extract_text(layout=True)

    # Extract words with bounding boxes
    words = page.extract_words()

    # Extract tables
    tables = page.extract_tables()

    # Search for text
    results = page.search("revenue")

Drop-in pdfplumber replacement

Swap one import — everything else stays the same:

# Before
import pdfplumber
pdf = pdfplumber.open("report.pdf")

# After
import ripdoc as pdfplumber
pdf = pdfplumber.open("report.pdf")

Or use the explicit compat module:

import ripdoc.compat as pdfplumber

API

ripdoc.open(path) -> PDF

Open a PDF file. Also supports PDF.from_bytes(bytes).

PDF

Property / Method Description
pdf.pages List of Page objects
pdf.page_count Number of pages
pdf.metadata Document metadata dict
pdf.page(n) Get page by number (1-indexed)

Page

Property / Method Description
page.extract_text(layout=False) Extract text, optionally preserving spatial layout
page.extract_words() Words with bounding boxes (x0, top, x1, bottom)
page.extract_tables() Tables as list of row lists
page.extract_table() Largest table on the page
page.find_tables() Table objects with metadata
page.search(query) Find text matches with positions
page.chars Individual characters with font info
page.lines Line segments
page.rects Rectangles
page.edges Edges (used for table detection)
page.crop(bbox) Crop to bounding box (x0, top, x1, bottom)
page.within_bbox(bbox) Filter objects within bounding box
page.width / page.height Page dimensions in points
page.page_number 1-indexed page number

Architecture

ripdoc
├── ripdoc-core     Pure Rust library (~5500 LOC)
│   ├── content_stream    PDF operator interpreter
│   ├── fonts/            Encoding, CMap, metrics
│   ├── geometry/         BBox, CTM, clustering
│   ├── text/             Word grouping, layout, search
│   ├── table/            Nurminen/Tabula algorithm
│   └── output/           Markdown, JSON, HTML, CSV
└── ripdoc-python   PyO3 bindings (~450 LOC)

Built on lopdf for low-level PDF structure parsing. All text extraction, table detection, and layout analysis is implemented from scratch in Rust.

Features

  • Text extraction — simple and layout-preserving modes
  • Table detection — Nurminen/Tabula algorithm with merged cell support
  • Search — full-text search with bounding box positions
  • Reading order — XY-cut algorithm + tagged PDF structure tree
  • Output formats — Markdown, JSON, HTML, CSV
  • Spatial queries — crop, within_bbox, character-level access
  • pdfplumber compatible — same API, same patterns

Development

# Build and install locally
cd crates/ripdoc-python
maturin develop --release

# Run tests
cargo test

# Type check the visualizer frontend
cd visualizer/frontend && npx tsc --noEmit

License

MIT OR Apache-2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ripdoc-0.1.2.tar.gz (60.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ripdoc-0.1.2-cp38-abi3-macosx_11_0_arm64.whl (713.0 kB view details)

Uploaded CPython 3.8+macOS 11.0+ ARM64

File details

Details for the file ripdoc-0.1.2.tar.gz.

File metadata

  • Download URL: ripdoc-0.1.2.tar.gz
  • Upload date:
  • Size: 60.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.12.3

File hashes

Hashes for ripdoc-0.1.2.tar.gz
Algorithm Hash digest
SHA256 2210cefa2e2bfb296f268b58d9d0d3cf6c4c00567ec17a64f9295fc1121f2f60
MD5 2179e76eb9de843ff5d01a092149ede0
BLAKE2b-256 1d708a0e5deaca25bfe8a9daa9d7a04e8769fdc6b457983dff9e4301c6b4530c

See more details on using hashes here.

File details

Details for the file ripdoc-0.1.2-cp38-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for ripdoc-0.1.2-cp38-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 9754664589c5cc1da2126f5a69827d3d4611d72fc356c0b28233e70de48ee83b
MD5 8a60a8a32819ce81a5e2b2916486a9b7
BLAKE2b-256 cbb1b7bc05e38553adf1caa058dbfea7e1b70ce94c74f7010b8fb4d4d9065c8c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page