Skip to main content

Rust-native PDF extraction. 73x faster than pdfplumber.

Project description

ripdoc

Rust-native PDF extraction. 73x faster than pdfplumber.

ripdoc is a drop-in replacement for pdfplumber built entirely in Rust with Python bindings via PyO3. Extract text, tables, words, and more from PDFs at speeds that feel instant.

$ python bench.py report.pdf (200 pages)

ripdoc       0.16s  ██
pymupdf      5.97s  ████████████████████████████████████████
pdfplumber  11.63s  █████████████████████████████████████████████████████████████████████████
pdfminer    15.55s  ██████████████████████████████████████████████████████████████████████████████████████████████████

Install

pip install ripdoc

Requires Python 3.8+. Pre-built wheels for macOS (arm64). Other platforms build from source (requires Rust toolchain).

Quick start

import ripdoc

pdf = ripdoc.open("report.pdf")

for page in pdf.pages:
    # Extract text
    text = page.extract_text()

    # Extract with layout preservation
    text = page.extract_text(layout=True)

    # Extract words with bounding boxes
    words = page.extract_words()

    # Extract tables
    tables = page.extract_tables()

    # Search for text
    results = page.search("revenue")

Drop-in pdfplumber replacement

Swap one import — everything else stays the same:

# Before
import pdfplumber
pdf = pdfplumber.open("report.pdf")

# After
import ripdoc as pdfplumber
pdf = pdfplumber.open("report.pdf")

Or use the explicit compat module:

import ripdoc.compat as pdfplumber

API

ripdoc.open(path) -> PDF

Open a PDF file. Also supports PDF.from_bytes(bytes).

PDF

Property / Method Description
pdf.pages List of Page objects
pdf.page_count Number of pages
pdf.metadata Document metadata dict
pdf.page(n) Get page by number (1-indexed)

Page

Property / Method Description
page.extract_text(layout=False) Extract text, optionally preserving spatial layout
page.extract_words() Words with bounding boxes (x0, top, x1, bottom)
page.extract_tables() Tables as list of row lists
page.extract_table() Largest table on the page
page.find_tables() Table objects with metadata
page.search(query) Find text matches with positions
page.chars Individual characters with font info
page.lines Line segments
page.rects Rectangles
page.edges Edges (used for table detection)
page.crop(bbox) Crop to bounding box (x0, top, x1, bottom)
page.within_bbox(bbox) Filter objects within bounding box
page.width / page.height Page dimensions in points
page.page_number 1-indexed page number

Architecture

ripdoc
├── ripdoc-core     Pure Rust library (~5500 LOC)
│   ├── content_stream    PDF operator interpreter
│   ├── fonts/            Encoding, CMap, metrics
│   ├── geometry/         BBox, CTM, clustering
│   ├── text/             Word grouping, layout, search
│   ├── table/            Nurminen/Tabula algorithm
│   └── output/           Markdown, JSON, HTML, CSV
└── ripdoc-python   PyO3 bindings (~450 LOC)

Built on lopdf for low-level PDF structure parsing. All text extraction, table detection, and layout analysis is implemented from scratch in Rust.

Features

  • Text extraction — simple and layout-preserving modes
  • Table detection — Nurminen/Tabula algorithm with merged cell support
  • Search — full-text search with bounding box positions
  • Reading order — XY-cut algorithm + tagged PDF structure tree
  • Output formats — Markdown, JSON, HTML, CSV
  • Spatial queries — crop, within_bbox, character-level access
  • pdfplumber compatible — same API, same patterns

Development

# Build and install locally
cd crates/ripdoc-python
maturin develop --release

# Run tests
cargo test

# Type check the visualizer frontend
cd visualizer/frontend && npx tsc --noEmit

License

MIT OR Apache-2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ripdoc-0.2.1.tar.gz (60.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ripdoc-0.2.1-cp38-abi3-macosx_11_0_arm64.whl (713.2 kB view details)

Uploaded CPython 3.8+macOS 11.0+ ARM64

File details

Details for the file ripdoc-0.2.1.tar.gz.

File metadata

  • Download URL: ripdoc-0.2.1.tar.gz
  • Upload date:
  • Size: 60.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.12.3

File hashes

Hashes for ripdoc-0.2.1.tar.gz
Algorithm Hash digest
SHA256 1575cf6ed8be998fd6efe3f9d123f3f0818fcc7cb64370c6121c3da0404b9f58
MD5 9fca006a48b48265c18c37284012ceec
BLAKE2b-256 afa4269739572c3405fd768986dc948d2957984701ea3c1c54678eacac858895

See more details on using hashes here.

File details

Details for the file ripdoc-0.2.1-cp38-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for ripdoc-0.2.1-cp38-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 dbbceab088b74e19325a8d85e0451ec71f61be63242c88881c938f4c8f8bdc26
MD5 a6a786199393c418b5e781e9f593c3cd
BLAKE2b-256 122549baeb595dfb73ddf5d564ec23d8e5e9784b02674bf5e95ff8f031c43e4d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page