Skip to main content

Rust-first PDF text extraction with geometry-aware search and optional Python bindings

Project description

reap

reap is a low-level PDF parser written in Rust, with Python bindings for fast geometry-aware text extraction and spatial queries. It is designed for fast spatial lookups, but also supports full text extraction and regex queries.

Installing

uv add reap-pdf

Usage

from reap import Rectangle, TextBlockIndex

index = TextBlockIndex.from_path("my_w2.pdf")
blocks = index.search_regex("Wage and Tax Statement")
print(blocks)
#> [TextBlock("Wage and Tax Statement")]

# Search to the right of the label.
search_rect = Rectangle(
    top=blocks[0].rect.top,
    left=blocks[0].rect.right,
    bottom=blocks[0].rect.bottom,
    right=blocks[0].rect.right + 100,
)
year = index.search(search_rect, overlap=0.3)
print(year[0].text)
#> 2026

Details

  • TextBlockIndex builds an index of word level TextBlocks, across all pages.
  • TextBlockIndex.search_regex works on the entire text corpus, in reading order, but returns merged TextBlocks that match the query into one TextBlock.
  • TextBlockIndex.text returns text in reading order.
  • All pages are merged, and considered as one big page, with each page's coordinates stacked below each other. Coordinate units are points.

Advanced use cases

  • TextBlockIndex.scoped creates a TextBlockIndex scoped to a specific region, with support for merging text blocks within the region into multi-word TextBlocks.
  • TextBlockIndex(..., include_chars=True) enables TextBlockIndex.chars, returning all TextChars for each page.

Scope

  • OCR is not in-scope. Reap focuses entirely on extracting visible text present in the PDF.
  • No promises for all PDF spec details to be supported, support is added as the need arises.
  • Best effort to extract all visible text in PDFs.

Comparisons

Extraction differences

PyMuPDF and pdfminer will both extract hidden text, eg. white text on a white background, which is often not desirable. reap performs visibility checks during extraction, to ensure that invisible text does not get included in the final output.

Speed

Text extraction performance measured for comparability, as neither alternative offers proper spatial query support.

Library Text Extraction
reap 0.850ms median
PyMuPDF 5ms median
pdfminer 220ms median

Speed comparisons vary greatly depending on the PDF, with smaller PDFs reap is roughly 10x faster than PyMuPDF, and 500x faster than pdfminer, with larger PDFs the advantage narrows to roughly 5x and 250x respectively. Tests were conducted on a private test suite of PDFs with varying complexity

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

reap_pdf-0.1.5.tar.gz (1.3 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

reap_pdf-0.1.5-cp314-cp314-manylinux_2_34_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.14manylinux: glibc 2.34+ x86-64

reap_pdf-0.1.5-cp314-cp314-manylinux_2_34_aarch64.whl (1.2 MB view details)

Uploaded CPython 3.14manylinux: glibc 2.34+ ARM64

reap_pdf-0.1.5-cp314-cp314-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl (2.3 MB view details)

Uploaded CPython 3.14macOS 10.12+ universal2 (ARM64, x86-64)macOS 10.12+ x86-64macOS 11.0+ ARM64

reap_pdf-0.1.5-cp313-cp313-manylinux_2_34_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.34+ x86-64

reap_pdf-0.1.5-cp313-cp313-manylinux_2_34_aarch64.whl (1.2 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.34+ ARM64

reap_pdf-0.1.5-cp313-cp313-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl (2.3 MB view details)

Uploaded CPython 3.13macOS 10.12+ universal2 (ARM64, x86-64)macOS 10.12+ x86-64macOS 11.0+ ARM64

reap_pdf-0.1.5-cp312-cp312-manylinux_2_34_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.34+ x86-64

reap_pdf-0.1.5-cp312-cp312-manylinux_2_34_aarch64.whl (1.2 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.34+ ARM64

reap_pdf-0.1.5-cp312-cp312-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl (2.3 MB view details)

Uploaded CPython 3.12macOS 10.12+ universal2 (ARM64, x86-64)macOS 10.12+ x86-64macOS 11.0+ ARM64

reap_pdf-0.1.5-cp311-cp311-manylinux_2_34_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.34+ x86-64

reap_pdf-0.1.5-cp311-cp311-manylinux_2_34_aarch64.whl (1.2 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.34+ ARM64

reap_pdf-0.1.5-cp311-cp311-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl (2.3 MB view details)

Uploaded CPython 3.11macOS 10.12+ universal2 (ARM64, x86-64)macOS 10.12+ x86-64macOS 11.0+ ARM64

reap_pdf-0.1.5-cp310-cp310-manylinux_2_34_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.34+ x86-64

reap_pdf-0.1.5-cp310-cp310-manylinux_2_34_aarch64.whl (1.2 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.34+ ARM64

reap_pdf-0.1.5-cp310-cp310-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl (2.3 MB view details)

Uploaded CPython 3.10macOS 10.12+ universal2 (ARM64, x86-64)macOS 10.12+ x86-64macOS 11.0+ ARM64

File details

Details for the file reap_pdf-0.1.5.tar.gz.

File metadata

  • Download URL: reap_pdf-0.1.5.tar.gz
  • Upload date:
  • Size: 1.3 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for reap_pdf-0.1.5.tar.gz
Algorithm Hash digest
SHA256 a52a214cb43749c6a45085d4138e73a849b9287a53069ff231f7d1e68fc0f3a0
MD5 5983eef6301d7340185216ab97bbcc7a
BLAKE2b-256 ff7af1a8d2e28f56cc543958d2b2a910006314c16edb35690b8eaa13bae64c33

See more details on using hashes here.

Provenance

The following attestation bundles were made for reap_pdf-0.1.5.tar.gz:

Publisher: release-publish.yml on kennipj/reap

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file reap_pdf-0.1.5-cp314-cp314-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for reap_pdf-0.1.5-cp314-cp314-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 d689437434ace1098d44dc1ea589c1f0c5eca1e2949860fecdae69dd859c9a71
MD5 f42ff9e7684664b039b4730e0a3e69ec
BLAKE2b-256 bb68dad2d8705e046f146986cfbc8abdb32bf54a00cec30571be523a70660ec4

See more details on using hashes here.

Provenance

The following attestation bundles were made for reap_pdf-0.1.5-cp314-cp314-manylinux_2_34_x86_64.whl:

Publisher: release-publish.yml on kennipj/reap

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file reap_pdf-0.1.5-cp314-cp314-manylinux_2_34_aarch64.whl.

File metadata

File hashes

Hashes for reap_pdf-0.1.5-cp314-cp314-manylinux_2_34_aarch64.whl
Algorithm Hash digest
SHA256 c4ae9aa0d118ed047c73fc2ec4a7a03fa9e246e2a2f6a531ef4d526151d6e5e9
MD5 0df2c4701d9c64c43169911b15c7e13c
BLAKE2b-256 83cc2da0283b1b614b5a45170cf6cb17d43af31cfc30aff9b4194087ea94728e

See more details on using hashes here.

Provenance

The following attestation bundles were made for reap_pdf-0.1.5-cp314-cp314-manylinux_2_34_aarch64.whl:

Publisher: release-publish.yml on kennipj/reap

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file reap_pdf-0.1.5-cp314-cp314-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl.

File metadata

File hashes

Hashes for reap_pdf-0.1.5-cp314-cp314-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl
Algorithm Hash digest
SHA256 714378da8e2d677528bdabc04b62d6f1a33e6896ec186b34c5a96056ea7b8470
MD5 d54acc7d4e67e28ca50871c173dab312
BLAKE2b-256 94924c771fd22be6697b15f6268a6d72dcf059f7c288b640239ea9bfe2dbb5ff

See more details on using hashes here.

Provenance

The following attestation bundles were made for reap_pdf-0.1.5-cp314-cp314-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl:

Publisher: release-publish.yml on kennipj/reap

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file reap_pdf-0.1.5-cp313-cp313-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for reap_pdf-0.1.5-cp313-cp313-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 0df79cf8696f861a9b66350fb04370e82cbb2f3690920c02f19378aa62ccc456
MD5 e0900ee19e90efe516bed668dce12929
BLAKE2b-256 15db88a7b765e71c423a1e547f304082dff27a220e51a37255aa3beef779d649

See more details on using hashes here.

Provenance

The following attestation bundles were made for reap_pdf-0.1.5-cp313-cp313-manylinux_2_34_x86_64.whl:

Publisher: release-publish.yml on kennipj/reap

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file reap_pdf-0.1.5-cp313-cp313-manylinux_2_34_aarch64.whl.

File metadata

File hashes

Hashes for reap_pdf-0.1.5-cp313-cp313-manylinux_2_34_aarch64.whl
Algorithm Hash digest
SHA256 0c67911b5d1a9a56a6bb237fe439478b7c6e47cb86bd3e63932458023bb10f2a
MD5 27afdc90c2a8f18984f596fdf75684f3
BLAKE2b-256 17b3ffcb9231b3e007fde55c7ef746fc24dd360382e4591934d1286eb5797929

See more details on using hashes here.

Provenance

The following attestation bundles were made for reap_pdf-0.1.5-cp313-cp313-manylinux_2_34_aarch64.whl:

Publisher: release-publish.yml on kennipj/reap

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file reap_pdf-0.1.5-cp313-cp313-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl.

File metadata

File hashes

Hashes for reap_pdf-0.1.5-cp313-cp313-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl
Algorithm Hash digest
SHA256 7b288f472488ab59565beb68885e604bba511f57d1545fe648f7fbfe6f61564c
MD5 bc480470bd8103a37e3409203dfd45b6
BLAKE2b-256 e26d25fd9492b499f4b73cc16e4b304f6e4d74736992386797ab6271e98db068

See more details on using hashes here.

Provenance

The following attestation bundles were made for reap_pdf-0.1.5-cp313-cp313-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl:

Publisher: release-publish.yml on kennipj/reap

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file reap_pdf-0.1.5-cp312-cp312-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for reap_pdf-0.1.5-cp312-cp312-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 eb7e5c1a39369157fdb93d98fc9a5d4bf08a7ad21eed72df7ae2b5b7d29a53cb
MD5 21fc4211b802d28f2903b42ba0dcfe99
BLAKE2b-256 05d2624dea3d3cd3bfe66fc0901d8b683a93d6bba53594ac65257e822e88ca48

See more details on using hashes here.

Provenance

The following attestation bundles were made for reap_pdf-0.1.5-cp312-cp312-manylinux_2_34_x86_64.whl:

Publisher: release-publish.yml on kennipj/reap

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file reap_pdf-0.1.5-cp312-cp312-manylinux_2_34_aarch64.whl.

File metadata

File hashes

Hashes for reap_pdf-0.1.5-cp312-cp312-manylinux_2_34_aarch64.whl
Algorithm Hash digest
SHA256 c9181ca0410eb2ba56b05bc1e70797518e9bcb916750433bbf403162c357c72f
MD5 29fd3e08083696a4add45aa68e799b75
BLAKE2b-256 60c47f83c20e232cd338c519f405c37dfe9d2ae15bdc51cf3eccbacf4ed53211

See more details on using hashes here.

Provenance

The following attestation bundles were made for reap_pdf-0.1.5-cp312-cp312-manylinux_2_34_aarch64.whl:

Publisher: release-publish.yml on kennipj/reap

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file reap_pdf-0.1.5-cp312-cp312-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl.

File metadata

File hashes

Hashes for reap_pdf-0.1.5-cp312-cp312-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl
Algorithm Hash digest
SHA256 52d98fa096f0eae0c7575976a4722934ae3439947eedbf18dfcdeac546bdf6a4
MD5 e4ccf0a9ad40db59b1bb42e987f53e3d
BLAKE2b-256 489372c7d942036ca4e5f9099480458b9da44fdd8f693c899070f0320cacb0fc

See more details on using hashes here.

Provenance

The following attestation bundles were made for reap_pdf-0.1.5-cp312-cp312-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl:

Publisher: release-publish.yml on kennipj/reap

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file reap_pdf-0.1.5-cp311-cp311-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for reap_pdf-0.1.5-cp311-cp311-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 ac5aefddb34040024dbf9554a597a13f509c28b2f4986635087c2184f97399e4
MD5 a9e64f80f423b27c991e9b209dc41e8c
BLAKE2b-256 01e2a0a3f03c3b1a8784bc559d6c5d2630791ee893167e74580beccaaa567c67

See more details on using hashes here.

Provenance

The following attestation bundles were made for reap_pdf-0.1.5-cp311-cp311-manylinux_2_34_x86_64.whl:

Publisher: release-publish.yml on kennipj/reap

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file reap_pdf-0.1.5-cp311-cp311-manylinux_2_34_aarch64.whl.

File metadata

File hashes

Hashes for reap_pdf-0.1.5-cp311-cp311-manylinux_2_34_aarch64.whl
Algorithm Hash digest
SHA256 b17a7534fc3e387bc3f544c34424a371f569eda64e73f45bdb9269e1f24add2e
MD5 e8b7ee37a1feec282a3d36f9bf483752
BLAKE2b-256 8b96950192a24f056e96cc492fca155c7b9d2c48bc09318a51249f99deaa785e

See more details on using hashes here.

Provenance

The following attestation bundles were made for reap_pdf-0.1.5-cp311-cp311-manylinux_2_34_aarch64.whl:

Publisher: release-publish.yml on kennipj/reap

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file reap_pdf-0.1.5-cp311-cp311-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl.

File metadata

File hashes

Hashes for reap_pdf-0.1.5-cp311-cp311-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl
Algorithm Hash digest
SHA256 542fe87526fffc352bd5d74588947eb195fc826dee1ba9cd2a4f3fdca410baba
MD5 1d0b68b1d9dd7e46db67a44f7be3e14b
BLAKE2b-256 fd74281bc41482751266b3cfc778c5b8bb3b29b3e369ad7e5efcba2f63ff6487

See more details on using hashes here.

Provenance

The following attestation bundles were made for reap_pdf-0.1.5-cp311-cp311-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl:

Publisher: release-publish.yml on kennipj/reap

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file reap_pdf-0.1.5-cp310-cp310-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for reap_pdf-0.1.5-cp310-cp310-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 75d7c0abac46ed45a34aab330d81e0a515e05e5f65a997928e2b4b5130452fad
MD5 4a36c89c1a6d2ddc958575d7409df961
BLAKE2b-256 980eb97b307d684b0f41218398578c30871e931306fc14d8c29e9ee02c11fd62

See more details on using hashes here.

Provenance

The following attestation bundles were made for reap_pdf-0.1.5-cp310-cp310-manylinux_2_34_x86_64.whl:

Publisher: release-publish.yml on kennipj/reap

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file reap_pdf-0.1.5-cp310-cp310-manylinux_2_34_aarch64.whl.

File metadata

File hashes

Hashes for reap_pdf-0.1.5-cp310-cp310-manylinux_2_34_aarch64.whl
Algorithm Hash digest
SHA256 c951b30a6373138d1a991c2b97976b02c02856928063d812f79ebf1ee8709f96
MD5 1af3f79617b18a2e1f12ce83426c774c
BLAKE2b-256 8dcaa13eef44b42c565458b2f23c81198f1790d779bb9f262fa60b79d3a6b508

See more details on using hashes here.

Provenance

The following attestation bundles were made for reap_pdf-0.1.5-cp310-cp310-manylinux_2_34_aarch64.whl:

Publisher: release-publish.yml on kennipj/reap

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file reap_pdf-0.1.5-cp310-cp310-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl.

File metadata

File hashes

Hashes for reap_pdf-0.1.5-cp310-cp310-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl
Algorithm Hash digest
SHA256 84a5efc4c8b26cc5a65bb3769dfbae5459bcc10eb751ada5ef07705c50ee0253
MD5 68ab6b9fc803a70a37dbc8fc412fc21b
BLAKE2b-256 fc380b213bcc3baf770b426648cb014a64a1bb32e8bd470afe4ca9f21aaea498

See more details on using hashes here.

Provenance

The following attestation bundles were made for reap_pdf-0.1.5-cp310-cp310-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl:

Publisher: release-publish.yml on kennipj/reap

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page