Skip to main content

Rust-first PDF text extraction with geometry-aware search and optional Python bindings

Project description

reap

reap is a low-level PDF parser written in Rust, with Python bindings for fast geometry-aware text extraction and spatial queries. It is designed for fast spatial lookups, but also supports full text extraction and regex queries.

Installing

uv add reap-pdf

Usage

from reap import Rectangle, TextBlockIndex

index = TextBlockIndex.from_path("my_w2.pdf")
blocks = index.search_regex("Wage and Tax Statement")
print(blocks)
#> [TextBlock("Wage and Tax Statement")]

# Search to the right of the label.
search_rect = Rectangle(
    top=blocks[0].rect.top,
    left=blocks[0].rect.right,
    bottom=blocks[0].rect.bottom,
    right=blocks[0].rect.right + 100,
)
year = index.search(search_rect, overlap=0.3)
print(year[0].text)
#> 2026

Details

  • TextBlockIndex builds an index of word level TextBlocks, across all pages.
  • TextBlockIndex.search_regex works on the entire text corpus, in reading order, but returns merged TextBlocks that match the query into one TextBlock.
  • TextBlockIndex.text returns text in reading order.
  • All pages are merged, and considered as one big page, with each page's coordinates stacked below each other. Coordinate units are points.

Advanced use cases

  • TextBlockIndex.scoped creates a TextBlockIndex scoped to a specific region, with support for merging text blocks within the region into multi-word TextBlocks.
  • TextBlock.chars returns the per-block extracted TextChar list.

Scope

  • OCR is not in-scope. Reap focuses entirely on extracting visible text present in the PDF.
  • No promises for all PDF spec details to be supported, support is added as the need arises.
  • Best effort to extract all visible text in PDFs.

Comparisons

Extraction differences

PyMuPDF and pdfminer will both extract hidden text, eg. white text on a white background, which is often not desirable. reap performs visibility checks during extraction, to ensure that invisible text does not get included in the final output.

Speed

Text extraction performance measured for comparability, as neither alternative offers proper spatial query support.

Library Text Extraction
reap 0.850ms median
PyMuPDF 5ms median
pdfminer 220ms median

Speed comparisons vary greatly depending on the PDF, with smaller PDFs reap is roughly 10x faster than PyMuPDF, and 500x faster than pdfminer, with larger PDFs the advantage narrows to roughly 5x and 250x respectively. Tests were conducted on a private test suite of PDFs with varying complexity

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

reap_pdf-0.1.7.tar.gz (1.4 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

reap_pdf-0.1.7-cp314-cp314-manylinux_2_34_x86_64.whl (1.4 MB view details)

Uploaded CPython 3.14manylinux: glibc 2.34+ x86-64

reap_pdf-0.1.7-cp314-cp314-manylinux_2_34_aarch64.whl (1.4 MB view details)

Uploaded CPython 3.14manylinux: glibc 2.34+ ARM64

reap_pdf-0.1.7-cp314-cp314-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl (2.5 MB view details)

Uploaded CPython 3.14macOS 10.12+ universal2 (ARM64, x86-64)macOS 10.12+ x86-64macOS 11.0+ ARM64

reap_pdf-0.1.7-cp313-cp313-manylinux_2_34_x86_64.whl (1.4 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.34+ x86-64

reap_pdf-0.1.7-cp313-cp313-manylinux_2_34_aarch64.whl (1.4 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.34+ ARM64

reap_pdf-0.1.7-cp313-cp313-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl (2.5 MB view details)

Uploaded CPython 3.13macOS 10.12+ universal2 (ARM64, x86-64)macOS 10.12+ x86-64macOS 11.0+ ARM64

reap_pdf-0.1.7-cp312-cp312-manylinux_2_34_x86_64.whl (1.4 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.34+ x86-64

reap_pdf-0.1.7-cp312-cp312-manylinux_2_34_aarch64.whl (1.4 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.34+ ARM64

reap_pdf-0.1.7-cp312-cp312-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl (2.5 MB view details)

Uploaded CPython 3.12macOS 10.12+ universal2 (ARM64, x86-64)macOS 10.12+ x86-64macOS 11.0+ ARM64

reap_pdf-0.1.7-cp311-cp311-manylinux_2_34_x86_64.whl (1.4 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.34+ x86-64

reap_pdf-0.1.7-cp311-cp311-manylinux_2_34_aarch64.whl (1.4 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.34+ ARM64

reap_pdf-0.1.7-cp311-cp311-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl (2.5 MB view details)

Uploaded CPython 3.11macOS 10.12+ universal2 (ARM64, x86-64)macOS 10.12+ x86-64macOS 11.0+ ARM64

reap_pdf-0.1.7-cp310-cp310-manylinux_2_34_x86_64.whl (1.4 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.34+ x86-64

reap_pdf-0.1.7-cp310-cp310-manylinux_2_34_aarch64.whl (1.4 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.34+ ARM64

reap_pdf-0.1.7-cp310-cp310-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl (2.5 MB view details)

Uploaded CPython 3.10macOS 10.12+ universal2 (ARM64, x86-64)macOS 10.12+ x86-64macOS 11.0+ ARM64

File details

Details for the file reap_pdf-0.1.7.tar.gz.

File metadata

  • Download URL: reap_pdf-0.1.7.tar.gz
  • Upload date:
  • Size: 1.4 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for reap_pdf-0.1.7.tar.gz
Algorithm Hash digest
SHA256 dd08dc827a2ad9c3f503a622d77d053482b7deca62b3f5283935ccc1374e0257
MD5 1bc8763776e4e81a33523e5eb6add7f5
BLAKE2b-256 256448c673a7f22f8cf20724c12efcb4909d9d788b8d7ad1de2b66f7fdca3b9d

See more details on using hashes here.

Provenance

The following attestation bundles were made for reap_pdf-0.1.7.tar.gz:

Publisher: release-publish.yml on kennipj/reap

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file reap_pdf-0.1.7-cp314-cp314-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for reap_pdf-0.1.7-cp314-cp314-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 f08b2cff5cb1afe598d5093f967c66e28a41db98c98f346655e2be22e818310e
MD5 ebeba0e3fe43af0fab3ce7fbd1306aee
BLAKE2b-256 66ee5478fb824e0977e05e1134fd9aa0a7af1f8002ef9ad224cf220ca9fbdb11

See more details on using hashes here.

Provenance

The following attestation bundles were made for reap_pdf-0.1.7-cp314-cp314-manylinux_2_34_x86_64.whl:

Publisher: release-publish.yml on kennipj/reap

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file reap_pdf-0.1.7-cp314-cp314-manylinux_2_34_aarch64.whl.

File metadata

File hashes

Hashes for reap_pdf-0.1.7-cp314-cp314-manylinux_2_34_aarch64.whl
Algorithm Hash digest
SHA256 e221b40401cdf8aec9f013cc0e2322c82a47d77ba8b064571507f544d383b50b
MD5 826700a045e7eabef11bc694518a8629
BLAKE2b-256 bd5cb30c9339e10406831dc7b99215a161603b93317133be57f2fa582724e4f8

See more details on using hashes here.

Provenance

The following attestation bundles were made for reap_pdf-0.1.7-cp314-cp314-manylinux_2_34_aarch64.whl:

Publisher: release-publish.yml on kennipj/reap

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file reap_pdf-0.1.7-cp314-cp314-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl.

File metadata

File hashes

Hashes for reap_pdf-0.1.7-cp314-cp314-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl
Algorithm Hash digest
SHA256 20953cdefa63484b73f1437e9fffb56a2c662fe7bff11ad7ae6f8e6bf522f023
MD5 885942f9f97547f62a9d8a7ace1ff4d9
BLAKE2b-256 a00c637bda4978295ef15e0dc457ffd8ac2bc02eee9a94ca4b9c529b23e9a256

See more details on using hashes here.

Provenance

The following attestation bundles were made for reap_pdf-0.1.7-cp314-cp314-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl:

Publisher: release-publish.yml on kennipj/reap

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file reap_pdf-0.1.7-cp313-cp313-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for reap_pdf-0.1.7-cp313-cp313-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 51ddc5fbe7b8bd27250c7f96f1f2e615bfd7dd7452b3cbe0ddc00197a0aaa15d
MD5 2bd9711f2ff995542e9cc11aafcf008d
BLAKE2b-256 1f56f701b9e7b6f6b6b8114375ef5a25e19c68f67885d585f698fdf292fdd25b

See more details on using hashes here.

Provenance

The following attestation bundles were made for reap_pdf-0.1.7-cp313-cp313-manylinux_2_34_x86_64.whl:

Publisher: release-publish.yml on kennipj/reap

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file reap_pdf-0.1.7-cp313-cp313-manylinux_2_34_aarch64.whl.

File metadata

File hashes

Hashes for reap_pdf-0.1.7-cp313-cp313-manylinux_2_34_aarch64.whl
Algorithm Hash digest
SHA256 0a11fd980015f5057e15b2e051f764e47053ae6b31c9f0fffe2ca02334d35ff6
MD5 f6d365b2b310efb77a8277d5eadd63ea
BLAKE2b-256 4cd252e2a2b527b1e8ef5cd6beeeb563e59f7edf3ca3d91a337b9c21126a023a

See more details on using hashes here.

Provenance

The following attestation bundles were made for reap_pdf-0.1.7-cp313-cp313-manylinux_2_34_aarch64.whl:

Publisher: release-publish.yml on kennipj/reap

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file reap_pdf-0.1.7-cp313-cp313-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl.

File metadata

File hashes

Hashes for reap_pdf-0.1.7-cp313-cp313-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl
Algorithm Hash digest
SHA256 4e587603056ab257f95b3f4c11f27471a1d4435dd2e0921a7b7a9f5f64ba6cba
MD5 61ad94618c5b7338caebeaf623bec6bb
BLAKE2b-256 db2fb842d9d9a7c659e204139fd0abc787daa11655951c61aaf76322d932da3e

See more details on using hashes here.

Provenance

The following attestation bundles were made for reap_pdf-0.1.7-cp313-cp313-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl:

Publisher: release-publish.yml on kennipj/reap

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file reap_pdf-0.1.7-cp312-cp312-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for reap_pdf-0.1.7-cp312-cp312-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 839cd0f3e1252f3dee4f279971acc51731c41ad1eb506d4fbd45fee3b3ab8822
MD5 641fd2a5f47f1027ffc0aa3005894931
BLAKE2b-256 0b61cef805af26c29f53ae482b9c7aa3380b293edf46a0711375e375b316d969

See more details on using hashes here.

Provenance

The following attestation bundles were made for reap_pdf-0.1.7-cp312-cp312-manylinux_2_34_x86_64.whl:

Publisher: release-publish.yml on kennipj/reap

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file reap_pdf-0.1.7-cp312-cp312-manylinux_2_34_aarch64.whl.

File metadata

File hashes

Hashes for reap_pdf-0.1.7-cp312-cp312-manylinux_2_34_aarch64.whl
Algorithm Hash digest
SHA256 16abdbdcbb8653b63649ca4881ffdeef3850ddc8a179536eb0b751f6aa2c899b
MD5 d0c0f05546a049d20f0cc29d8d409f91
BLAKE2b-256 89502e43e02c8fe580d4f5a113956496276d829bccfb391f3f53bd555184470b

See more details on using hashes here.

Provenance

The following attestation bundles were made for reap_pdf-0.1.7-cp312-cp312-manylinux_2_34_aarch64.whl:

Publisher: release-publish.yml on kennipj/reap

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file reap_pdf-0.1.7-cp312-cp312-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl.

File metadata

File hashes

Hashes for reap_pdf-0.1.7-cp312-cp312-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl
Algorithm Hash digest
SHA256 f832bb3f6cd525c30539d65d2c119fae2ad4e19d888e6edf31d41db95aaa0402
MD5 4ede978f3d17deddfab16787d0f79a94
BLAKE2b-256 ec82b4403d98c7805483a4ead23ef6aec4b3fd87a83bee52b38150c87a4a4512

See more details on using hashes here.

Provenance

The following attestation bundles were made for reap_pdf-0.1.7-cp312-cp312-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl:

Publisher: release-publish.yml on kennipj/reap

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file reap_pdf-0.1.7-cp311-cp311-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for reap_pdf-0.1.7-cp311-cp311-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 482290a7e603abd745fd7825e4dc8c072584614b9102d26e5b203104079415f8
MD5 4986deb718ec5b5f2a674a4f7b02170d
BLAKE2b-256 a5b2660423183c56d752833abcf9a84fa9a0fa3f8a5e44d34751a4104b9666f2

See more details on using hashes here.

Provenance

The following attestation bundles were made for reap_pdf-0.1.7-cp311-cp311-manylinux_2_34_x86_64.whl:

Publisher: release-publish.yml on kennipj/reap

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file reap_pdf-0.1.7-cp311-cp311-manylinux_2_34_aarch64.whl.

File metadata

File hashes

Hashes for reap_pdf-0.1.7-cp311-cp311-manylinux_2_34_aarch64.whl
Algorithm Hash digest
SHA256 9215707a9c5a4f5d23610b400565f0ff4fe35d46cf5d76fa1dc8d60f069978ac
MD5 37db0add0f5017af93dfb0d54e7f86c7
BLAKE2b-256 2224ef8767d6716368c95de088c83fe810bd264dcc99a81b59c9886c0dc52bf1

See more details on using hashes here.

Provenance

The following attestation bundles were made for reap_pdf-0.1.7-cp311-cp311-manylinux_2_34_aarch64.whl:

Publisher: release-publish.yml on kennipj/reap

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file reap_pdf-0.1.7-cp311-cp311-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl.

File metadata

File hashes

Hashes for reap_pdf-0.1.7-cp311-cp311-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl
Algorithm Hash digest
SHA256 c31e89145676694c2ede754595b324d63bd0ea44bb76c1ab6faa77eb5816a832
MD5 ca6718daa6e69ea1eab43a4171b44ece
BLAKE2b-256 34870a81c45ffe3cd70a6cb176d0d0fc4bf7d59a1f9c50cc44e01a1bf16825dc

See more details on using hashes here.

Provenance

The following attestation bundles were made for reap_pdf-0.1.7-cp311-cp311-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl:

Publisher: release-publish.yml on kennipj/reap

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file reap_pdf-0.1.7-cp310-cp310-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for reap_pdf-0.1.7-cp310-cp310-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 6d0243c1bfd550a7e3361d6f446e841c66fe52b797f1bdcb06feb6b6aae3ebfb
MD5 474ca5d5e8a0f8c0628be1b4c123a3b9
BLAKE2b-256 3f5c3c3271592ac2237c310a62b2493bd1e5355159c2c1c600cc14ad6d026cf1

See more details on using hashes here.

Provenance

The following attestation bundles were made for reap_pdf-0.1.7-cp310-cp310-manylinux_2_34_x86_64.whl:

Publisher: release-publish.yml on kennipj/reap

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file reap_pdf-0.1.7-cp310-cp310-manylinux_2_34_aarch64.whl.

File metadata

File hashes

Hashes for reap_pdf-0.1.7-cp310-cp310-manylinux_2_34_aarch64.whl
Algorithm Hash digest
SHA256 3889e7587e8a154001d0ab61be5e043619738337e15dd1f91f83fb6e0d006006
MD5 8bcb7dd35c1f6a124c9487ff1423128a
BLAKE2b-256 b5ea060e22a1879f9864404047dff486194aab4dc2daa1258b72ce8b93e23ac6

See more details on using hashes here.

Provenance

The following attestation bundles were made for reap_pdf-0.1.7-cp310-cp310-manylinux_2_34_aarch64.whl:

Publisher: release-publish.yml on kennipj/reap

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file reap_pdf-0.1.7-cp310-cp310-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl.

File metadata

File hashes

Hashes for reap_pdf-0.1.7-cp310-cp310-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl
Algorithm Hash digest
SHA256 cde9e2ff9325466c2b8c4155c69442f23c5fb3d16fa0fbdeacee25c6820674e9
MD5 600ec70696bb5c85c1506ba873299876
BLAKE2b-256 76d58b0e40769e61218814a245870abd852d17d92dcb3b4f1f2d7db55d576771

See more details on using hashes here.

Provenance

The following attestation bundles were made for reap_pdf-0.1.7-cp310-cp310-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl:

Publisher: release-publish.yml on kennipj/reap

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page