Skip to main content

Rust-first PDF text extraction with geometry-aware search and optional Python bindings

Project description

reap

reap is a low-level PDF parser written in Rust, with Python bindings for fast geometry-aware text extraction and spatial queries. It is designed for fast spatial lookups, but also supports full text extraction and regex queries.

Installing

uv add reap-pdf

Usage

from reap import Rectangle, TextBlockIndex

index = TextBlockIndex.from_path("my_w2.pdf")
blocks = index.search_regex("Wage and Tax Statement")
print(blocks)
#> [TextBlock("Wage and Tax Statement")]

# Search to the right of the label.
search_rect = Rectangle(
    top=blocks[0].rect.top,
    left=blocks[0].rect.right,
    bottom=blocks[0].rect.bottom,
    right=blocks[0].rect.right + 100,
)
year = index.search(search_rect, overlap=0.3)
print(year[0].text)
#> 2026

Details

  • TextBlockIndex builds an index of word level TextBlocks, across all pages.
  • TextBlockIndex.search_regex works on the entire text corpus, in reading order, but returns merged TextBlocks that match the query into one TextBlock.
  • TextBlockIndex.text returns text in reading order.
  • All pages are merged, and considered as one big page, with each page's coordinates stacked below each other. Coordinate units are points.

Advanced use cases

  • TextBlockIndex.scoped creates a TextBlockIndex scoped to a specific region, with support for merging text blocks within the region into multi-word TextBlocks.
  • TextBlockIndex(..., include_chars=True) enables TextBlockIndex.chars, returning all TextChars for each page.

Scope

  • OCR is not in-scope. Reap focuses entirely on extracting visible text present in the PDF.
  • No promises for all PDF spec details to be supported, support is added as the need arises.
  • Best effort to extract all visible text in PDFs.

Comparisons

Extraction differences

PyMuPDF and pdfminer will both extract hidden text, eg. white text on a white background, which is often not desirable. reap performs visibility checks during extraction, to ensure that invisible text does not get included in the final output.

Speed

Text extraction performance measured for comparability, as neither alternative offers proper spatial query support.

Library Text Extraction
reap 0.850ms median
PyMuPDF 5ms median
pdfminer 220ms median

Speed comparisons vary greatly depending on the PDF, with smaller PDFs reap is roughly 10x faster than PyMuPDF, and 500x faster than pdfminer, with larger PDFs the advantage narrows to roughly 5x and 250x respectively. Tests were conducted on a private test suite of PDFs with varying complexity

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

reap_pdf-0.1.2.tar.gz (1.3 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

reap_pdf-0.1.2-cp314-cp314-manylinux_2_34_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.14manylinux: glibc 2.34+ x86-64

reap_pdf-0.1.2-cp314-cp314-manylinux_2_34_aarch64.whl (1.2 MB view details)

Uploaded CPython 3.14manylinux: glibc 2.34+ ARM64

reap_pdf-0.1.2-cp314-cp314-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl (2.3 MB view details)

Uploaded CPython 3.14macOS 10.12+ universal2 (ARM64, x86-64)macOS 10.12+ x86-64macOS 11.0+ ARM64

reap_pdf-0.1.2-cp313-cp313-manylinux_2_34_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.34+ x86-64

reap_pdf-0.1.2-cp313-cp313-manylinux_2_34_aarch64.whl (1.2 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.34+ ARM64

reap_pdf-0.1.2-cp313-cp313-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl (2.3 MB view details)

Uploaded CPython 3.13macOS 10.12+ universal2 (ARM64, x86-64)macOS 10.12+ x86-64macOS 11.0+ ARM64

reap_pdf-0.1.2-cp312-cp312-manylinux_2_34_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.34+ x86-64

reap_pdf-0.1.2-cp312-cp312-manylinux_2_34_aarch64.whl (1.2 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.34+ ARM64

reap_pdf-0.1.2-cp312-cp312-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl (2.3 MB view details)

Uploaded CPython 3.12macOS 10.12+ universal2 (ARM64, x86-64)macOS 10.12+ x86-64macOS 11.0+ ARM64

reap_pdf-0.1.2-cp311-cp311-manylinux_2_34_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.34+ x86-64

reap_pdf-0.1.2-cp311-cp311-manylinux_2_34_aarch64.whl (1.2 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.34+ ARM64

reap_pdf-0.1.2-cp311-cp311-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl (2.3 MB view details)

Uploaded CPython 3.11macOS 10.12+ universal2 (ARM64, x86-64)macOS 10.12+ x86-64macOS 11.0+ ARM64

reap_pdf-0.1.2-cp310-cp310-manylinux_2_34_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.34+ x86-64

reap_pdf-0.1.2-cp310-cp310-manylinux_2_34_aarch64.whl (1.2 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.34+ ARM64

reap_pdf-0.1.2-cp310-cp310-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl (2.3 MB view details)

Uploaded CPython 3.10macOS 10.12+ universal2 (ARM64, x86-64)macOS 10.12+ x86-64macOS 11.0+ ARM64

File details

Details for the file reap_pdf-0.1.2.tar.gz.

File metadata

  • Download URL: reap_pdf-0.1.2.tar.gz
  • Upload date:
  • Size: 1.3 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for reap_pdf-0.1.2.tar.gz
Algorithm Hash digest
SHA256 64ee053ea2f92ccdd7737660713cdb0e852b1dd90fa2164c1417c3a4982841ac
MD5 af78e94fff10fcb150c7c56b841580c3
BLAKE2b-256 3462cea66c6725284ef9f2aae387567bf08ed12c8d615a94acc9465532cdd438

See more details on using hashes here.

Provenance

The following attestation bundles were made for reap_pdf-0.1.2.tar.gz:

Publisher: release-publish.yml on kennipj/reap

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file reap_pdf-0.1.2-cp314-cp314-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for reap_pdf-0.1.2-cp314-cp314-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 5a27e419cf8d5ea366b3e8dc3ebb3a6dce02132228cf5bf990c8ed24b5e519ba
MD5 3adb66811713ee9a5df7f5b24099da3d
BLAKE2b-256 e6667c1607e4f28e68573a421332925c2b39f26b0e8055a940513abf561f89a9

See more details on using hashes here.

Provenance

The following attestation bundles were made for reap_pdf-0.1.2-cp314-cp314-manylinux_2_34_x86_64.whl:

Publisher: release-publish.yml on kennipj/reap

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file reap_pdf-0.1.2-cp314-cp314-manylinux_2_34_aarch64.whl.

File metadata

File hashes

Hashes for reap_pdf-0.1.2-cp314-cp314-manylinux_2_34_aarch64.whl
Algorithm Hash digest
SHA256 d9be36f3ff89845ceccaa1ecb73e281465f28adc5a396e9c25856c1d82c8a1f9
MD5 3d088b444cd5a06a783e4055c7cbe97a
BLAKE2b-256 64309b115ca0808af14fd88284f0120451880f90c604db162003bd7c3f18bd04

See more details on using hashes here.

Provenance

The following attestation bundles were made for reap_pdf-0.1.2-cp314-cp314-manylinux_2_34_aarch64.whl:

Publisher: release-publish.yml on kennipj/reap

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file reap_pdf-0.1.2-cp314-cp314-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl.

File metadata

File hashes

Hashes for reap_pdf-0.1.2-cp314-cp314-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl
Algorithm Hash digest
SHA256 40755fbd84bded5da6810a1060ffb219a9e978667ff6f13ca6f92386bf3b68bb
MD5 6fd01cc6c0afd0e2cf351598524b6987
BLAKE2b-256 75fdb062d1e544782380997aed9215d2b566ed0e9ec4ff878268806b75b6d74f

See more details on using hashes here.

Provenance

The following attestation bundles were made for reap_pdf-0.1.2-cp314-cp314-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl:

Publisher: release-publish.yml on kennipj/reap

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file reap_pdf-0.1.2-cp313-cp313-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for reap_pdf-0.1.2-cp313-cp313-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 97f7b66f2d58238b6526f67bc71dc7479c55aeb45f811a56f9c4922437917428
MD5 54ebbe16b7bcdbbfcdeeeaf4587c650a
BLAKE2b-256 dc6c5a86bfc044a78d808d39b355a5faff8fcc13047438f2a238cb1112d7c420

See more details on using hashes here.

Provenance

The following attestation bundles were made for reap_pdf-0.1.2-cp313-cp313-manylinux_2_34_x86_64.whl:

Publisher: release-publish.yml on kennipj/reap

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file reap_pdf-0.1.2-cp313-cp313-manylinux_2_34_aarch64.whl.

File metadata

File hashes

Hashes for reap_pdf-0.1.2-cp313-cp313-manylinux_2_34_aarch64.whl
Algorithm Hash digest
SHA256 0203e0fb2e5681139292d3da954be9ea8d37e6438f57f04510079db9feb4a14c
MD5 e0748a408fb55935c332e5f85098d2f1
BLAKE2b-256 c99ac5ee338fabb050fdcf1bc55192bb30e7bc3e32bea4a7ab10f594e790b0f9

See more details on using hashes here.

Provenance

The following attestation bundles were made for reap_pdf-0.1.2-cp313-cp313-manylinux_2_34_aarch64.whl:

Publisher: release-publish.yml on kennipj/reap

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file reap_pdf-0.1.2-cp313-cp313-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl.

File metadata

File hashes

Hashes for reap_pdf-0.1.2-cp313-cp313-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl
Algorithm Hash digest
SHA256 ff47ae228af459857e31b7bc0e799e3290132a4c5d14f623749916d9bbeb4d02
MD5 4b01626c54c0b7312e23a77ef8bca40e
BLAKE2b-256 9e44202c8f0aac2bba5e9af4cd4b819bcf531646ed1560d9ec6fd86b13ac6cc0

See more details on using hashes here.

Provenance

The following attestation bundles were made for reap_pdf-0.1.2-cp313-cp313-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl:

Publisher: release-publish.yml on kennipj/reap

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file reap_pdf-0.1.2-cp312-cp312-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for reap_pdf-0.1.2-cp312-cp312-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 ed0cd88a7076d50efa088b4e869d30854091428ba86d7605fcccc0e6108ac07b
MD5 17f0b16f3d3f31cb995f26fe1a0d1a0f
BLAKE2b-256 998eae912c2ef27ad8e3dbd59d7d862185cd9c36ba2d7e072067badf5fa270a1

See more details on using hashes here.

Provenance

The following attestation bundles were made for reap_pdf-0.1.2-cp312-cp312-manylinux_2_34_x86_64.whl:

Publisher: release-publish.yml on kennipj/reap

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file reap_pdf-0.1.2-cp312-cp312-manylinux_2_34_aarch64.whl.

File metadata

File hashes

Hashes for reap_pdf-0.1.2-cp312-cp312-manylinux_2_34_aarch64.whl
Algorithm Hash digest
SHA256 90fafa49584732ff52075bfd4d05318a4e2022e14a3524db79e8cf1984db1eee
MD5 d867392c026f8318de0f7cf4b935eca7
BLAKE2b-256 63e8a4c885ea44cc0607a37f79b8562a727950852e7dcc5fcbb0e6e3c9d959d2

See more details on using hashes here.

Provenance

The following attestation bundles were made for reap_pdf-0.1.2-cp312-cp312-manylinux_2_34_aarch64.whl:

Publisher: release-publish.yml on kennipj/reap

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file reap_pdf-0.1.2-cp312-cp312-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl.

File metadata

File hashes

Hashes for reap_pdf-0.1.2-cp312-cp312-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl
Algorithm Hash digest
SHA256 f3d59949d46e8e377603e93d9de24a8f7da119713fa77b520660554b30598292
MD5 4a1f0c59531d0967d1c453d879c74616
BLAKE2b-256 0b71c272c79bfb9079d05a5dccc85b63d9d51dda7571ae2a3489d48134074563

See more details on using hashes here.

Provenance

The following attestation bundles were made for reap_pdf-0.1.2-cp312-cp312-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl:

Publisher: release-publish.yml on kennipj/reap

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file reap_pdf-0.1.2-cp311-cp311-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for reap_pdf-0.1.2-cp311-cp311-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 c58586156a1e7d1d01c60a02325bc24ed80b80bad99263e7b8f5d9f8511a97ac
MD5 f5325554101dd35e5e43f428b25cdee1
BLAKE2b-256 bd9a69e7036b910541e492e526828918618501f66fd5d739496c5e3e66cbb1dd

See more details on using hashes here.

Provenance

The following attestation bundles were made for reap_pdf-0.1.2-cp311-cp311-manylinux_2_34_x86_64.whl:

Publisher: release-publish.yml on kennipj/reap

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file reap_pdf-0.1.2-cp311-cp311-manylinux_2_34_aarch64.whl.

File metadata

File hashes

Hashes for reap_pdf-0.1.2-cp311-cp311-manylinux_2_34_aarch64.whl
Algorithm Hash digest
SHA256 8dfe6d8c84efc398bab21d2be196043677741e662ce819b7204f2048ff33c3e1
MD5 d15602b3424911c3848632d4ed58784b
BLAKE2b-256 b03ca8a5938f71d89c9dd12159bd8321996882d836e0fbcf195b3ab310bd32c8

See more details on using hashes here.

Provenance

The following attestation bundles were made for reap_pdf-0.1.2-cp311-cp311-manylinux_2_34_aarch64.whl:

Publisher: release-publish.yml on kennipj/reap

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file reap_pdf-0.1.2-cp311-cp311-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl.

File metadata

File hashes

Hashes for reap_pdf-0.1.2-cp311-cp311-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl
Algorithm Hash digest
SHA256 593aba80c56e4ad715593b232d041ac02b837feba0dfe81ae906176ba315c459
MD5 8bb68dd45397961cac3a35c8da52abdf
BLAKE2b-256 372267ce43878c0dab05ae301c43fb22844c5c520e3d8cba6e122668ed595cb2

See more details on using hashes here.

Provenance

The following attestation bundles were made for reap_pdf-0.1.2-cp311-cp311-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl:

Publisher: release-publish.yml on kennipj/reap

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file reap_pdf-0.1.2-cp310-cp310-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for reap_pdf-0.1.2-cp310-cp310-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 3b2754c42a03cb0da5327c1fb2d5352fdc5fdd899b738017349f42f35e044a68
MD5 b219621ae5cf7b5901aa3c496176371b
BLAKE2b-256 648d6f9adca7ba3d929e163f4a2cbdd8b7120a14b658a7d5f480688221ae4618

See more details on using hashes here.

Provenance

The following attestation bundles were made for reap_pdf-0.1.2-cp310-cp310-manylinux_2_34_x86_64.whl:

Publisher: release-publish.yml on kennipj/reap

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file reap_pdf-0.1.2-cp310-cp310-manylinux_2_34_aarch64.whl.

File metadata

File hashes

Hashes for reap_pdf-0.1.2-cp310-cp310-manylinux_2_34_aarch64.whl
Algorithm Hash digest
SHA256 24f5cee65362c2643b9d4a618e3ba3a8be5e66b785580769c0e5a4e492976c02
MD5 ea50c0f858c112bb6ddd9090d9015d3c
BLAKE2b-256 edd94b646b53893833ceb52fb562429e698f047d0629397e90c492fa343e4b1c

See more details on using hashes here.

Provenance

The following attestation bundles were made for reap_pdf-0.1.2-cp310-cp310-manylinux_2_34_aarch64.whl:

Publisher: release-publish.yml on kennipj/reap

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file reap_pdf-0.1.2-cp310-cp310-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl.

File metadata

File hashes

Hashes for reap_pdf-0.1.2-cp310-cp310-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl
Algorithm Hash digest
SHA256 9458263c2b615895540adbf70a930e261752d843bda1f12c10f3c50b4fbad540
MD5 9acc92a2987a9217917b46fa15a2fa89
BLAKE2b-256 8f616490ff0e95784dc0bac523673a5c7b37694088895a59e50112fb5829812a

See more details on using hashes here.

Provenance

The following attestation bundles were made for reap_pdf-0.1.2-cp310-cp310-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl:

Publisher: release-publish.yml on kennipj/reap

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page