Skip to main content

Rust-first PDF text extraction with geometry-aware search and optional Python bindings

Project description

reap

reap is a low-level PDF parser written in Rust, with Python bindings for fast geometry-aware text extraction and spatial queries. It is designed for fast spatial lookups, but also supports full text extraction and regex queries.

Installing

uv add reap-pdf

Usage

from reap import Rectangle, TextBlockIndex

index = TextBlockIndex.from_path("my_w2.pdf")
blocks = index.search_regex("Wage and Tax Statement")
print(blocks)
#> [TextBlock("Wage and Tax Statement")]

# Search to the right of the label.
search_rect = Rectangle(
    top=blocks[0].rect.top,
    left=blocks[0].rect.right,
    bottom=blocks[0].rect.bottom,
    right=blocks[0].rect.right + 100,
)
year = index.search(search_rect, overlap=0.3)
print(year[0].text)
#> 2026

Details

  • TextBlockIndex builds an index of word level TextBlocks, across all pages.
  • TextBlockIndex.search_regex works on the entire text corpus, in reading order, but returns merged TextBlocks that match the query into one TextBlock.
  • TextBlockIndex.text returns text in reading order.
  • All pages are merged, and considered as one big page, with each page's coordinates stacked below each other. Coordinate units are points.

Advanced use cases

  • TextBlockIndex.scoped creates a TextBlockIndex scoped to a specific region, with support for merging text blocks within the region into multi-word TextBlocks.
  • TextBlockIndex(..., include_chars=True) enables TextBlockIndex.chars, returning all TextChars for each page.

Scope

  • OCR is not in-scope. Reap focuses entirely on extracting visible text present in the PDF.
  • No promises for all PDF spec details to be supported, support is added as the need arises.
  • Best effort to extract all visible text in PDFs.

Comparisons

Extraction differences

PyMuPDF and pdfminer will both extract hidden text, eg. white text on a white background, which is often not desirable. reap performs visibility checks during extraction, to ensure that invisible text does not get included in the final output.

Speed

Text extraction performance measured for comparability, as neither alternative offers proper spatial query support.

Library Text Extraction
reap 0.850ms median
PyMuPDF 5ms median
pdfminer 220ms median

Speed comparisons vary greatly depending on the PDF, with smaller PDFs reap is roughly 10x faster than PyMuPDF, and 500x faster than pdfminer, with larger PDFs the advantage narrows to roughly 5x and 250x respectively. Tests were conducted on a private test suite of PDFs with varying complexity

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

reap_pdf-0.1.6.tar.gz (1.3 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

reap_pdf-0.1.6-cp314-cp314-manylinux_2_34_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.14manylinux: glibc 2.34+ x86-64

reap_pdf-0.1.6-cp314-cp314-manylinux_2_34_aarch64.whl (1.2 MB view details)

Uploaded CPython 3.14manylinux: glibc 2.34+ ARM64

reap_pdf-0.1.6-cp314-cp314-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl (2.3 MB view details)

Uploaded CPython 3.14macOS 10.12+ universal2 (ARM64, x86-64)macOS 10.12+ x86-64macOS 11.0+ ARM64

reap_pdf-0.1.6-cp313-cp313-manylinux_2_34_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.34+ x86-64

reap_pdf-0.1.6-cp313-cp313-manylinux_2_34_aarch64.whl (1.2 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.34+ ARM64

reap_pdf-0.1.6-cp313-cp313-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl (2.3 MB view details)

Uploaded CPython 3.13macOS 10.12+ universal2 (ARM64, x86-64)macOS 10.12+ x86-64macOS 11.0+ ARM64

reap_pdf-0.1.6-cp312-cp312-manylinux_2_34_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.34+ x86-64

reap_pdf-0.1.6-cp312-cp312-manylinux_2_34_aarch64.whl (1.2 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.34+ ARM64

reap_pdf-0.1.6-cp312-cp312-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl (2.3 MB view details)

Uploaded CPython 3.12macOS 10.12+ universal2 (ARM64, x86-64)macOS 10.12+ x86-64macOS 11.0+ ARM64

reap_pdf-0.1.6-cp311-cp311-manylinux_2_34_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.34+ x86-64

reap_pdf-0.1.6-cp311-cp311-manylinux_2_34_aarch64.whl (1.2 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.34+ ARM64

reap_pdf-0.1.6-cp311-cp311-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl (2.3 MB view details)

Uploaded CPython 3.11macOS 10.12+ universal2 (ARM64, x86-64)macOS 10.12+ x86-64macOS 11.0+ ARM64

reap_pdf-0.1.6-cp310-cp310-manylinux_2_34_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.34+ x86-64

reap_pdf-0.1.6-cp310-cp310-manylinux_2_34_aarch64.whl (1.2 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.34+ ARM64

reap_pdf-0.1.6-cp310-cp310-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl (2.3 MB view details)

Uploaded CPython 3.10macOS 10.12+ universal2 (ARM64, x86-64)macOS 10.12+ x86-64macOS 11.0+ ARM64

File details

Details for the file reap_pdf-0.1.6.tar.gz.

File metadata

  • Download URL: reap_pdf-0.1.6.tar.gz
  • Upload date:
  • Size: 1.3 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for reap_pdf-0.1.6.tar.gz
Algorithm Hash digest
SHA256 905bf76b7ed0ff776ae7ab66d8a06708cabb2e8731fad5807e07d1cd4139457b
MD5 0bd5656bb5cf49782c3c64af53adc780
BLAKE2b-256 334662db9d8d4b2e6a3844297aa81ebbd70083e1a28d5d8c6708feaabf8a2472

See more details on using hashes here.

Provenance

The following attestation bundles were made for reap_pdf-0.1.6.tar.gz:

Publisher: release-publish.yml on kennipj/reap

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file reap_pdf-0.1.6-cp314-cp314-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for reap_pdf-0.1.6-cp314-cp314-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 7cc64713745abd49c778de012903927c085c23ee9c2cb21fe651e36ec1536b19
MD5 b6ea35fed0cbcc71c2b354fadfea4ce1
BLAKE2b-256 a9efecdca04fdc18dc421673cd9f6137ce9aed9d4f59f242e65d28cd586aac98

See more details on using hashes here.

Provenance

The following attestation bundles were made for reap_pdf-0.1.6-cp314-cp314-manylinux_2_34_x86_64.whl:

Publisher: release-publish.yml on kennipj/reap

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file reap_pdf-0.1.6-cp314-cp314-manylinux_2_34_aarch64.whl.

File metadata

File hashes

Hashes for reap_pdf-0.1.6-cp314-cp314-manylinux_2_34_aarch64.whl
Algorithm Hash digest
SHA256 54fdb11f62b9dc14c61ba95d6a1f94c75eb1aead40d7217215b15e6e3079dadd
MD5 1333a79ea041e1c8482860bc2e4837ec
BLAKE2b-256 e89da3faa102bd44423c48ddd279a32d32a430bd3b1650fc16ec3590781b7cb8

See more details on using hashes here.

Provenance

The following attestation bundles were made for reap_pdf-0.1.6-cp314-cp314-manylinux_2_34_aarch64.whl:

Publisher: release-publish.yml on kennipj/reap

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file reap_pdf-0.1.6-cp314-cp314-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl.

File metadata

File hashes

Hashes for reap_pdf-0.1.6-cp314-cp314-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl
Algorithm Hash digest
SHA256 92530584508fd613ac8a0f91456caba2a73bcfc849f54a8079c0d5e8bb08b836
MD5 9497d846be0ef77da54de3184e6f4ef9
BLAKE2b-256 17aa0f05b86f9ab7d5472a4a4bfbb7411794baebb730dc373f7173f087c777b9

See more details on using hashes here.

Provenance

The following attestation bundles were made for reap_pdf-0.1.6-cp314-cp314-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl:

Publisher: release-publish.yml on kennipj/reap

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file reap_pdf-0.1.6-cp313-cp313-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for reap_pdf-0.1.6-cp313-cp313-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 251f08082e281d9d20c421057074c5bf25e1c6469f42524d7ef9af45a1a4443a
MD5 d469bc9eaba895027054cf2dbf81e3ee
BLAKE2b-256 a1b68d3a6daca4174ce2891840487e1402c0c695ba1fd218494a09133359c58d

See more details on using hashes here.

Provenance

The following attestation bundles were made for reap_pdf-0.1.6-cp313-cp313-manylinux_2_34_x86_64.whl:

Publisher: release-publish.yml on kennipj/reap

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file reap_pdf-0.1.6-cp313-cp313-manylinux_2_34_aarch64.whl.

File metadata

File hashes

Hashes for reap_pdf-0.1.6-cp313-cp313-manylinux_2_34_aarch64.whl
Algorithm Hash digest
SHA256 1050bf9375a29ba5b871f6fc0a661514f331ccbe077e506e4562eaa519e598f2
MD5 5c7c75ad7fc9994c240db04a23530e5c
BLAKE2b-256 0e22e9fd80b61e44dcb5e8a7f51192289ff9d90e50971a0932efac3f2697705a

See more details on using hashes here.

Provenance

The following attestation bundles were made for reap_pdf-0.1.6-cp313-cp313-manylinux_2_34_aarch64.whl:

Publisher: release-publish.yml on kennipj/reap

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file reap_pdf-0.1.6-cp313-cp313-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl.

File metadata

File hashes

Hashes for reap_pdf-0.1.6-cp313-cp313-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl
Algorithm Hash digest
SHA256 c2acd0a9f511c5e09220e7b6fabe4c2bbe7f16a15c6cc87c1ed6b929a34f83ac
MD5 1b370fe0301827948902966ad438721c
BLAKE2b-256 48e0035cb9169dab10c61a641c41463965c5e6001ae01ce14a5d252f3e238b53

See more details on using hashes here.

Provenance

The following attestation bundles were made for reap_pdf-0.1.6-cp313-cp313-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl:

Publisher: release-publish.yml on kennipj/reap

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file reap_pdf-0.1.6-cp312-cp312-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for reap_pdf-0.1.6-cp312-cp312-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 19bbfa4cb649af128beeaddac72dc6e8c56734d8266a9b8fd5def4bb8e1732ec
MD5 efbd6982c99f7ae2c9b3c59c3c745f0c
BLAKE2b-256 40b765d36dd12592d44cfb32862dff249fe2e3624c7592b9b3edf50aa72fe6ae

See more details on using hashes here.

Provenance

The following attestation bundles were made for reap_pdf-0.1.6-cp312-cp312-manylinux_2_34_x86_64.whl:

Publisher: release-publish.yml on kennipj/reap

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file reap_pdf-0.1.6-cp312-cp312-manylinux_2_34_aarch64.whl.

File metadata

File hashes

Hashes for reap_pdf-0.1.6-cp312-cp312-manylinux_2_34_aarch64.whl
Algorithm Hash digest
SHA256 8501c43f57ede8351bc73184a1f7819d26914f6713c3454fc0c4f2cd44e63fd1
MD5 d6c7280a062754db47132e585b5800b2
BLAKE2b-256 96abc72432ca398b3dbc1f1d6466ccb0f2ecc2ddaa212cdb80483a3a2631a561

See more details on using hashes here.

Provenance

The following attestation bundles were made for reap_pdf-0.1.6-cp312-cp312-manylinux_2_34_aarch64.whl:

Publisher: release-publish.yml on kennipj/reap

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file reap_pdf-0.1.6-cp312-cp312-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl.

File metadata

File hashes

Hashes for reap_pdf-0.1.6-cp312-cp312-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl
Algorithm Hash digest
SHA256 37bb3db931e20f9ac2a4289d7e7e6b52a24d9aff37b25694c68b08a747bcb2e3
MD5 ea028945eb0981ca8a9ebd94124fc2a1
BLAKE2b-256 78e0621ce36500e8faba01029a7436f1f6f6989c63fd38c7eb3ba020527ba185

See more details on using hashes here.

Provenance

The following attestation bundles were made for reap_pdf-0.1.6-cp312-cp312-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl:

Publisher: release-publish.yml on kennipj/reap

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file reap_pdf-0.1.6-cp311-cp311-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for reap_pdf-0.1.6-cp311-cp311-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 a63caf11e425eacfb716ab4833771ec96b0907727887d5591937b54a17b64f0f
MD5 c1b99022866f0e30fea54ed9454df83d
BLAKE2b-256 5396268781d25ce4c03207f16489fa177965c2eb8e79274e210c8f03128b4c05

See more details on using hashes here.

Provenance

The following attestation bundles were made for reap_pdf-0.1.6-cp311-cp311-manylinux_2_34_x86_64.whl:

Publisher: release-publish.yml on kennipj/reap

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file reap_pdf-0.1.6-cp311-cp311-manylinux_2_34_aarch64.whl.

File metadata

File hashes

Hashes for reap_pdf-0.1.6-cp311-cp311-manylinux_2_34_aarch64.whl
Algorithm Hash digest
SHA256 7edaaedb3c20c57534abd067a36c840197f330320fd86ef58b5d753e4217741b
MD5 05b67d270c30101344abf8af6b105bb1
BLAKE2b-256 0592eeb82001b95ecde6885d559fc9d988a4318198e2d9663e6e0f63cdad2e9f

See more details on using hashes here.

Provenance

The following attestation bundles were made for reap_pdf-0.1.6-cp311-cp311-manylinux_2_34_aarch64.whl:

Publisher: release-publish.yml on kennipj/reap

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file reap_pdf-0.1.6-cp311-cp311-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl.

File metadata

File hashes

Hashes for reap_pdf-0.1.6-cp311-cp311-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl
Algorithm Hash digest
SHA256 f3a016fce7f135925f69bb2d2e8da0aec419ef645839fe5b85dab87f3b824747
MD5 fb26c914e3d27c9c99d83a48bbd61297
BLAKE2b-256 dab414938b7e15e311168be500bb9739c64dff2b49e2ed2b9666c35212327bbc

See more details on using hashes here.

Provenance

The following attestation bundles were made for reap_pdf-0.1.6-cp311-cp311-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl:

Publisher: release-publish.yml on kennipj/reap

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file reap_pdf-0.1.6-cp310-cp310-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for reap_pdf-0.1.6-cp310-cp310-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 ac474e9f4317dce5ee3b0651833a15ba8c65bd15a9ceab71ffd3090e87440921
MD5 441324fce2c7cf2b8be3b7d11f1f4001
BLAKE2b-256 1a178258fbdc6d953fc7b57e136a61934817c47813273de07ea2a994a1456aa2

See more details on using hashes here.

Provenance

The following attestation bundles were made for reap_pdf-0.1.6-cp310-cp310-manylinux_2_34_x86_64.whl:

Publisher: release-publish.yml on kennipj/reap

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file reap_pdf-0.1.6-cp310-cp310-manylinux_2_34_aarch64.whl.

File metadata

File hashes

Hashes for reap_pdf-0.1.6-cp310-cp310-manylinux_2_34_aarch64.whl
Algorithm Hash digest
SHA256 410af30e9e6ec02bfeec18e6e6771dfe26e1cb4cc52837cdb6c205bd9f088a0d
MD5 30bbfff398b30f45dd3600fc0e9c597a
BLAKE2b-256 b3a06fe4bb3ec40be77e8fe659d3a0a01bab146606175b97828af2e09d394da2

See more details on using hashes here.

Provenance

The following attestation bundles were made for reap_pdf-0.1.6-cp310-cp310-manylinux_2_34_aarch64.whl:

Publisher: release-publish.yml on kennipj/reap

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file reap_pdf-0.1.6-cp310-cp310-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl.

File metadata

File hashes

Hashes for reap_pdf-0.1.6-cp310-cp310-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl
Algorithm Hash digest
SHA256 093d4ce040217ac7691a6731284c5ba14cf4e78f9c0c557e788e7fcf114922c6
MD5 ee456fdbe1b70d272d67f83ddb3d11ec
BLAKE2b-256 f5338570cd54d2de9ab09213f2a88ff9da56190927acbf723566e913ab8723ed

See more details on using hashes here.

Provenance

The following attestation bundles were made for reap_pdf-0.1.6-cp310-cp310-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl:

Publisher: release-publish.yml on kennipj/reap

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page