Skip to main content

Rust-first PDF text extraction with geometry-aware search and optional Python bindings

Project description

reap

reap is a low-level PDF parser written in Rust, with Python bindings for fast geometry-aware text extraction and spatial queries. It is designed for fast spatial lookups, but also supports full text extraction and regex queries.

Installing

uv add reap-pdf

Usage

from reap import Rectangle, TextBlockIndex

index = TextBlockIndex.from_path("my_w2.pdf")
blocks = index.search_regex("Wage and Tax Statement")
print(blocks)
#> [TextBlock("Wage and Tax Statement")]

# Search to the right of the label.
search_rect = Rectangle(
    top=blocks[0].rect.top,
    left=blocks[0].rect.right,
    bottom=blocks[0].rect.bottom,
    right=blocks[0].rect.right + 100,
)
year = index.search(search_rect, overlap=0.3)
print(year[0].text)
#> 2026

Details

  • TextBlockIndex builds an index of word level TextBlocks, across all pages.
  • TextBlockIndex.search_regex works on the entire text corpus, in reading order, but returns merged TextBlocks that match the query into one TextBlock.
  • TextBlockIndex.text returns text in reading order.
  • All pages are merged, and considered as one big page, with each page's coordinates stacked below each other. Coordinate units are points.

Advanced use cases

  • TextBlockIndex.scoped creates a TextBlockIndex scoped to a specific region, with support for merging text blocks within the region into multi-word TextBlocks.
  • TextBlock.chars returns the per-block extracted TextChar list.

Scope

  • OCR is not in-scope. Reap focuses entirely on extracting visible text present in the PDF.
  • No promises for all PDF spec details to be supported, support is added as the need arises.
  • Best effort to extract all visible text in PDFs.

Comparisons

Extraction differences

PyMuPDF and pdfminer will both extract hidden text, eg. white text on a white background, which is often not desirable. reap performs visibility checks during extraction, to ensure that invisible text does not get included in the final output.

Speed

Text extraction performance measured for comparability, as neither alternative offers proper spatial query support.

Library Text Extraction
reap 0.850ms median
PyMuPDF 5ms median
pdfminer 220ms median

Speed comparisons vary greatly depending on the PDF, with smaller PDFs reap is roughly 10x faster than PyMuPDF, and 500x faster than pdfminer, with larger PDFs the advantage narrows to roughly 5x and 250x respectively. Tests were conducted on a private test suite of PDFs with varying complexity

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

reap_pdf-0.1.10.tar.gz (1.4 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

reap_pdf-0.1.10-cp314-cp314-manylinux_2_34_x86_64.whl (1.4 MB view details)

Uploaded CPython 3.14manylinux: glibc 2.34+ x86-64

reap_pdf-0.1.10-cp314-cp314-manylinux_2_34_aarch64.whl (1.4 MB view details)

Uploaded CPython 3.14manylinux: glibc 2.34+ ARM64

reap_pdf-0.1.10-cp314-cp314-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl (2.6 MB view details)

Uploaded CPython 3.14macOS 10.12+ universal2 (ARM64, x86-64)macOS 10.12+ x86-64macOS 11.0+ ARM64

reap_pdf-0.1.10-cp313-cp313-manylinux_2_34_x86_64.whl (1.4 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.34+ x86-64

reap_pdf-0.1.10-cp313-cp313-manylinux_2_34_aarch64.whl (1.4 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.34+ ARM64

reap_pdf-0.1.10-cp313-cp313-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl (2.6 MB view details)

Uploaded CPython 3.13macOS 10.12+ universal2 (ARM64, x86-64)macOS 10.12+ x86-64macOS 11.0+ ARM64

reap_pdf-0.1.10-cp312-cp312-manylinux_2_34_x86_64.whl (1.4 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.34+ x86-64

reap_pdf-0.1.10-cp312-cp312-manylinux_2_34_aarch64.whl (1.4 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.34+ ARM64

reap_pdf-0.1.10-cp312-cp312-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl (2.6 MB view details)

Uploaded CPython 3.12macOS 10.12+ universal2 (ARM64, x86-64)macOS 10.12+ x86-64macOS 11.0+ ARM64

reap_pdf-0.1.10-cp311-cp311-manylinux_2_34_x86_64.whl (1.4 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.34+ x86-64

reap_pdf-0.1.10-cp311-cp311-manylinux_2_34_aarch64.whl (1.4 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.34+ ARM64

reap_pdf-0.1.10-cp311-cp311-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl (2.6 MB view details)

Uploaded CPython 3.11macOS 10.12+ universal2 (ARM64, x86-64)macOS 10.12+ x86-64macOS 11.0+ ARM64

reap_pdf-0.1.10-cp310-cp310-manylinux_2_34_x86_64.whl (1.4 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.34+ x86-64

reap_pdf-0.1.10-cp310-cp310-manylinux_2_34_aarch64.whl (1.4 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.34+ ARM64

reap_pdf-0.1.10-cp310-cp310-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl (2.6 MB view details)

Uploaded CPython 3.10macOS 10.12+ universal2 (ARM64, x86-64)macOS 10.12+ x86-64macOS 11.0+ ARM64

File details

Details for the file reap_pdf-0.1.10.tar.gz.

File metadata

  • Download URL: reap_pdf-0.1.10.tar.gz
  • Upload date:
  • Size: 1.4 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for reap_pdf-0.1.10.tar.gz
Algorithm Hash digest
SHA256 0d479e02ff0841c1e77db1b8faf3fbbef4933ac98336470c259b34bb64d9c402
MD5 10c5f770390698f5ac3139c65e48aced
BLAKE2b-256 afdd4b7fa65aff9a699f85ced1a4fe9dca3988ee897e4c25e9237df68fc513fc

See more details on using hashes here.

Provenance

The following attestation bundles were made for reap_pdf-0.1.10.tar.gz:

Publisher: release-publish.yml on kennipj/reap

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file reap_pdf-0.1.10-cp314-cp314-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for reap_pdf-0.1.10-cp314-cp314-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 57081903f3e4695fdd8c8d43e8eb875c08a5077529c2f71106ec41eb8d8d5a79
MD5 9d6acb1c217f9c1fdf665e03d565023b
BLAKE2b-256 115292327e73e761476dc0333fc6afb60231b4ecb85b7b3b7d046fdd57dc2cd6

See more details on using hashes here.

Provenance

The following attestation bundles were made for reap_pdf-0.1.10-cp314-cp314-manylinux_2_34_x86_64.whl:

Publisher: release-publish.yml on kennipj/reap

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file reap_pdf-0.1.10-cp314-cp314-manylinux_2_34_aarch64.whl.

File metadata

File hashes

Hashes for reap_pdf-0.1.10-cp314-cp314-manylinux_2_34_aarch64.whl
Algorithm Hash digest
SHA256 ecb059ebb7d1ab78a2e1f97eeb2ed8bd627400317b46efe79cd5458fcd26d84f
MD5 41da90f71ebe91f30e086389da529de7
BLAKE2b-256 0293347905bc0dbb089a662f715fa920cec5d885a05234af816d207f4c81d104

See more details on using hashes here.

Provenance

The following attestation bundles were made for reap_pdf-0.1.10-cp314-cp314-manylinux_2_34_aarch64.whl:

Publisher: release-publish.yml on kennipj/reap

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file reap_pdf-0.1.10-cp314-cp314-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl.

File metadata

File hashes

Hashes for reap_pdf-0.1.10-cp314-cp314-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl
Algorithm Hash digest
SHA256 d6a881542b9cd7e7f9a2e74fcbf72cce81f2746ec96b736265bdc1d2622f6cd3
MD5 c2f16eab9835f59f9f8d3af76c0bd04b
BLAKE2b-256 6bf58f0213da07f0a045c2613734e8d3a882fb454a7978cc46697181d7660483

See more details on using hashes here.

Provenance

The following attestation bundles were made for reap_pdf-0.1.10-cp314-cp314-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl:

Publisher: release-publish.yml on kennipj/reap

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file reap_pdf-0.1.10-cp313-cp313-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for reap_pdf-0.1.10-cp313-cp313-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 5d16c901ac204a12de2d0028f380e3f6c290f4303162e9281b38839a4fb55906
MD5 ae540214f9a29f334bb10dedf66339fc
BLAKE2b-256 b0c20805c16a7fcf5651df1715dd74ce1858cc658b62e388340978316bc37287

See more details on using hashes here.

Provenance

The following attestation bundles were made for reap_pdf-0.1.10-cp313-cp313-manylinux_2_34_x86_64.whl:

Publisher: release-publish.yml on kennipj/reap

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file reap_pdf-0.1.10-cp313-cp313-manylinux_2_34_aarch64.whl.

File metadata

File hashes

Hashes for reap_pdf-0.1.10-cp313-cp313-manylinux_2_34_aarch64.whl
Algorithm Hash digest
SHA256 2f81b0d086feff0187938b963ea83eea3061c7a65a7a906098a1c06a2687c3fc
MD5 e3a8bc092353cc73b41ece0f086ab1af
BLAKE2b-256 249c3adc418f7b0f8821e10ad8828cda20f9ad12543a8b3435ab2b24f88028f3

See more details on using hashes here.

Provenance

The following attestation bundles were made for reap_pdf-0.1.10-cp313-cp313-manylinux_2_34_aarch64.whl:

Publisher: release-publish.yml on kennipj/reap

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file reap_pdf-0.1.10-cp313-cp313-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl.

File metadata

File hashes

Hashes for reap_pdf-0.1.10-cp313-cp313-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl
Algorithm Hash digest
SHA256 971bf9099f0f3910cdf689c8f2dd8b630c69570d8fa581f73525af839d8aebb3
MD5 fff3430fb699c69a3426037200a43d11
BLAKE2b-256 8192a7fca36fd1bf2edbe5f8bcb80baf902df8eb503ebbc809fecf25a749998e

See more details on using hashes here.

Provenance

The following attestation bundles were made for reap_pdf-0.1.10-cp313-cp313-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl:

Publisher: release-publish.yml on kennipj/reap

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file reap_pdf-0.1.10-cp312-cp312-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for reap_pdf-0.1.10-cp312-cp312-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 fa7ddaaf5ace7cc71d7a3ee41ba4339449214b7772479f8bf21078cbc8ef24f5
MD5 6d8baa93dd1b6365c6f91b030676c031
BLAKE2b-256 1f6f7f53b9d6e9b770725d122d24370a7af321e2527f18293ce1d095511a3866

See more details on using hashes here.

Provenance

The following attestation bundles were made for reap_pdf-0.1.10-cp312-cp312-manylinux_2_34_x86_64.whl:

Publisher: release-publish.yml on kennipj/reap

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file reap_pdf-0.1.10-cp312-cp312-manylinux_2_34_aarch64.whl.

File metadata

File hashes

Hashes for reap_pdf-0.1.10-cp312-cp312-manylinux_2_34_aarch64.whl
Algorithm Hash digest
SHA256 b6e8293640890566dc8a118273339ffd5cf936f7a03a9214abe18cf010c05f19
MD5 0737ace84e6f1ba7ff625f592f47ae87
BLAKE2b-256 4066ba635215ad9f7e418ee2470e64c866e79e3704ea4e210d09fe47fd86ef85

See more details on using hashes here.

Provenance

The following attestation bundles were made for reap_pdf-0.1.10-cp312-cp312-manylinux_2_34_aarch64.whl:

Publisher: release-publish.yml on kennipj/reap

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file reap_pdf-0.1.10-cp312-cp312-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl.

File metadata

File hashes

Hashes for reap_pdf-0.1.10-cp312-cp312-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl
Algorithm Hash digest
SHA256 5f86ebbfa85bb1b36ed957c9165a92e59c3baad79cd8b120b032ed063d8e4b90
MD5 cd7522acf4a24762682fb9a0e1f55dc6
BLAKE2b-256 2d9a1fe512f3504fc14ed30c8c36ef2b975ad8bb19883a0ec0261c73eaa7d46e

See more details on using hashes here.

Provenance

The following attestation bundles were made for reap_pdf-0.1.10-cp312-cp312-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl:

Publisher: release-publish.yml on kennipj/reap

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file reap_pdf-0.1.10-cp311-cp311-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for reap_pdf-0.1.10-cp311-cp311-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 225346bdc711bf47ca336c2195b6540801a74d9d68827c005087735a9f1f3b08
MD5 11f05117df3b9905d2d5a3583e6d7ab8
BLAKE2b-256 251f39ef3537b9a1c17a23dbc85f580d3646517ebe32f70081bd9ef6617c1083

See more details on using hashes here.

Provenance

The following attestation bundles were made for reap_pdf-0.1.10-cp311-cp311-manylinux_2_34_x86_64.whl:

Publisher: release-publish.yml on kennipj/reap

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file reap_pdf-0.1.10-cp311-cp311-manylinux_2_34_aarch64.whl.

File metadata

File hashes

Hashes for reap_pdf-0.1.10-cp311-cp311-manylinux_2_34_aarch64.whl
Algorithm Hash digest
SHA256 69395695f121600c0f3bb1459c3923d27dc01dcf1fa174c926040148f4eb62ae
MD5 fe4c1f18b7533175f9148f67e30d4241
BLAKE2b-256 b5dac2cc417876040c4a1956ae767cf7ca30f3f906ca6bdc3db093ecf2ea7453

See more details on using hashes here.

Provenance

The following attestation bundles were made for reap_pdf-0.1.10-cp311-cp311-manylinux_2_34_aarch64.whl:

Publisher: release-publish.yml on kennipj/reap

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file reap_pdf-0.1.10-cp311-cp311-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl.

File metadata

File hashes

Hashes for reap_pdf-0.1.10-cp311-cp311-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl
Algorithm Hash digest
SHA256 f303b5f3c8dfba074fa63597c514d0af096362b234fd650faccb62653a9f2228
MD5 84964381f97f52e42f66083b3f0710f1
BLAKE2b-256 15824f3f2fc0f2edc684f3e41d853256d8741289cba1e354af04d77f13ca36b9

See more details on using hashes here.

Provenance

The following attestation bundles were made for reap_pdf-0.1.10-cp311-cp311-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl:

Publisher: release-publish.yml on kennipj/reap

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file reap_pdf-0.1.10-cp310-cp310-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for reap_pdf-0.1.10-cp310-cp310-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 5d5ca5f6b38065053d6d23be9e94d82a17de1352c69984d13fad2acabf4e9270
MD5 300c207557d009c42102e2a0d7a41869
BLAKE2b-256 9b35eff941a9ce5cfab0d04547b689ea6cb6d3f5f49acfe751e49d1942bc00de

See more details on using hashes here.

Provenance

The following attestation bundles were made for reap_pdf-0.1.10-cp310-cp310-manylinux_2_34_x86_64.whl:

Publisher: release-publish.yml on kennipj/reap

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file reap_pdf-0.1.10-cp310-cp310-manylinux_2_34_aarch64.whl.

File metadata

File hashes

Hashes for reap_pdf-0.1.10-cp310-cp310-manylinux_2_34_aarch64.whl
Algorithm Hash digest
SHA256 163581c69d44733966a69ea490152cac00a5037a9b1878ae10d1dd7420c49b9d
MD5 fbece0ccfde1b6830b53c9890bee48a4
BLAKE2b-256 d55e13147f2074faad3627373919deebe4c98acc233c71c1966161275875056f

See more details on using hashes here.

Provenance

The following attestation bundles were made for reap_pdf-0.1.10-cp310-cp310-manylinux_2_34_aarch64.whl:

Publisher: release-publish.yml on kennipj/reap

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file reap_pdf-0.1.10-cp310-cp310-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl.

File metadata

File hashes

Hashes for reap_pdf-0.1.10-cp310-cp310-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl
Algorithm Hash digest
SHA256 7fa35c91f0036c34b4cb22ba3f6be338bf41e7cb5a6dc8ee6cc98b9a5e5bb437
MD5 c5d85aa6299c7f5126739d450876020b
BLAKE2b-256 da5fd3b5a21809f92769c1574162ed2b38723e58092379a4d85dbf6c6b8c97e7

See more details on using hashes here.

Provenance

The following attestation bundles were made for reap_pdf-0.1.10-cp310-cp310-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl:

Publisher: release-publish.yml on kennipj/reap

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page