Fast PDF Data Extraction library

These details have not been verified by PyPI

Project description

hotpdf

This project was started as an internal project @ Prestatech to parse PDF files in a fast and memory efficient way to overcome the difficulties we were having while parsing big PDF files using libraries such as pdfquery.

hotpdf is a wrapper around pdfminer.six focusing on text extraction and text search operations on PDFs.

hotpdf can be used to find and extract text from PDFs. Please read the docs to understand how the library can help you!

Installation

The latest version of hotpdf can be installed directly from PyPI with pip.

pip install hotpdf

Contributing

You should install the pre-commit hooks with pre-commit install. This will run the linter, mypy, and ruff formatting before each commit.

Rembember to run pip install -e '.[dev]' to install the extra dependencies for development.

For more examples of how to run the full test suite please refer to the CI workflow.

We strive to keep the test coverage at 100%: if you want your contributions accepted please write tests for them :D

Some examples of running tests locally:

python3 -m pip install -e '.[dev]'               # install extra deps for testing
python3 -m pytest -n=auto tests/                      # run the test suite

Documentation

We use sphinx for generating our docs and host them on readthedocs

Please update and add documentation if required, with your contributions.

Update the .rst files, rebuild them, and commit them along with your PRs.

cd docs
make clean
make html

This will generate the necessary documentation files. Once merged to main the docs will be updated automatically.

Usage

To view more detailed usage information, please read the docs

Basic usage is as follows:

from hotpdf import HotPdf

pdf_file_path = "test.pdf"

# Load pdf file into memory
hotpdf_document = HotPdf(pdf_file_path, transformer='Command to convert from PDF to XML')

# Alternatively, you can also pass an opened pdf stream to be loaded
with open(pdf_file_path, "rb") as f:
   hotpdf_document_2 = HotPdf(f)

# Get number of pages
print(len(hotpdf_document.pages))

# Find text
text_occurences = hotpdf_document.find_text("foo")

# Find text and its full span
text_occurences_full_span = hotpdf_document.find_text("foo", take_span=True)

# Extract text in region
text_in_bbox = hotpdf_document.extract_text(
   x0=0,
   y0=0,
   x1=100,
   y1=10,
   page=0,
)

# Extract spans in region
spans_in_bbox = hotpdf_document.extract_spans(
   x0=0,
   y0=0,
   x1=100,
   y1=10,
   page=0,
)

# Extract spans text in region
spans_text_in_bbox = hotpdf_document.extract_spans_text(
   x0=0,
   y0=0,
   x1=100,
   y1=10,
   page=0,
)

# Extract full page text
full_page_text = hotpdf_document.extract_page_text(page=0)

License

This project is licensed under the terms of the MIT license.

with ❤️ from the team @ Prestatech GmbH

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.5.2

Feb 28, 2024

0.5.1

Feb 26, 2024

0.5.0

Feb 23, 2024

0.4.6.1

Feb 22, 2024

0.4.6

Feb 20, 2024

0.4.5.2

Feb 5, 2024

0.4.5.2.dev0 pre-release

Feb 7, 2024

0.4.5.1

Feb 5, 2024

0.4.5

Feb 2, 2024

0.4.4

Feb 2, 2024

0.4.3.1

Feb 2, 2024

0.4.3

Feb 1, 2024

0.4.2.4

Jan 31, 2024

0.4.2.3

Jan 31, 2024

0.4.2.2

Jan 31, 2024

0.4.2.1

Jan 31, 2024

0.4.2

Jan 30, 2024

0.4.1.3

Jan 30, 2024

0.4.1.2

Jan 30, 2024

0.4.1.1

Jan 30, 2024

0.4.1

Jan 30, 2024

This version

0.4.0

Jan 30, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hotpdf-0.4.0.tar.gz (18.1 kB view details)

Uploaded Jan 30, 2024 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

hotpdf-0.4.0-py3-none-any.whl (15.1 kB view details)

Uploaded Jan 30, 2024 Python 3

File details

Details for the file hotpdf-0.4.0.tar.gz.

File metadata

Download URL: hotpdf-0.4.0.tar.gz
Upload date: Jan 30, 2024
Size: 18.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.11.7

File hashes

Hashes for hotpdf-0.4.0.tar.gz
Algorithm	Hash digest
SHA256	`10f38335d43ba1740b6b9fec5620fd913d9db7f495bbc650ab17057bbc273116`
MD5	`dcf65edeeeaf0ab1e4138ebfe94b07ee`
BLAKE2b-256	`e1a494f6983b714e6e5d2ae2d26aef6ac2bbb7875bbdb9da0933c26485168f97`

See more details on using hashes here.

File details

Details for the file hotpdf-0.4.0-py3-none-any.whl.

File metadata

Download URL: hotpdf-0.4.0-py3-none-any.whl
Upload date: Jan 30, 2024
Size: 15.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.11.7

File hashes

Hashes for hotpdf-0.4.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c1b54bbe846d9503e9203d536a04a59eae4b80413ccaaaae1b133ac032bc7791`
MD5	`801ef25d170c5c9a57b99982bd20f82c`
BLAKE2b-256	`793e84fbb7d4ccdb96ffbaaeb35adb4a688d409b65c0a075eabaa2ab0db60097`

See more details on using hashes here.

hotpdf 0.4.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

hotpdf

Installation

Contributing

Documentation

Usage

License

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes