No project description provided

These details have not been verified by PyPI

Project links

Homepage

Project description

Parsee PDF Reader

This PDF reader was designed to overcome the common problems when trying to extract tables from PDFs.

We started initially with a focus on financial/numeric tables, so this is what this PDF reader works best for.

This is an early release, so we will be still making major changes.

Installation

Recommended install with poetry: https://python-poetry.org/docs/

poetry add parsee-pdf-reader

Alternatively:

pip install parsee-pdf-reader

In order to use the OCR capabilities you also have to install tesseract: Install Google Tesseract OCR (additional info how to install the engine on Linux, Mac OSX and Windows). You must be able to invoke the tesseract command as 'tesseract'. Note: in our testing we always used tesseract 5+, as that proved to be the most reliable. So for Linux you might have to build from source to get tesseract 5.

In order to run the PDF to image functionality you need to install poppler, e.g. on MacOSX:

brew install poppler

Extracting Tables and Paragraphs

Extracting tables and paragraphs of text can be done in one line:

from pdf_reader import get_elements_from_pdf
elements = get_elements_from_pdf("FILE_PATH")

If you are processing a PDF that needs OCR but no elements or just very few are being returned, you can force OCR like this (replace the paths):

elements = get_elements_from_pdf("FILE_PATH", force_ocr=True)

If you want to visualise the output from the extraction, you can run the following (replace the paths):

from pdf_reader import visualise_pdf_output
visualise_pdf_output("FILE_PATH", "OUTPUT_PATH")

This will save an image of each page with the detected tables and text marked in red.

Methodology

Combines pdfminer, pypdf and tesseract and augments them with the introduction of table elements, which are treated separately from the rest of the text. As a result, the output contains basically two types of elements: tables and text paragraphs. We believe this separation is important as otherwise the tabular information is not extracted very precisely and concepts such as columns and rows are usually lost.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.1.8.2

Dec 27, 2025

0.1.8.1

Aug 29, 2025

0.1.8.0

Aug 29, 2025

This version

0.1.7.0

Jul 8, 2025

0.1.6.0

May 16, 2024

0.1.5.8

Apr 11, 2024

0.1.5.7

Mar 19, 2024

0.1.5.6

Mar 19, 2024

0.1.5.4

Mar 15, 2024

0.1.5.3

Mar 15, 2024

0.1.5.2

Mar 15, 2024

0.1.5.1

Mar 14, 2024

0.1.5.0

Mar 14, 2024

0.1.4.1

Mar 13, 2024

0.1.4.0

Mar 13, 2024

0.1.3.1

Mar 11, 2024

0.1.3

Mar 7, 2024

0.1.2.1

Mar 5, 2024

0.1.2

Mar 5, 2024

0.1.1

Mar 1, 2024

0.1.0

Feb 14, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

parsee_pdf_reader-0.1.7.0.tar.gz (25.1 kB view details)

Uploaded Jul 8, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

parsee_pdf_reader-0.1.7.0-py3-none-any.whl (27.2 kB view details)

Uploaded Jul 8, 2025 Python 3

File details

Details for the file parsee_pdf_reader-0.1.7.0.tar.gz.

File metadata

Download URL: parsee_pdf_reader-0.1.7.0.tar.gz
Upload date: Jul 8, 2025
Size: 25.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.0.0 CPython/3.13.3 Darwin/24.5.0

File hashes

Hashes for parsee_pdf_reader-0.1.7.0.tar.gz
Algorithm	Hash digest
SHA256	`8c9bdcbd5b4bd8026b88d226a65e916a2b079c2ae95d20c0609e60f2fbbbd86d`
MD5	`77544a293d021dec4bec110a36425b5e`
BLAKE2b-256	`e85ca6ae3abb90aa85ff32b620db6e94a55b4c03b683a3e13025ee2e52815e85`

See more details on using hashes here.

File details

Details for the file parsee_pdf_reader-0.1.7.0-py3-none-any.whl.

File metadata

Download URL: parsee_pdf_reader-0.1.7.0-py3-none-any.whl
Upload date: Jul 8, 2025
Size: 27.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.0.0 CPython/3.13.3 Darwin/24.5.0

File hashes

Hashes for parsee_pdf_reader-0.1.7.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`7fb313fa440a9455bc49dc9f86dc9f7ec746f04ca9aeb5f879f4f91656d9e4e5`
MD5	`1d5721fe8501eb881fbdc3e368b564e5`
BLAKE2b-256	`fbe8184b659135365ffe0060871a55639f182ccfd28d88d257bd7811fa7f4572`

See more details on using hashes here.

parsee-pdf-reader 0.1.7.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Parsee PDF Reader

Installation

Extracting Tables and Paragraphs

Methodology

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes