Skip to main content

No project description provided

Project description

Parsee PDF Reader

This PDF reader was designed to overcome the common problems when trying to extract tables from PDFs.

We started initially with a focus on financial/numeric tables, so this is what this PDF reader works best for.

This is an early release, so we will be still making major changes.

Installation

Recommended install with poetry: https://python-poetry.org/docs/

poetry add parsee-pdf-reader

Alternatively:

pip install parsee-pdf-reader

In order to use the OCR capabilities you also have to install tesseract: Install Google Tesseract OCR (additional info how to install the engine on Linux, Mac OSX and Windows). You must be able to invoke the tesseract command as 'tesseract'. Note: in our testing we always used tesseract 5+, as that proved to be the most reliable. So for Linux you might have to build from source to get tesseract 5.

Extracting Tables and Paragraphs

Extracting tables and paragraphs of text can be done in one line:

from pdf_reader import get_elements_from_pdf
elements = get_elements_from_pdf("FILE_PATH")

If you are processing a PDF that needs OCR but no elements or just very few are being returned, you can force OCR like this (replace the paths):

elements = get_elements_from_pdf("FILE_PATH", force_ocr=True)

If you want to visualise the output from the extraction, you can run the following (replace the paths):

from pdf_reader import visualise_pdf_output
visualise_pdf_output("FILE_PATH", "OUTPUT_PATH")

This will save an image of each page with the detected tables and text marked in red.

Methodology

Combines pdfminer, pypdf and tesseract and augments them with the introduction of table elements, which are treated separately from the rest of the text. As a result, the output contains basically two types of elements: tables and text paragraphs. We believe this separation is important as otherwise the tabular information is not extracted very precisely and concepts such as columns and rows are usually lost.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

parsee_pdf_reader-0.1.6.0.tar.gz (25.2 kB view details)

Uploaded Source

Built Distribution

parsee_pdf_reader-0.1.6.0-py3-none-any.whl (26.9 kB view details)

Uploaded Python 3

File details

Details for the file parsee_pdf_reader-0.1.6.0.tar.gz.

File metadata

  • Download URL: parsee_pdf_reader-0.1.6.0.tar.gz
  • Upload date:
  • Size: 25.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.4.2 CPython/3.8.3 Darwin/21.4.0

File hashes

Hashes for parsee_pdf_reader-0.1.6.0.tar.gz
Algorithm Hash digest
SHA256 21a1af26f9248917998de9e2d3ee9d289c1f5f762755186a6912da6adae09b95
MD5 efb0cf04744c3074f9d30fca3942908f
BLAKE2b-256 599d9de0d08d0b54d4a090ccb15305cd028f413ff75557466f4aeaa2ef186ba8

See more details on using hashes here.

File details

Details for the file parsee_pdf_reader-0.1.6.0-py3-none-any.whl.

File metadata

File hashes

Hashes for parsee_pdf_reader-0.1.6.0-py3-none-any.whl
Algorithm Hash digest
SHA256 6d8b15f074fefdee9eb244c213a8e5d0e018acd6cf10f0d63eff3c7d28b64e37
MD5 1c66cf20dfe69d62fe1a7e5abbfd9cf6
BLAKE2b-256 5f42c09b02b48ea7cd7214a581c837ccc1a411c023e91b76bddd933ac20c8169

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page