PDFContentConverter

A tool for converting PDF text as well as structural features into a pandas dataframe.

These details have not been verified by PyPI

Project links

Project description

The PDF Content Converter is a tool for converting PDF text as well as structural features into a pandas dataframe, written natively in Python. It retrieves information about textual content, fonts, positions, character frequencies and surrounding visual PDF elements.

How-to

Pass the path of the PDF file which is wanted to be converted to PDFContentConverter.
Call the function pdf2pandas(). The PDF content is then returned as a pandas dataframe.
Media boxes of a PDF can be accessed using get_media_boxes(), the page count over get_page_count() and the document text using pdf2text().
Using the convert() function, the pandas dataframe, textual document content, media boxes and page count are returned as a dictionary.

Example call:

converter = PDFContentConverter(pdf)

result = converter.pdf2pandas()

Output Format

The output containing the converted PDF data is stored as pandas dataframe.

The different PDF elements are stored as rows.

The dataframe contains the following columns:

id: unique identifier of the PDF element
page: page number, starting with 0
text: text of the PDF element
x_0: left x coordinate
x_1: right x coordinate
y_0: top y coordinate
y_1: bottom y coordinate
pos_x: center x coordinate
pos_y: center y coordinate
abs_pos: tuple containing a page independent representation of (pos_x,pos_y) coordinates
original_font: font as extracted by pdfminer
font_name: name of the font extracted from original_font
code: font code as provided by pdfminer
bold: factor 1 indicating that a text is bold and 0 otherwise
italic: factor 1 indicating that a text is italic and 0 otherwise
font_size: size of the text in points
masked: text with numeric content substituted as #
frequency_hist: histogram of character type frequencies in a text, stored as a tuple containing percentages of textual, numerical, text symbolic and other symbols
len_text: number of characters
n_tokens: number of words
tag: tag for key-value pair extractions, indicating keys or values based on simple heuristics
box: box extracted by pdfminer Layout Analysis
in_element_ids: contains IDs of surrounding visual elements such as rectangles or lists. They are stored as a list [left, right, top, bottom]. -1 is indicating that there is no adjacent visual element.
in_element: indicates based on in*element_ids whether an element is stored in a visual rectangle representation (stored as “rectangle”) or not (stored as “none”).

Additionally, a dictionary is returned containing the following entries,

which can be used to transform the absolute CSV coordinates:

x0: Left x page crop box coordinate
x1: Right x page crop box coordinate
y0: Top y page crop box coordinate
y1: Bottom y page crop box coordinate
x0page: Left x page coordinate
x1page: Right x page coordinate
y0page: Top y page coordinate
y1page: Bottom y page coordinate

Both are returned in a dictionary when using convert().

The dataframe is stored as “content”, the page characteristics as “media_boxes”, the textual content as “text” and the number of pages as “page_count”.

Acknowledgements

This work is built on top of the pdfminer project https://github.com/euske/pdfminer.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.7

Sep 8, 2020

0.6

Sep 8, 2020

0.5

Sep 8, 2020

0.4

Sep 8, 2020

0.3.1

Sep 8, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

PDFContentConverter-0.7.tar.gz (5.9 kB view details)

Uploaded Sep 8, 2020 Source

File details

Details for the file PDFContentConverter-0.7.tar.gz.

File metadata

Download URL: PDFContentConverter-0.7.tar.gz
Upload date: Sep 8, 2020
Size: 5.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.23.0 setuptools/50.3.0 requests-toolbelt/0.9.1 tqdm/4.45.0 CPython/3.7.5

File hashes

Hashes for PDFContentConverter-0.7.tar.gz
Algorithm	Hash digest
SHA256	`193f44cd744b533cdb3df302aeefd75c1313f831e09fbfee91355a827f3f4f06`
MD5	`9efc434e4ff901fc8a2f00d681d9f2b2`
BLAKE2b-256	`522c170418556b449553844438960432fb8963bccffcecb4875ee6cf3477e026`

See more details on using hashes here.

PDFContentConverter 0.7

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

How-to

Output Format

Acknowledgements

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes