PDFSegmenter

This library builds a graph-representation of the content of PDFs. The graph is then clustered, resulting page segments are classified and returned. Tables are retrieved formatted as a CSV.

These details have not been verified by PyPI

Project links

Project description

This library builds a graph-representation of the content of PDFs. The graph is then clustered, resulting page segments are classified and returned. Tables are retrieved formatted in a CSV-style.

How-to

Pass the path of the PDF file (as a string) which is wanted to be converted to PDFSegmenter.
Call the function segment_document().
The function get_labeled_graphs() returns page-wise document graph representations as a list of networkx graphs. The labels indicate a clustering assignment.
segments2json() returns a JSON representation of the segmented document.
segments2text() returns a textual representation of the segmented document. This can be either annotated (lists, text and tables are supported) or not and controlled via the boolean parameter annotate.
Media boxes of a PDF can be accessed using get_media_boxes(), the page count over get_page_count().

Example call:

segmenter = PDFSegmenter(pdf)

segmenter.segment_document()

result = segmenter.segments2json()

text = segmenter.segments2text()

graphs = get_labeled_graphs()

A file is the only parameter mandatory for the page segmentation.

Beside the graph conversion, media boxes of a document can be accessed using get_media_boxes() and the page count over get_page_count().

JSON

tbd

Annotated text

tbd

Settings

Clustering

tbd

Merging

tbd

Classification

tbd

Graph

General parameters:

file: file name
merge_boxes: indicating if PDF text boxes should be graph nodes, based on visual rectangles present in documents.
regress_parameters: indicating if graph parameters are regressed or used as a priori optimized default ones.

Edge restrictions:

use_font: differing font size
use_width: differing width
use_rect: nodes contained in differing visual structures
use_horizontal_overlap: indicating if horizontal edges should be built on overlap. If not, default deltas are used.
use_vertical_overlap: indicating if vertical edges should be built on overlap. If not, default deltas are used.

Edge thresholds:

page_ratio_x: maximal relative horizontal distance of two nodes where an edge can be created
page_ratio_y: maximal relative vertical distance of two nodes where an edge can be created
x_eps: alignment epsilon for vertical edges in points if use_horizontal_overlap is not enabled
y_eps: alignment epsilon for horizontal edges in points if use_vertical_overlap is not enabled
font_eps_h: indicates how much font sizes of nodes are allowed to differ as a constraint for building horizontal edges when use_font is enabled
font_eps_v: indicates how much font sizes of nodes are allowed to differ as a constraint for building vertical edges when use_font is enabled
width_pct_eps: relative width difference of nodes as a condition for vertical edges if use_width is enabled
width_page_eps: indicating at which maximal width of a node the width should act as an edge condition if use_width is enabled

Project Structure

tbd

Output Format

JSON

tbd

Text

tbd

Graph

As a result, a list of networkx graphs is returned.

Each graph encapsulates a structured representation of a single page.

Edges are attributed with the following features:

direction: shows the direction of an edge.

* v: Vertical edge

* h: Horizontal edge

* l: Rectangular loop. This represents a novel concept encapsulating structural characteristics of document segments by observing if two different paths end up in the same node.
length: Scaled length of an edge
lengthx_phys: Horizontal edge length
lengthy_phys: Vertical edge length
weight: Scaled total length

All nodes contain the following content attributes:

id: unique identifier of the PDF element
page: page number, starting with 0
text: text of the PDF element
x_0: left x coordinate
x_1: right x coordinate
y_0: top y coordinate
y_1: bottom y coordinate
pos_x: center x coordinate
pos_y: center y coordinate
abs_pos: tuple containing a page independent representation of (pos_x,pos_y) coordinates
original_font: font as extracted by pdfminer
font_name: name of the font extracted from original_font
code: font code as provided by pdfminer
bold: factor 1 indicating that a text is bold and 0 otherwise
italic: factor 1 indicating that a text is italic and 0 otherwise
font_size: size of the text in points
masked: text with numeric content substituted as #
frequency_hist: histogram of character type frequencies in a text, stored as a tuple containing percentages of textual, numerical, text symbolic and other symbols
len_text: number of characters
n_tokens: number of words
tag: tag for key-value pair extractions, indicating keys or values based on simple heuristics
box: box extracted by pdfminer Layout Analysis
in_element_ids: contains IDs of surrounding visual elements such as rectangles or lists. They are stored as a list [left, right, top, bottom]. -1 is indicating that there is no adjacent visual element.
in_element: indicates based on in_element_ids whether an element is stored in a visual rectangle representation (stored as “rectangle”) or not (stored as “none”).
is_loop: indicates whether or not a node is connected via a rectangular loop

The media boxes possess the following entries in a dictionary:

x0: Left x page crop box coordinate
x1: Right x page crop box coordinate
y0: Top y page crop box coordinate
y1: Bottom y page crop box coordinate
x0page: Left x page coordinate
x1page: Right x page coordinate
y0page: Top y page coordinate
y1page: Bottom y page coordinate

Acknowledgements

Example PDFs are obtained from the ICDAR Table Recognition Challenge 2013 https://roundtrippdf.com/en/data-extraction/pdf-table-recognition-dataset/.

Authors

Michael Benedikt Aigner
Florian Preis

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1

Sep 11, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

PDFSegmenter-0.1.tar.gz (10.5 kB view details)

Uploaded Sep 11, 2020 Source

File details

Details for the file PDFSegmenter-0.1.tar.gz.

File metadata

Download URL: PDFSegmenter-0.1.tar.gz
Upload date: Sep 11, 2020
Size: 10.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.23.0 setuptools/50.3.0 requests-toolbelt/0.9.1 tqdm/4.45.0 CPython/3.7.5

File hashes

Hashes for PDFSegmenter-0.1.tar.gz
Algorithm	Hash digest
SHA256	`80d9a0b4be7d194588ffba4fb7a165bc5681c085a35aa08cbdcf00c7a560f081`
MD5	`04a2b36e6859b47de2ea5537ee5b9f39`
BLAKE2b-256	`36c2c6f5d53033d02ea0819f28e6c7c8669d98896ae5696cbed6d2095c932207`

See more details on using hashes here.

PDFSegmenter 0.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

How-to

JSON

Annotated text

Settings

Clustering

Merging

Classification

Graph

Project Structure

Output Format

JSON

Text

Graph

Acknowledgements

Authors

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes