This library builds a graph-representation of the content of PDFs. The graph is then clustered, resulting page segments are classified and returned. Tables are retrieved formatted as a CSV.
Project description
This library builds a graph-representation of the content of PDFs. The graph is then clustered, resulting page segments are classified and returned. Tables are retrieved formatted in a CSV-style.
How-to
Pass the path of the PDF file (as a string) which is wanted to be converted to PDFSegmenter.
Call the function segment_document().
The function get_labeled_graphs() returns page-wise document graph representations as a list of networkx graphs. The labels indicate a clustering assignment.
segments2json() returns a JSON representation of the segmented document.
segments2text() returns a textual representation of the segmented document. This can be either annotated (lists, text and tables are supported) or not and controlled via the boolean parameter annotate.
Media boxes of a PDF can be accessed using get_media_boxes(), the page count over get_page_count().
Example call:
segmenter = PDFSegmenter(pdf)
segmenter.segment_document()
result = segmenter.segments2json()
text = segmenter.segments2text()
graphs = get_labeled_graphs()
A file is the only parameter mandatory for the page segmentation.
Beside the graph conversion, media boxes of a document can be accessed using get_media_boxes() and the page count over get_page_count().
JSON
tbd
Annotated text
tbd
Settings
Clustering
tbd
Merging
tbd
Classification
tbd
Graph
General parameters:
file: file name
merge_boxes: indicating if PDF text boxes should be graph nodes, based on visual rectangles present in documents.
regress_parameters: indicating if graph parameters are regressed or used as a priori optimized default ones.
Edge restrictions:
use_font: differing font size
use_width: differing width
use_rect: nodes contained in differing visual structures
use_horizontal_overlap: indicating if horizontal edges should be built on overlap. If not, default deltas are used.
use_vertical_overlap: indicating if vertical edges should be built on overlap. If not, default deltas are used.
Edge thresholds:
page_ratio_x: maximal relative horizontal distance of two nodes where an edge can be created
page_ratio_y: maximal relative vertical distance of two nodes where an edge can be created
x_eps: alignment epsilon for vertical edges in points if use_horizontal_overlap is not enabled
y_eps: alignment epsilon for horizontal edges in points if use_vertical_overlap is not enabled
font_eps_h: indicates how much font sizes of nodes are allowed to differ as a constraint for building horizontal edges when use_font is enabled
font_eps_v: indicates how much font sizes of nodes are allowed to differ as a constraint for building vertical edges when use_font is enabled
width_pct_eps: relative width difference of nodes as a condition for vertical edges if use_width is enabled
width_page_eps: indicating at which maximal width of a node the width should act as an edge condition if use_width is enabled
Project Structure
tbd
Output Format
JSON
tbd
Text
tbd
Graph
As a result, a list of networkx graphs is returned.
Each graph encapsulates a structured representation of a single page.
Edges are attributed with the following features:
direction: shows the direction of an edge.
* v: Vertical edge
* h: Horizontal edge
* l: Rectangular loop. This represents a novel concept encapsulating structural characteristics of document segments by observing if two different paths end up in the same node.
length: Scaled length of an edge
lengthx_phys: Horizontal edge length
lengthy_phys: Vertical edge length
weight: Scaled total length
All nodes contain the following content attributes:
id: unique identifier of the PDF element
page: page number, starting with 0
text: text of the PDF element
x_0: left x coordinate
x_1: right x coordinate
y_0: top y coordinate
y_1: bottom y coordinate
pos_x: center x coordinate
pos_y: center y coordinate
abs_pos: tuple containing a page independent representation of (pos_x,pos_y) coordinates
original_font: font as extracted by pdfminer
font_name: name of the font extracted from original_font
code: font code as provided by pdfminer
bold: factor 1 indicating that a text is bold and 0 otherwise
italic: factor 1 indicating that a text is italic and 0 otherwise
font_size: size of the text in points
masked: text with numeric content substituted as #
frequency_hist: histogram of character type frequencies in a text, stored as a tuple containing percentages of textual, numerical, text symbolic and other symbols
len_text: number of characters
n_tokens: number of words
tag: tag for key-value pair extractions, indicating keys or values based on simple heuristics
box: box extracted by pdfminer Layout Analysis
in_element_ids: contains IDs of surrounding visual elements such as rectangles or lists. They are stored as a list [left, right, top, bottom]. -1 is indicating that there is no adjacent visual element.
in_element: indicates based on in_element_ids whether an element is stored in a visual rectangle representation (stored as “rectangle”) or not (stored as “none”).
is_loop: indicates whether or not a node is connected via a rectangular loop
The media boxes possess the following entries in a dictionary:
x0: Left x page crop box coordinate
x1: Right x page crop box coordinate
y0: Top y page crop box coordinate
y1: Bottom y page crop box coordinate
x0page: Left x page coordinate
x1page: Right x page coordinate
y0page: Top y page coordinate
y1page: Bottom y page coordinate
Acknowledgements
Example PDFs are obtained from the ICDAR Table Recognition Challenge 2013 https://roundtrippdf.com/en/data-extraction/pdf-table-recognition-dataset/.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file PDFSegmenter-0.1.tar.gz.
File metadata
- Download URL: PDFSegmenter-0.1.tar.gz
- Upload date:
- Size: 10.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.23.0 setuptools/50.3.0 requests-toolbelt/0.9.1 tqdm/4.45.0 CPython/3.7.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
80d9a0b4be7d194588ffba4fb7a165bc5681c085a35aa08cbdcf00c7a560f081
|
|
| MD5 |
04a2b36e6859b47de2ea5537ee5b9f39
|
|
| BLAKE2b-256 |
36c2c6f5d53033d02ea0819f28e6c7c8669d98896ae5696cbed6d2095c932207
|