A tool for creating a graph representation out of the content of PDF documents.
Project description
Graph Converter
The Graph Converter is a tool for creating a graph representation out of the content of PDFs.
A graph representation can act as the basis for further document processing steps.
Geometric relationships are encapsulated. By those, a document structure can be retrieved.
The tool works independent of different document layouts.
The graph construction can be controlled via parameter settings mentioned subsequently.
Furthermore, layout-based optimizations without the need parameter tweaks are supported using a regression estimation based on document layout characteristics.
The processing of PDF documents is done using the `PDFContentConverter` library.
# How-to
Pass the path of the PDF file which is wanted to be converted to `GraphConverter`.
Call the function `convert()`. The document graph representations are returned page-wise as a list of `networkx` graphs.
Media boxes of a PDF can be accessed using `get*media*boxes()`, the page count over `get*page*count()`
Example call:
converter = GraphConverter(pdf)
result = converter.convert()
A file is the only parameter mandatory for a graph construction.
Beside the graph conversion, media boxes of a document can be accessed using `get*media*boxes()` and the page count over `get*page*count()`.
General document layout characteristics are stored in a `converter.meta` object.
A more detailed example usage is also given in `Tester.py`.
# Example
The following image shows a resulting document graph representation when using the `GraphConverter`.
TODO
# Settings
General parameters:
`file`: file name
`merge_boxes`: indicating if PDF text boxes should be graph nodes, based on visual rectangles present in documents.
`regress_parameters`: indicating if graph parameters are regressed or used as a priori optimized default ones.
Edge restrictions:
`use_font`: differing font size
`use_width`: differing width
`use_rect`: nodes contained in differing visual structures
`use*horizontal*overlap`: indicating if horizontal edges should be built on overlap. If not, default deltas are used.
`use*vertical*overlap`: indicating if vertical edges should be built on overlap. If not, default deltas are used.
Edge thresholds:
`page*ratio*x`: maximal relative horizontal distance of two nodes where an edge can be created
`page*ratio*y`: maximal relative vertical distance of two nodes where an edge can be created
`x*eps`: alignment epsilon for vertical edges in points if `use*horizontal_overlap` is not enabled
`y*eps`: alignment epsilon for horizontal edges in points if `use*vertical_overlap` is not enabled
`font*eps*h`: indicates how much font sizes of nodes are allowed to differ as a constraint for building horizontal edges when `use_font` is enabled
`font*eps*v`: indicates how much font sizes of nodes are allowed to differ as a constraint for building vertical edges when `use_font` is enabled
`width*pct*eps`: relative width difference of nodes as a condition for vertical edges if `use_width` is enabled
`width*page*eps`: indicating at which maximal width of a node the width should act as an edge condition if `use_width` is enabled
# Project Structure
`GraphConverter.py`: contains the `GraphConverter` class for converting documents into graphs.
`util`:
`constants`:
`StorageUtil`: store/load functionalities
`Tester.py`: Python script for testing the `GraphConverter`
`pdf`: example pdf input files for tests
# Output Format
As a result, a list of `networkx` graphs is returned.
Each graph encapsulates a structured representation of a single page.
Edges are attributed with the following features:
`direction`: shows the direction of an edge.
* `v`: Vertical edge
* `h`: Horizontal edge
* `l`: Rectangular loop. This represents a novel concept encapsulating structural characteristics of document segments by observing if two different paths end up in the same node.
`length`: Scaled length of an edge
`lengthx_phys`: Horizontal edge length
`lengthy_phys`: Vertical edge length
`weight`: Scaled total length
All nodes contain the following content attributes:
`id`: unique identifier of the PDF element
`page`: page number, starting with 0
`text`: text of the PDF element
`x_0`: left x coordinate
`x_1`: right x coordinate
`y_0`: top y coordinate
`y_1`: bottom y coordinate
`pos_x`: center x coordinate
`pos_y`: center y coordinate
`abs*pos`: tuple containing a page independent representation of `(pos*x,pos_y)` coordinates
`original_font`: font as extracted by pdfminer
`font*name`: name of the font extracted from `original*font`
`code`: font code as provided by pdfminer
`bold`: factor 1 indicating that a text is bold and 0 otherwise
`italic`: factor 1 indicating that a text is italic and 0 otherwise
`font_size`: size of the text in points
`masked`: text with numeric content substituted as #
`frequency_hist`: histogram of character type frequencies in a text, stored as a tuple containing percentages of textual, numerical, text symbolic and other symbols
`len_text`: number of characters
`n_tokens`: number of words
`tag`: tag for key-value pair extractions, indicating keys or values based on simple heuristics
`box`: box extracted by pdfminer Layout Analysis
`in*element*ids`: contains IDs of surrounding visual elements such as rectangles or lists. They are stored as a list [left, right, top, bottom]. -1 is indicating that there is no adjacent visual element.
`in*element`: indicates based on in*element_ids whether an element is stored in a visual rectangle representation (stored as “rectangle”) or not (stored as “none”).
The media boxes possess the following entries in a dictionary:
`x0`: Left x page crop box coordinate
`x1`: Right x page crop box coordinate
`y0`: Top y page crop box coordinate
`y1`: Bottom y page crop box coordinate
`x0page`: Left x page coordinate
`x1page`: Right x page coordinate
`y0page`: Top y page coordinate
`y1page`: Bottom y page coordinate
# Future Work
The `GraphConverter` will be extended using OCR processing for images in order to support more unstructured types than solely PDFs.
# Acknowledgements
Example PDFs are obtained from the ICDAR Table Recognition Challenge 2013 https://roundtrippdf.com/en/data-extraction/pdf-table-recognition-dataset/.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.