pdftotree

Parse PDFs into HTML-like trees.

These details have not been verified by PyPI

Project links

Project description

Fonduer has been successfully extended to perform information extraction from richly formatted data such as tables. A crucial step in this process is the construction of the hierarchical tree of context objects such as text blocks, figures, tables, etc. The system currently uses PDF to HTML conversion provided by Adobe Acrobat. However, Adobe Acrobat is not an open source tool, which may be inconvenient for Fonduer users.

This package is the result of building our own module as replacement to Adobe Acrobat. Several open source tools are available for pdf to html conversion but these tools do not preserve the cell structure in a table. Our goal in this project is to develop a tool that extracts text, figures and tables in a pdf document and maintains the structure of the document using a tree data structure.

Dependencies

sudo apt-get install python3-tk

Installation

To install this package from PyPi:

pip install pdftotree

Usage

pdftotree as a Python package

import pdftotree

pdftotree.parse(pdf_file, html_path=None, model_type=None, model_path=None, favor_figures=True, visualize=False):

pdftotree

This is the primary command-line utility provided with this Python package. This takes a PDF file as input, and produces an HTML-like representation of the data.

usage: pdftotree [options] pdf_file

Script to extract tree structure from PDF files. Takes a PDF as input and
outputs an HTML-like representation of the document's structure. By default,
this conversion is done using heuristics. However, a model can be provided as
a parameter to use a machine-learning-based approach.

positional arguments:
  pdf_file              PDF file name for which tree structure needs to be
                        extracted

optional arguments:
  -h, --help            show this help message and exit
  -mt {vision,ml,None}, --model_type {vision,ml,None}
                        Model type to use. None (default) for heuristics
                        approach.
  -m MODEL_PATH, --model_path MODEL_PATH
                        Pretrained model, generated by extract_tables tool
  -o OUTPUT, --output OUTPUT
                        Path where tree structure should be saved. If none,
                        HTML is printed to stdout.
  -f FAVOR_FIGURES, --favor_figures FAVOR_FIGURES
                        Whether figures must be favored over other parts such
                        as tables and section headers
  -V, --visualize       Whether to output visualization images for the tree
  -d, --dry-run         Run pdftotree, but do not save any output or print to
                        console.
  -v, --verbose         Output INFO level logging.
  -vv, --veryverbose    Output DEBUG level logging.

extract_tables

usage: extract_tables [-h] [--mode MODE] --model-path MODEL_PATH
                      [--train-pdf TRAIN_PDF] --test-pdf TEST_PDF
                      [--gt-train GT_TRAIN] --gt-test GT_TEST --datapath
                      DATAPATH [--iou-thresh IOU_THRESH] [-v] [-vv]

Script to extract tables bounding boxes from PDF files using machine learning.
If `model.pkl` is saved in the model-path, the pickled model will be used for
prediction. Otherwise the model will be retrained. If --mode is test (by
default), the script will create a .bbox file containing the tables for the
pdf documents listed in the file --test-pdf. If --mode is dev, the script will
also extract ground truth labels for the test data and compute statistics.

optional arguments:
  -h, --help            show this help message and exit
  --mode MODE           Usage mode dev or test, default is test
  --model-path MODEL_PATH
                        Path to the model. If the file exists, it will be
                        used. Otherwise, a new model will be trained.
  --train-pdf TRAIN_PDF
                        List of pdf file names used for training. These files
                        must be saved in the --datapath directory. Required if
                        no pretrained model is provided.
  --test-pdf TEST_PDF   List of pdf file names used for testing. These files
                        must be saved in the --datapath directory.
  --gt-train GT_TRAIN   Ground truth train tables. Required if no pretrained
                        model is provided.
  --gt-test GT_TEST     Ground truth test tables.
  --datapath DATAPATH   Path to directory containing the input documents.
  --iou-thresh IOU_THRESH
                        Intersection over union threshold to remove duplicate
                        tables
  -v                    Output INFO level logging
  -vv                   Output DEBUG level logging

PDF List Format

The list of PDFs are simply a single filename on each line. For example:

1-s2.0-S000925411100369X-main.pdf
1-s2.0-S0009254115301030-main.pdf
1-s2.0-S0012821X12005717-main.pdf
1-s2.0-S0012821X15007487-main.pdf
1-s2.0-S0016699515000601-main.pdf

Ground Truth File Format

The ground truth is formatted to mirror the PDF List. That is, the first line of the ground truth file provides the labels for the first document in corresponding PDF list. Labels take the form of semicolon-separated tuples containing the values (page_num, page_width, page_height, top, left, bottom, right). For example:

(10, 696, 951, 634, 366, 832, 653);(14, 696, 951, 720, 62, 819, 654);(4, 696, 951, 152, 66, 813, 654);(7, 696, 951, 415, 57, 833, 647);(8, 696, 951, 163, 370, 563, 652)
(11, 713, 951, 97, 47, 204, 676);(11, 713, 951, 261, 45, 357, 673);(3, 713, 951, 110, 44, 355, 676);(8, 713, 951, 763, 55, 903, 687)
(5, 672, 951, 88, 57, 203, 578);(5, 672, 951, 593, 60, 696, 579)
(5, 718, 951, 131, 382, 403, 677)
(13, 713, 951, 119, 56, 175, 364);(13, 713, 951, 844, 57, 902, 363);(14, 713, 951, 109, 365, 164, 671);(8, 713, 951, 663, 46, 890, 672)

One method to label these tables is to use DocumentAnnotation, which allows you to select table regions in your web browser and produces the bounding box file.

Example Dataset: Paleontological Papers

A full set of documents and ground truth labels can be downloaded here. You can train a machine-learning model to extract table regions by downloading this dataset and extracting it into a directory named data and then running the command below. Double check that the paths in the command match wherever you have downloaded the data.

extract_tables --train-pdf data/paleo/ml/train.pdf.list.paleo.not.scanned --gt-train data/paleo/ml/gt.train --test-pdf data/paleo/ml/test.pdf.list.paleo.not.scanned --gt-test data/paleo/ml/gt.test --datapath data/paleo/documents/ --model-path data/model.pkl

The resulting model of this example command would be saved as data/model.pkl.

For Developers

We are following Semantic Versioning 2.0.0 conventions. The maintainers will create a git tag for each release and increment the version number found in pdftotree/_version.py accordingly. We deploy tags to PyPI automatically using Travis-CI.

To install locally, you’ll need to install pandoc:

sudo apt-get install pandoc

which is used to create the reStructuredText file that the package expects.

Tests

To test changes in the package, you install it in editable mode locally in your virtualenv by running:

make dev

Then you can run our tests

make test

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.5.0

Oct 13, 2020

0.5.0rc1 pre-release

Oct 13, 2020

0.4.1

Sep 21, 2020

0.4.0

Jul 26, 2018

0.3.1

Mar 20, 2018

This version

0.3.0

Mar 16, 2018

0.2.15

Mar 14, 2018

0.2.14

Mar 14, 2018

0.2.13

Mar 13, 2018

0.2.12

Mar 13, 2018

0.2.11

Mar 12, 2018

0.2.10

Mar 12, 2018

0.2.9

Mar 7, 2018

0.2.8

Mar 7, 2018

0.2.7

Mar 6, 2018

0.2.6

Mar 6, 2018

0.2.5

Mar 6, 2018

0.2.4

Mar 5, 2018

0.2.3

Mar 2, 2018

0.2.2

Mar 2, 2018

0.2.1

Mar 2, 2018

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdftotree-0.3.0.tar.gz (45.4 kB view details)

Uploaded Mar 16, 2018 Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pdftotree-0.3.0-py3.6.egg (113.3 kB view details)

Uploaded Mar 16, 2018 Egg

pdftotree-0.3.0-py3-none-any.whl (57.9 kB view details)

Uploaded Mar 16, 2018 Python 3

File details

Details for the file pdftotree-0.3.0.tar.gz.

File metadata

Download URL: pdftotree-0.3.0.tar.gz
Upload date: Mar 16, 2018
Size: 45.4 kB
Tags: Source
Uploaded using Trusted Publishing? No

File hashes

Hashes for pdftotree-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`a26fa6e2420b6715f5e8b76b54566bec792b5d31a521578e1666ccb6dd05c25d`
MD5	`5f556547b9749401d444d53f83e23778`
BLAKE2b-256	`7bb71671f4172862e0e54b3353b0534b92af7dfb8b638b06f4c0abc640d424a5`

See more details on using hashes here.

File details

Details for the file pdftotree-0.3.0-py3.6.egg.

File metadata

Download URL: pdftotree-0.3.0-py3.6.egg
Upload date: Mar 16, 2018
Size: 113.3 kB
Tags: Egg
Uploaded using Trusted Publishing? No

File hashes

Hashes for pdftotree-0.3.0-py3.6.egg
Algorithm	Hash digest
SHA256	`750bf5e49033853437112332c501ded7b96b6f47ce798c87751bc8ca16544ca3`
MD5	`3b564a1cf158fa8e42d1350bbc52dc4b`
BLAKE2b-256	`e4c0c1a8ca47090c7162a1c8e97d308acf7119ce6c4dc6a06a06621e8917e723`

See more details on using hashes here.

File details

Details for the file pdftotree-0.3.0-py3-none-any.whl.

File metadata

Download URL: pdftotree-0.3.0-py3-none-any.whl
Upload date: Mar 16, 2018
Size: 57.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No

File hashes

Hashes for pdftotree-0.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`3edff69794ab2ec0215052070d2bb4791715489f7e43c2581483d5864347ffaa`
MD5	`70cf8bcb55bb0cc5e46ff2aaceebedd4`
BLAKE2b-256	`5f66a2698f6be59ab6e6a38cd502c68a0865cf8b57c44c30955e9ebb392a2d8d`

See more details on using hashes here.

pdftotree 0.3.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Dependencies

Installation

Usage

pdftotree as a Python package

pdftotree

extract_tables

Example Dataset: Paleontological Papers

For Developers

Tests

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distributions

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes