Skip to main content

Reconstruct the original continuous text from PDFs with language models

Project description

pd3f-core PyPI PyPI - Python Version PyPI - Downloads

Experimental, use with care.

pd3f-core is Python package to reconstruct the original continuous text from PDFs with language models. pd3f-core assumes your PDF is either text-based or already OCRd. pd3f-core is at the heart of pd3f: a full Docker-based text extraction pipeline (including OCR).

pd3f-core first uses Parsr to chunk PDFs into lines and paragraphs. Then, it uses the Python package dehyphen to reconstruct the paragraphs in the most probable way. The probability is derived by calculating the perplexity with Flair's character-based language models. Unnecessary hyphens are removed, space or new lines are kept or dropt depending on the surround words.

It's mainly developed for German but should work with other languages as well. The project is still in an early stage. Expect rough edges and rapid changes.

Documentation

API Documentation of pd3f-core: https://pd3f.github.io/pd3f-core/index.html

Documentation of pd3f (the ): https://pd3f.com/docs/

Features

Dehyphenation of Lines

Check if two lines can be joined by removing hyphens ('-').

Reasonable Joining of Lines

Decide between adding a simple space (' ') or a new line ('\n') when joining lines.

Reverse Page Break (Experimental)

Check if the last paragraph of a page und the first paragraph of the following page can be joined.

Footnote to Endnotes (Experimental)

In order to join paragraphs (and reverse page breaks), detect footnotes and turn them into endnotes. For now, the footnotes are pulled to the end of a file.

Deduplication of Pager Header / Footer (Experimental)

If the header or the footer are the same for all pages, only display them once. Headers are pulled to the start of the document and footer to the end. Some heuristic based on the similarity of footers are used. (Jaccard distance for text, and compare overlapping shapes)

Installation

pip install pd3f

or

poetry add pd3f

Usage

Start a local Parsr instance:

docker-compose up

(You may also use tunnel a remote Parsr instance (script) or choose a remote address.)

from pd3f import extract

text, tables = extract(file_path, tables=False, experimental=False, force_gpu=False, lang="multi", fast=False, parsr_location="localhost:3001")

Explanations of the paramaters in the docs: https://pd3f.github.io/pd3f-core/export.html#pd3f.export.extract

GPU Support (CUDA)

Using CUDA speeds up the evaluation with Flair. But you need an (expensive) GPU. You need to set up your GPU with CUDA. Here a guide for Ubuntu 18.04

  1. install conda (via miniconda) and poetry
  2. create a new conda enviroment & activate it
  3. Install PyTorch with CUDA: conda install pytorch torchvision cudatoolkit=10.2 -c pytorch (example)
  4. Install pd3f-core with poetry: poetry add pd3f

Poetry realizes that it is run within a conda virtual env so it doesn't create a new one. Since setting up CUDA is hard, install it with the most easy way (with conda).

Background

Parsr Config

At the heart of pd3f-core is the JSON output of Parsr. Some comments on how and why certain things were chosen. Parsr's documentation about the different modules

Parsr has several module to classify paragraphs into certain types. They offer a list detections as well as an heading detection. In my experience, the accuracy is too low for both, so we don't use it right now. This also means all the extracted (output) text is flat (no headings, different formattings etc.).

We enable Drawing + Image Detection because we may need to understand what paragraph is following which other one. This may be helpful when to decide whether to join paragraphs. But it's dropped when activating the fast setting.

In the JSON output is a field pageNumber. This comes from the page detection module. So pageNumber is derived from header / footer of each page. So it may be different from the index in the page array. Don't relay on pageNumber in the JSON output.

words-to-line-new has be used like this. There is no error but the accuracy decreases if it used otherwise.

"words-to-line-new",
[
    "reading-order-detection",

Don't do OCR with Parsr because the results are worse than OCRmyPDF (because the latter uses image preprocessing).

Future Work / TODO

  • make reverse page break work without requiring the experimental features

Developement

Install and use poetry.

License

Affero General Public License 3.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pd3f-0.4.0.tar.gz (30.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pd3f-0.4.0-py3-none-any.whl (30.6 kB view details)

Uploaded Python 3

File details

Details for the file pd3f-0.4.0.tar.gz.

File metadata

  • Download URL: pd3f-0.4.0.tar.gz
  • Upload date:
  • Size: 30.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.5 CPython/3.9.2 Darwin/19.6.0

File hashes

Hashes for pd3f-0.4.0.tar.gz
Algorithm Hash digest
SHA256 b1d2e327291a4d5b155f6308d58886ba5fb7505ae97dd24c32182c5f44323034
MD5 cd9ad1214a9182da463d7ae2a5b966e8
BLAKE2b-256 4b01469401e26fea483ac427608a7dcd79eb7243ea04408bac226284d9941d13

See more details on using hashes here.

File details

Details for the file pd3f-0.4.0-py3-none-any.whl.

File metadata

  • Download URL: pd3f-0.4.0-py3-none-any.whl
  • Upload date:
  • Size: 30.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.5 CPython/3.9.2 Darwin/19.6.0

File hashes

Hashes for pd3f-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 9c05c9975938ddb83018c62536c9c09db1b3255edc1f70706250eec711e8e47e
MD5 742469fa93870e1287a04e2b73bf7a01
BLAKE2b-256 27fda391b9e82f7865474f2114f74277fda4faff57caaa75809cd5469108f12d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page