Skip to main content

Post processor for Document AI

Project description

post-processor-pi is a Python module that exploits the output of Document AI (tool available on GCP) to create a JSON file that stores the text contained in a document in an organized way, so that it reflects its original structure. To work with different types of documents having different structures, post-processor-pi has to be configured using the GUI that I developed.

Installation

The easiest way to install post-processor-pi is using pip:

pip install post-processor-pi

Documentation

As I mentioned above in order to implement the post-processor we need to configure it first and this can be done using the Configuration GUI that I developed.

Configuration GUI

In order to start the Configuration GUI we need an example corresponding to the output of Document AI for one document. Below you can find the code example that enables you to start the Configuration GUI.

from dai_post_processor import post_processor as pp

# Open one example from Document AI output
path = "your/path/to/DocumentAI/example"
doc_ai = pp.open_doc_ai(path)

# Start Configuration GUI
mywin = pp.configGUI(doc_ai)
mywin.start()

After running the above instructions the GUI should open.

Configuration GUI Tuning

The Configuration GUI consists of two windows:

  1. Main: here you can find the essential instructions that are needed to configure post-processor-pi.

    • Select text of interest: in this section you can select the portion of the pages in the documents you're interested in. In the Page number field you can provide the page number of the Document example you want to open. I you press the Draw Box button a new window will open, and here you can select the portion of the page drawing a rectangle. All the text outside this rectangle will be ignored for all pages and documents. Once you performed the selection you can close the interactive window. Below you have a demo.

    • Structuring Filter: in this section you can specify the first and second level points that give structure to the document. In the Main Structure field you need to specify the regex rule that lets you find all the first level points. If you press the Add Point button an additional field, where you can specify the regex rule for the second level points, will appear.

    • Output: in this section you can specify what to include in the JSON output of the Post Processor. You can choose to include the list of all lines and paragraphs in the document and the structured text. You can also choose to ignore those lines that are recognised as Headers with the Filter Headers checkbox.

  2. Advanced: here you can perform some more advanced configurations:

    • Paragraph Multiplier: Every line in the document that has a vertical gap with respect to the previous line lower than the median vertical gap of all lines in the document multiplied by the Paragraph Multiplier, will be appended to the previous line to form a paragraph.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

post-processor-pi-0.1.1.tar.gz (9.3 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page