Skip to main content

Amazon Textract package to easier access data through geometric information

Project description

Textract-Pipeline-GeoFinder

Provides functions to use geometric information to extract information.

Use cases include:

  • Give context to key/value pairs from the Amazon Textract AnalyzeDocument API for FORMS
  • Find values in specific areas

Install

> python -m pip install amazon-textract-geofinder

Make sure your environment is setup with AWS credentials through configuration files or environment variables or an attached role. (https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html)

Concept

To find information in a document based on geometry with this library the main advantage over defining x,y coordinates where the expected value should be is the concept of an area.

An area is ultimately defined by a box with x_min, y_min, x_max, y_max coordinates but can be defined by finding words/phrases in the document and then use to create the area.

From there functions to parse the information in the area help to extract the information. E. g. by defining the area based on the question like 'Did you feel fever or feverish lately?' we can associate the answers to it and create a new key/value pair specific to this question.

Samples

Get context for key value pairs

The response for the sample image [[./tests/data/patient_intake_form_sample.jpg>]] from the Amazon Textract AnalyzeDocument API with the FORMS featur does include the following keys:

|----------------------------------------------|----------------| | Key | Value | | First Name: | ALEJANDRO | | First Name: | CARLOS | | Relationship to Patient: | BROTHER | | First Name: | JANE | | Marital Status: | MARRIED | | Phone: | 646-555-0111 | | Last Name: | SALAZAR | | Phone: | 212-555-0150 | | Relationship to Patient: | FRIEND | | Last Name: | ROSALEZ | | City: | ANYTOWN | | Phone: | 650-555-0123 | | Address: | 123 ANY STREET | | Yes | SELECTED | | Yes | NOT_SELECTED | | Date of Birth: | 10/10/1982 | | Last Name: | DOE | | Sex: | M | | Yes | NOT_SELECTED | | Yes | NOT_SELECTED | | Yes | NOT_SELECTED | | State: | CA | | Zip Code: | 12345 | | Email Address: | | | No | NOT_SELECTED | | No | SELECTED | | No | NOT_SELECTED | | Yes | SELECTED | | No | SELECTED | | No | SELECTED | | No | SELECTED |

But the information to which section of the document the individual keys belong is not obvious. Most keys appear multiple times and we want to give them context to associate them with the 'Patient', 'Emergency Contact 1', 'Emergency Contact 2' or specific questions.

Here is a Jupyter notebook that walks through the sample: sample notebook

python -m pip install amazon-textract-helper amazon-textract-geofinder
from textractgeofinder.ocrdb import AreaSelection
from textractgeofinder.tgeofinder import KeyValue, TGeoFinder, AreaSelection, SelectionElement
from textractprettyprinter.t_pretty_print import get_forms_string
from textractcaller import call_textract
from textractcaller.t_call import Textract_Features

import trp.trp2 as t2

image_filename='./tests/data/patient_intake_form_sample.jpg'

j = call_textract(input_document=image_filename, features=[Textract_Features.FORMS])


t_document = t2.TDocumentSchema().load(j)
doc_height = 1000
doc_width = 1000
geofinder_doc = TGeoFinder(j, doc_height=doc_height, doc_width=doc_width)

def set_hierarchy_kv(list_kv: list[KeyValue], t_document: t2.TDocument, page_block: t2.TBlock, prefix="BORROWER"):
    for x in list_kv:
        t_document.add_virtual_key_for_existing_key(key_name=f"{prefix}_{x.key.text}",
                                                    existing_key=t_document.get_block_by_id(x.key.id),
                                                    page_block=page_block)
# patient information
patient_information = geofinder_doc.find_phrase_on_page("patient information")[0]
emergency_contact_1 = geofinder_doc.find_phrase_on_page("emergency contact 1:", min_textdistance=0.99)[0]
top_left = t2.TPoint(y=patient_information.ymax, x=0)
lower_right = t2.TPoint(y=emergency_contact_1.ymin, x=doc_width)
form_fields = geofinder_doc.get_form_fields_in_area(
    area_selection=AreaSelection(top_left=top_left, lower_right=lower_right))
set_hierarchy_kv(list_kv=form_fields, t_document=t_document, prefix='PATIENT', page_block=t_document.pages[0])

set_hierarchy_kv(list_kv=form_fields, t_document=t_document, prefix='PATIENT', page_block=t_document.pages[0])

print(get_forms_string(t2.TDocumentSchema().dump(t_document)))

|----------------------------------------------|----------------| | Key | Value | | PATIENT_first name: | ALEJANDRO | | PATIENT_address: | 123 ANY STREET | | PATIENT_sex: | M | | PATIENT_state: | CA | | PATIENT_zip code: | 12345 | | PATIENT_marital status: | MARRIED | | PATIENT_last name: | ROSALEZ | | PATIENT_phone: | 646-555-0111 | | PATIENT_email address: | | | PATIENT_city: | ANYTOWN | | PATIENT_date of birth: | 10/10/1982 |

Using the Amazon Textact Helper command line tool with the sample

This will show the full result, like the notebook.

> python -m pip install amazon-textract-helper amazon-textract-geofinder
> cat tests/data/patient_intake_form_sample.json| bin/amazon-textract-geofinder | amazon-textract --stdin --pretty-print FORMS

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

amazon-textract-geofinder-0.0.1.tar.gz (21.0 kB view details)

Uploaded Source

Built Distribution

amazon_textract_geofinder-0.0.1-py2.py3-none-any.whl (23.0 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file amazon-textract-geofinder-0.0.1.tar.gz.

File metadata

  • Download URL: amazon-textract-geofinder-0.0.1.tar.gz
  • Upload date:
  • Size: 21.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.6.3 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.0 CPython/3.9.6

File hashes

Hashes for amazon-textract-geofinder-0.0.1.tar.gz
Algorithm Hash digest
SHA256 22a947e11be8c071a1857678d56ed0d57056c72b2ab397e2fbf46cf3355edc5f
MD5 1873297183a42493ba5118be178fc9aa
BLAKE2b-256 482998790203a1f06560da8bbc40d0f31e1e6216eb7cc4b8236382f8c0b8dcb8

See more details on using hashes here.

File details

Details for the file amazon_textract_geofinder-0.0.1-py2.py3-none-any.whl.

File metadata

  • Download URL: amazon_textract_geofinder-0.0.1-py2.py3-none-any.whl
  • Upload date:
  • Size: 23.0 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.6.3 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.0 CPython/3.9.6

File hashes

Hashes for amazon_textract_geofinder-0.0.1-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 a41d4f4ad8ec68ff4da694b4b8ae43336d660e0998c84478e26f8094845b331e
MD5 78e6c5d85106d6a62b5a43ae0f522ffc
BLAKE2b-256 a33ac0dc6b07b08b5f91c425a9093521c4a550b11c98c7e7dc558778d00a9871

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page