Skip to main content

Amazon Textract Overlay tools

Project description

Textract-Overlayer

amazon-textract-overlayer provides functions to help overlay bounding boxes on documents.

Install

> python -m pip install amazon-textract-overlayer

Make sure your environment is setup with AWS credentials through configuration files or environment variables or an attached role. (https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html)

Samples

Primary method provided is get_bounding_boxes which returns bounding boxes based on the Textract_Type passed in. Mostly taken from the amazon-textract command from the package amazon-textract-helper.

This will return the bounding boxes for WORD and CELL data types.

from textractoverlayer.t_overlay import DocumentDimensions, get_bounding_boxes
from textractcaller.t_call import Textract_Features, Textract_Types, call_textract

doc = call_textract(input_document=input_document, features=features)
# image is a PIL.Image.Image in this case
document_dimension:DocumentDimensions = DocumentDimensions(doc_width=image.size[0], doc_height=image.size[1])
overlay=[Textract_Types.WORD, Textract_Types.CELL]

bounding_box_list = get_bounding_boxes(textract_json=doc, document_dimensions=document_dimension, overlay_features=overlay)

The actual overlay drawing of bounding boxes for images is in the amazon-textract command from the package amazon-textract-helper and looks like this:

from PIL import Image, ImageDraw

image = Image.open(input_document)
rgb_im = image.convert('RGB')
draw = ImageDraw.Draw(rgb_im)

# check the impl in amazon-textract-helper for ways to associate different colors to types
for bbox in bounding_box_list:
    draw.rectangle(xy=[bbox.xmin, bbox.ymin, bbox.xmax, bbox.ymax], outline=(128, 128, 0), width=2)

rgb_im.show()

The draw bounding boxes within PDF documents the following code can be used:

import fitz

# for local stored files
file_path = "<<replace with the local path to your pdf file>>"
doc = fitz.open(file_path)
# for files stored in S3 the streaming object can be used
# doc = fitz.open(stream="<<replace with stream_object_variable>>", filetype="pdf")

# draw boxes
for p, page in enumerate(doc):
    p += 1
    for bbox in bounding_box_list:
        if bbox.page_number == p:
            page.draw_rect(
                [bbox.xmin, bbox.ymin, bbox.xmax, bbox.ymax], color=(0, 1, 0), width=2
            )

# save file locally 
doc.save("<<local path for output file>>")

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

amazon-textract-overlayer-0.0.9.tar.gz (8.9 kB view details)

Uploaded Source

Built Distribution

amazon_textract_overlayer-0.0.9-py2.py3-none-any.whl (9.4 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file amazon-textract-overlayer-0.0.9.tar.gz.

File metadata

  • Download URL: amazon-textract-overlayer-0.0.9.tar.gz
  • Upload date:
  • Size: 8.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.6.3 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.0 CPython/3.9.6

File hashes

Hashes for amazon-textract-overlayer-0.0.9.tar.gz
Algorithm Hash digest
SHA256 3992c151d2c7e9b59109908b7186bff6a2c3e030f30237ed492f7eb435406af3
MD5 8c92b953e730dac0cf3fa885b035f836
BLAKE2b-256 5e7a2bcb8bbd3acfdfa279f9291bc9c8c7e0411ee7e7077cccb79c655b67a46f

See more details on using hashes here.

File details

Details for the file amazon_textract_overlayer-0.0.9-py2.py3-none-any.whl.

File metadata

  • Download URL: amazon_textract_overlayer-0.0.9-py2.py3-none-any.whl
  • Upload date:
  • Size: 9.4 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.6.3 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.0 CPython/3.9.6

File hashes

Hashes for amazon_textract_overlayer-0.0.9-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 416d6ef40def1c928bcda669210b448c8ac38a9d8e44c2702ca222bb9b824db8
MD5 68c108fb0c83fb8cd6899c0d6ba8d2dd
BLAKE2b-256 27d3395cb58984bb4e50724006831af94215845a91b3e39fdea7d59c9ea88c0e

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page