Amazon Textract Overlay tools
Project description
Textract-Overlayer
amazon-textract-overlayer provides functions to help overlay bounding boxes on documents.
Install
> python -m pip install amazon-textract-overlayer
Make sure your environment is setup with AWS credentials through configuration files or environment variables or an attached role. (https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html)
Samples
Primary method provided is get_bounding_boxes which returns bounding boxes based on the Textract_Type passed in.
Mostly taken from the amazon-textract
command from the package amazon-textract-helper
.
This will return the bounding boxes for WORD and CELL data types.
from textractoverlayer.t_overlay import DocumentDimensions, get_bounding_boxes
from textractcaller.t_call import Textract_Features, Textract_Types, call_textract
doc = call_textract(input_document=input_document, features=features)
# image is a PIL.Image.Image in this case
document_dimension:DocumentDimensions = DocumentDimensions(doc_width=image.size[0], doc_height=image.size[1])
overlay=[Textract_Types.WORD, Textract_Types.CELL]
bounding_box_list = get_bounding_boxes(textract_json=doc, document_dimensions=document_dimension, overlay_features=overlay)
The actual overlay drawing of bounding boxes for images is in the amazon-textract
command from the package amazon-textract-helper
and looks like this:
from PIL import Image, ImageDraw
image = Image.open(input_document)
rgb_im = image.convert('RGB')
draw = ImageDraw.Draw(rgb_im)
# check the impl in amazon-textract-helper for ways to associate different colors to types
for bbox in bounding_box_list:
draw.rectangle(xy=[bbox.xmin, bbox.ymin, bbox.xmax, bbox.ymax], outline=(128, 128, 0), width=2)
rgb_im.show()
The draw bounding boxes within PDF documents the following code can be used:
import fitz
# for local stored files
file_path = "<<replace with the local path to your pdf file>>"
doc = fitz.open(file_path)
# for files stored in S3 the streaming object can be used
# doc = fitz.open(stream="<<replace with stream_object_variable>>", filetype="pdf")
# draw boxes
for p, page in enumerate(doc):
p += 1
for bbox in bounding_box_list:
if bbox.page_number == p:
page.draw_rect(
[bbox.xmin, bbox.ymin, bbox.xmax, bbox.ymax], color=(0, 1, 0), width=2
)
# save file locally
doc.save("<<local path for output file>>")
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for amazon-textract-overlayer-0.0.11.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0c2c77697a1c1ef6aed1da3b4e2432e8feec7ff4d89eeeac340b7e7f86806d1f |
|
MD5 | 5d061d7a05e6a47466b7967288c2eb0c |
|
BLAKE2b-256 | dd832048f94cc6351f9966b590a2ec1352bd33bc54eec72784fe64c670e89077 |
Hashes for amazon_textract_overlayer-0.0.11-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | c6b7c80a665ff66d1fbc891e2d54ac8f6cc1b97c71d10ad44e0dddea8003c98b |
|
MD5 | 4e55f7351fe972a1840a6e16f6085dc5 |
|
BLAKE2b-256 | 637bc6796ead68bef4be08d25c5d4e82920818fd90df3e322e9d6097c59ea9c7 |