Amazon Textract Overlay tools
Project description
Textract-Overlayer
amazon-textract-overlayer provides functions to help overlay bounding boxes on documents.
Install
> python -m pip install amazon-textract-overlayer
Make sure your environment is setup with AWS credentials through configuration files or environment variables or an attached role. (https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html)
Samples
Primary method provided is get_bounding_boxes which returns bounding boxes based on the Textract_Type passed in.
Mostly taken from the amazon-textract command from the package amazon-textract-helper.
This will return the bounding boxes for WORD and CELL data types.
from textractoverlayer.t_overlay import DocumentDimensions, get_bounding_boxes
from textractcaller.t_call import Textract_Features, Textract_Types, call_textract
doc = call_textract(input_document=input_document, features=features)
# image is a PIL.Image.Image in this case
document_dimension:DocumentDimensions = DocumentDimensions(doc_width=image.size[0], doc_height=image.size[1])
overlay=[Textract_Types.WORD, Textract_Types.CELL]
bounding_box_list = get_bounding_boxes(textract_json=doc, document_dimensions=document_dimension, overlay_features=overlay)
The actual overlay drawing of bounding boxes for images is in the amazon-textract command from the package amazon-textract-helper and looks like this:
from PIL import Image, ImageDraw
image = Image.open(input_document)
rgb_im = image.convert('RGB')
draw = ImageDraw.Draw(rgb_im)
# check the impl in amazon-textract-helper for ways to associate different colors to types
for bbox in bounding_box_list:
draw.rectangle(xy=[bbox.xmin, bbox.ymin, bbox.xmax, bbox.ymax], outline=(128, 128, 0), width=2)
rgb_im.show()
The draw bounding boxes within PDF documents the following code can be used:
import fitz
# for local stored files
file_path = "<<replace with the local path to your pdf file>>"
doc = fitz.open(file_path)
# for files stored in S3 the streaming object can be used
# doc = fitz.open(stream="<<replace with stream_object_variable>>", filetype="pdf")
# draw boxes
for p, page in enumerate(doc):
p += 1
for bbox in bounding_box_list:
if bbox.page_number == p:
page.draw_rect(
[bbox.xmin, bbox.ymin, bbox.xmax, bbox.ymax], color=(0, 1, 0), width=2
)
# save file locally
doc.save("<<local path for output file>>")
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file amazon-textract-overlayer-0.0.12.tar.gz.
File metadata
- Download URL: amazon-textract-overlayer-0.0.12.tar.gz
- Upload date:
- Size: 9.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f6b7f87381d62a84aa8f159c218600f7e6742771a58e6126515b1849a105e288
|
|
| MD5 |
af47810e9f5d286af3dc34e654ccfca9
|
|
| BLAKE2b-256 |
be41cdfc5dcab9eaf3c2b3aedc7d49bfa18cecae06d0f87e2732bf39ce2f5aa7
|
File details
Details for the file amazon_textract_overlayer-0.0.12-py2.py3-none-any.whl.
File metadata
- Download URL: amazon_textract_overlayer-0.0.12-py2.py3-none-any.whl
- Upload date:
- Size: 9.4 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
68ac82fbee1fa8080a79cb2cba304d94e07862b856fbbaebe50fc2f23195926c
|
|
| MD5 |
4c23cdcda519fe9683c52969617490ef
|
|
| BLAKE2b-256 |
7bd665dd95f8807c7bba6f6ace217ae00c505504b09ca39d2c7559a2f4edff18
|