Amazon Textract Pipeline Component to add page dimensions to page block types
Project description
Textract-Pipeline-PageDimensions
Provides functions to add page dimensions with doc_width and doc_height to the Textract JSON schema for the PAGE blocks under the custom attribute in the form of:
e. g.
{'PageDimension': {'doc_width': 1549.0, 'doc_height': 370.0} }
Install
> python -m pip install amazon-textract-pipeline-pagedimensions
Make sure your environment is setup with AWS credentials through configuration files or environment variables or an attached role. (https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html)
Samples
Add Page dimensions for a local file
sample uses amazon-textract-caller amazon-textract-pipeline-pagedimensions
python -m pip install amazon-textract-caller
from textractpagedimensions.t_pagedimensions import add_page_dimensions
from textractcaller.t_call import call_textract
from trp.trp2 import TDocument, TDocumentSchema
j = call_textract(input_document='<path to some image file>')
t_document: TDocument = TDocumentSchema().load(j)
add_page_dimensions(t_document=t_document, input_document=input_file)
print(t_document.pages[0].custom['PageDimension'])
# output will be something like this:
# {
# 'doc_width': 1544,
# 'doc_height': 1065
# }
Using the Amazon Textact Helper command line tool with PageDimensions
Together with the Amazon Textract Helper and Amazon Textract Response Parser, we can build a pipeline that includes information about PageDimension and Orientation of pages as a short demonstration on the information that is added to the Textract JSON.
> python -m pip install amazon-textract-helper amazon-textract-response-parser amazon-textract-pipeline-pagedimensions
> amazon-textract --input-document "s3://amazon-textract-public-content/blogs/2-pager-different-dimensions.pdf" | amazon-textract-pipeline-pagedimensions --input-document "s3://amazon-textract-public-content/blogs/2-pager-different-dimensions.pdf" | amazon-textract-pipeline --components add_page_orientation | jq '.Blocks[] | select(.BlockType=="PAGE") | .Custom'
{
"PageDimension": {
"doc_width": 1549,
"doc_height": 370
},
"Orientation": 0
}
{
"PageDimension": {
"doc_width": 1079,
"doc_height": 505
},
"Orientation": 0
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for amazon-textract-pipeline-pagedimensions-0.0.6.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | b63c4f4a216624e41907ab6d77f8ace1373e546023c5e087d25a2fdcca88e693 |
|
MD5 | c0dd77ae06e161fd35ff285f149bc044 |
|
BLAKE2b-256 | 040b69f55fef7504f162ff519b7e578b0a248a81401101dea1f17f8b699f9f30 |
Hashes for amazon_textract_pipeline_pagedimensions-0.0.6-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 169d7119a0bae242c4cb8f57f538fd83c5c2a5053b69302260748c8fc3c9fc18 |
|
MD5 | ee559c89b65bb7c28d18b2d6ba889afc |
|
BLAKE2b-256 | 9c1e7178bf7e82d3c09bffbe6d3b92c32120850abd581a458c7b8878c6228fc2 |