Skip to main content

Amazon Textract Pipeline Component to add page dimensions to page block types

Project description

Textract-Pipeline-PageDimensions

Provides functions to add page dimensions with doc_width and doc_height to the Textract JSON schema for the PAGE blocks under the custom attribute in the form of:

e. g.

{'PageDimension': {'doc_width': 1549.0, 'doc_height': 370.0} }

Install

> python -m pip install amazon-textract-pipeline-pagedimensions

Make sure your environment is setup with AWS credentials through configuration files or environment variables or an attached role. (https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html)

Samples

Add Page dimensions for a local file

sample uses amazon-textract-caller amazon-textract-pipeline-pagedimensions

python -m pip install amazon-textract-caller
from textractpagedimensions.t_pagedimensions import add_page_dimensions
from textractcaller.t_call import call_textract
from trp.trp2 import TDocument, TDocumentSchema

j = call_textract(input_document='<path to some image file>')
t_document: TDocument = TDocumentSchema().load(j)
add_page_dimensions(t_document=t_document, input_document=input_file)
print(t_document.pages[0].custom['PageDimension']) 
# output will be something like this:
# {
#     'doc_width': 1544,
#     'doc_height': 1065
# }

Using the Amazon Textact Helper command line tool with PageDimensions

Together with the Amazon Textract Helper and Amazon Textract Response Parser, we can build a pipeline that includes information about PageDimension and Orientation of pages as a short demonstration on the information that is added to the Textract JSON.

> python -m pip install amazon-textract-helper amazon-textract-response-parser amazon-textract-pipeline-pagedimensions
> amazon-textract --input-document "s3://amazon-textract-public-content/blogs/2-pager-different-dimensions.pdf" | amazon-textract-pipeline-pagedimensions --input-document "s3://amazon-textract-public-content/blogs/2-pager-different-dimensions.pdf"  | amazon-textract-pipeline --components add_page_orientation | jq '.Blocks[] | select(.BlockType=="PAGE") | .Custom'

{
  "PageDimension": {
    "doc_width": 1549,
    "doc_height": 370
  },
  "Orientation": 0
}
{
  "PageDimension": {
    "doc_width": 1079,
    "doc_height": 505
  },
  "Orientation": 0
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

Built Distribution

File details

Details for the file amazon-textract-pipeline-pagedimensions-0.0.5.tar.gz.

File metadata

  • Download URL: amazon-textract-pipeline-pagedimensions-0.0.5.tar.gz
  • Upload date:
  • Size: 8.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.6.3 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.0 CPython/3.9.6

File hashes

Hashes for amazon-textract-pipeline-pagedimensions-0.0.5.tar.gz
Algorithm Hash digest
SHA256 6a6b481832ea9a9da2f3b59b82443a0b3acb4f1b732660191119c8ad55cf444f
MD5 8da245fc1e11fff22c3c2b5f29b31b92
BLAKE2b-256 b7917368d986cedb0cb3e3529978966cb406b4d50f403b21444112e63a7cc84b

See more details on using hashes here.

File details

Details for the file amazon_textract_pipeline_pagedimensions-0.0.5-py2.py3-none-any.whl.

File metadata

File hashes

Hashes for amazon_textract_pipeline_pagedimensions-0.0.5-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 a30359a79616f24562d9be7a8665c6c8710dffafabb82e9b349526fd5a9d108c
MD5 b1b6ae6c924bc4b399775083bc493363
BLAKE2b-256 e435f601f1f9c9bb476b9f04fd2e4dde75da68061dc7c9023d7b5e49b3013ec0

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page