Skip to main content

Amazon Textract Caller tools

Project description

Textract-Caller

amazon-textract-caller provides a collection of ready to use functions and sample implementations to speed up the evaluation and development for any project using Amazon Textract.

Making it easy to call Amazon Textract regardless of file type and location.

Install

> python -m pip install amazon-textract-caller

Functions

from textractcaller import call_textract
def call_textract(input_document: Union[str, bytes],
                  features: Optional[List[Textract_Features]] = None,
                  queries_config: Optional[QueriesConfig] = None,
                  output_config: Optional[OutputConfig] = None,
                  kms_key_id: str = "",
                  job_tag: str = "",
                  notification_channel: Optional[NotificationChannel] = None,
                  client_request_token: str = "",
                  return_job_id: bool = False,
                  force_async_api: bool = False,
                  call_mode: Textract_Call_Mode = Textract_Call_Mode.DEFAULT,
                  boto3_textract_client=None,
                  job_done_polling_interval=1) -> dict:

Also useful when receiving the JSON response from an asynchronous job (start_document_text_detection or start_document_analysis)

from textractcaller import get_full_json
def get_full_json(job_id: str = None,
                  textract_api: Textract_API = Textract_API.DETECT,
                  boto3_textract_client=None)->dict:

And when receiving the JSON from the OutputConfig location, this method is useful as well.

from textractcaller import get_full_json_from_output_config
def get_full_json_from_output_config(output_config: OutputConfig = None,
                                     job_id: str = None,
                                     s3_client = None)->dict:

Samples

Calling with file from local filesystem only with detect_text

textract_json = call_textract(input_document="/folder/local-filesystem-file.png")

Calling with file from local filesystem only detect_text and using in Textract Response Parser

(needs trp dependency through python -m pip install amazon-textract-response-parser)

import json
from trp import Document
from textractcaller import call_textract

textract_json = call_textract(input_document="/folder/local-filesystem-file.png")
d = Document(textract_json)

Calling with Queries for a multi-page document and extract the Answers

sample also uses the amazon-textract-response-parser

python -m pip install amazon-textract-caller amazon-textract-response-parser
import textractcaller as tc
import trp.trp2 as t2
import boto3

textract = boto3.client('textract', region_name="us-east-2")
q1 = tc.Query(text="What is the employee SSN?", alias="SSN", pages=["1"])
q2 = tc.Query(text="What is YTD gross pay?", alias="GROSS_PAY", pages=["2"])
textract_json = tc.call_textract(
    input_document="s3://amazon-textract-public-content/blogs/2-pager.pdf",
    queries_config=tc.QueriesConfig(queries=[q1, q2]),
    features=[tc.Textract_Features.QUERIES],
    force_async_api=True,
    boto3_textract_client=textract)
t_doc: t2.TDocument = t2.TDocumentSchema().load(textract_json)  # type: ignore
for page in t_doc.pages:
    query_answers = t_doc.get_query_answers(page=page)
    for x in query_answers:
        print(f"{x[1]},{x[2]}")

Calling with file from local filesystem with TABLES features

from textractcaller import call_textract, Textract_Features
features = [Textract_Features.TABLES]
response = call_textract(
    input_document="/folder/local-filesystem-file.png", features=features)

Call with images located on S3 but force asynchronous API

from textractcaller import call_textract
response = call_textract(input_document="s3://some-bucket/w2-example.png", force_async_api=True)

Call with OutputConfig, Customer-Managed-Key

from textractcaller import call_textract
from textractcaller import OutputConfig, Textract_Features
output_config = OutputConfig(s3_bucket="somebucket-encrypted", s3_prefix="output/")
response = call_textract(input_document="s3://someprefix/somefile.png",
                          force_async_api=True,
                          output_config=output_config,
                          kms_key_id="arn:aws:kms:us-east-1:12345678901:key/some-key-id-ref-erence",
                          return_job_id=False,
                          job_tag="sometag",
                          client_request_token="sometoken")

Call with PDF located on S3 and force return of JobId instead of JSON response

from textractcaller import call_textract
response = call_textract(input_document="s3://some-bucket/some-document.pdf", return_job_id=True)
job_id = response['JobId']

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

amazon-textract-caller-0.0.26.tar.gz (12.5 kB view details)

Uploaded Source

Built Distribution

amazon_textract_caller-0.0.26-py2.py3-none-any.whl (12.6 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file amazon-textract-caller-0.0.26.tar.gz.

File metadata

File hashes

Hashes for amazon-textract-caller-0.0.26.tar.gz
Algorithm Hash digest
SHA256 d3de36991cc2431b1e3079459f495e4dab6dda41f1138f168f8acb20f574a899
MD5 976b1ea9ba5aa80ef0245b60b3487953
BLAKE2b-256 ae9fc4526b9ff6d8646f58e476e9734345a0fa0306a8757fee2451d1b4248652

See more details on using hashes here.

File details

Details for the file amazon_textract_caller-0.0.26-py2.py3-none-any.whl.

File metadata

File hashes

Hashes for amazon_textract_caller-0.0.26-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 2b7af0489df34e1a7876d7c0cef0faa7372d2a76768c8f84ef2306bcfbc9c6c0
MD5 2a6d2682b9a2d1bb687b1b1cedb62f21
BLAKE2b-256 4ae786b241916171f9a2730793a3352c8d69dbee93070f852f967d96ee6f4176

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page