Amazon Textract Caller tools

These details have not been verified by PyPI

Project links

Homepage

Project description

Textract-Caller

amazon-textract-caller provides a collection of ready to use functions and sample implementations to speed up the evaluation and development for any project using Amazon Textract.

Making it easy to call Amazon Textract regardless of file type and location.

Install

> python -m pip install amazon-textract-caller

Functions

from textractcaller import call_textract
def call_textract(input_document: Union[str, bytes],
                  features: Optional[List[Textract_Features]] = None,
                  queries_config: Optional[QueriesConfig] = None,
                  output_config: Optional[OutputConfig] = None,
                  adapters_config: Optional[AdaptersConfig] = None,
                  kms_key_id: str = "",
                  job_tag: str = "",
                  notification_channel: Optional[NotificationChannel] = None,
                  client_request_token: str = "",
                  return_job_id: bool = False,
                  force_async_api: bool = False,
                  call_mode: Textract_Call_Mode = Textract_Call_Mode.DEFAULT,
                  boto3_textract_client=None,
                  job_done_polling_interval=1) -> dict:

Also useful when receiving the JSON response from an asynchronous job (start_document_text_detection or start_document_analysis)

from textractcaller import get_full_json
def get_full_json(job_id: str = None,
                  textract_api: Textract_API = Textract_API.DETECT,
                  boto3_textract_client=None)->dict:

And when receiving the JSON from the OutputConfig location, this method is useful as well.

from textractcaller import get_full_json_from_output_config
def get_full_json_from_output_config(output_config: OutputConfig = None,
                                     job_id: str = None,
                                     s3_client = None)->dict:

Samples

Calling with file from local filesystem only with detect_text

textract_json = call_textract(input_document="/folder/local-filesystem-file.png")

Calling with file from local filesystem only detect_text and using in Textract Response Parser

(needs trp dependency through python -m pip install amazon-textract-response-parser)

import json
from trp import Document
from textractcaller import call_textract

textract_json = call_textract(input_document="/folder/local-filesystem-file.png")
d = Document(textract_json)

Calling with Queries for a multi-page document and extract the Answers

sample also uses the amazon-textract-response-parser

python -m pip install amazon-textract-caller amazon-textract-response-parser

import textractcaller as tc
import trp.trp2 as t2
import boto3

textract = boto3.client('textract', region_name="us-east-2")
q1 = tc.Query(text="What is the employee SSN?", alias="SSN", pages=["1"])
q2 = tc.Query(text="What is YTD gross pay?", alias="GROSS_PAY", pages=["2"])
textract_json = tc.call_textract(
    input_document="s3://amazon-textract-public-content/blogs/2-pager.pdf",
    queries_config=tc.QueriesConfig(queries=[q1, q2]),
    features=[tc.Textract_Features.QUERIES],
    force_async_api=True,
    boto3_textract_client=textract)
t_doc: t2.TDocument = t2.TDocumentSchema().load(textract_json)  # type: ignore
for page in t_doc.pages:
    query_answers = t_doc.get_query_answers(page=page)
    for x in query_answers:
        print(f"{x[1]},{x[2]}")

Calling with Custom Queries for a multi-page document using an adapter

sample also uses the amazon-textract-response-parser

python -m pip install amazon-textract-caller amazon-textract-response-parser

import textractcaller as tc
import trp.trp2 as t2
import boto3

textract = boto3.client('textract', region_name="us-east-2")
q1 = tc.Query(text="What is the employee SSN?", alias="SSN", pages=["1"])
q2 = tc.Query(text="What is YTD gross pay?", alias="GROSS_PAY", pages=["2"])
adapter1 = tc.Adapter(adapter_id="2e9bf1c4aa31", version="1", pages=["1"])
textract_json = tc.call_textract(
    input_document="s3://amazon-textract-public-content/blogs/2-pager.pdf",
    queries_config=tc.QueriesConfig(queries=[q1, q2]),
    adapters_config=tc.AdaptersConfig(adapters=[adapter1])
    features=[tc.Textract_Features.QUERIES],
    force_async_api=True,
    boto3_textract_client=textract)
t_doc: t2.TDocument = t2.TDocumentSchema().load(textract_json)  # type: ignore
for page in t_doc.pages:
    query_answers = t_doc.get_query_answers(page=page)
    for x in query_answers:
        print(f"{x[1]},{x[2]}")

Calling with file from local filesystem with TABLES features

from textractcaller import call_textract, Textract_Features
features = [Textract_Features.TABLES]
response = call_textract(
    input_document="/folder/local-filesystem-file.png", features=features)

Call with images located on S3 but force asynchronous API

from textractcaller import call_textract
response = call_textract(input_document="s3://some-bucket/w2-example.png", force_async_api=True)

Call with OutputConfig, Customer-Managed-Key

from textractcaller import call_textract
from textractcaller import OutputConfig, Textract_Features
output_config = OutputConfig(s3_bucket="somebucket-encrypted", s3_prefix="output/")
response = call_textract(input_document="s3://someprefix/somefile.png",
                          force_async_api=True,
                          output_config=output_config,
                          kms_key_id="arn:aws:kms:us-east-1:12345678901:key/some-key-id-ref-erence",
                          return_job_id=False,
                          job_tag="sometag",
                          client_request_token="sometoken")

Call with PDF located on S3 and force return of JobId instead of JSON response

from textractcaller import call_textract
response = call_textract(input_document="s3://some-bucket/some-document.pdf", return_job_id=True)
job_id = response['JobId']

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.2.4

Jun 20, 2024

0.2.3

Apr 3, 2024

0.2.2

Feb 14, 2024

0.2.1

Oct 23, 2023

0.2.0

Oct 19, 2023

0.1.0

Oct 16, 2023

0.0.29

May 9, 2023

0.0.28

Feb 10, 2023

0.0.27

Dec 9, 2022

0.0.26

Dec 5, 2022

0.0.25

Sep 12, 2022

0.0.24

Jun 9, 2022

0.0.23

May 24, 2022

0.0.22

May 6, 2022

0.0.21

May 6, 2022

0.0.20

May 4, 2022

0.0.19

May 4, 2022

0.0.18

Nov 30, 2021

0.0.17

Nov 15, 2021

0.0.16

Oct 28, 2021

0.0.15

Oct 11, 2021

0.0.14

Oct 8, 2021

0.0.13

Jun 14, 2021

0.0.12

May 10, 2021

0.0.11

Apr 28, 2021

0.0.10

Apr 23, 2021

0.0.3

Apr 12, 2021

0.0.2

Apr 8, 2021

0.0.1

Apr 7, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

amazon-textract-caller-0.2.4.tar.gz (13.2 kB view details)

Uploaded Jun 20, 2024 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

amazon_textract_caller-0.2.4-py2.py3-none-any.whl (13.7 kB view details)

Uploaded Jun 20, 2024 Python 2Python 3

File details

Details for the file amazon-textract-caller-0.2.4.tar.gz.

File metadata

Download URL: amazon-textract-caller-0.2.4.tar.gz
Upload date: Jun 20, 2024
Size: 13.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.11.3

File hashes

Hashes for amazon-textract-caller-0.2.4.tar.gz
Algorithm	Hash digest
SHA256	`ac9848322fba92bee8a2f5dc9f9f7f208a181e2754312ccf02f97e6126de7059`
MD5	`2aefe8313f29ff01cd08e2b6344e12e9`
BLAKE2b-256	`fe6282eada03a5bbedff817090e3365d883c354a9c10cf66f4d8f15af145828f`

See more details on using hashes here.

File details

Details for the file amazon_textract_caller-0.2.4-py2.py3-none-any.whl.

File metadata

Download URL: amazon_textract_caller-0.2.4-py2.py3-none-any.whl
Upload date: Jun 20, 2024
Size: 13.7 kB
Tags: Python 2, Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.11.3

File hashes

Hashes for amazon_textract_caller-0.2.4-py2.py3-none-any.whl
Algorithm	Hash digest
SHA256	`ec7dc3517f1cc9b37b41a74b2b5ea040d67be91e8559a8150f44af75bf7f5590`
MD5	`e217e836d624b9ce1fb513695373362d`
BLAKE2b-256	`06521712e298e0afbd8824a8e521ac8c39db2b9ad0e26e51a48e5a7c77487537`

See more details on using hashes here.

amazon-textract-caller 0.2.4

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Textract-Caller

Install

Functions

Samples

Calling with file from local filesystem only with detect_text

Calling with file from local filesystem only detect_text and using in Textract Response Parser

Calling with Queries for a multi-page document and extract the Answers

Calling with Custom Queries for a multi-page document using an adapter

Calling with file from local filesystem with TABLES features

Call with images located on S3 but force asynchronous API

Call with OutputConfig, Customer-Managed-Key

Call with PDF located on S3 and force return of JobId instead of JSON response

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes