Skip to main content

amazon-textract-idp-cdk-constructs

Project description

Amazon Textract IDP CDK Constructs

---

Stability: Experimental

All classes are under active development and subject to non-backward compatible changes or removal in any future version. These are not subject to the Semantic Versioning model. This means that while you may use them, you may need to update your source code when upgrading to a newer version of this package.


Context

This CDK Construct can be used as Step Function task and call Textract in Asynchonous mode for DetectText and AnalyzeDocument APIs.

For samples on usage, look at Amazon Textact IDP CDK Stack Samples

Input

Expects a Manifest JSON at 'Payload'. Manifest description: https://pypi.org/project/schadem-tidp-manifest/

Example call in Python

        textract_async_task = t_async.TextractGenericAsyncSfnTask(
            self,
            "textract-async-task",
            s3_output_bucket=s3_output_bucket,
            s3_temp_output_prefix=s3_temp_output_prefix,
            integration_pattern=sfn.IntegrationPattern.WAIT_FOR_TASK_TOKEN,
            lambda_log_level="DEBUG",
            timeout=Duration.hours(24),
            input=sfn.TaskInput.from_object({
                "Token":
                sfn.JsonPath.task_token,
                "ExecutionId":
                sfn.JsonPath.string_at('$$.Execution.Id'),
                "Payload":
                sfn.JsonPath.entire_payload,
            }),
            result_path="$.textract_result")

Output

Adds the "TextractTempOutputJsonPath" to the Step Function ResultPath. At this location the Textract output is stored as individual JSON files. Use the CDK Construct schadem-cdk-construct-sfn-textract-output-config-to-json to combine them to one single JSON file.

example with ResultPath = textract_result (like configured above):

"textract_result": {
    "TextractTempOutputJsonPath": "s3://schademcdkstackpaystuban-schademcdkidpstackpaystu-bt0j5wq0zftu/textract-temp-output/c6e141e8f4e93f68321c17dcbc6bf7291d0c8cdaeb4869758604c387ce91a480"
  }

Spacy Classification

Expect a Spacy textcat model at the root of the directory. Call the script <TO_INSERT) to copy a public one which classifies Paystub and W2.

aws s3 cp s3://amazon-textract-public-content/constructs/en_textcat_demo-0.0.0.tar.gz .

How to use Workmail Integration

In order to demonstrate this functionality, I have used below architecture where once the inbound email is delivered to your Amazon workmail inbox and if the pattern/s matches, it will invoke the rule action which is inovocation of a lambda function in this case. You can use my sample code to fetch the inbound email message body and parse it properly as text.

architecture

Prerequisites

  1. As I have used Python 3.6 as my Lambda function runtime hence some knowledge of python 3 version is required.

Steps

  1. First setup an Amazon workmail site, setup an organization and create a user access by following steps mentioned in 'Getting Started' document here. Once above setup process is done, you will have access to https://your Organization.awsapps.com/mail webmail url and you can login using your created user's username / password to access your emails.
  2. Now we will create a lambda function which will be invoked once inbound email reaches the inbox and email flow rule pattern is matched (more on this in below steps). You can use the sample lambda python(3.6) code ( lambda_function.py) provided in the 'code' folder for the same. It will fetch the inbound email message body and then parse it properly to get the message body as text. Once you get it as text you can perform various operations on it.
  3. Inbound email flow rules, also called rule actions, automatically apply to all email messages sent to anyone inside of the Amazon WorkMail organization. This differs from email rules for individual mailboxes. Now we will set up email flow rules to handle email flows based on email addresses or domains. Email flow rules are based on both the sender's and recipient's email addresses or domains.

To create an email flow rule, we need to specify a rule action to apply to an email when a specified pattern is matched. Follow the documenttion link here to create email flow rule for your organization which you created in step #1 above. you have to select Action=Run Lambda for your rule. Below is the email flow rule created by me:

Email Flow Rule

you can now follow documentation link here to create pattern/s which need to be satisfied first in order to invoke the rule action (in this case it will invoke our lambda function). For this sample code functionality I have used my email address as pattern in 'origns' and my domain as pattern in 'destinations'. so in this case the lambda function will only be invoke if inbound email sender is my email address and destination is my domain only but you can set patterns as per your requirements. Below screen shots depicts my patterns:

Origin pattern

Destnation pattern

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

Built Distribution

File details

Details for the file amazon-textract-idp-cdk-constructs-0.0.18.tar.gz.

File metadata

File hashes

Hashes for amazon-textract-idp-cdk-constructs-0.0.18.tar.gz
Algorithm Hash digest
SHA256 a9509f553ef86a9bbfcfa7f2ff18ec80cab5d95498a51efadf3fa3445b50768d
MD5 91eab1614ca06f4a5416ffea40c50206
BLAKE2b-256 074e69e6831352b33fcd835c56fc741190b978cfe6743aa86688428f34660e78

See more details on using hashes here.

File details

Details for the file amazon_textract_idp_cdk_constructs-0.0.18-py3-none-any.whl.

File metadata

File hashes

Hashes for amazon_textract_idp_cdk_constructs-0.0.18-py3-none-any.whl
Algorithm Hash digest
SHA256 0437529896676c3854e223a950b7fed63da182648577120f89a3ce20be90e276
MD5 057a46b508dd3f05ed31549cdb6346f5
BLAKE2b-256 a2468348b781870b5d65355533b5385afebb1f7dd8d39b31da97a53bd5d550bd

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page