Skip to main content

amazon-textract-idp-cdk-constructs

Project description

Amazon Textract IDP CDK Constructs

---

Stability: Experimental

All classes are under active development and subject to non-backward compatible changes or removal in any future version. These are not subject to the Semantic Versioning model. This means that while you may use them, you may need to update your source code when upgrading to a newer version of this package.


Context

This CDK Construct can be used as Step Function task and call Textract in Asynchonous mode for DetectText and AnalyzeDocument APIs.

For samples on usage, look at Amazon Textact IDP CDK Stack Samples

Input

Expects a Manifest JSON at 'Payload'. Manifest description: https://pypi.org/project/schadem-tidp-manifest/

Example call in Python

        textract_async_task = t_async.TextractGenericAsyncSfnTask(
            self,
            "textract-async-task",
            s3_output_bucket=s3_output_bucket,
            s3_temp_output_prefix=s3_temp_output_prefix,
            integration_pattern=sfn.IntegrationPattern.WAIT_FOR_TASK_TOKEN,
            lambda_log_level="DEBUG",
            timeout=Duration.hours(24),
            input=sfn.TaskInput.from_object({
                "Token":
                sfn.JsonPath.task_token,
                "ExecutionId":
                sfn.JsonPath.string_at('$$.Execution.Id'),
                "Payload":
                sfn.JsonPath.entire_payload,
            }),
            result_path="$.textract_result")

Output

Adds the "TextractTempOutputJsonPath" to the Step Function ResultPath. At this location the Textract output is stored as individual JSON files. Use the CDK Construct schadem-cdk-construct-sfn-textract-output-config-to-json to combine them to one single JSON file.

example with ResultPath = textract_result (like configured above):

"textract_result": {
    "TextractTempOutputJsonPath": "s3://schademcdkstackpaystuban-schademcdkidpstackpaystu-bt0j5wq0zftu/textract-temp-output/c6e141e8f4e93f68321c17dcbc6bf7291d0c8cdaeb4869758604c387ce91a480"
  }

Spacy Classification

Expect a Spacy textcat model at the root of the directory. Call the script <TO_INSERT) to copy a public one which classifies Paystub and W2.

aws s3 cp s3://amazon-textract-public-content/constructs/en_textcat_demo-0.0.0.tar.gz .

How to use Workmail Integration

In order to demonstrate this functionality, I have used below architecture where once the inbound email is delivered to your Amazon workmail inbox and if the pattern/s matches, it will invoke the rule action which is inovocation of a lambda function in this case. You can use my sample code to fetch the inbound email message body and parse it properly as text.

architecture

Prerequisites

  1. As I have used Python 3.6 as my Lambda function runtime hence some knowledge of python 3 version is required.

Steps

  1. First setup an Amazon workmail site, setup an organization and create a user access by following steps mentioned in 'Getting Started' document here. Once above setup process is done, you will have access to https://your Organization.awsapps.com/mail webmail url and you can login using your created user's username / password to access your emails.
  2. Now we will create a lambda function which will be invoked once inbound email reaches the inbox and email flow rule pattern is matched (more on this in below steps). You can use the sample lambda python(3.6) code ( lambda_function.py) provided in the 'code' folder for the same. It will fetch the inbound email message body and then parse it properly to get the message body as text. Once you get it as text you can perform various operations on it.
  3. Inbound email flow rules, also called rule actions, automatically apply to all email messages sent to anyone inside of the Amazon WorkMail organization. This differs from email rules for individual mailboxes. Now we will set up email flow rules to handle email flows based on email addresses or domains. Email flow rules are based on both the sender's and recipient's email addresses or domains.

To create an email flow rule, we need to specify a rule action to apply to an email when a specified pattern is matched. Follow the documenttion link here to create email flow rule for your organization which you created in step #1 above. you have to select Action=Run Lambda for your rule. Below is the email flow rule created by me:

Email Flow Rule

you can now follow documentation link here to create pattern/s which need to be satisfied first in order to invoke the rule action (in this case it will invoke our lambda function). For this sample code functionality I have used my email address as pattern in 'origns' and my domain as pattern in 'destinations'. so in this case the lambda function will only be invoke if inbound email sender is my email address and destination is my domain only but you can set patterns as per your requirements. Below screen shots depicts my patterns:

Origin pattern

Destnation pattern

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

Built Distribution

File details

Details for the file amazon-textract-idp-cdk-constructs-0.0.23.tar.gz.

File metadata

File hashes

Hashes for amazon-textract-idp-cdk-constructs-0.0.23.tar.gz
Algorithm Hash digest
SHA256 2d1cbd0f0c943529b8242f6062cf7d964cede19c4f75f6332d6d334172c48adc
MD5 ce8c313d4e8bfb080b803170e0774a33
BLAKE2b-256 235b59c1e6fa3ca87c2f96b213ba037bb60b27c1bb9b2af4789f574be3d0bec1

See more details on using hashes here.

File details

Details for the file amazon_textract_idp_cdk_constructs-0.0.23-py3-none-any.whl.

File metadata

File hashes

Hashes for amazon_textract_idp_cdk_constructs-0.0.23-py3-none-any.whl
Algorithm Hash digest
SHA256 fccbb64e90845f45c2d832e767b1ae4ab908e16424dc30ab21fd8ee4db1234b2
MD5 51bdcf7cede745a1829d672a10df3744
BLAKE2b-256 f9d3766336c42a4b038603aaa6083d2541aaaa36f3645d1fed701a36906695b3

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page