Skip to main content

some description

Project description

TL;DR:

This Construct allows calling Amazon Textract as part of a AWS Step Function Workflow. Still work-in-process, so interfaces may (most likely will) change. But a good starting point imho and happy to hear feedback.

Build using https://projen.io/ - GitHub at: https://github.com/projen/projen

Usage

Atm deployed as NPM package for use in TypeScript and as a PyPI package for use in Python based CDK stacks.

The Step Function flow expects a message with information about the location of the file to process:

{
  "s3_bucket": "<somebucket>",
  "s3_key": "someprefix/someobject.somesuffix"
}

The object can either be a supported document type (PDF, PNG, JPEG, TIFF) for the default flow, which calls the [https://docs.aws.amazon.com/textract/latest/dg/API_Operations.html] (DetectDocumentText) for single page or https://docs.aws.amazon.com/textract/latest/dg/API_Operations.html for multi page documents.

The output is in the following format

{
  "TextractOutputJsonPath": "s3://somebucket/someoutputprefix/randomuuid/inputfilename.json"
}

and includes the full JSON returned from Textract. It already combines paginated results into one JSON file.

To call other functionality like the Forms, Tables, Queries, AnalyzeID or AnalyzeExpense there are 2 ways:

  1. Pass in "EXPENSE" or "IDENTITY" as the defaultClassification when creating the TextractStepFunctionsStartExecution, which will then change the default from the Detect API to the Expense or Identity one.
  2. Upload a manifest file

The format of the manifest file is defined as:

{
    "S3Path": "s3://sdx-textract-us-east-1/employeeapp20210510.png",
    "TextractFeatures": [
        "FORMS",
        "TABLES",
        "QUERIES"
    ],
    "QueriesConfig": [{
        "Text": "What is the applicant full name?",
        "Alias": "FULL_NAME",
        "Pages": ["*"]
    }],
    "Classification":"SOMECLASSIFICATION"
}

for the AnalyzeDocument API at least one TextractFeature and the S3Path is required. To execute the Expense API the Classification has to be set to "EXPENSE".

For the Identity API the following format is required:

{
    "DocumentPages": ["s3://sdx-textract-us-east-1/driverlicense.png"],
    "Classification":"IDENTITY"
}

For Identity, 2 document pages can be passed in.

TypeScript sample

sample GitHub project that uses the Textract Construct: https://github.com/schadem/schadem-cdk-stack-test

To use, add as dependency

  "dependencies": {
    "schadem-cdk-construct-sfn-test": "0.0.12"
  },
const textract_task = new tstep.TextractStepFunctionsStartExecution(this, 'textract-task', {
    s3OutputBucket: documentBucket.bucketName,
    s3OutputPrefix: textractS3OutputPrefix,
    s3TempOutputPrefix: textractTemporaryS3OutputPrefix,
});
const workflow_chain = sfn.Chain.start(textract_task)
const stateMachine = new sfn.StateMachine(this, 'IDPWorkflow', {
    definition: workflow_chain,
    timeout: Duration.minutes(240),
});

Python sample

sample GitHub project with Stack that uses the Textract Construct: https://github.com/schadem/schadem-cdk-idp-stack-python-sample

package name: schadem-cdk-construct-sfn-test

textract_task = sfctc.TextractStepFunctionsStartExecution(
    self,
    "textract-task",
    s3_output_bucket=document_bucket.bucket_name,
    s3_temp_output_prefix=s3_temp_prefix,
    s3_output_prefix=s3_output_prefix)
workflow_chain = sfn.Chain.start(textract_task)

state_machine = sfn.StateMachine(self,
                                'IDPWorkflowPython',
                                definition=workflow_chain)

The Construct implements the sfn.TaskStateBase similar to the StepFunctionsStartExecution and therefore is used as a part of a Step Function workflow. See the stack for a usage sample.

Development

At the moment essentially just do

npx projen build

to generate the packages.

When pushing/merging to mainline branch onto GitHub it kicks off a pipeline which increases the version number and deploys the packages to PyPI and NPM atm (nugen and maven can be added).

That package I reference in a script in the stack (install_construct_and_deploy.s) - which atm has hardcoded references to locations of the packages on my local system. Obviously that will change when we push the packages out

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

schadem-cdk-construct-sfn-test-0.0.14.tar.gz (70.9 kB view details)

Uploaded Source

Built Distribution

File details

Details for the file schadem-cdk-construct-sfn-test-0.0.14.tar.gz.

File metadata

File hashes

Hashes for schadem-cdk-construct-sfn-test-0.0.14.tar.gz
Algorithm Hash digest
SHA256 30fe7ed8bf584baf70573c6c87609244f82fca22f25bc540aedc935d224a17b0
MD5 66a10c53a5e1bb6c860b863ed7512e5e
BLAKE2b-256 436e7d5e70eff22f6ebc7e178fe2940f21eefa8182585d98363b9aa103b95c9e

See more details on using hashes here.

File details

Details for the file schadem_cdk_construct_sfn_test-0.0.14-py3-none-any.whl.

File metadata

File hashes

Hashes for schadem_cdk_construct_sfn_test-0.0.14-py3-none-any.whl
Algorithm Hash digest
SHA256 cd8b11e6c9a1f88560ce13cf4b36c681ceffb7094079c292c29050444b9c5962
MD5 48979368b4c3b48ef74ea17ab0db209c
BLAKE2b-256 3e713a0a7695bc5df267c9f692e2c86eec8fa25ba4f83616359196654ebbb0b3

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page