some description
Project description
TL;DR:
This Construct allows calling Amazon Textract as part of a AWS Step Function Workflow. Still work-in-process, so interfaces may (most likely will) change. But a good starting point imho and happy to hear feedback.
Build using https://projen.io/ - GitHub at: https://github.com/projen/projen
Usage
Atm deployed as NPM package for use in TypeScript and as a PyPI package for use in Python based CDK stacks.
The Step Function flow expects a message with information about the location of the file to process:
{
"s3_bucket": "<somebucket>",
"s3_key": "someprefix/someobject.somesuffix"
}
The object can either be a supported document type (PDF, PNG, JPEG, TIFF) for the default flow, which calls the [https://docs.aws.amazon.com/textract/latest/dg/API_Operations.html] (DetectDocumentText) for single page or https://docs.aws.amazon.com/textract/latest/dg/API_Operations.html for multi page documents.
The output is in the following format
{
"TextractOutputJsonPath": "s3://somebucket/someoutputprefix/randomuuid/inputfilename.json"
}
and includes the full JSON returned from Textract. It already combines paginated results into one JSON file.
To call other functionality like the Forms, Tables, Queries, AnalyzeID or AnalyzeExpense there are 2 ways:
- Pass in "EXPENSE" or "IDENTITY" as the defaultClassification when creating the TextractStepFunctionsStartExecution, which will then change the default from the Detect API to the Expense or Identity one.
- Upload a manifest file
The format of the manifest file is defined as:
{
"S3Path": "s3://sdx-textract-us-east-1/employeeapp20210510.png",
"TextractFeatures": [
"FORMS",
"TABLES",
"QUERIES"
],
"QueriesConfig": [{
"Text": "What is the applicant full name?",
"Alias": "FULL_NAME",
"Pages": ["*"]
}],
"Classification":"SOMECLASSIFICATION"
}
for the AnalyzeDocument API at least one TextractFeature and the S3Path is required. To execute the Expense API the Classification has to be set to "EXPENSE".
For the Identity API the following format is required:
{
"DocumentPages": ["s3://sdx-textract-us-east-1/driverlicense.png"],
"Classification":"IDENTITY"
}
For Identity, 2 document pages can be passed in.
TypeScript sample
sample GitHub project that uses the Textract Construct: https://github.com/schadem/schadem-cdk-stack-test
To use, add as dependency
"dependencies": {
"schadem-cdk-construct-sfn-test": "0.0.12"
},
const textract_task = new tstep.TextractStepFunctionsStartExecution(this, 'textract-task', {
s3OutputBucket: documentBucket.bucketName,
s3OutputPrefix: textractS3OutputPrefix,
s3TempOutputPrefix: textractTemporaryS3OutputPrefix,
});
const workflow_chain = sfn.Chain.start(textract_task)
const stateMachine = new sfn.StateMachine(this, 'IDPWorkflow', {
definition: workflow_chain,
timeout: Duration.minutes(240),
});
Python sample
sample GitHub project with Stack that uses the Textract Construct: https://github.com/schadem/schadem-cdk-idp-stack-python-sample
package name: schadem-cdk-construct-sfn-test
textract_task = sfctc.TextractStepFunctionsStartExecution(
self,
"textract-task",
s3_output_bucket=document_bucket.bucket_name,
s3_temp_output_prefix=s3_temp_prefix,
s3_output_prefix=s3_output_prefix)
workflow_chain = sfn.Chain.start(textract_task)
state_machine = sfn.StateMachine(self,
'IDPWorkflowPython',
definition=workflow_chain)
The Construct implements the sfn.TaskStateBase similar to the StepFunctionsStartExecution and therefore is used as a part of a Step Function workflow. See the stack for a usage sample.
Development
At the moment essentially just do
npx projen build
to generate the packages.
When pushing/merging to mainline branch onto GitHub it kicks off a pipeline which increases the version number and deploys the packages to PyPI and NPM atm (nugen and maven can be added).
That package I reference in a script in the stack (install_construct_and_deploy.s) - which atm has hardcoded references to locations of the packages on my local system. Obviously that will change when we push the packages out
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file schadem-cdk-construct-sfn-test-0.0.14.tar.gz
.
File metadata
- Download URL: schadem-cdk-construct-sfn-test-0.0.14.tar.gz
- Upload date:
- Size: 70.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.0 CPython/3.10.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 30fe7ed8bf584baf70573c6c87609244f82fca22f25bc540aedc935d224a17b0 |
|
MD5 | 66a10c53a5e1bb6c860b863ed7512e5e |
|
BLAKE2b-256 | 436e7d5e70eff22f6ebc7e178fe2940f21eefa8182585d98363b9aa103b95c9e |
File details
Details for the file schadem_cdk_construct_sfn_test-0.0.14-py3-none-any.whl
.
File metadata
- Download URL: schadem_cdk_construct_sfn_test-0.0.14-py3-none-any.whl
- Upload date:
- Size: 69.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.0 CPython/3.10.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | cd8b11e6c9a1f88560ce13cf4b36c681ceffb7094079c292c29050444b9c5962 |
|
MD5 | 48979368b4c3b48ef74ea17ab0db209c |
|
BLAKE2b-256 | 3e713a0a7695bc5df267c9f692e2c86eec8fa25ba4f83616359196654ebbb0b3 |