Skip to main content

Amazon Textract Helper tools

Project description

Textractor-Textract-Helper

amazon-textract-helper provides a collection of ready to use functions and sample implementations to speed up the evaluation and development for any project using Amazon Textract. It installs a command line tool called amazon-textract

Install

> python -m pip install amazon-textract-helper

Make sure your environment is setup with AWS credentials through configuration files or environment variables or an attached role. (https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html)

Test

> amazon-textract --help
usage: amazon-textract [-h] (--input-document INPUT_DOCUMENT | --example | --stdin) [--features {FORMS,TABLES} [{FORMS,TABLES} ...]]
                       [--pretty-print {WORDS,LINES,FORMS,TABLES} [{WORDS,LINES,FORMS,TABLES} ...]]
                       [--pretty-print-table-format {csv,plain,simple,github,grid,fancy_grid,pipe,orgtbl,jira,presto,pretty,psql,rst,medi
awiki,moinmoin,youtrack,html,unsafehtml,latex,latex_raw,latex_booktabs,latex_longtable,textile,tsv}]
                       [--overlay {WORD,LINE,FORM,KEY,VALUE,TABLE,CELL} [{WORD,LINE,FORM,KEY,VALUE,TABLE,CELL} ...]]
                       [--pop-up-overlay-output] [--overlay-output-folder OVERLAY_OUTPUT_FOLDER] [--version] [--no-stdout] [-v | -vv]

optional arguments:
  -h, --help            show this help message and exit
  --input-document INPUT_DOCUMENT
                        s3 object (s3://) or file from local filesystem
  --example             using the example document to call Textract
  --stdin               receive JSON from stdin
  --features {FORMS,TABLES} [{FORMS,TABLES} ...]
                        features to call Textract with. Will trigger call to AnalyzeDocument instead of DetectDocumentText
  --pretty-print {WORDS,LINES,FORMS,TABLES} [{WORDS,LINES,FORMS,TABLES} ...]
  --pretty-print-table-format {csv,plain,simple,github,grid,fancy_grid,pipe,orgtbl,jira,presto,pretty,psql,rst,mediawiki,moinmoin,youtrac
k,html,unsafehtml,latex,latex_raw,latex_booktabs,latex_longtable,textile,tsv}
                        which format to output the pretty print information to. Only effects FORMS and TABLES
  --overlay {WORD,LINE,FORM,KEY,VALUE,TABLE,CELL} [{WORD,LINE,FORM,KEY,VALUE,TABLE,CELL} ...]
                        defines what bounding boxes to draw on the output
  --pop-up-overlay-output
                        shows image with overlay
  --overlay-text        shows image with WORD or LINE text overlay. When both WORD and LINE overlay are specified, WORD text will be overlayed
  --overlay-confidence  shows image with confidence overlay
  --overlay-output-folder OVERLAY_OUTPUT_FOLDER
                        output with bounding boxes to folder
  --version             print version information
  --no-stdout           no output to stdout
  -v                    >=INFO level logging output to stderr
  -vv                   >=DEBUG level logging output to stderr

Sample Commands

Easy Start

> amazon-textract --example

this will run the examples document using the DetectDocumentText API. Output will be printed to stdout and look similar to this:

{"DocumentMetadata": {"Pages": 1}, "Blocks": [{"BlockType": "PAGE", "Geometry": {"BoundingBox": {"Width": 1.0, "Height": 1.0, "Left": 0.0
, "Top": 0.0}, "Polygon": [{"X": 9.33321120033382e-17, "Y": 0.0}, {"X": 1.0, "Y": 1.6069064689339292e-16}, {"X": 1.0, "Y": 1.0}],
"HTTPHeaders": {"x-amzn-requestid": "12345678-1234-1234-1234-123456789012", "content-type": "a
pplication/x-amz-json-1.1", "content-length": "48177", "date": "Thu, 01 Apr 2021 21:50:29 GMT"}, "RetryAttempts": 0}}

It is working.

Call with document on S3

> amazon-textract --input-document "s3://somebucket/someprefix/someobjectname.png"

Output similar to Easy Start

Call with document on local file system

> amazon-textract --input-document "./somepath/somefilename.png"

Output similar to Easy Start

We will continue to use the --example parameter to keep it simple and easy to reproduce. S3 and local files work the same way, just instead of --example use --input-document .

Call with STDIN

# first create JSON
amazon-textract --example > example.json
# now use a stored JSON with the ```amazon-textract``` command
cat example.json | amazon-textract --stdin -pretty-print LINES

Call with FORMS and TABLES

> amazon-textract --example --features FORMS TABLES

This will call the [AnalyzeDocument API] (https://docs.aws.amazon.com/textract/latest/dg/API_AnalyzeDocument.html) and output will include Output will look similar to "Easy Start" but include FORMS and TABLES information

Pretty print the output

Pretty print outputs nicely formatted information for words, lines, forms or tables.

For example to print the tables identified by Amazon Textract to stdout, use

> amazon-textract --example --features TABLES --pretty-print TABLES

Output will look like this:

|------------|-----------|---------------------|-----------------|-----------------------|
|            |           | Previous Employment | History         |                       |
| Start Date | End Date  | Employer Name       | Position Held   | Reason for leaving    |
| 1/15/2009  | 6/30/2011 | Any Company         | Assistant Baker | Family relocated      |
| 7/1/2011   | 8/10/2013 | Best Corp.          | Baker           | Better opportunity    |
| 8/15/2013  | present   | Example Corp.       | Head Baker      | N/A, current employer |

to pretty print both, FORMS and TABLES:

> amazon-textract --example --features FORMS TABLES --pretty-print FORMS TABLES

will output

Phone Number:: 555-0100
Home Address:: 123 Any Street, Any Town, USA
Full Name:: Jane Doe
Mailing Address:: same as home address
|------------|-----------|---------------------|-----------------|-----------------------|
|            |           | Previous Employment | History         |                       |
| Start Date | End Date  | Employer Name       | Position Held   | Reason for leaving    |
| 1/15/2009  | 6/30/2011 | Any Company         | Assistant Baker | Family relocated      |
| 7/1/2011   | 8/10/2013 | Best Corp.          | Baker           | Better opportunity    |
| 8/15/2013  | present   | Example Corp.       | Head Baker      | N/A, current employer |

Overlay

At the moment overlay only works with images, we will add support for PDF soon.

The following command runs DetectDocumentText, pretty prints the WORDS in the document to stdout and draws bounding boxes around each WORD and displays the result in a popup window and stores it to a folder called 'overlay-output-folder-name'.

amazon-textract --example --pretty-print WORDS --overlay WORD --pop-up-overlay-output --overlay-output-folder overlay-output-folder-name
Sample overlay WORD

The following command runs AnalyzeDocument for FORMS and TABLES, pretty prints FORMS and TABLES to to stdout and draws bounding boxes around each TABLE-CELL and FORM KEY/VALUE and displays the result in a popup window and stores it to a folder called 'overlay-output-folder-name'.

> amazon-textract --example --features TABLES FORMS --pretty-print FORMS TABLES --overlay FORM CELL --pop-up-overlay-output --overlay-output-folder ../mywonderfuloutputfolderfordocs/
Sample overlay FORM CELL

The following command draws bounding boxes around each WORD, overlays the detected WORD text, and displays the result in a popup window and stores it to a folder called 'overlay-output-folder-name'.

> amazon-textract --example --overlay WORD --overlay-text --pop-up-overlay-output --overlay-output-folder overlay-output-folder-name
Sample overlay LINE with overlay text and confidence percentage

The following command draws bounding boxes around each LINE, overlays LINE text along with percentage confidence of the detected LINE text, and displays the result in a popup window and stores it to a folder called 'overlay-output-folder-name'.

> amazon-textract --example --overlay LINE --overlay-text --overlay-confidence --pop-up-overlay-output --overlay-output-folder overlay-output-folder-name
Sample overlay LINE with overlay text and confidence percentage

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

amazon-textract-helper-0.0.35.tar.gz (1.4 MB view details)

Uploaded Source

Built Distribution

amazon_textract_helper-0.0.35-py2.py3-none-any.whl (298.1 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file amazon-textract-helper-0.0.35.tar.gz.

File metadata

File hashes

Hashes for amazon-textract-helper-0.0.35.tar.gz
Algorithm Hash digest
SHA256 e5dc1afc4ade3cb5aa247ca556ee26a2e8e2b5e3aa7e176f3449df445262a53d
MD5 d0d8c2a102ac3adaa3f179503637f414
BLAKE2b-256 8c2be6b0aca31d5504bac5d4f9b47b8ba6cbb816c0bc787f3839456aa61138a8

See more details on using hashes here.

File details

Details for the file amazon_textract_helper-0.0.35-py2.py3-none-any.whl.

File metadata

File hashes

Hashes for amazon_textract_helper-0.0.35-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 4f727e480da17d5cf8060a5b1f1c045f61331f782c90cbd0084b20fd92808ab9
MD5 2463f4fafea9515da5e21f7620ba8037
BLAKE2b-256 6ea6a16de3e84d357a68357dcda378bc80cd2def24e0065dfc0e692128f8d056

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page