Skip to main content

Parser for Amazon Textract results.

Project description

Amazon Textract Results Parser - textract-trp

Amazon Textract Results Parser or trp module packaged and improved for ease of use.

TL;DR

pip install textract-trp

Requires Python 3.6 or newer.

Usage

import boto3
import trp

textract_client = boto3.client('textract')
results = textract_client.analyze_document(... your file and other params ...)
doc = trp.Document(results)

Now you can examine doc.pages. For example print all the detected on the page:

print(doc.pages[0].text)

Or print out the detected tables in CSV format:

for row in doc.pages[0].tables[0].rows:
    for cell in row.cells:
        print(cell.text.strip(), end=",")
    print()

Or retrieve text from a given position on the page. For that we have to create Bounding Box with the required coordinates relative to the page.

# Coordinates are from top-left corner [0,0] to bottom-right [1,1]
bbox = trp.BoundingBox(width=0.220, height=0.085, left=0.734, top=0.140)
lines = doc.pages[0].getLinesInBoundingBox(bbox)

# Print only the lines contained in the Bounding Box
for line in lines:
    print(line.text)

Refer to the Textract blog post and to amazon-textract-code-samples GitHub repository for more details.

Background

The Amazon blog post about Textract refers to a python module trp.py which used to be quite hard to find. There are many posts on the internet from people looking for the module, often confused by the "other trp module" that's got nothing to do with Textract.

Hence I decided to package and publish the trp.py module from the aws-samples/amazon-textract-code-samples repository. Fortunately its MIT license permits that.

Over time I have made some improvements to the module for ease of use.

Maintainer

Michael Ludvig

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

textract-trp-0.1.3.tar.gz (6.2 kB view details)

Uploaded Source

Built Distribution

textract_trp-0.1.3-py3-none-any.whl (5.8 kB view details)

Uploaded Python 3

File details

Details for the file textract-trp-0.1.3.tar.gz.

File metadata

  • Download URL: textract-trp-0.1.3.tar.gz
  • Upload date:
  • Size: 6.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.0.5 CPython/3.6.10 Linux/4.15.0-1052-aws

File hashes

Hashes for textract-trp-0.1.3.tar.gz
Algorithm Hash digest
SHA256 2970350226a1c5caa679dadfdac1cc2d241dc95616ace0626f8edd05dc32fd87
MD5 90e4e2f9069c0f67cd89e3979b0edc1a
BLAKE2b-256 1361d4dbf2ff0875a6bff33d99b7162f3d3843072af76c09cffe466171ade6b8

See more details on using hashes here.

File details

Details for the file textract_trp-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: textract_trp-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 5.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.0.5 CPython/3.6.10 Linux/4.15.0-1052-aws

File hashes

Hashes for textract_trp-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 63a443cd01a37c1c0b8bffbba6ea1d8b2234ac28b6c05a80aaca1e1f2796c03c
MD5 1749bfbc9b0186a4f4fd692ff670a0f4
BLAKE2b-256 7708d1d64520f2b78736a2f75e9afaa3bf2bb2d61d6be2361ff51e85f152c66c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page