Skip to main content

A package to use AWS Textract services.

Project description

Textractor

Tests Documentation PyPI version Code style: black

Textractor is a python package created to seamlessly work with Amazon Textract a document intelligence service offering text recognition, table extraction, form processing, and much more. Whether you are making a one-off script or a complex distributed document processing pipeline, Textractor makes it easy to use Textract.

If you are looking for the other amazon-textract-* packages, you can find them using the links below:

Installation

Textractor is available on PyPI and can be installed with pip install amazon-textract-textractor. By default this will install the minimal version of Textractor which is suitable for lambda execution. The following extras can be used to add features:

  • pandas (pip install "amazon-textract-textractor[pandas]") installs pandas which is used to enable DataFrame and CSV exports.
  • pdf (pip install "amazon-textract-textractor[pdf]") includes pdf2image and enables PDF rasterization in Textractor. Note that this is not necessary to call Textract with a PDF file.
  • torch (pip install "amazon-textract-textractor[torch]") includes sentence_transformers for better word search and matching. This will work on CPU but be noticeably slower than non-machine learning based approaches.
  • dev (pip install "amazon-textract-textractor[dev]") includes all the dependencies above and everything else needed to test the code.

You can pick several extras by separating the labels with commas like this pip install "amazon-textract-textractor[pdf,torch]".

Documentation

Generated documentation for the latest released version can be accessed here: aws-samples.github.io/amazon-textract-textractor/

Examples

While a collection of simplistic examples is presented here, the documentation has a much larger collection of examples with specific case studies that will help you get started.

Setup

These two lines are all you need to use Textract. The Textractor instance can be reused across multiple requests for both synchronous and asynchronous requests.

from textractor import Textractor

extractor = Textractor(profile_name="default")

Text recognition

# file_source can be an image, list of images, bytes or S3 path
document = extractor.detect_document_text(file_source="tests/fixtures/single-page-1.png")
print(document.lines)
#[Textractor Test, Document, Page (1), Key - Values, Name of package: Textractor, Date : 08/14/2022, Table 1, Cell 1, Cell 2, Cell 4, Cell 5, Cell 6, Cell 7, Cell 8, Cell 9, Cell 10, Cell 11, Cell 12, Cell 13, Cell 14, Cell 15, Selection Element, Selected Checkbox, Un-Selected Checkbox]

Table extraction

from textractor.data.constants import TextractFeatures

document = extractor.analyze_document(
	file_source="tests/fixtures/form.png",
	features=[TextractFeatures.TABLES]
)
# Saves the table in an excel document for further processing
document.tables[0].to_excel("output.xlsx")

Form extraction

from textractor.data.constants import TextractFeatures

document = extractor.analyze_document(
	file_source="tests/fixtures/form.png",
	features=[TextractFeatures.FORMS]
)
# Use document.get() to search for a key with fuzzy matching
document.get("email")
# [E-mail Address : johndoe@gmail.com]

Analyze ID

document = extractor.analyze_id(file_source="tests/fixtures/fake_id.png")
print(document.identity_documents[0].get("FIRST_NAME"))
# 'MARIA'

Receipt processing (Analyze Expense)

document = extractor.analyze_expense(file_source="tests/fixtures/receipt.jpg")
print(document.expense_documents[0].summary_fields.get("TOTAL")[0].text)
# '$1810.46'

If your use case was not covered here or if you are looking for asynchronous usage examples, see our collection of examples.

CLI

Textractor also comes with the textractor script, which supports calling, printing and overlaying directly in the terminal.

textractor analyze-document tests/fixtures/amzn_q2.png output.json --features TABLES --overlay TABLES

overlay_example

See the documentation for more examples.

Tests

The package comes with tests that call the production Textract APIs. Running the tests will incur charges to your AWS account.

Acknowledgements

This library was made possible by the work of Srividhya Radhakrishna (@srividh-r).

Contributing

See CONTRIBUTING.md

License

This library is licensed under the Apache 2.0 License.

Excavator image by macrovector on Freepik

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

amazon-textract-textractor-1.4.2.tar.gz (276.7 kB view details)

Uploaded Source

Built Distribution

amazon_textract_textractor-1.4.2-py3-none-any.whl (294.0 kB view details)

Uploaded Python 3

File details

Details for the file amazon-textract-textractor-1.4.2.tar.gz.

File metadata

File hashes

Hashes for amazon-textract-textractor-1.4.2.tar.gz
Algorithm Hash digest
SHA256 6673db625f717c4a9b69f5423072c6d9dc6edfc0aca850daef6daba6a2d19911
MD5 a34250c1e22271545170a77f7d12dd64
BLAKE2b-256 1bc2feedcd7c4a8fa4a0bd25db68fd6689cca2fb8e8d20203b282b1424e355b5

See more details on using hashes here.

File details

Details for the file amazon_textract_textractor-1.4.2-py3-none-any.whl.

File metadata

File hashes

Hashes for amazon_textract_textractor-1.4.2-py3-none-any.whl
Algorithm Hash digest
SHA256 e537374e2a7b4f93346b2e5cc4ea4e49339c5b74556b44434cdc5cb48a09a130
MD5 a9e6ce29db71d7f1ad11557b0727ed53
BLAKE2b-256 ebe9bf39abe79094c087508b354268a0ba51471a3af8bccbbffd5af9c2713d7f

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page