amazon-textract-textractor

A package to use AWS Textract services.

These details have not been verified by PyPI

Project links

Homepage

Project description

Textractor

Textractor is a python package created to seamlessly work with Amazon Textract a document intelligence service offering text recognition, table extraction, form processing, and much more. Whether you are making a one-off script or a complex distributed document processing pipeline, Textractor makes it easy to use Textract.

If you are looking for the other amazon-textract-* packages, you can find them using the links below:

amazon-textract-caller (to simplify calling Amazon Textract without additional dependencies)
amazon-textract-response-parser (to parse the JSON response returned by Textract APIs)
amazon-textract-overlayer (to draw bounding boxes around the document entities on the document image)
amazon-textract-prettyprinter (convert Amazon Textract response to CSV, text, markdown, ...)
amazon-textract-geofinder (extract specific information from document with methods that help navigate the document using geometry and relations, e. g. hierarchical key/value pairs)

Installation

Textractor is available on PyPI and can be installed with pip install amazon-textract-textractor. By default this will install the minimal version of Textractor which is suitable for lambda execution. The following extras can be used to add features:

pandas (pip install "amazon-textract-textractor[pandas]") installs pandas which is used to enable DataFrame and CSV exports.
pdfium (pip install amazon-textract-textractor[pdfium]) includes pypdfium2 and is the recommended way to enable PDF rasterization in Textractor. Note that this is not necessary to call Textract with a PDF file.
pdf (pip install amazon-textract-textractor[pdf]) includes pdf2image and is an additional way to enable PDF rasterization in Textractor. Note that this is not necessary to call Textract with a PDF file.
torch (pip install "amazon-textract-textractor[torch]") includes sentence_transformers for better word search and matching. This will work on CPU but be noticeably slower than non-machine learning based approaches.
dev (pip install "amazon-textract-textractor[dev]") includes all the dependencies above and everything else needed to test the code.

You can pick several extras by separating the labels with commas like this pip install "amazon-textract-textractor[pdf,torch]".

Documentation

Generated documentation for the latest released version can be accessed here: aws-samples.github.io/amazon-textract-textractor/

Examples

While a collection of simplistic examples is presented here, the documentation has a much larger collection of examples with specific case studies that will help you get started.

Setup

These two lines are all you need to use Textract. The Textractor instance can be reused across multiple requests for both synchronous and asynchronous requests.

from textractor import Textractor

extractor = Textractor(profile_name="default")

Text recognition

# file_source can be an image, list of images, bytes or S3 path
document = extractor.detect_document_text(file_source="tests/fixtures/single-page-1.png")
print(document.lines)
#[Textractor Test, Document, Page (1), Key - Values, Name of package: Textractor, Date : 08/14/2022, Table 1, Cell 1, Cell 2, Cell 4, Cell 5, Cell 6, Cell 7, Cell 8, Cell 9, Cell 10, Cell 11, Cell 12, Cell 13, Cell 14, Cell 15, Selection Element, Selected Checkbox, Un-Selected Checkbox]

Table extraction

from textractor.data.constants import TextractFeatures

document = extractor.analyze_document(
	file_source="tests/fixtures/form.png",
	features=[TextractFeatures.TABLES]
)
# Saves the table in an excel document for further processing
document.tables[0].to_excel("output.xlsx")

Form extraction

from textractor.data.constants import TextractFeatures

document = extractor.analyze_document(
	file_source="tests/fixtures/form.png",
	features=[TextractFeatures.FORMS]
)
# Use document.get() to search for a key with fuzzy matching
document.get("email")
# [E-mail Address : johndoe@gmail.com]

Analyze ID

document = extractor.analyze_id(file_source="tests/fixtures/fake_id.png")
print(document.identity_documents[0].get("FIRST_NAME"))
# 'MARIA'

Receipt processing (Analyze Expense)

document = extractor.analyze_expense(file_source="tests/fixtures/receipt.jpg")
print(document.expense_documents[0].summary_fields.get("TOTAL")[0].text)
# '$1810.46'

If your use case was not covered here or if you are looking for asynchronous usage examples, see our collection of examples.

CLI

Textractor also comes with the textractor script, which supports calling, printing and overlaying directly in the terminal.

textractor analyze-document tests/fixtures/amzn_q2.png output.json --features TABLES --overlay TABLES

overlay_example

See the documentation for more examples.

Tests

The package comes with tests that call the production Textract APIs. Running the tests will incur charges to your AWS account.

Acknowledgements

This library was made possible by the work of Srividhya Radhakrishna (@srividh-r).

Contributing

See CONTRIBUTING.md

Citing

Textractor can be cited using:

@software{amazontextractor,
  author = {Belval, Edouard and Delteil, Thomas and Schade, Martin and Radhakrishna, Srividhya},
  title = {{Amazon Textractor}},
  url = {https://github.com/aws-samples/amazon-textract-textractor},
  version = {1.9.2},
  year = {2025}
}

Or using the CITATION.cff file.

License

This library is licensed under the Apache 2.0 License.

_{^{Excavator image by macrovector on Freepik}}

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

1.9.2

Apr 24, 2025

1.9.1

Mar 27, 2025

1.9.0

Mar 11, 2025

1.8.5

Nov 13, 2024

1.8.4

Nov 6, 2024

1.8.3

Aug 21, 2024

1.8.2

Jun 25, 2024

1.8.1

Jun 24, 2024

1.8.0

Jun 21, 2024

1.7.12

May 23, 2024

1.7.11

May 10, 2024

1.7.10

Apr 19, 2024

1.7.9

Mar 22, 2024

1.7.8

Mar 21, 2024

1.7.7

Mar 20, 2024

1.7.6

Mar 15, 2024

1.7.5

Mar 7, 2024

1.7.4

Feb 26, 2024

1.7.3

Feb 26, 2024

1.7.2

Feb 9, 2024

1.7.1

Jan 31, 2024

1.7.0

Jan 31, 2024

1.6.1

Dec 19, 2023

1.6.0 yanked

Dec 19, 2023

Reason this release was yanked:

Table to markdown bug.

1.5.1

Jan 10, 2024

1.5.0

Dec 12, 2023

1.4.5

Nov 2, 2023

1.4.4

Nov 1, 2023

1.4.3

Oct 30, 2023

1.4.2

Oct 23, 2023

1.4.1

Oct 19, 2023

1.4.0

Oct 19, 2023

1.3.7

Oct 4, 2023

1.3.6

Sep 29, 2023

1.3.5

Sep 15, 2023

1.3.4

Sep 8, 2023

1.3.3

Sep 5, 2023

1.3.2

Jul 5, 2023

1.3.1

May 24, 2023

1.3.0

Apr 18, 2023

1.2.0

Apr 12, 2023

1.1.1

Mar 2, 2023

1.1.0

Mar 1, 2023

1.0.24

Feb 10, 2023

1.0.23

Jan 13, 2023

1.0.22

Dec 19, 2022

1.0.21

Dec 15, 2022

1.0.20

Dec 15, 2022

1.0.18

Nov 4, 2022

1.0.17

Nov 2, 2022

1.0.16

Oct 26, 2022

1.0.15

Oct 21, 2022

1.0.0 yanked

Sep 28, 2022

Reason this release was yanked:

Wrong version number

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

amazon-textract-textractor-1.9.2.tar.gz (304.5 kB view details)

Uploaded Apr 24, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

amazon_textract_textractor-1.9.2-py3-none-any.whl (311.0 kB view details)

Uploaded Apr 24, 2025 Python 3

File details

Details for the file amazon-textract-textractor-1.9.2.tar.gz.

File metadata

Download URL: amazon-textract-textractor-1.9.2.tar.gz
Upload date: Apr 24, 2025
Size: 304.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for amazon-textract-textractor-1.9.2.tar.gz
Algorithm	Hash digest
SHA256	`ed69f88b33a6b131454fefc7823c8eb8cf18ef29a02faef87c0023ab425faa02`
MD5	`403e45b4d37807fa439fc92a5acc8222`
BLAKE2b-256	`96bdd8009776fbb8055e8c73b87a914c138f4c60a42885a8bc50d99153e939a8`

See more details on using hashes here.

File details

Details for the file amazon_textract_textractor-1.9.2-py3-none-any.whl.

File metadata

Download URL: amazon_textract_textractor-1.9.2-py3-none-any.whl
Upload date: Apr 24, 2025
Size: 311.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for amazon_textract_textractor-1.9.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`7faf3a991588e8aad2527c93fa79115f569002d6dc0cada4cf4e57df07fb4410`
MD5	`cd38fd9d3aa7fdd0b51c30fcddd22ad3`
BLAKE2b-256	`761cf60a35c1ba5781944f5f50bb4320da8e3723387f11cb9b2bb56dde9c2961`

See more details on using hashes here.

amazon-textract-textractor 1.9.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Installation

Documentation

Examples

Setup

Text recognition

Table extraction

Form extraction

Analyze ID

Receipt processing (Analyze Expense)

CLI

Tests

Acknowledgements

Contributing

Citing

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes