Haystack integration for AWS Textract document text extraction and analysis
Project description
amazon-textract-haystack
Overview
A Haystack integration for AWS Textract that extracts text and structured data from documents using OCR.
The AmazonTextractConverter component converts images and single-page PDFs into Haystack Document objects using the AWS Textract synchronous API.
Supported file formats: JPEG, PNG, TIFF, BMP, and single-page PDF (up to 10 MB).
Installation
pip install amazon-textract-haystack
Usage
Basic text extraction
Extract plain text from a document using DetectDocumentText:
from haystack_integrations.components.converters.amazon_textract import AmazonTextractConverter
converter = AmazonTextractConverter()
results = converter.run(sources=["document.png"])
documents = results["documents"]
print(documents[0].content)
Table and form analysis
Use AnalyzeDocument to detect tables and forms by setting feature_types:
converter = AmazonTextractConverter(feature_types=["TABLES", "FORMS"])
results = converter.run(sources=["invoice.png"])
documents = results["documents"]
raw_responses = results["raw_textract_response"]
Valid feature_types values: "TABLES", "FORMS", "SIGNATURES", "LAYOUT".
Natural-language queries
Ask questions about a document and get extracted answers. The QUERIES feature type
is enabled automatically when you pass the queries parameter at runtime:
converter = AmazonTextractConverter()
results = converter.run(
sources=["medical_form.png"],
queries=["What is the patient name?", "What is the date of birth?"],
)
documents = results["documents"]
raw_responses = results["raw_textract_response"]
Queries can be combined with feature_types for both structural and question-based extraction:
converter = AmazonTextractConverter(feature_types=["TABLES", "FORMS"])
results = converter.run(
sources=["invoice.png"],
queries=["What is the total amount due?"],
)
In a Haystack pipeline
from haystack import Pipeline
from haystack.components.preprocessors import DocumentCleaner
from haystack_integrations.components.converters.amazon_textract import AmazonTextractConverter
pipeline = Pipeline()
pipeline.add_component("converter", AmazonTextractConverter())
pipeline.add_component("cleaner", DocumentCleaner())
pipeline.connect("converter.documents", "cleaner.documents")
result = pipeline.run({"converter": {"sources": ["scan.png"]}})
AWS Credentials
The component uses the standard boto3 credential chain. You can configure credentials in any of these ways:
- Environment variables (default): Set
AWS_ACCESS_KEY_ID,AWS_SECRET_ACCESS_KEY, andAWS_DEFAULT_REGION. - AWS credentials file: Configure via
~/.aws/credentialsand~/.aws/config. - IAM role: When running on AWS infrastructure (EC2, Lambda, ECS).
- Explicit parameters:
from haystack.utils import Secret
converter = AmazonTextractConverter(
aws_access_key_id=Secret.from_env_var("MY_AWS_KEY"),
aws_secret_access_key=Secret.from_env_var("MY_AWS_SECRET"),
aws_region_name=Secret.from_token("us-east-1"),
)
Running Tests
Unit tests (no AWS credentials needed):
cd integrations/amazon_textract
hatch run test:unit
Integration tests (require AWS credentials and a test image at tests/test_files/sample_text.png):
export AWS_ACCESS_KEY_ID=...
export AWS_SECRET_ACCESS_KEY=...
export AWS_DEFAULT_REGION=us-east-1
hatch run test:integration
Contributing
Refer to the general Contribution Guidelines.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file amazon_textract_haystack-1.0.0.tar.gz.
File metadata
- Download URL: amazon_textract_haystack-1.0.0.tar.gz
- Upload date:
- Size: 37.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8fea83f38c71b7d391c5a1a806041adbfb1958d3d57fbcc75f9099ff29842b0a
|
|
| MD5 |
35f6c274363e46f7d22cf079e496625f
|
|
| BLAKE2b-256 |
2478ba6dcf2e79cbb1392f5a2922593864009f3b223f5dfcc141c160e312408d
|
Provenance
The following attestation bundles were made for amazon_textract_haystack-1.0.0.tar.gz:
Publisher:
CI_pypi_release.yml on deepset-ai/haystack-core-integrations
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
amazon_textract_haystack-1.0.0.tar.gz -
Subject digest:
8fea83f38c71b7d391c5a1a806041adbfb1958d3d57fbcc75f9099ff29842b0a - Sigstore transparency entry: 1602068951
- Sigstore integration time:
-
Permalink:
deepset-ai/haystack-core-integrations@4582a3259de652e45b2410b05c252914ce8edaed -
Branch / Tag:
refs/tags/integrations/amazon_textract-v1.0.0 - Owner: https://github.com/deepset-ai
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
CI_pypi_release.yml@4582a3259de652e45b2410b05c252914ce8edaed -
Trigger Event:
push
-
Statement type:
File details
Details for the file amazon_textract_haystack-1.0.0-py3-none-any.whl.
File metadata
- Download URL: amazon_textract_haystack-1.0.0-py3-none-any.whl
- Upload date:
- Size: 11.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4e05b20d2021ad5ad1801f7ff7f4048b086bc6e39d55eb5318737ad2ef9dfb40
|
|
| MD5 |
b41b3167e2b395e77b6fbd74f7202ce7
|
|
| BLAKE2b-256 |
e119becb95335694fdd4fee40ed73210bcc90f44385c6862edcba91ea99ebb56
|
Provenance
The following attestation bundles were made for amazon_textract_haystack-1.0.0-py3-none-any.whl:
Publisher:
CI_pypi_release.yml on deepset-ai/haystack-core-integrations
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
amazon_textract_haystack-1.0.0-py3-none-any.whl -
Subject digest:
4e05b20d2021ad5ad1801f7ff7f4048b086bc6e39d55eb5318737ad2ef9dfb40 - Sigstore transparency entry: 1602068955
- Sigstore integration time:
-
Permalink:
deepset-ai/haystack-core-integrations@4582a3259de652e45b2410b05c252914ce8edaed -
Branch / Tag:
refs/tags/integrations/amazon_textract-v1.0.0 - Owner: https://github.com/deepset-ai
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
CI_pypi_release.yml@4582a3259de652e45b2410b05c252914ce8edaed -
Trigger Event:
push
-
Statement type: