A Python library for parsing AWS Textract form output
Project description
Textract Form Parser
A Python library for parsing AWS Textract form output. This library helps process and analyze form data extracted by AWS Textract, generating structured outputs and reports.
Features
- Parse AWS Textract JSON output
- Extract form fields and tables
- Generate HTML reports
- Create concise and verbose outputs
- Command-line interface
- Logging and debugging support
Installation
pip install textract-form-parser
Usage
As a Library
from textract_parser import analyze_document, generate_html_report, create_concise_results
# Load your Textract JSON
with open("notebook.json", "r") as f:
textract_json = json.load(f)
# Analyze document
analysis_results = analyze_document(textract_json)
# Generate HTML report
generate_html_report(analysis_results, "report.html")
# Get concise results
concise_results = create_concise_results(analysis_results)
Command Line Interface
# Basic usage
textract-parser input.json -o output
# With verbose logging
textract-parser input.json -o output -v
Development
Setup
- Clone the repository:
git clone https://github.com/yogeshvar/text-extractor.git
cd text-extractor
- Create and activate virtual environment:
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
- Install development dependencies:
pip install -e ".[dev]"
- Install pre-commit hooks:
pre-commit install
Code Formatting
Format code using the provided script:
./scripts/format.sh
This will:
- Fix end of files
- Fix trailing whitespace
- Run Black formatter
- Sort imports with isort
- Stage formatted files
Testing
Run tests with coverage:
pytest --cov=textract_parser \
--cov-report=term-missing \
--html=test-results/report.html \
--self-contained-html \
-v
Commit Guidelines
We use conventional commits. Format:
<type>: <description>
[optional body]
[optional footer(s)]
Types:
feat
: New featurefix
: Bug fixenhance
: Enhancementdocs
: Documentationstyle
: Code stylerefactor
: Code refactoringtest
: Testingchore
: Maintenance
Release Process
- Create a PR from your feature branch to master
- Ensure all tests pass
- Update version in:
setup.py
textract_parser/__init__.py
pyproject.toml
- Merge PR to master
- GitHub Actions will automatically:
- Run tests
- Create a new tag
- Generate release notes
- Create GitHub release
- Publish to PyPI
License
MIT License - see LICENSE for details
Contributing
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature
) - Format code (
./scripts/format.sh
) - Commit changes (
git commit -m 'feat: add amazing feature'
) - Push to branch (
git push origin feature/amazing-feature
) - Open a Pull Request
Authors
- Yogeshvar Senthilkumar - yogeshvar@icloud.com
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
textract_form_parser-0.1.6.tar.gz
(548.6 kB
view details)
Built Distribution
File details
Details for the file textract_form_parser-0.1.6.tar.gz
.
File metadata
- Download URL: textract_form_parser-0.1.6.tar.gz
- Upload date:
- Size: 548.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 |
9cab0c380c2af01248871add81d1d069ccf5a37f72f9203a0cba4608e1dc94a1
|
|
MD5 |
37cd554c4f794263dd0ccae280be06c2
|
|
BLAKE2b-256 |
efeff7f652ec626008f1b736ed79b153e5f09e17e77a463c397edf1a9e78591d
|
File details
Details for the file textract_form_parser-0.1.6-py3-none-any.whl
.
File metadata
- Download URL: textract_form_parser-0.1.6-py3-none-any.whl
- Upload date:
- Size: 26.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 |
4b1fbba3e9352129d454a71b5d8f826129c83e9388dabf39bdff0694d08eed7b
|
|
MD5 |
29b70909fa868e663a113f8b09e84473
|
|
BLAKE2b-256 |
509ea9fb358d3dbf7a50dbe4f4b13de4fc1bf6c49ce833765e303b9771793b77
|