A Python library for parsing AWS Textract form output
Project description
Textract Form Parser
A Python library for parsing AWS Textract form output. This library helps process and analyze form data extracted by AWS Textract, generating structured outputs and reports.
Features
- Parse AWS Textract JSON output
- Extract form fields and tables
- Generate HTML reports
- Create concise and verbose outputs
- Command-line interface
- Logging and debugging support
Installation
pip install textract-form-parser
Usage
As a Library
from textract_parser import analyze_document, generate_html_report, create_concise_results
# Load your Textract JSON
with open("notebook.json", "r") as f:
textract_json = json.load(f)
# Analyze document
analysis_results = analyze_document(textract_json)
# Generate HTML report
generate_html_report(analysis_results, "report.html")
# Get concise results
concise_results = create_concise_results(analysis_results)
Command Line Interface
# Basic usage
textract-parser input.json -o output
# With verbose logging
textract-parser input.json -o output -v
Development
Setup
- Clone the repository:
git clone https://github.com/yogeshvar/text-extractor.git
cd text-extractor
- Create and activate virtual environment:
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
- Install development dependencies:
pip install -e ".[dev]"
- Install pre-commit hooks:
pre-commit install
Code Formatting
Format code using the provided script:
./scripts/format.sh
This will:
- Fix end of files
- Fix trailing whitespace
- Run Black formatter
- Sort imports with isort
- Stage formatted files
Testing
Run tests with coverage:
pytest --cov=textract_parser \
--cov-report=term-missing \
--html=test-results/report.html \
--self-contained-html \
-v
Commit Guidelines
We use conventional commits. Format:
<type>: <description>
[optional body]
[optional footer(s)]
Types:
feat: New featurefix: Bug fixenhance: Enhancementdocs: Documentationstyle: Code stylerefactor: Code refactoringtest: Testingchore: Maintenance
Release Process
- Create a PR from your feature branch to master
- Ensure all tests pass
- Update version in:
setup.pytextract_parser/__init__.pypyproject.toml
- Merge PR to master
- GitHub Actions will automatically:
- Run tests
- Create a new tag
- Generate release notes
- Create GitHub release
- Publish to PyPI
License
MIT License - see LICENSE for details
Contributing
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature) - Format code (
./scripts/format.sh) - Commit changes (
git commit -m 'feat: add amazing feature') - Push to branch (
git push origin feature/amazing-feature) - Open a Pull Request
Authors
- Yogeshvar Senthilkumar - yogeshvar@icloud.com
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file textract_form_parser-0.1.6.tar.gz.
File metadata
- Download URL: textract_form_parser-0.1.6.tar.gz
- Upload date:
- Size: 548.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9cab0c380c2af01248871add81d1d069ccf5a37f72f9203a0cba4608e1dc94a1
|
|
| MD5 |
37cd554c4f794263dd0ccae280be06c2
|
|
| BLAKE2b-256 |
efeff7f652ec626008f1b736ed79b153e5f09e17e77a463c397edf1a9e78591d
|
File details
Details for the file textract_form_parser-0.1.6-py3-none-any.whl.
File metadata
- Download URL: textract_form_parser-0.1.6-py3-none-any.whl
- Upload date:
- Size: 26.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4b1fbba3e9352129d454a71b5d8f826129c83e9388dabf39bdff0694d08eed7b
|
|
| MD5 |
29b70909fa868e663a113f8b09e84473
|
|
| BLAKE2b-256 |
509ea9fb358d3dbf7a50dbe4f4b13de4fc1bf6c49ce833765e303b9771793b77
|