Skip to main content

A Python library for parsing AWS Textract form output

Project description

Textract Form Parser

Build and Publish PyPI version Built with Cursor

A Python library for parsing AWS Textract form output. This library helps process and analyze form data extracted by AWS Textract, generating structured outputs and reports.

Features

  • Parse AWS Textract JSON output
  • Extract form fields and tables
  • Generate HTML reports
  • Create concise and verbose outputs
  • Command-line interface
  • Logging and debugging support

Installation

pip install textract-form-parser

Usage

As a Library

from textract_parser import analyze_document, generate_html_report, create_concise_results

# Load your Textract JSON
with open("notebook.json", "r") as f:
    textract_json = json.load(f)

# Analyze document
analysis_results = analyze_document(textract_json)

# Generate HTML report
generate_html_report(analysis_results, "report.html")

# Get concise results
concise_results = create_concise_results(analysis_results)

Command Line Interface

# Basic usage
textract-parser input.json -o output

# With verbose logging
textract-parser input.json -o output -v

Development

Setup

  1. Clone the repository:
git clone https://github.com/yogeshvar/text-extractor.git
cd text-extractor
  1. Create and activate virtual environment:
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
  1. Install development dependencies:
pip install -e ".[dev]"
  1. Install pre-commit hooks:
pre-commit install

Code Formatting

Format code using the provided script:

./scripts/format.sh

This will:

  • Fix end of files
  • Fix trailing whitespace
  • Run Black formatter
  • Sort imports with isort
  • Stage formatted files

Testing

Run tests with coverage:

pytest --cov=textract_parser \
      --cov-report=term-missing \
      --html=test-results/report.html \
      --self-contained-html \
      -v

Commit Guidelines

We use conventional commits. Format:

<type>: <description>

[optional body]
[optional footer(s)]

Types:

  • feat: New feature
  • fix: Bug fix
  • enhance: Enhancement
  • docs: Documentation
  • style: Code style
  • refactor: Code refactoring
  • test: Testing
  • chore: Maintenance

Release Process

  1. Create a PR from your feature branch to master
  2. Ensure all tests pass
  3. Update version in:
    • setup.py
    • textract_parser/__init__.py
    • pyproject.toml
  4. Merge PR to master
  5. GitHub Actions will automatically:
    • Run tests
    • Create a new tag
    • Generate release notes
    • Create GitHub release
    • Publish to PyPI

License

MIT License - see LICENSE for details

Contributing

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/amazing-feature)
  3. Format code (./scripts/format.sh)
  4. Commit changes (git commit -m 'feat: add amazing feature')
  5. Push to branch (git push origin feature/amazing-feature)
  6. Open a Pull Request

Authors

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

textract_form_parser-0.1.6.tar.gz (548.6 kB view details)

Uploaded Source

Built Distribution

textract_form_parser-0.1.6-py3-none-any.whl (26.3 kB view details)

Uploaded Python 3

File details

Details for the file textract_form_parser-0.1.6.tar.gz.

File metadata

  • Download URL: textract_form_parser-0.1.6.tar.gz
  • Upload date:
  • Size: 548.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for textract_form_parser-0.1.6.tar.gz
Algorithm Hash digest
SHA256 9cab0c380c2af01248871add81d1d069ccf5a37f72f9203a0cba4608e1dc94a1
MD5 37cd554c4f794263dd0ccae280be06c2
BLAKE2b-256 efeff7f652ec626008f1b736ed79b153e5f09e17e77a463c397edf1a9e78591d

See more details on using hashes here.

File details

Details for the file textract_form_parser-0.1.6-py3-none-any.whl.

File metadata

File hashes

Hashes for textract_form_parser-0.1.6-py3-none-any.whl
Algorithm Hash digest
SHA256 4b1fbba3e9352129d454a71b5d8f826129c83e9388dabf39bdff0694d08eed7b
MD5 29b70909fa868e663a113f8b09e84473
BLAKE2b-256 509ea9fb358d3dbf7a50dbe4f4b13de4fc1bf6c49ce833765e303b9771793b77

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page