Skip to main content

Convert AWS Textract JSON output to hOCR format

Project description

textract-hocr

Convert AWS Textract JSON output to hOCR format for use with document processing tools.

License: MIT Python 3.7+

Based on amazon-textract-hocr-output by AWS Samples.

Features

  • ✅ Convert Textract JSON to hOCR HTML format
  • ✅ hOCR 1.2 compliant output
  • ✅ Support for single and multi-page documents
  • ✅ Basic Table extraction with full line/word structure
  • ✅ Block grouping based on vertical overlap
  • ✅ Extract specific pages or page ranges from multi-page documents
  • ✅ Automatic dimension detection from source images (PNG, JPEG, TIFF)
  • ✅ PDF dimension extraction support
  • ✅ Force custom dimensions (override auto-detection)
  • ✅ Fallback to Textract's default 1000x1000 dimensions
  • ✅ Command-line interface and Python library
  • ✅ Preserves text confidence scores and bounding boxes

Installation

From PyPI (when published)

pip install textract-hocr

From source

git clone https://github.com/BlueBox-WorldWide/textract-hocr.git
cd textract-hocr
pip install -e .

Development installation

git clone https://github.com/BlueBox-WorldWide/textract-hocr.git
cd textract-hocr
pip install -e ".[dev]"

Usage

Command Line

Convert entire document:

textract-to-hocr input.json output.html

Convert with source image for accurate dimensions:

textract-to-hocr input.json output.html --source image.png

Convert specific page only:

textract-to-hocr input.json output.html --first-page 2 --last-page 2

Convert page range:

textract-to-hocr input.json output.html --first-page 2 --last-page 5

Convert from page 3 to end:

textract-to-hocr input.json output.html --first-page 3

Force specific dimensions (override auto-detection):

textract-to-hocr input.json output.html --width 2550 --height 3300

Python Library

Convert entire document

from textract_hocr import textract_to_hocr
import json

# Load Textract JSON output
with open('textract_output.json', 'r') as f:
    textract_result = json.load(f)

# Convert to hOCR
hocr_html = textract_to_hocr(textract_result)

# Save to file
with open('output.html', 'w', encoding='utf-8') as f:
    f.write(hocr_html)

Convert with source image for accurate dimensions

from textract_hocr import textract_to_hocr
import json

with open('textract_output.json', 'r') as f:
    textract_result = json.load(f)

# Provide source image path
hocr_html = textract_to_hocr(textract_result, source_file='scan.png')

with open('output.html', 'w', encoding='utf-8') as f:
    f.write(hocr_html)

Convert specific page

from textract_hocr import textract_to_hocr
import json

with open('textract_output.json', 'r') as f:
    textract_result = json.load(f)

# Extract page 2 only
hocr_html = textract_to_hocr(
    textract_result, 
    first_page=2,
    last_page=2,
    source_file='document.pdf'
)

with open('page2.html', 'w', encoding='utf-8') as f:
    f.write(hocr_html)

Convert page range

from textract_hocr import textract_to_hocr
import json

with open('textract_output.json', 'r') as f:
    textract_result = json.load(f)

# Extract pages 3-5
hocr_html = textract_to_hocr(
    textract_result,
    first_page=3,
    last_page=5,
    source_file='document.pdf'
)

with open('pages_3_5.html', 'w', encoding='utf-8') as f:
    f.write(hocr_html)

Force custom dimensions

from textract_hocr import textract_to_hocr
import json

with open('textract_output.json', 'r') as f:
    textract_result = json.load(f)

# Override dimension detection
hocr_html = textract_to_hocr(
    textract_result,
    dimensions={'width': 2550, 'height': 3300}
)

with open('output.html', 'w', encoding='utf-8') as f:
    f.write(hocr_html)

Get document dimensions

from textract_hocr import get_document_dimensions

# From image
dims = get_document_dimensions('image.png')
print(f"Width: {dims['width']}, Height: {dims['height']}")

# From PDF (specific page)
dims = get_document_dimensions('document.pdf', page_number=3)

# Force specific dimensions
dims = get_document_dimensions(dimensions={'width': 2550, 'height': 3300})

# Fallback to Textract defaults
dims = get_document_dimensions()  # Returns {'width': 1000, 'height': 1000}

What is hOCR?

hOCR is an open standard for representing OCR results in HTML format. It embeds text content along with layout information (bounding boxes, confidence scores, etc.) that can be used by document processing tools.

The hOCR format is widely supported by:

  • Tesseract OCR
  • OCRopus
  • ABBYY FineReader
  • Document analysis tools
  • PDF overlay generators

Dimension Handling

The converter handles document dimensions in the following priority order:

  1. Image files (PNG, JPEG, TIFF, etc.): Extracts actual pixel dimensions
  2. PDF files: Extracts page dimensions from PDF mediabox
  3. Fallback: Uses Textract's default 1000×1000 normalized dimensions

Textract returns normalized coordinates (0-1 range). This tool converts them to pixel coordinates using the actual document dimensions for accuracy.

Output Format

The generated hOCR HTML includes:

  • hOCR 1.2 compliant structure with proper DOCTYPE and metadata
  • ocr_page divs with page dimensions
  • ocr_block divs grouping lines with overlapping vertical positions
  • ocr_table divs for tables with complete line and word structure
  • ocr_line spans for text lines
  • ocrx_word spans for individual words
  • Bounding boxes in bbox left top right bottom format
  • Confidence scores in x_wconf property
  • Proper baseline information for line elements
  • Content ordered by vertical position (top to bottom on page)

Block Grouping

Lines are grouped into ocr_block divs based on vertical overlap:

  • Lines with overlapping Y-axis positions are grouped together
  • Creates natural paragraph-like blocks without explicit paragraph detection
  • Blocks use synthetic IDs (e.g., block_1_1, block_2_1)
  • Each block's bounding box encompasses all contained lines

Table Support

Tables detected by Textract are converted to float div elements with ocr_table class:

  • ocr_table rendered as <div> float elements (no HTML table structure)
  • Each cell's content rendered as ocr_line spans containing ocrx_word spans
  • Cell content in reading order (row by row, left to right)
  • Bounding box and confidence score for the table region

Requirements

  • Python 3.7+
  • yattag >= 1.14.0
  • Pillow >= 9.0.0 (for image dimension extraction)
  • PyPDF2 >= 3.0.0 (for PDF dimension extraction)

License

MIT License - see LICENSE file for details.

Based on amazon-textract-hocr-output by AWS Samples.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add some amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

Related Projects

Support

If you encounter any issues or have questions:

  1. Check existing GitHub Issues
  2. Create a new issue with:
    • Your Python version
    • The error message or unexpected behavior
    • Sample input (if possible)
    • Steps to reproduce

Changelog

0.1.0 (2026-01-03)

  • Initial release
  • Support for single and multi-page conversion
  • Image and PDF dimension extraction
  • Command-line interface
  • Python library API
  • Textract default dimension fallback

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

textract_hocr-0.1.0.tar.gz (20.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

textract_hocr-0.1.0-py3-none-any.whl (14.2 kB view details)

Uploaded Python 3

File details

Details for the file textract_hocr-0.1.0.tar.gz.

File metadata

  • Download URL: textract_hocr-0.1.0.tar.gz
  • Upload date:
  • Size: 20.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for textract_hocr-0.1.0.tar.gz
Algorithm Hash digest
SHA256 709e94cdb0092a007b3c224ec83648c655a1f30f2dd1befe3b5637c68d801624
MD5 de2f8ebd4ec5f9a26d567d588fa58587
BLAKE2b-256 82971a4cb703082af08df622edc0a94ec0262ab0a8df48c636adc6a14406ce30

See more details on using hashes here.

Provenance

The following attestation bundles were made for textract_hocr-0.1.0.tar.gz:

Publisher: publish.yml on BlueBox-WorldWide/textract-hocr

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file textract_hocr-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: textract_hocr-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 14.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for textract_hocr-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 5ca54fa257fa492e6e00dc9e4fcd1e51c503f062378b35deef429745bef21fce
MD5 e39c4a261114f4a6d94d6a35578f1af8
BLAKE2b-256 d7c21c8d323e952a55553ca2e635a1726688965771085ad6e85caf5a01e14632

See more details on using hashes here.

Provenance

The following attestation bundles were made for textract_hocr-0.1.0-py3-none-any.whl:

Publisher: publish.yml on BlueBox-WorldWide/textract-hocr

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page