Convert AWS Textract JSON output to hOCR format

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

bluebox_steven

These details have not been verified by PyPI

Project links

Original AWS Samples

Project description

textract-hocr

Convert AWS Textract JSON output to hOCR format for use with document processing tools.

Based on amazon-textract-hocr-output by AWS Samples.

Features

✅ Convert Textract JSON to hOCR HTML format
✅ hOCR 1.2 compliant output
✅ Support for single and multi-page documents
✅ Basic Table extraction with full line/word structure
✅ Block grouping based on vertical overlap
✅ Extract specific pages or page ranges from multi-page documents
✅ Automatic dimension detection from source images (PNG, JPEG, TIFF)
✅ Explicit dimension specification for PDFs (required)
✅ Force custom dimensions (override auto-detection)
✅ Fallback to Textract's default 1000x1000 dimensions
✅ Command-line interface and Python library
✅ Preserves text confidence scores and bounding boxes
✅ Configurable logging levels (info, warning, error)

Installation

From PyPI (when published)

pip install textract-hocr

From source

git clone https://github.com/BlueBox-WorldWide/textract-hocr.git
cd textract-hocr
pip install -e .

Development installation

git clone https://github.com/BlueBox-WorldWide/textract-hocr.git
cd textract-hocr
pip install -e ".[dev]"

Usage

Command Line

Convert entire document:

textract-to-hocr input.json output.html

Convert with source image for automatic dimension detection:

textract-to-hocr input.json output.html --source image.png

Convert PDF with explicit dimensions (required for PDFs):

# For A4 at 300 DPI (8.27" x 11.69")
textract-to-hocr input.json output.html --width 2480 --height 3507

Convert specific page only:

textract-to-hocr input.json output.html --first-page 2 --last-page 2

Convert page range:

textract-to-hocr input.json output.html --first-page 2 --last-page 5

Convert from page 3 to end:

textract-to-hocr input.json output.html --first-page 3

Force specific dimensions (override auto-detection):

textract-to-hocr input.json output.html --width 2550 --height 3300

Control logging verbosity:

# Verbose output (info level)
textract-to-hocr input.json output.html --log-level info

# Default (warnings only)
textract-to-hocr input.json output.html --log-level warning

# Quiet (errors only)
textract-to-hocr input.json output.html --log-level error

Python Library

Convert entire document

from textract_hocr import textract_to_hocr
import json

# Load Textract JSON output
with open('textract_output.json', 'r') as f:
    textract_result = json.load(f)

# Convert to hOCR
hocr_html = textract_to_hocr(textract_result)

# Save to file
with open('output.html', 'w', encoding='utf-8') as f:
    f.write(hocr_html)

Convert with source image for automatic dimension detection

from textract_hocr import textract_to_hocr
import json

with open('textract_output.json', 'r') as f:
    textract_result = json.load(f)

# Provide source image path for auto-detection
hocr_html = textract_to_hocr(textract_result, source_file='scan.png')

with open('output.html', 'w', encoding='utf-8') as f:
    f.write(hocr_html)

Convert PDF with explicit dimensions (required)

from textract_hocr import textract_to_hocr
import json

with open('textract_output.json', 'r') as f:
    textract_result = json.load(f)

# For PDFs, you MUST provide explicit dimensions matching Textract's rasterization
# Example: A4 at 300 DPI (8.27" x 11.69")
hocr_html = textract_to_hocr(
    textract_result,
    dimensions={'width': 2480, 'height': 3507}
)

with open('output.html', 'w', encoding='utf-8') as f:
    f.write(hocr_html)

Convert specific page

from textract_hocr import textract_to_hocr
import json

with open('textract_output.json', 'r') as f:
    textract_result = json.load(f)

# Extract page 2 only (with explicit dimensions for PDF)
hocr_html = textract_to_hocr(
    textract_result, 
    first_page=2,
    last_page=2,
    dimensions={'width': 2480, 'height': 3507}  # Required for PDFs
)

with open('page2.html', 'w', encoding='utf-8') as f:
    f.write(hocr_html)

Convert page range

from textract_hocr import textract_to_hocr
import json

with open('textract_output.json', 'r') as f:
    textract_result = json.load(f)

# Extract pages 3-5 (with explicit dimensions for PDF)
hocr_html = textract_to_hocr(
    textract_result,
    first_page=3,
    last_page=5,
    dimensions={'width': 2550, 'height': 3300}  # Letter at 300 DPI
)

with open('pages_3_5.html', 'w', encoding='utf-8') as f:
    f.write(hocr_html)

Force custom dimensions

from textract_hocr import textract_to_hocr
import json

with open('textract_output.json', 'r') as f:
    textract_result = json.load(f)

# Override dimension detection
hocr_html = textract_to_hocr(
    textract_result,
    dimensions={'width': 2550, 'height': 3300}
)

with open('output.html', 'w', encoding='utf-8') as f:
    f.write(hocr_html)

Get document dimensions

from textract_hocr import get_document_dimensions

# From image (auto-detected)
dims = get_document_dimensions('image.png')
print(f"Width: {dims['width']}, Height: {dims['height']}")

# For PDFs, you MUST provide explicit dimensions
# This will raise ValueError:
# dims = get_document_dimensions('document.pdf')  # ERROR!

# Instead, provide dimensions explicitly:
dims = get_document_dimensions(
    'document.pdf',
    dimensions={'width': 2480, 'height': 3507}
)

# Or use dimensions parameter alone
dims = get_document_dimensions(dimensions={'width': 2550, 'height': 3300})

# Fallback to Textract defaults
dims = get_document_dimensions()  # Returns {'width': 1000, 'height': 1000}

What is hOCR?

hOCR is an open standard for representing OCR results in HTML format. It embeds text content along with layout information (bounding boxes, confidence scores, etc.) that can be used by document processing tools.

The hOCR format is widely supported by:

Tesseract OCR
OCRopus
ABBYY FineReader
Document analysis tools
PDF overlay generators

Dimension Handling

The converter handles document dimensions in the following priority order:

Explicit dimensions (via dimensions parameter): Uses provided width/height
Image files (PNG, JPEG, TIFF, etc.): Auto-extracts actual pixel dimensions
PDF files: CANNOT auto-extract - you MUST provide explicit dimensions parameter
Fallback: Uses Textract's default 1000×1000 normalized dimensions

Why PDFs Require Explicit Dimensions

Textract rasterizes PDFs at a specific DPI (typically 200-300) before processing. The original PDF dimensions don't reliably indicate the resolution Textract used. Therefore, you must provide the dimensions matching Textract's rasterization:

A4 at 300 DPI: {'width': 2480, 'height': 3507} (8.27" × 11.69")
Letter at 300 DPI: {'width': 2550, 'height': 3300} (8.5" × 11")
A4 at 200 DPI: {'width': 1654, 'height': 2339}
Letter at 200 DPI: {'width': 1700, 'height': 2200}

Textract returns normalized coordinates (0-1 range). This tool converts them to pixel coordinates using the actual document dimensions for accuracy.

Output Format

The generated hOCR HTML includes:

hOCR 1.2 compliant structure with proper DOCTYPE and metadata
ocr_page divs with page dimensions
ocr_block divs grouping lines with overlapping vertical positions
ocr_table divs for tables with complete line and word structure
ocr_line spans for text lines
ocrx_word spans for individual words
Bounding boxes in bbox left top right bottom format
Confidence scores in x_wconf property
Proper baseline information for line elements
Content ordered by vertical position (top to bottom on page)

Block Grouping

Lines are grouped into ocr_block divs based on vertical overlap:

Lines with overlapping Y-axis positions are grouped together
Creates natural paragraph-like blocks without explicit paragraph detection
Blocks use synthetic IDs (e.g., block_1_1, block_2_1)
Each block's bounding box encompasses all contained lines

Table Support

Tables detected by Textract are converted to float div elements with ocr_table class:

ocr_table rendered as <div> float elements (no HTML table structure)
Each cell's content rendered as ocr_line spans containing ocrx_word spans
Cell content in reading order (row by row, left to right)
Bounding box and confidence score for the table region

Requirements

Python 3.8+
yattag >= 1.14.0
Pillow >= 9.0.0 (for image dimension extraction)

License

MIT License - see LICENSE file for details.

Based on amazon-textract-hocr-output by AWS Samples.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Fork the repository
Create your feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add some amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

Related Projects

aws-samples/amazon-textract-hocr-output - Original implementation
AWS Textract - AWS OCR service
hOCR 1.2 Spec - hOCR 1.2 spec documentation
Tesseract OCR - Popular open-source OCR engine with hOCR support

Support

If you encounter any issues or have questions:

Check existing GitHub Issues
Create a new issue with:
- Your Python version
- The error message or unexpected behavior
- Sample input (if possible)
- Steps to reproduce

Changelog

0.1.1 (2026-01-04)

Breaking Changes:

PDF dimension handling changed: PDFs now require explicit dimensions parameter. Auto-extraction from PDF files has been removed due to reliability issues with determining Textract's rasterization DPI.
Attempting to process a PDF without providing dimensions will now raise a ValueError with clear instructions.

Improvements:

Added comprehensive logging throughout the conversion process
Better error messages with actionable guidance for PDF dimension requirements
Improved documentation with detailed examples for PDF processing at different DPIs
Clearer function docstrings with examples for both image and PDF workflows

Dependency Changes:

Removed PyPDF2 dependency (no longer needed)

Migration Guide: If you were using PDFs with auto-detection:

# Old (v0.1.0) - no longer works
hocr = textract_to_hocr(data, source_file='document.pdf')

# New (v0.1.1) - provide explicit dimensions
hocr = textract_to_hocr(
    data,
    dimensions={'width': 2480, 'height': 3507}  # A4 at 300 DPI
)

0.1.0 (2026-01-04)

Initial release
Support for single and multi-page conversion
Image dimension auto-detection (PNG, JPEG, TIFF)
PDF dimension extraction (removed in 0.1.1)
Command-line interface
Python library API
Textract default dimension fallback
Block grouping based on vertical overlap

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

bluebox_steven

These details have not been verified by PyPI

Project links

Original AWS Samples

Release history Release notifications | RSS feed

0.1.3

Jan 8, 2026

0.1.2

Jan 5, 2026

This version

0.1.1

Jan 4, 2026

0.1.0

Jan 4, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

textract_hocr-0.1.1.tar.gz (23.6 kB view details)

Uploaded Jan 4, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

textract_hocr-0.1.1-py3-none-any.whl (15.8 kB view details)

Uploaded Jan 4, 2026 Python 3

File details

Details for the file textract_hocr-0.1.1.tar.gz.

File metadata

Download URL: textract_hocr-0.1.1.tar.gz
Upload date: Jan 4, 2026
Size: 23.6 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for textract_hocr-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`44c7455c63534aaf341027f4b3fe1eccc1c944188711a9ab046999c032fb472b`
MD5	`cb83e9b5e153732af85b242589bf92d5`
BLAKE2b-256	`90c51b41d363d54e3988da18b84490fe0944c818435b4a81110eb7a55fa35d4b`

See more details on using hashes here.

Provenance

The following attestation bundles were made for textract_hocr-0.1.1.tar.gz:

Publisher: publish.yml on BlueBox-WorldWide/textract-hocr

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: textract_hocr-0.1.1.tar.gz
- Subject digest: 44c7455c63534aaf341027f4b3fe1eccc1c944188711a9ab046999c032fb472b
- Sigstore transparency entry: 790356572
- Sigstore integration time: Jan 4, 2026
Source repository:
- Permalink: BlueBox-WorldWide/textract-hocr@e5064ba926a021cddeb7d12311df961b61dbe331
- Branch / Tag: refs/heads/main
- Owner: https://github.com/BlueBox-WorldWide
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@e5064ba926a021cddeb7d12311df961b61dbe331
- Trigger Event: workflow_dispatch

File details

Details for the file textract_hocr-0.1.1-py3-none-any.whl.

File metadata

Download URL: textract_hocr-0.1.1-py3-none-any.whl
Upload date: Jan 4, 2026
Size: 15.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for textract_hocr-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a3a8be5c5827cb2a800f6bb911db5125fa59448f7294ff783b22b2546fa9095c`
MD5	`c44c27998d669d816fdccfb4bc2b0175`
BLAKE2b-256	`0f93b5fa2080d97fd54fb00329fdd003e3f0c35c29987af1ce6486cae2d3b130`

See more details on using hashes here.

Provenance

The following attestation bundles were made for textract_hocr-0.1.1-py3-none-any.whl:

Publisher: publish.yml on BlueBox-WorldWide/textract-hocr

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: textract_hocr-0.1.1-py3-none-any.whl
- Subject digest: a3a8be5c5827cb2a800f6bb911db5125fa59448f7294ff783b22b2546fa9095c
- Sigstore transparency entry: 790356575
- Sigstore integration time: Jan 4, 2026
Source repository:
- Permalink: BlueBox-WorldWide/textract-hocr@e5064ba926a021cddeb7d12311df961b61dbe331
- Branch / Tag: refs/heads/main
- Owner: https://github.com/BlueBox-WorldWide
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@e5064ba926a021cddeb7d12311df961b61dbe331
- Trigger Event: workflow_dispatch

textract-hocr 0.1.1

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

textract-hocr

Features

Installation

From PyPI (when published)

From source

Development installation

Usage

Command Line

Python Library

Convert entire document

Convert with source image for automatic dimension detection

Convert PDF with explicit dimensions (required)

Convert specific page

Convert page range

Force custom dimensions

Get document dimensions

What is hOCR?

Dimension Handling

Why PDFs Require Explicit Dimensions

Output Format

Block Grouping

Table Support

Requirements

License

Contributing

Related Projects

Support

Changelog

0.1.1 (2026-01-04)

0.1.0 (2026-01-04)

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance