Convert AWS Textract JSON output to hOCR format
Project description
textract-hocr
Convert AWS Textract JSON output to hOCR format for use with document processing tools.
Based on amazon-textract-hocr-output by AWS Samples.
Features
- ✅ Convert Textract JSON to hOCR HTML format
- ✅ hOCR 1.2 compliant* output
- ✅ Support for single and multi-page documents
- ✅ Basic Table extraction with full line/word structure
- ✅ Block grouping based on vertical/horizontal overlap (Limitation based on LTR and top-down reading order)
- ✅ Extract specific pages or page ranges from multi-page documents
- ✅ Automatic dimension detection from source images (PNG, JPEG, TIFF)
- ✅ Explicit dimension specification for PDFs (required)
- ✅ Force custom dimensions (override auto-detection)
- ✅ Fallback to Textract's default 1000x1000 dimensions
- ✅ Command-line interface and Python library
- ✅ Preserves text confidence scores and bounding boxes
- ✅ Configurable logging levels (info, warning, error)
- Note: hOCR spec is fairly loose in its requirements, and therefore there are a number of different interpretations and usages of hOCR type classes in OCR and PDF engines. This outputs similar elements to Tesseract, barring the usage of
ocr_blockinstead ofocr_carea.
Installation
From PyPI (when published)
pip install textract-hocr
From source
git clone https://github.com/BlueBox-WorldWide/textract-hocr.git
cd textract-hocr
pip install -e .
Development installation
git clone https://github.com/BlueBox-WorldWide/textract-hocr.git
cd textract-hocr
pip install -e ".[dev]"
Usage
Command Line
Convert entire document:
textract-to-hocr input.json output.html
Convert with source image for automatic dimension detection:
textract-to-hocr input.json output.html --source image.png
Convert PDF with explicit dimensions (required for PDFs):
# For A4 at 300 DPI (8.27" x 11.69")
textract-to-hocr input.json output.html --width 2480 --height 3507
Convert specific page only:
textract-to-hocr input.json output.html --first-page 2 --last-page 2
Convert page range:
textract-to-hocr input.json output.html --first-page 2 --last-page 5
Convert from page 3 to end:
textract-to-hocr input.json output.html --first-page 3
Force specific dimensions (override auto-detection):
textract-to-hocr input.json output.html --width 2550 --height 3300
Control logging verbosity:
# Verbose output (info level)
textract-to-hocr input.json output.html --log-level info
# Default (warnings only)
textract-to-hocr input.json output.html --log-level warning
# Quiet (errors only)
textract-to-hocr input.json output.html --log-level error
Python Library
Convert entire document
from textract_hocr import textract_to_hocr
import json
# Load Textract JSON output
with open('textract_output.json', 'r') as f:
textract_result = json.load(f)
# Convert to hOCR
hocr_html = textract_to_hocr(textract_result)
# Save to file
with open('output.html', 'w', encoding='utf-8') as f:
f.write(hocr_html)
Convert with source image for automatic dimension detection
from textract_hocr import textract_to_hocr
import json
with open('textract_output.json', 'r') as f:
textract_result = json.load(f)
# Provide source image path for auto-detection
hocr_html = textract_to_hocr(textract_result, source_file='scan.png')
with open('output.html', 'w', encoding='utf-8') as f:
f.write(hocr_html)
Convert PDF with explicit dimensions (required)
from textract_hocr import textract_to_hocr
import json
with open('textract_output.json', 'r') as f:
textract_result = json.load(f)
# For PDFs, you MUST provide explicit dimensions matching Textract's rasterization
# Example: A4 at 300 DPI (8.27" x 11.69")
hocr_html = textract_to_hocr(
textract_result,
dimensions={'width': 2480, 'height': 3507}
)
with open('output.html', 'w', encoding='utf-8') as f:
f.write(hocr_html)
Convert specific page
from textract_hocr import textract_to_hocr
import json
with open('textract_output.json', 'r') as f:
textract_result = json.load(f)
# Extract page 2 only (with explicit dimensions for PDF)
hocr_html = textract_to_hocr(
textract_result,
first_page=2,
last_page=2,
dimensions={'width': 2480, 'height': 3507} # Required for PDFs
)
with open('page2.html', 'w', encoding='utf-8') as f:
f.write(hocr_html)
Convert page range
from textract_hocr import textract_to_hocr
import json
with open('textract_output.json', 'r') as f:
textract_result = json.load(f)
# Extract pages 3-5 (with explicit dimensions for PDF)
hocr_html = textract_to_hocr(
textract_result,
first_page=3,
last_page=5,
dimensions={'width': 2550, 'height': 3300} # Letter at 300 DPI
)
with open('pages_3_5.html', 'w', encoding='utf-8') as f:
f.write(hocr_html)
Force custom dimensions
from textract_hocr import textract_to_hocr
import json
with open('textract_output.json', 'r') as f:
textract_result = json.load(f)
# Override dimension detection
hocr_html = textract_to_hocr(
textract_result,
dimensions={'width': 2550, 'height': 3300}
)
with open('output.html', 'w', encoding='utf-8') as f:
f.write(hocr_html)
Get document dimensions
from textract_hocr import get_document_dimensions
# From image (auto-detected)
dims = get_document_dimensions('image.png')
print(f"Width: {dims['width']}, Height: {dims['height']}")
# For PDFs, you MUST provide explicit dimensions
# This will raise ValueError:
# dims = get_document_dimensions('document.pdf') # ERROR!
# Instead, provide dimensions explicitly:
dims = get_document_dimensions(
'document.pdf',
dimensions={'width': 2480, 'height': 3507}
)
# Or use dimensions parameter alone
dims = get_document_dimensions(dimensions={'width': 2550, 'height': 3300})
# Fallback to Textract defaults
dims = get_document_dimensions() # Returns {'width': 1000, 'height': 1000}
What is hOCR?
hOCR is an open standard for representing OCR results in HTML format. It embeds text content along with layout information (bounding boxes, confidence scores, etc.) that can be used by document processing tools.
The hOCR format is widely supported by:
- Tesseract OCR
- OCRopus
- ABBYY FineReader
- Document analysis tools
- PDF overlay generators
Dimension Handling
The converter handles document dimensions in the following priority order:
- Explicit dimensions (via
dimensionsparameter): Uses provided width/height - Image files (PNG, JPEG, TIFF, etc.): Auto-extracts actual pixel dimensions
- PDF files: CANNOT auto-extract - you MUST provide explicit
dimensionsparameter - Fallback: Uses Textract's default 1000×1000 normalized dimensions
Why PDFs Require Explicit Dimensions
Textract rasterizes PDFs at a specific DPI (typically 200-300) before processing. The original PDF dimensions don't reliably indicate the resolution Textract used. Therefore, you must provide the dimensions matching Textract's rasterization:
- A4 at 300 DPI:
{'width': 2480, 'height': 3507}(8.27" × 11.69") - Letter at 300 DPI:
{'width': 2550, 'height': 3300}(8.5" × 11") - A4 at 200 DPI:
{'width': 1654, 'height': 2339} - Letter at 200 DPI:
{'width': 1700, 'height': 2200}
Textract returns normalized coordinates (0-1 range). This tool converts them to pixel coordinates using the actual document dimensions for accuracy.
Output Format
The generated hOCR HTML includes:
- hOCR 1.2 compliant structure with proper DOCTYPE and metadata
ocr_pagedivs with page dimensionsocr_blockdivs grouping lines with overlapping vertical positionsocr_tabledivs for tables with complete line and word structureocr_linespans for text linesocrx_wordspans for individual words- Bounding boxes in
bbox left top right bottomformat - Confidence scores in
x_wconfproperty - Proper baseline information for line elements
- Content ordered by vertical position (top to bottom on page)
Block Grouping
Lines are grouped into ocr_block divs based on vertical overlap:
- Lines with overlapping Y-axis positions are grouped together
- Creates natural paragraph-like blocks without explicit paragraph detection
- Blocks use synthetic IDs (e.g.,
block_1_1,block_2_1) - Each block's bounding box encompasses all contained lines
Table Support
Tables detected by Textract are converted to float div elements with ocr_table class:
ocr_tablerendered as<div>float elements (no HTML table structure)- Each cell's content rendered as
ocr_linespans containingocrx_wordspans - Cell content in reading order (row by row, left to right)
- Bounding box and confidence score for the table region
Requirements
- Python 3.8+
- yattag >= 1.14.0
- Pillow >= 9.0.0 (for image dimension extraction)
License
MIT License - see LICENSE file for details.
Based on amazon-textract-hocr-output by AWS Samples.
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add some amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
Related Projects
- aws-samples/amazon-textract-hocr-output - Original implementation
- AWS Textract - AWS OCR service
- hOCR 1.2 Spec - hOCR 1.2 spec documentation
- Tesseract OCR - Popular open-source OCR engine with hOCR support
Support
If you encounter any issues or have questions:
- Check existing GitHub Issues
- Create a new issue with:
- Your Python version
- The error message or unexpected behavior
- Sample input (if possible)
- Steps to reproduce
Changelog
0.1.3 (2026-01-08)
Bug Fixes:
- Fixed reading order for lines on the same visual line
- Lines are now correctly grouped by vertical overlap and sorted left-to-right within each group
- Prevents incorrect grouping of non-overlapping lines
0.1.2 (2026-01-05)
Improvements:
- Added paragraph grouping (ocr_par)
- Improved intersection calculations (limited to LTR languages)
0.1.1 (2026-01-04)
Breaking Changes:
- PDF dimension handling changed: PDFs now require explicit
dimensionsparameter. Auto-extraction from PDF files has been removed due to reliability issues with determining Textract's rasterization DPI. - Attempting to process a PDF without providing
dimensionswill now raise aValueErrorwith clear instructions.
Improvements:
- Added comprehensive logging throughout the conversion process
- Better error messages with actionable guidance for PDF dimension requirements
- Improved documentation with detailed examples for PDF processing at different DPIs
- Clearer function docstrings with examples for both image and PDF workflows
Dependency Changes:
- Removed PyPDF2 dependency (no longer needed)
Migration Guide: If you were using PDFs with auto-detection:
# Old (v0.1.0) - no longer works
hocr = textract_to_hocr(data, source_file='document.pdf')
# New (v0.1.1) - provide explicit dimensions
hocr = textract_to_hocr(
data,
dimensions={'width': 2480, 'height': 3507} # A4 at 300 DPI
)
0.1.0 (2026-01-04)
- Initial release
- Support for single and multi-page conversion
- Image dimension auto-detection (PNG, JPEG, TIFF)
- PDF dimension extraction (removed in 0.1.1)
- Command-line interface
- Python library API
- Textract default dimension fallback
- Block grouping based on vertical overlap
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file textract_hocr-0.1.3.tar.gz.
File metadata
- Download URL: textract_hocr-0.1.3.tar.gz
- Upload date:
- Size: 26.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c72e2399f09ed991b55ea34ac1d9c0d5724f4427265da781e4b2301588ad32fe
|
|
| MD5 |
96754d0297a717eccda293364b4fa44f
|
|
| BLAKE2b-256 |
2f0f012fabde4ea1bf09da1baccfd99baff232bdcb55ecfd43c523fc0d369930
|
Provenance
The following attestation bundles were made for textract_hocr-0.1.3.tar.gz:
Publisher:
publish.yml on BlueBox-WorldWide/textract-hocr
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
textract_hocr-0.1.3.tar.gz -
Subject digest:
c72e2399f09ed991b55ea34ac1d9c0d5724f4427265da781e4b2301588ad32fe - Sigstore transparency entry: 804361293
- Sigstore integration time:
-
Permalink:
BlueBox-WorldWide/textract-hocr@95aa4f962ae84dc061cf076134ccb503119eeb9f -
Branch / Tag:
refs/tags/v0.1.3 - Owner: https://github.com/BlueBox-WorldWide
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@95aa4f962ae84dc061cf076134ccb503119eeb9f -
Trigger Event:
release
-
Statement type:
File details
Details for the file textract_hocr-0.1.3-py3-none-any.whl.
File metadata
- Download URL: textract_hocr-0.1.3-py3-none-any.whl
- Upload date:
- Size: 18.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ab9bca1fc67afc8bbfe91107cdd65d8ab84adeb7b8b70401cb210542f5ff49da
|
|
| MD5 |
c8e24ad96ccc1b6db36fc1043ed15389
|
|
| BLAKE2b-256 |
e9268296991656185ca558d4b02bd08808d0a6967d3c2a647328ff92744ecf0f
|
Provenance
The following attestation bundles were made for textract_hocr-0.1.3-py3-none-any.whl:
Publisher:
publish.yml on BlueBox-WorldWide/textract-hocr
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
textract_hocr-0.1.3-py3-none-any.whl -
Subject digest:
ab9bca1fc67afc8bbfe91107cdd65d8ab84adeb7b8b70401cb210542f5ff49da - Sigstore transparency entry: 804361296
- Sigstore integration time:
-
Permalink:
BlueBox-WorldWide/textract-hocr@95aa4f962ae84dc061cf076134ccb503119eeb9f -
Branch / Tag:
refs/tags/v0.1.3 - Owner: https://github.com/BlueBox-WorldWide
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@95aa4f962ae84dc061cf076134ccb503119eeb9f -
Trigger Event:
release
-
Statement type: