Reconstruct Azure Document Intelligence JSON output into readable spatial text layouts
Project description
azure-di-reconstruct
Reconstruct Azure Document Intelligence JSON output into readable spatial text layouts.
azure-di-reconstruct takes the JSON returned by the Azure Document Intelligence prebuilt-read model and reproduces the original document's two-dimensional layout as a monospace text grid — no image file, no external dependencies.
Features
- Zero runtime dependencies — pure Python 3.10+
- Language agnostic — works with any language Azure DI supports (Tamil, Hindi, English, Arabic, Chinese, etc.)
- Multi-page support — reconstruct any page by index
- Two output modes — pipe-bordered boxes or plain spatial text
- Tunable layout — four hyperparameters control grouping and grid resolution
- Lightweight — single function call, no setup required
How It Works
Azure DI returns paragraph polygons in inch coordinates. azure-di-reconstruct:
- Extracts paragraph bounding boxes from the JSON
- Groups blocks into rows using configurable height and width overlap thresholds
- Maps inch coordinates to character columns proportionally
- Renders a monospace grid with each paragraph in its original spatial position
+------------------------------+ +--------------------+
| கிரையம௠கொடà¯à®ªà¯à®ªà®µà®°à¯ பேவர௠| | கிரையம௠வாஙà¯à®•à¯à®ªà®µà®°à¯ |
+------------------------------+ +--------------------+
+--------------------------------------+
| INDIA NON JUDICIAL |
+--------------------------------------+
+----------------------------+ +------------------+
| 91 நெ. சமிடà¯à®Ÿà®¿à®±à¯à®•à¯à®®à¯ | | வடகà¯à®•à¯, |
+----------------------------+ +------------------+
Installation
pip install azure-di-reconstruct
Quick Start
import json
from azure_di_reconstruct import reconstruct
with open("document.json", encoding="utf-8") as f:
data = json.load(f)
# Pipe-bordered layout (default)
print(reconstruct(data))
# Plain spatial text
print(reconstruct(data, borders=False))
# Second page, wider grid
print(reconstruct(data, page=1, total_cols=160))
API Reference
reconstruct(json_data, *, page, height_threshold, width_threshold, total_cols, borders)
| Parameter | Type | Default | Description |
|---|---|---|---|
json_data |
dict |
— | Parsed Azure DI JSON with analyzeResult key |
page |
int |
0 |
Zero-based page index to reconstruct |
height_threshold |
float |
0.8 |
Minimum Y-overlap ratio for blocks to share a row |
width_threshold |
float |
0.3 |
Maximum X-overlap ratio before blocks are placed in separate rows |
total_cols |
int |
120 |
Output grid width in characters |
borders |
bool |
True |
Wrap blocks in +---+ / | | box characters |
Returns str — monospace text grid.
Raises ValueError if page index exceeds the document's page count.
Parameter Guide
height_threshold
Controls whether two blocks are on the same row or separate rows.
- Higher (e.g. 0.9) — stricter; blocks must nearly perfectly align vertically to share a row. Best for clean printed documents.
- Lower (e.g. 0.5) — looser; allows blocks with rough vertical alignment to share a row. Best for handwritten or skewed scans.
width_threshold
Controls column separation within a row.
- Lower (e.g. 0.1) — even small X overlaps force blocks into separate rows (strict column separation).
- Higher (e.g. 0.6) — blocks need heavy X overlap before being separated (permissive).
total_cols
Maps the page width to a fixed number of character columns.
- Fewer columns (60–80) — more compressed, fits narrow terminals.
- More columns (140–200) — more spatial detail, better column separation.
borders
True—+---+/| |box characters around each block (default, best for verification)False— plain text with spatial positioning only (best for copy-paste)
Examples
Multi-page document
pages = data["analyzeResult"]["pages"]
for i in range(len(pages)):
print(f"\n{'='*60}")
print(f" Page {i + 1}")
print('='*60)
print(reconstruct(data, page=i))
Save reconstruction to file
with open("reconstruction.txt", "w", encoding="utf-8") as f:
f.write(reconstruct(data, borders=False, total_cols=120))
Compare pages
page_1 = reconstruct(data, page=0, total_cols=100)
page_2 = reconstruct(data, page=1, total_cols=100)
Input Format
azure-di-reconstruct expects the standard Azure DI REST API response structure:
{
"analyzeResult": {
"pages": [
{ "width": 8.5, "height": 11.0, "pageNumber": 1 }
],
"paragraphs": [
{
"content": "Sample text",
"boundingRegions": [
{
"pageNumber": 1,
"polygon": [1.0, 1.0, 4.0, 1.0, 4.0, 1.5, 1.0, 1.5]
}
]
}
]
}
}
If using the Azure Python SDK, wrap the result:
from azure.ai.documentintelligence.models import AnalyzeDocumentRequest
result = client.begin_analyze_document("prebuilt-read", body=request).result()
data = {"analyzeResult": result.as_dict()} # wrap before passing to reconstruct()
Supported Models
| Azure DI Model | Supported |
|---|---|
prebuilt-read (OCR) |
✅ Full support |
prebuilt-layout |
âš ï¸ Paragraphs extracted; table cell grouping may be inaccurate |
prebuilt-document |
âš ï¸ Paragraph-level extraction only |
Note: The
prebuilt-readmodel produces the most accurate spatial reconstruction because its paragraph boundaries align closely with the visual layout.
Limitations
- Character alignment — Tamil, Devanagari, Arabic, and CJK characters may not be monospace-width in all terminals, which can affect column alignment in the text grid.
- Rotated pages — heavily rotated page scans may require pre-processing before Azure DI analysis.
- Complex tables — table cells are treated as individual paragraphs; explicit table structure is not preserved.
License
MIT © 2026 Gopi Pitchai. See LICENSE for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file azure_di_reconstruct-0.1.0.tar.gz.
File metadata
- Download URL: azure_di_reconstruct-0.1.0.tar.gz
- Upload date:
- Size: 12.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1f5c9c241240c8d1ad3de585287933ccc7613a88bd4a675409bae93edf174af1
|
|
| MD5 |
2efccebc9ef568203abfb41397355b7f
|
|
| BLAKE2b-256 |
e530989db61d5f538583f93678aa912ee02bfa9ed2e57a2b09de1b9117f6c5f1
|
File details
Details for the file azure_di_reconstruct-0.1.0-py3-none-any.whl.
File metadata
- Download URL: azure_di_reconstruct-0.1.0-py3-none-any.whl
- Upload date:
- Size: 10.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2475c7fbf4e9f54a7c2136d2adeea8e67f4a56ed16ca7d76b7178e216cf91bae
|
|
| MD5 |
d511b9c20e9230923bd4dcbbaa966a19
|
|
| BLAKE2b-256 |
b65c490ae053abf0726088ed36682098b87629cf8acbd332e2020d972147d88a
|