Reconstruct Azure Document Intelligence JSON output into readable spatial text layouts

These details have not been verified by PyPI

Project links

Project description

azure-di-reconstruct

Reconstruct Azure Document Intelligence JSON output into readable spatial text layouts.

azure-di-reconstruct takes the JSON returned by the Azure Document Intelligence prebuilt-read model and reproduces the original document's two-dimensional layout as a monospace text grid â€” no image file, no external dependencies.

Features

Zero runtime dependencies â€” pure Python 3.10+
Language agnostic â€” works with any language Azure DI supports (Tamil, Hindi, English, Arabic, Chinese, etc.)
Multi-page support â€” reconstruct any page by index
Two output modes â€” pipe-bordered boxes or plain spatial text
Tunable layout â€” four hyperparameters control grouping and grid resolution
Lightweight â€” single function call, no setup required

How It Works

Azure DI returns paragraph polygons in inch coordinates. azure-di-reconstruct:

Extracts paragraph bounding boxes from the JSON
Groups blocks into rows using configurable height and width overlap thresholds
Maps inch coordinates to character columns proportionally
Renders a monospace grid with each paragraph in its original spatial position

+------------------------------+          +--------------------+
| à®•à®¿à®°à¯ˆà®¯à®®à¯ à®•à¯Šà®Ÿà¯à®ªà¯à®ªà®µà®°à¯ à®ªà¯‡à®µà®°à¯    |          | à®•à®¿à®°à¯ˆà®¯à®®à¯ à®µà®¾à®™à¯à®•à¯à®ªà®µà®°à¯ |
+------------------------------+          +--------------------+

         +--------------------------------------+
         | INDIA NON JUDICIAL                   |
         +--------------------------------------+

+----------------------------+             +------------------+
| 91 à®¨à¯†. à®šà®®à®¿à®Ÿà¯à®Ÿà®¿à®±à¯à®•à¯à®®à¯      |             | à®µà®Ÿà®•à¯à®•à¯,          |
+----------------------------+             +------------------+

Installation

pip install azure-di-reconstruct

Quick Start

import json
from azure_di_reconstruct import reconstruct

with open("document.json", encoding="utf-8") as f:
    data = json.load(f)

# Pipe-bordered layout (default)
print(reconstruct(data))

# Plain spatial text
print(reconstruct(data, borders=False))

# Second page, wider grid
print(reconstruct(data, page=1, total_cols=160))

API Reference

`reconstruct(json_data, *, page, height_threshold, width_threshold, total_cols, borders)`

Parameter	Type	Default	Description
`json_data`	`dict`	â€”	Parsed Azure DI JSON with `analyzeResult` key
`page`	`int`	`0`	Zero-based page index to reconstruct
`height_threshold`	`float`	`0.8`	Minimum Y-overlap ratio for blocks to share a row
`width_threshold`	`float`	`0.3`	Maximum X-overlap ratio before blocks are placed in separate rows
`total_cols`	`int`	`120`	Output grid width in characters
`borders`	`bool`	`True`	Wrap blocks in `+---+` / `\| \|` box characters

Returns str â€” monospace text grid.

Raises ValueError if page index exceeds the document's page count.

Parameter Guide

`height_threshold`

Controls whether two blocks are on the same row or separate rows.

Higher (e.g. 0.9) â€” stricter; blocks must nearly perfectly align vertically to share a row. Best for clean printed documents.
Lower (e.g. 0.5) â€” looser; allows blocks with rough vertical alignment to share a row. Best for handwritten or skewed scans.

`width_threshold`

Controls column separation within a row.

Lower (e.g. 0.1) â€” even small X overlaps force blocks into separate rows (strict column separation).
Higher (e.g. 0.6) â€” blocks need heavy X overlap before being separated (permissive).

`total_cols`

Maps the page width to a fixed number of character columns.

Fewer columns (60â€“80) â€” more compressed, fits narrow terminals.
More columns (140â€“200) â€” more spatial detail, better column separation.

`borders`

True â€” +---+ / | | box characters around each block (default, best for verification)
False â€” plain text with spatial positioning only (best for copy-paste)

Examples

Multi-page document

pages = data["analyzeResult"]["pages"]

for i in range(len(pages)):
    print(f"\n{'='*60}")
    print(f"  Page {i + 1}")
    print('='*60)
    print(reconstruct(data, page=i))

Save reconstruction to file

with open("reconstruction.txt", "w", encoding="utf-8") as f:
    f.write(reconstruct(data, borders=False, total_cols=120))

Compare pages

page_1 = reconstruct(data, page=0, total_cols=100)
page_2 = reconstruct(data, page=1, total_cols=100)

Input Format

azure-di-reconstruct expects the standard Azure DI REST API response structure:

{
  "analyzeResult": {
    "pages": [
      { "width": 8.5, "height": 11.0, "pageNumber": 1 }
    ],
    "paragraphs": [
      {
        "content": "Sample text",
        "boundingRegions": [
          {
            "pageNumber": 1,
            "polygon": [1.0, 1.0, 4.0, 1.0, 4.0, 1.5, 1.0, 1.5]
          }
        ]
      }
    ]
  }
}

If using the Azure Python SDK, wrap the result:

from azure.ai.documentintelligence.models import AnalyzeDocumentRequest

result  = client.begin_analyze_document("prebuilt-read", body=request).result()
data    = {"analyzeResult": result.as_dict()}   # wrap before passing to reconstruct()

Supported Models

Azure DI Model	Supported
`prebuilt-read` (OCR)	âœ… Full support
`prebuilt-layout`	âš ï¸ Paragraphs extracted; table cell grouping may be inaccurate
`prebuilt-document`	âš ï¸ Paragraph-level extraction only

Note: The prebuilt-read model produces the most accurate spatial reconstruction because its paragraph boundaries align closely with the visual layout.

Limitations

Character alignment â€” Tamil, Devanagari, Arabic, and CJK characters may not be monospace-width in all terminals, which can affect column alignment in the text grid.
Rotated pages â€” heavily rotated page scans may require pre-processing before Azure DI analysis.
Complex tables â€” table cells are treated as individual paragraphs; explicit table structure is not preserved.

License

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.1

May 16, 2026

This version

0.1.0

May 16, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

azure_di_reconstruct-0.1.0.tar.gz (12.3 kB view details)

Uploaded May 16, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

azure_di_reconstruct-0.1.0-py3-none-any.whl (10.8 kB view details)

Uploaded May 16, 2026 Python 3

File details

Details for the file azure_di_reconstruct-0.1.0.tar.gz.

File metadata

Download URL: azure_di_reconstruct-0.1.0.tar.gz
Upload date: May 16, 2026
Size: 12.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for azure_di_reconstruct-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`1f5c9c241240c8d1ad3de585287933ccc7613a88bd4a675409bae93edf174af1`
MD5	`2efccebc9ef568203abfb41397355b7f`
BLAKE2b-256	`e530989db61d5f538583f93678aa912ee02bfa9ed2e57a2b09de1b9117f6c5f1`

See more details on using hashes here.

File details

Details for the file azure_di_reconstruct-0.1.0-py3-none-any.whl.

File metadata

Download URL: azure_di_reconstruct-0.1.0-py3-none-any.whl
Upload date: May 16, 2026
Size: 10.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for azure_di_reconstruct-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2475c7fbf4e9f54a7c2136d2adeea8e67f4a56ed16ca7d76b7178e216cf91bae`
MD5	`d511b9c20e9230923bd4dcbbaa966a19`
BLAKE2b-256	`b65c490ae053abf0726088ed36682098b87629cf8acbd332e2020d972147d88a`

See more details on using hashes here.

azure-di-reconstruct 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

azure-di-reconstruct

Features

How It Works

Installation

Quick Start

API Reference

reconstruct(json_data, *, page, height_threshold, width_threshold, total_cols, borders)

Parameter Guide

height_threshold

width_threshold

total_cols

borders

Examples

Multi-page document

Save reconstruction to file

Compare pages

Input Format

Supported Models

Limitations

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`reconstruct(json_data, *, page, height_threshold, width_threshold, total_cols, borders)`

`height_threshold`

`width_threshold`

`total_cols`

`borders`