Skip to main content

Reconstruct Azure Document Intelligence JSON output into readable spatial text layouts

Project description

azure-di-reconstruct

PyPI version Python License: MIT Zero dependencies

Reconstruct Azure Document Intelligence JSON output into readable spatial text layouts.

azure-di-reconstruct takes the JSON returned by the Azure Document Intelligence prebuilt-read model and reproduces the original document's two-dimensional layout as a monospace text grid — no image file, no external dependencies.


Features

  • Zero runtime dependencies — pure Python 3.10+
  • Language agnostic — works with any language Azure DI supports (Tamil, Hindi, English, Arabic, Chinese, etc.)
  • Multi-page support — reconstruct any page by index
  • Two output modes — pipe-bordered boxes or plain spatial text
  • Tunable layout — four hyperparameters control grouping and grid resolution
  • Lightweight — single function call, no setup required

How It Works

Azure DI returns paragraph polygons in inch coordinates. azure-di-reconstruct:

  1. Extracts paragraph bounding boxes from the JSON
  2. Groups blocks into rows using configurable height and width overlap thresholds
  3. Maps inch coordinates to character columns proportionally
  4. Renders a monospace grid with each paragraph in its original spatial position
+------------------------------+          +--------------------+
| கிரையம் கொடுப்பவர் பேவர்    |          | கிரையம் வாங்குபவர் |
+------------------------------+          +--------------------+

         +--------------------------------------+
         | INDIA NON JUDICIAL                   |
         +--------------------------------------+

+----------------------------+             +------------------+
| 91 நெ. சமிட்டிற்கும்      |             | வடக்கு,          |
+----------------------------+             +------------------+

Installation

pip install azure-di-reconstruct

Quick Start

import json
from azure_di_reconstruct import reconstruct

with open("document.json", encoding="utf-8") as f:
    data = json.load(f)

# Pipe-bordered layout (default)
print(reconstruct(data))

# Plain spatial text
print(reconstruct(data, borders=False))

# Second page, wider grid
print(reconstruct(data, page=1, total_cols=160))

API Reference

reconstruct(json_data, *, page, height_threshold, width_threshold, total_cols, borders)

Parameter Type Default Description
json_data dict — Parsed Azure DI JSON with analyzeResult key
page int 0 Zero-based page index to reconstruct
height_threshold float 0.8 Minimum Y-overlap ratio for blocks to share a row
width_threshold float 0.3 Maximum X-overlap ratio before blocks are placed in separate rows
total_cols int 120 Output grid width in characters
borders bool True Wrap blocks in +---+ / | | box characters

Returns str — monospace text grid.

Raises ValueError if page index exceeds the document's page count.


Parameter Guide

height_threshold

Controls whether two blocks are on the same row or separate rows.

  • Higher (e.g. 0.9) — stricter; blocks must nearly perfectly align vertically to share a row. Best for clean printed documents.
  • Lower (e.g. 0.5) — looser; allows blocks with rough vertical alignment to share a row. Best for handwritten or skewed scans.

width_threshold

Controls column separation within a row.

  • Lower (e.g. 0.1) — even small X overlaps force blocks into separate rows (strict column separation).
  • Higher (e.g. 0.6) — blocks need heavy X overlap before being separated (permissive).

total_cols

Maps the page width to a fixed number of character columns.

  • Fewer columns (60–80) — more compressed, fits narrow terminals.
  • More columns (140–200) — more spatial detail, better column separation.

borders

  • True — +---+ / | | box characters around each block (default, best for verification)
  • False — plain text with spatial positioning only (best for copy-paste)

Examples

Multi-page document

pages = data["analyzeResult"]["pages"]

for i in range(len(pages)):
    print(f"\n{'='*60}")
    print(f"  Page {i + 1}")
    print('='*60)
    print(reconstruct(data, page=i))

Save reconstruction to file

with open("reconstruction.txt", "w", encoding="utf-8") as f:
    f.write(reconstruct(data, borders=False, total_cols=120))

Compare pages

page_1 = reconstruct(data, page=0, total_cols=100)
page_2 = reconstruct(data, page=1, total_cols=100)

Input Format

azure-di-reconstruct expects the standard Azure DI REST API response structure:

{
  "analyzeResult": {
    "pages": [
      { "width": 8.5, "height": 11.0, "pageNumber": 1 }
    ],
    "paragraphs": [
      {
        "content": "Sample text",
        "boundingRegions": [
          {
            "pageNumber": 1,
            "polygon": [1.0, 1.0, 4.0, 1.0, 4.0, 1.5, 1.0, 1.5]
          }
        ]
      }
    ]
  }
}

If using the Azure Python SDK, wrap the result:

from azure.ai.documentintelligence.models import AnalyzeDocumentRequest

result  = client.begin_analyze_document("prebuilt-read", body=request).result()
data    = {"analyzeResult": result.as_dict()}   # wrap before passing to reconstruct()

Supported Models

Azure DI Model Supported
prebuilt-read (OCR) ✅ Full support
prebuilt-layout ⚠️ Paragraphs extracted; table cell grouping may be inaccurate
prebuilt-document ⚠️ Paragraph-level extraction only

Note: The prebuilt-read model produces the most accurate spatial reconstruction because its paragraph boundaries align closely with the visual layout.


Limitations

  • Character alignment — Tamil, Devanagari, Arabic, and CJK characters may not be monospace-width in all terminals, which can affect column alignment in the text grid.
  • Rotated pages — heavily rotated page scans may require pre-processing before Azure DI analysis.
  • Complex tables — table cells are treated as individual paragraphs; explicit table structure is not preserved.

License

MIT © 2026 Gopi Pitchai. See LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

azure_di_reconstruct-0.1.0.tar.gz (12.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

azure_di_reconstruct-0.1.0-py3-none-any.whl (10.8 kB view details)

Uploaded Python 3

File details

Details for the file azure_di_reconstruct-0.1.0.tar.gz.

File metadata

  • Download URL: azure_di_reconstruct-0.1.0.tar.gz
  • Upload date:
  • Size: 12.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for azure_di_reconstruct-0.1.0.tar.gz
Algorithm Hash digest
SHA256 1f5c9c241240c8d1ad3de585287933ccc7613a88bd4a675409bae93edf174af1
MD5 2efccebc9ef568203abfb41397355b7f
BLAKE2b-256 e530989db61d5f538583f93678aa912ee02bfa9ed2e57a2b09de1b9117f6c5f1

See more details on using hashes here.

File details

Details for the file azure_di_reconstruct-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for azure_di_reconstruct-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 2475c7fbf4e9f54a7c2136d2adeea8e67f4a56ed16ca7d76b7178e216cf91bae
MD5 d511b9c20e9230923bd4dcbbaa966a19
BLAKE2b-256 b65c490ae053abf0726088ed36682098b87629cf8acbd332e2020d972147d88a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page