Production-grade Document Ingestion & Canonicalization Engine

These details have not been verified by PyPI

Project links

Project description

Document IR - Production Document Ingestion Engine

An IR-first, extensible document compiler for AI systems.

This is NOT a PDF-to-Markdown script. It is a production-grade document ingestion and canonicalization engine designed with compiler-like architecture: Input → IR → Backends.

Architecture

Design Philosophy

Think like a compiler engineer:

Input Layer: Format-specific parsers (currently PDF via Docling)
AST/IR: Canonical intermediate representation with strict schema
Backends: Multiple export formats (Markdown, Text, Parquet)

Layer Separation (Non-Negotiable)

flowchart TD
    A[Input Adapter Layer<br/>Format-specific parsing only]
    B[Extraction Layer<br/>Extract raw structural elements]
    C[Normalization Layer<br/>Convert to canonical IR with hashing]
    D[Canonical IR Layer<br/>Typed schema, stable IDs, relationships]
    E[Export Layer<br/>Markdown, Text, Parquet, Assets]

    A --> B
    B --> C
    C --> D
    D --> E

Key Features

✅ Deterministic & Idempotent

Hash-based stable IDs (document, block, table, image, chunk)
Running pipeline twice produces identical output
No UUIDs, no randomness

✅ Canonical IR Schema

Document
├── document_id: str (hash-based)
├── schema_version: str
├── parser_version: str
├── metadata: DocumentMetadata
├── blocks: List[Block]
│   ├── block_id: str (deterministic)
│   ├── type: BlockType (heading, paragraph, table, image, etc.)
│   ├── content: str
│   ├── page_number: int
│   ├── bbox: BoundingBox
│   └── metadata: dict
└── relationships: List[Relationship]

✅ Pluggable Chunking

SemanticSectionChunker: Section-based (headings)
TokenWindowChunker: Fixed token windows with overlap
LayoutAwareChunker: Layout-aware (stub)

All chunking operates on IR, not raw text.

✅ Multiple Export Formats

Markdown: Human-readable with formatting
Plain Text: Simple text extraction
Parquet: Efficient structured storage for tables/blocks
Assets: Extracted images (PNG) and tables (CSV)

✅ Structured Output

/<document_id>/
    manifest.json       # Processing metadata
    ir.json            # Canonical IR
    chunks.json        # Chunk definitions
    /assets/
        /images/       # Extracted images
        /tables/       # Tables as CSV
    /exports/
        /markdown/     # Markdown output
        /text/         # Plain text output
        /parquet/      # Parquet datasets
    /logs/             # Processing logs

Installation

# Install from PyPI
pip install layoutir

# Or install from source
git clone https://github.com/RahulPatnaik/layoutir.git
cd layoutir
pip install -e .

Usage

Basic Usage

# Using the CLI
layoutir --input file.pdf --output ./out

# Or using Python directly
python -m layoutir.cli --input file.pdf --output ./out

Advanced Options

# Semantic chunking (default)
layoutir --input file.pdf --output ./out --chunk-strategy semantic

# Token-based chunking with custom size
layoutir --input file.pdf --output ./out \
  --chunk-strategy token \
  --chunk-size 1024 \
  --chunk-overlap 128

# Enable GPU acceleration
layoutir --input file.pdf --output ./out --use-gpu

# Debug mode with structured logging
layoutir --input file.pdf --output ./out \
  --log-level DEBUG \
  --structured-logs

Python API

from pathlib import Path
from layoutir import Pipeline
from layoutir.adapters import DoclingAdapter
from layoutir.chunking import SemanticSectionChunker

# Create pipeline
adapter = DoclingAdapter(use_gpu=True)
chunker = SemanticSectionChunker(max_heading_level=2)
pipeline = Pipeline(adapter=adapter, chunk_strategy=chunker)

# Process document
document = pipeline.process(
    input_path=Path("document.pdf"),
    output_dir=Path("./output")
)

# Access results
print(f"Extracted {len(document.blocks)} blocks")
print(f"Document ID: {document.document_id}")

Project Structure

src/layoutir/
├── schema.py              # Canonical IR schema (Pydantic)
├── pipeline.py            # Main orchestrator
│
├── adapters/              # Input adapters
│   ├── base.py           # Abstract interface
│   └── docling_adapter.py # PDF via Docling
│
├── extraction/            # Raw element extraction
│   └── docling_extractor.py
│
├── normalization/         # IR normalization
│   └── normalizer.py
│
├── chunking/              # Chunking strategies
│   └── strategies.py
│
├── exporters/             # Export backends
│   ├── markdown_exporter.py
│   ├── text_exporter.py
│   ├── parquet_exporter.py
│   └── asset_writer.py
│
└── utils/
    ├── hashing.py        # Deterministic ID generation
    └── logging_config.py  # Structured logging

ingest.py                  # CLI entrypoint
benchmark.py               # Performance benchmark
test_pipeline.py           # Integration test

Design Constraints

✅ What We DO

Strict layer separation
Deterministic processing
Schema validation
Pluggable strategies
Observability/timing
Efficient storage (Parquet)

❌ What We DON'T DO

Mix business logic into adapters
Hardcode paths or configurations
Use non-deterministic IDs (UUIDs)
Combine IR and export logic
Skip schema validation
Load entire files into memory unnecessarily

Extensibility

Adding New Input Formats

Implement InputAdapter interface:

class DocxAdapter(InputAdapter):
    def parse(self, file_path: Path) -> Any: ...
    def supports_format(self, file_path: Path) -> bool: ...
    def get_parser_version(self) -> str: ...

Implement corresponding extractor
Update pipeline to use new adapter

Adding New Chunk Strategies

class CustomChunker(ChunkStrategy):
    def chunk(self, document: Document) -> List[Chunk]:
        # Operate on IR blocks
        ...

Adding New Export Formats

class JsonExporter(Exporter):
    def export(self, document: Document, output_dir: Path, chunks: List[Chunk]):
        # Export from canonical IR
        ...

Performance

Designed to handle 200+ page PDFs efficiently:

Streaming processing where possible
Lazy loading of heavy dependencies
GPU acceleration support
Parallel export operations
Efficient Parquet storage for tables

Observability

Structured JSON logging
Stage-level timing metrics
Extraction statistics
Deterministic output for debugging

Schema Versioning

Current schema version: 1.0.0

Future schema changes will be tracked via semantic versioning:

Major: Breaking changes to IR structure
Minor: Backwards-compatible additions
Patch: Bug fixes

Future Enhancements

DOCX input adapter
HTML input adapter
Advanced layout-aware chunking
Parallel page processing
Incremental updates (only reprocess changed pages)
Vector embeddings export
OCR fallback for scanned PDFs

License

See project root for license information.

Contributing

This is a research/prototype phase project. See main project README for contribution guidelines.

layoutir

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.0.4

Feb 19, 2026

1.0.3

Feb 15, 2026

1.0.2

Feb 15, 2026

1.0.1

Feb 15, 2026

This version

1.0.0

Feb 15, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

layoutir-1.0.0.tar.gz (28.0 kB view details)

Uploaded Feb 15, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

layoutir-1.0.0-py3-none-any.whl (32.1 kB view details)

Uploaded Feb 15, 2026 Python 3

File details

Details for the file layoutir-1.0.0.tar.gz.

File metadata

Download URL: layoutir-1.0.0.tar.gz
Upload date: Feb 15, 2026
Size: 28.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for layoutir-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`603b26928faaeb68ba9932d202715ff65a177babdbb5b7f55af158c8587741b8`
MD5	`4f49993067ee7a59d04cc1d553cef244`
BLAKE2b-256	`d08fb9e63ffe07a398d70b71f21925868a827f09889855e7cd64cab7a8961b06`

See more details on using hashes here.

File details

Details for the file layoutir-1.0.0-py3-none-any.whl.

File metadata

Download URL: layoutir-1.0.0-py3-none-any.whl
Upload date: Feb 15, 2026
Size: 32.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for layoutir-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`cb7ac06517a1447430d51a043903d5e3167c564d08073abd2606c76c18baafa2`
MD5	`ad550690f075134eaefa53b58c42728d`
BLAKE2b-256	`e9c4b393cc362ddcdaa3c4abe093ee04dd4bc4bef56060926bfdea3932339f12`

See more details on using hashes here.

layoutir 1.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Document IR - Production Document Ingestion Engine

Architecture

Design Philosophy

Layer Separation (Non-Negotiable)

Key Features

✅ Deterministic & Idempotent

✅ Canonical IR Schema

✅ Pluggable Chunking

✅ Multiple Export Formats

✅ Structured Output

Installation

Usage

Basic Usage

Advanced Options

Python API

Project Structure

Design Constraints

✅ What We DO

❌ What We DON'T DO

Extensibility

Adding New Input Formats

Adding New Chunk Strategies

Adding New Export Formats

Performance

Observability

Schema Versioning

Future Enhancements

License

Contributing

layoutir

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes