Skip to main content

A structured OCR pipeline designed for layout-aware text extraction from complex documents, combining preprocessing, column detection, region classification, PaddleOCR, and ordered OCR assembly.

Project description

Part of the Abstract Media Intelligence Platform

This module provides layout-aware OCR as part of a larger media processing system.

abstract_ocr focuses on extraction:

  • multi-engine OCR (Tesseract / EasyOCR / PaddleOCR)
  • column detection and region segmentation
  • structured, position-aware text output

Full system: https://github.com/AbstractEndeavors/abstract-media-intelligence


abstract_ocr / layout_ocr — Layout-Aware OCR Pipeline

A structured OCR pipeline designed for layout-aware text extraction from complex documents, combining preprocessing, column detection, region classification, and ordered OCR assembly.

Built to handle:

  • multi-column PDFs
  • mixed-content layouts (text, figures, captions)
  • noisy or scanned documents
  • large-scale document ingestion pipelines

🔹 What This System Is

This is not a simple OCR wrapper — it is a typed, multi-stage processing pipeline:

  • transforms raw images into structured page representations
  • detects document layout (columns, headers, regions)
  • classifies content blocks (text, figures, captions)
  • applies OCR at the region level
  • reconstructs output in correct reading order

The system is designed for deterministic, reproducible extraction rather than heuristic text scraping.


Pipeline Overview

PDF Input
    ↓
Slice / Decompose (images + text per page)
    ↓
OCR + Text Extraction (layout-aware engines)
    ↓
Metadata Generation
    ├─ summaries
    ├─ keywords
    └─ descriptions
    ↓
Manifest Creation (per-page + per-document)
    ↓
HTML Generation
    ├─ PDF viewer pages
    └─ gallery index pages
    ↓
Static Site Output (SEO-ready)
flowchart TD
    A[Input Image / Page Image]
    B[Preprocess\nDenoise + Binarize]
    C[Layout Detection\nColumns + Header Cutoff]
    D[Region Classification\nText / Figure / Caption]
    E[Region OCR\nCrop + Tesseract]
    F[Fallback OCR\nColumn-level OCR]
    G[Reading Order Assembly]
    H[Structured OCRResult\nBlocks + Raw Text + Layout]

    A --> B --> C --> D --> E --> G --> H
    D -->|No usable regions| F --> G

🔹 Core Capabilities

  • Layout Detection

    • Column detection via vertical projection valleys
    • Header segmentation via density scanning
    • Multi-column classification (single / dual / mixed)
  • Region Classification

    • Connected-component analysis
    • Density-based classification (text vs figure vs caption)
    • Column-aware region assignment
  • Region-Level OCR

    • OCR applied per detected block (not full-page)
    • Adaptive Tesseract configuration by region type
    • Automatic fallback to column-level OCR when detection fails
  • Reading Order Reconstruction

    • Column-aware ordering
    • Top-to-bottom sequencing within columns
    • Header/body/caption prioritization
  • Typed Pipeline Execution

    • All steps validated via explicit input/output types
    • Registry-driven execution model
    • No implicit coupling between pipeline stages

🔹 Architecture

The pipeline is built around a step registry + type-safe execution chain:

  • Each step declares:

    • input type
    • output type
  • The pipeline validates compatibility before execution

  • Execution is explicit, deterministic, and observable

Example chain:

["preprocess", "detect_layout", "ocr_regions"]

Each step is independently replaceable and composable.


🔹 Key Design Decisions

Typed Data Flow

All intermediate results are structured dataclasses:

  • PageImage
  • PreprocessedImage
  • LayoutDetection
  • OCRResult

No ad-hoc dictionaries — ensures:

  • traceability
  • consistency
  • debuggability

Layout-First OCR

OCR is applied after structure is understood, not before.

This prevents:

  • column interleaving
  • incorrect reading order
  • misclassification of content

Fallback Over Failure

If region detection fails:

  • system falls back to column-level OCR
  • ensures output is still usable

Determinism Over Heuristics

  • explicit thresholds (config-driven)
  • no hidden behavior
  • reproducible results across runs

🔹 Why This Exists

Traditional OCR pipelines:

  • ignore layout
  • operate on full pages
  • produce inconsistent reading order
  • fail silently on complex documents

This system:

  • understands document structure
  • isolates regions before OCR
  • enforces reading order
  • produces structured outputs suitable for downstream systems

🔹 Example Use Cases

  • PDF → structured text extraction
  • research document ingestion pipelines
  • financial filings parsing
  • multi-column article extraction
  • preprocessing for NLP / LLM pipelines
  • search indexing and document analysis

🔹 Integration Context

This module is designed to plug into:

  • document ingestion systems
  • OCR + NLP pipelines (e.g. abstract_hugpy)
  • search and indexing systems
  • large-scale document processing workflows

🔹 Design Philosophy

  • Structure before extraction
  • Determinism over convenience
  • Typed pipelines over implicit flows
  • Fallback over failure

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

abstract_ocr-0.0.1.70.tar.gz (52.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

abstract_ocr-0.0.1.70-py3-none-any.whl (72.8 kB view details)

Uploaded Python 3

File details

Details for the file abstract_ocr-0.0.1.70.tar.gz.

File metadata

  • Download URL: abstract_ocr-0.0.1.70.tar.gz
  • Upload date:
  • Size: 52.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for abstract_ocr-0.0.1.70.tar.gz
Algorithm Hash digest
SHA256 62f44707c9bbd4160b1969ee0965dbf8e5a771ac6f3e5c1343d3c7d954c586a9
MD5 cbc28b42b107bbce2c10757ec0dd10ac
BLAKE2b-256 70644453019cbc97da1b0affec8ff50b3038690c5288c3e7e857dee3307d55ba

See more details on using hashes here.

File details

Details for the file abstract_ocr-0.0.1.70-py3-none-any.whl.

File metadata

File hashes

Hashes for abstract_ocr-0.0.1.70-py3-none-any.whl
Algorithm Hash digest
SHA256 21d6a5e1c235414ea9d88ae9239b67867ce3ddccf3e91cb2987edd9bd6a2afc3
MD5 1a03ed1d98074082f67d6da51b6fa6e9
BLAKE2b-256 1f71bf42c3651eebd2c5ccbc7dbcedb0ab44349c72b37abfe0fb38d1757459e6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page