abstract-ocr

A structured OCR pipeline designed for layout-aware text extraction from complex documents, combining preprocessing, column detection, region classification, PaddleOCR, and ordered OCR assembly.

These details have not been verified by PyPI

Project description

Part of the Abstract Media Intelligence Platform

This module provides layout-aware OCR as part of a larger media processing system.

abstract_ocr focuses on extraction:

multi-engine OCR (Tesseract / EasyOCR / PaddleOCR)
column detection and region segmentation
structured, position-aware text output

Full system: https://github.com/AbstractEndeavors/abstract-media-intelligence

abstract_ocr / layout_ocr — Layout-Aware OCR Pipeline

A structured OCR pipeline designed for layout-aware text extraction from complex documents, combining preprocessing, column detection, region classification, and ordered OCR assembly.

Built to handle:

multi-column PDFs
mixed-content layouts (text, figures, captions)
noisy or scanned documents
large-scale document ingestion pipelines

🔹 What This System Is

This is not a simple OCR wrapper — it is a typed, multi-stage processing pipeline:

transforms raw images into structured page representations
detects document layout (columns, headers, regions)
classifies content blocks (text, figures, captions)
applies OCR at the region level
reconstructs output in correct reading order

The system is designed for deterministic, reproducible extraction rather than heuristic text scraping.

Pipeline Overview

PDF Input
    ↓
Slice / Decompose (images + text per page)
    ↓
OCR + Text Extraction (layout-aware engines)
    ↓
Metadata Generation
    ├─ summaries
    ├─ keywords
    └─ descriptions
    ↓
Manifest Creation (per-page + per-document)
    ↓
HTML Generation
    ├─ PDF viewer pages
    └─ gallery index pages
    ↓
Static Site Output (SEO-ready)

flowchart TD
    A[Input Image / Page Image]
    B[Preprocess\nDenoise + Binarize]
    C[Layout Detection\nColumns + Header Cutoff]
    D[Region Classification\nText / Figure / Caption]
    E[Region OCR\nCrop + Tesseract]
    F[Fallback OCR\nColumn-level OCR]
    G[Reading Order Assembly]
    H[Structured OCRResult\nBlocks + Raw Text + Layout]

    A --> B --> C --> D --> E --> G --> H
    D -->|No usable regions| F --> G

🔹 Core Capabilities

Layout Detection
- Column detection via vertical projection valleys
- Header segmentation via density scanning
- Multi-column classification (single / dual / mixed)
Region Classification
- Connected-component analysis
- Density-based classification (text vs figure vs caption)
- Column-aware region assignment
Region-Level OCR
- OCR applied per detected block (not full-page)
- Adaptive Tesseract configuration by region type
- Automatic fallback to column-level OCR when detection fails
Reading Order Reconstruction
- Column-aware ordering
- Top-to-bottom sequencing within columns
- Header/body/caption prioritization
Typed Pipeline Execution
- All steps validated via explicit input/output types
- Registry-driven execution model
- No implicit coupling between pipeline stages

🔹 Architecture

The pipeline is built around a step registry + type-safe execution chain:

Each step declares:
- input type
- output type
The pipeline validates compatibility before execution
Execution is explicit, deterministic, and observable

Example chain:

["preprocess", "detect_layout", "ocr_regions"]

Each step is independently replaceable and composable.

🔹 Key Design Decisions

Typed Data Flow

All intermediate results are structured dataclasses:

PageImage
PreprocessedImage
LayoutDetection
OCRResult

No ad-hoc dictionaries — ensures:

traceability
consistency
debuggability

Layout-First OCR

OCR is applied after structure is understood, not before.

This prevents:

column interleaving
incorrect reading order
misclassification of content

Fallback Over Failure

If region detection fails:

system falls back to column-level OCR
ensures output is still usable

Determinism Over Heuristics

explicit thresholds (config-driven)
no hidden behavior
reproducible results across runs

🔹 Why This Exists

Traditional OCR pipelines:

ignore layout
operate on full pages
produce inconsistent reading order
fail silently on complex documents

This system:

understands document structure
isolates regions before OCR
enforces reading order
produces structured outputs suitable for downstream systems

🔹 Example Use Cases

PDF → structured text extraction
research document ingestion pipelines
financial filings parsing
multi-column article extraction
preprocessing for NLP / LLM pipelines
search indexing and document analysis

🔹 Integration Context

This module is designed to plug into:

document ingestion systems
OCR + NLP pipelines (e.g. abstract_hugpy)
search and indexing systems
large-scale document processing workflows

🔹 Design Philosophy

Structure before extraction
Determinism over convenience
Typed pipelines over implicit flows
Fallback over failure

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.0.1.70

Jul 5, 2026

0.0.1.69

Jun 2, 2026

This version

0.0.1.68

May 29, 2026

0.0.1.67

May 20, 2026

0.0.1.66

May 20, 2026

0.0.1.65

May 11, 2026

0.0.1.64

May 11, 2026

0.0.1.63

May 11, 2026

0.0.1.62

May 11, 2026

0.0.1.61

Apr 6, 2026

0.0.1.60

Mar 28, 2026

0.0.1.59

Mar 17, 2026

0.0.1.58

Mar 15, 2026

0.0.1.57

Mar 15, 2026

0.0.1.56

Mar 15, 2026

0.0.1.55

Mar 15, 2026

0.0.1.54

Mar 11, 2026

0.0.1.53

Mar 11, 2026

0.0.1.52

Mar 10, 2026

0.0.1.51

Mar 10, 2026

0.0.1.50

Mar 10, 2026

0.0.1.49

Mar 10, 2026

0.0.1.48

Mar 10, 2026

0.0.1.47

Mar 10, 2026

0.0.1.46

Mar 10, 2026

0.0.1.45

Mar 10, 2026

0.0.1.44

Mar 10, 2026

0.0.1.43

Mar 9, 2026

0.0.1.42

Mar 9, 2026

0.0.1.41

Mar 9, 2026

0.0.1.40

Mar 9, 2026

0.0.1.39

Mar 8, 2026

0.0.1.38

Mar 8, 2026

0.0.1.37

Mar 8, 2026

0.0.1.36

Mar 8, 2026

0.0.1.35

Mar 8, 2026

0.0.1.34

Mar 8, 2026

0.0.1.33

Mar 8, 2026

0.0.1.32

Dec 15, 2025

0.0.1.31

Oct 21, 2025

0.0.1.30

Oct 21, 2025

0.0.1.29

Oct 21, 2025

0.0.1.28

Oct 21, 2025

0.0.1.27

Oct 21, 2025

0.0.1.26

Oct 21, 2025

0.0.1.25

Oct 21, 2025

0.0.1.24

Oct 21, 2025

0.0.1.23

Oct 21, 2025

0.0.1.22

Oct 18, 2025

0.0.1.21

Oct 18, 2025

0.0.1.20

Oct 18, 2025

0.0.1.19

Oct 18, 2025

0.0.1.18

Oct 18, 2025

0.0.1.17

Oct 18, 2025

0.0.1.16

Oct 18, 2025

0.0.1.15

Oct 18, 2025

0.0.1.14

Oct 18, 2025

0.0.1.13

Oct 18, 2025

0.0.1.12

Oct 18, 2025

0.0.1.11

Oct 18, 2025

0.0.1.10

Oct 18, 2025

0.0.1.9

Sep 24, 2025

0.0.1.8

Jun 9, 2025

0.0.1.7

Jun 9, 2025

0.0.1.6

Jun 9, 2025

0.0.1.5

Jun 9, 2025

0.0.1.4

Jun 9, 2025

0.0.1.3

Jun 9, 2025

0.0.1.2

Jun 3, 2025

0.0.1.1

May 3, 2025

0.0.1.0

May 3, 2025

0.0.0.225

May 3, 2025

0.0.0.224

May 3, 2025

0.0.0.223

Apr 30, 2025

0.0.0.222

Apr 29, 2025

0.0.0.221

Apr 29, 2025

0.0.0.220

Apr 29, 2025

0.0.0.219

Apr 29, 2025

0.0.0.218

Apr 29, 2025

0.0.0.217

Apr 29, 2025

0.0.0.216

Apr 29, 2025

0.0.0.215

Apr 29, 2025

0.0.0.214

Apr 29, 2025

0.0.0.213

Apr 29, 2025

0.0.0.212

Apr 29, 2025

0.0.0.211

Apr 29, 2025

0.0.0.210

Apr 29, 2025

0.0.0.209

Apr 29, 2025

0.0.0.208

Apr 28, 2025

0.0.0.207

Apr 28, 2025

0.0.0.206

Apr 28, 2025

0.0.0.205

Apr 28, 2025

0.0.0.204

Apr 27, 2025

0.0.0.203

Apr 27, 2025

0.0.0.202

Apr 27, 2025

0.0.0.201

Apr 27, 2025

0.0.0.200

Apr 27, 2025

0.0.0.199

Apr 26, 2025

0.0.0.198

Apr 26, 2025

0.0.0.197

Apr 26, 2025

0.0.0.196

Apr 26, 2025

0.0.0.195

Apr 25, 2025

0.0.0.194

Apr 25, 2025

0.0.0.193

Apr 25, 2025

0.0.0.192

Apr 25, 2025

0.0.0.191

Apr 25, 2025

0.0.0.190

Apr 25, 2025

0.0.0.189

Apr 25, 2025

0.0.0.188

Apr 25, 2025

0.0.0.187

Apr 25, 2025

0.0.0.186

Apr 25, 2025

0.0.0.185

Apr 25, 2025

0.0.0.184

Apr 25, 2025

0.0.0.183

Apr 25, 2025

0.0.0.182

Apr 25, 2025

0.0.0.181

Apr 25, 2025

0.0.0.180

Apr 25, 2025

0.0.0.179

Apr 25, 2025

0.0.0.178

Apr 25, 2025

0.0.0.177

Apr 25, 2025

0.0.0.176

Apr 25, 2025

0.0.0.175

Apr 25, 2025

0.0.0.174

Apr 25, 2025

0.0.0.173

Apr 25, 2025

0.0.0.172

Apr 25, 2025

0.0.0.171

Apr 25, 2025

0.0.0.170

Apr 25, 2025

0.0.0.169

Apr 25, 2025

0.0.0.168

Apr 25, 2025

0.0.0.167

Apr 25, 2025

0.0.0.166

Apr 23, 2025

0.0.0.165

Apr 23, 2025

0.0.0.164

Apr 23, 2025

0.0.0.163

Apr 23, 2025

0.0.0.162

Apr 23, 2025

0.0.0.161

Apr 23, 2025

0.0.0.160

Apr 23, 2025

0.0.0.159

Apr 22, 2025

0.0.0.158

Apr 22, 2025

0.0.0.157

Apr 22, 2025

0.0.0.156

Apr 22, 2025

0.0.0.155

Apr 22, 2025

0.0.0.154

Apr 22, 2025

0.0.0.153

Apr 22, 2025

0.0.0.152

Apr 22, 2025

0.0.0.151

Apr 22, 2025

0.0.0.150

Apr 22, 2025

0.0.0.149

Apr 22, 2025

0.0.0.148

Apr 22, 2025

0.0.0.147

Apr 22, 2025

0.0.0.146

Apr 22, 2025

0.0.0.145

Apr 22, 2025

0.0.0.144

Apr 22, 2025

0.0.0.143

Apr 22, 2025

0.0.0.142

Apr 22, 2025

0.0.0.141

Apr 22, 2025

0.0.0.140

Apr 22, 2025

0.0.0.139

Apr 22, 2025

0.0.0.138

Apr 22, 2025

0.0.0.137

Apr 22, 2025

0.0.0.136

Apr 22, 2025

0.0.0.135

Apr 22, 2025

0.0.0.134

Apr 22, 2025

0.0.0.133

Apr 22, 2025

0.0.0.132

Apr 22, 2025

0.0.0.131

Apr 22, 2025

0.0.0.130

Apr 22, 2025

0.0.0.129

Apr 22, 2025

0.0.0.128

Apr 22, 2025

0.0.0.127

Apr 22, 2025

0.0.0.126

Apr 22, 2025

0.0.0.125

Apr 22, 2025

0.0.0.124

Apr 22, 2025

0.0.0.123

Apr 22, 2025

0.0.0.122

Apr 22, 2025

0.0.0.121

Apr 22, 2025

0.0.0.120

Apr 22, 2025

0.0.0.119

Apr 22, 2025

0.0.0.118

Apr 22, 2025

0.0.0.117

Apr 22, 2025

0.0.0.116

Apr 22, 2025

0.0.0.115

Apr 22, 2025

0.0.0.114

Apr 22, 2025

0.0.0.113

Apr 22, 2025

0.0.0.112

Apr 22, 2025

0.0.0.111

Apr 22, 2025

0.0.0.110

Apr 22, 2025

0.0.0.109

Apr 22, 2025

0.0.0.108

Apr 22, 2025

0.0.0.107

Apr 22, 2025

0.0.0.106

Apr 22, 2025

0.0.0.105

Apr 22, 2025

0.0.0.104

Apr 22, 2025

0.0.0.103

Apr 22, 2025

0.0.0.102

Apr 22, 2025

0.0.0.101

Apr 22, 2025

0.0.0.100

Apr 22, 2025

0.0.0.99

Apr 22, 2025

0.0.0.98

Apr 22, 2025

0.0.0.97

Apr 22, 2025

0.0.0.96

Apr 22, 2025

0.0.0.95

Apr 22, 2025

0.0.0.94

Apr 22, 2025

0.0.0.93

Apr 22, 2025

0.0.0.92

Apr 21, 2025

0.0.0.91

Apr 21, 2025

0.0.0.90

Apr 21, 2025

0.0.0.89

Apr 21, 2025

0.0.0.88

Apr 21, 2025

0.0.0.87

Apr 21, 2025

0.0.0.86

Apr 21, 2025

0.0.0.85

Apr 21, 2025

0.0.0.84

Apr 21, 2025

0.0.0.83

Apr 21, 2025

0.0.0.82

Apr 21, 2025

0.0.0.81

Apr 21, 2025

0.0.0.80

Apr 21, 2025

0.0.0.79

Apr 21, 2025

0.0.0.78

Apr 21, 2025

0.0.0.77

Apr 21, 2025

0.0.0.76

Apr 21, 2025

0.0.0.75

Apr 21, 2025

0.0.0.74

Apr 21, 2025

0.0.0.73

Apr 21, 2025

0.0.0.72

Apr 21, 2025

0.0.0.71

Apr 21, 2025

0.0.0.70

Apr 21, 2025

0.0.0.69

Apr 21, 2025

0.0.0.68

Apr 21, 2025

0.0.0.67

Apr 21, 2025

0.0.0.66

Apr 21, 2025

0.0.0.65

Apr 21, 2025

0.0.0.64

Apr 21, 2025

0.0.0.63

Apr 21, 2025

0.0.0.62

Apr 21, 2025

0.0.0.61

Apr 21, 2025

0.0.0.60

Apr 21, 2025

0.0.0.59

Apr 21, 2025

0.0.0.58

Apr 21, 2025

0.0.0.57

Apr 21, 2025

0.0.0.56

Apr 21, 2025

0.0.0.55

Apr 21, 2025

0.0.0.54

Apr 21, 2025

0.0.0.53

Apr 21, 2025

0.0.0.52

Apr 21, 2025

0.0.0.51

Apr 20, 2025

0.0.0.50

Apr 20, 2025

0.0.0.49

Apr 20, 2025

0.0.0.48

Apr 20, 2025

0.0.0.47

Apr 20, 2025

0.0.0.46

Apr 20, 2025

0.0.0.45

Apr 20, 2025

0.0.0.44

Apr 20, 2025

0.0.0.43

Apr 20, 2025

0.0.0.42

Apr 20, 2025

0.0.0.41

Apr 20, 2025

0.0.0.40

Apr 20, 2025

0.0.0.39

Apr 20, 2025

0.0.0.38

Apr 20, 2025

0.0.0.37

Apr 20, 2025

0.0.0.36

Apr 20, 2025

0.0.0.35

Apr 20, 2025

0.0.0.34

Apr 20, 2025

0.0.0.33

Mar 28, 2025

0.0.0.32

Mar 28, 2025

0.0.0.31

Mar 28, 2025

0.0.0.30

Mar 28, 2025

0.0.0.29

Mar 27, 2025

0.0.0.28

Mar 27, 2025

0.0.0.27

Mar 26, 2025

0.0.0.26

Mar 26, 2025

0.0.0.25

Mar 26, 2025

0.0.0.24

Mar 26, 2025

0.0.0.23

Mar 26, 2025

0.0.0.22

Mar 26, 2025

0.0.0.21

Mar 26, 2025

0.0.0.20

Mar 26, 2025

0.0.0.19

Mar 26, 2025

0.0.0.18

Mar 26, 2025

0.0.0.17

Mar 25, 2025

0.0.0.16

Mar 25, 2025

0.0.0.15

Mar 25, 2025

0.0.0.14

Mar 25, 2025

0.0.0.13

Mar 25, 2025

0.0.0.12

Mar 25, 2025

0.0.0.11

Mar 25, 2025

0.0.0.10

Mar 25, 2025

0.0.0.9

Mar 25, 2025

0.0.0.8

Mar 25, 2025

0.0.0.7

Mar 25, 2025

0.0.0.6

Mar 25, 2025

0.0.0.5

Mar 20, 2025

0.0.0.4

Mar 20, 2025

0.0.0.3

Mar 20, 2025

0.0.0.2

Mar 20, 2025

0.0.0.1

Mar 20, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

abstract_ocr-0.0.1.68.tar.gz (52.2 kB view details)

Uploaded May 29, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

abstract_ocr-0.0.1.68-py3-none-any.whl (72.8 kB view details)

Uploaded May 29, 2026 Python 3

File details

Details for the file abstract_ocr-0.0.1.68.tar.gz.

File metadata

Download URL: abstract_ocr-0.0.1.68.tar.gz
Upload date: May 29, 2026
Size: 52.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.11

File hashes

Hashes for abstract_ocr-0.0.1.68.tar.gz
Algorithm	Hash digest
SHA256	`5e966022b7a517ff902959e5d7e3514c09de2d20ab90ee856b722dbdbff2f7df`
MD5	`2e188c267de44f3658dc7665a2ff75c7`
BLAKE2b-256	`86983dce908a820c06939c78fd0f56dc65d3bb9be5ac5042699d3f2e6339008c`

See more details on using hashes here.

File details

Details for the file abstract_ocr-0.0.1.68-py3-none-any.whl.

File metadata

Download URL: abstract_ocr-0.0.1.68-py3-none-any.whl
Upload date: May 29, 2026
Size: 72.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.11

File hashes

Hashes for abstract_ocr-0.0.1.68-py3-none-any.whl
Algorithm	Hash digest
SHA256	`5d8055aa9c138bf51de4fd886abcef8bcf7267f23c24c9b18d01a0f26f498861`
MD5	`1c65bd322c46c27ed2d19f1d2b4d1525`
BLAKE2b-256	`e15d0cb91054a57ecb9e949928de0625dd5102d4feba585ccc77e9e17a8b9c70`

See more details on using hashes here.

abstract-ocr 0.0.1.68

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

Part of the Abstract Media Intelligence Platform

abstract_ocr / layout_ocr — Layout-Aware OCR Pipeline

🔹 What This System Is

Pipeline Overview

🔹 Core Capabilities

🔹 Architecture

🔹 Key Design Decisions

Typed Data Flow

Layout-First OCR

Fallback Over Failure

Determinism Over Heuristics

🔹 Why This Exists

🔹 Example Use Cases

🔹 Integration Context

🔹 Design Philosophy

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes