Modular OCR pipeline for historical newspaper scans

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

nealcaren

These details have not been verified by PyPI

Project description

newspaper-ocr

Modular OCR pipeline for historical newspaper scans. Three-phase architecture with swappable backends at every stage.

Pipeline

                    Phase 1              Phase 2           Phase 3
                    LAYOUT               OCR               POST-PROCESSING

 Image ──→ ┌─────────────────┐   ┌──────────────┐   ┌─────────────────┐
            │ Detection       │   │ Recognition  │   │ Text Cleaning   │
 JP2       │ (AS YOLO or     │──→│ (Tesseract,  │──→│ (dehyphenation, │──→ Output
 JPG       │  PP-DocLayout)  │   │  tesserocr,  │   │  line joining)  │    text
 PNG       │                 │   │  EffOCR)     │   │                 │    json
            │ Layout Proc.   │   │              │   │ Spell Check     │    hOCR
            │ (reading order, │   │              │   │ (SymSpell)      │
            │  dedup, merge) │   │              │   │                 │
            └─────────────────┘   └──────────────┘   └─────────────────┘

Phase 1 — Layout: Detect regions (articles, headlines, ads) and text lines. Reorder into newspaper reading order (columns left-to-right, top-to-bottom). Deduplicate overlapping detections, fill gaps.

Phase 2 — OCR: Recognize text in each detected line or region. Swappable backends with different speed/accuracy tradeoffs.

Phase 3 — Post-Processing: Reconstruct continuous text from OCR'd lines. Rejoin hyphenated words across line breaks. Join continuation lines into paragraphs. Optional spell correction.

Installation

pip install newspaper-ocr

# Tesseract (requires system install):
#   macOS: brew install tesseract
#   Ubuntu: apt install tesseract-ocr

# Optional backends:
pip install "newspaper-ocr[glm-ocr]"     # GLM-OCR vision-language model
pip install "newspaper-ocr[paddlex]"      # PP-DocLayout detector

# EfficientOCR (installed separately from fork):
pip install git+https://github.com/nealcaren/efficient_ocr.git

Quick Start

Python

from newspaper_ocr import Pipeline

# Defaults: AS YOLO detection + Tesseract recognition
pipe = Pipeline()
text = pipe.ocr("page.jp2")

# Fast mode (tesserocr C API, ~4x faster)
pipe = Pipeline(recognizer="tesserocr")

# With spell correction
pipe = Pipeline(recognizer="tesserocr", spell_check=True)

# JSON output with bounding boxes and confidence scores
pipe = Pipeline(output="json")
result = pipe.ocr("page.jp2")

# Fine-tuned Tesseract model
from newspaper_ocr.recognizers.tesseract import TesseractRecognizer
rec = TesseractRecognizer(model="news_gold_v2", tessdata_dir="/path/to/models")
pipe = Pipeline(recognizer=rec)

# Disable layout post-processing (for non-newspaper documents)
pipe = Pipeline(layout_processing=False)

# Batch processing
results = pipe.ocr_batch(["page1.jp2", "page2.jp2", "page3.jp2"])

Command Line

# Basic OCR
newspaper-ocr page.jp2

# Fast mode with JSON output
newspaper-ocr page.jp2 --backend tesserocr --output json

# With spell correction
newspaper-ocr page.jp2 --backend tesserocr --spell-check

# Batch processing to files
newspaper-ocr *.jp2 --outdir results/ --output text

# Fine-tuned model
newspaper-ocr page.jp2 --model news_gold_v2.traineddata

# Disable post-processing
newspaper-ocr page.jp2 --no-layout-processing --no-text-cleaning

Phase 1: Layout

Two detection backends, plus battle-tested newspaper layout post-processing.

Detectors

Detector	What it finds	Speed	Best for
`as_yolo` (default)	Regions + lines	~8s/page	Line-level OCR (Tesseract, EffOCR)
`paddlex`	Regions only (20 categories)	varies	Region-level OCR, detailed layout analysis

Layout Processing

Ported from the Dangerous Press production pipeline. Applied automatically after detection:

Filter low-confidence detections
Rescue missed regions in gaps between accepted detections
Deduplicate overlapping regions (three-pass: contained, title-text, near-duplicate)
Fill column gaps using geometric column detection
Reading order — column-aware sorting (full-width headers first, then column-by-column)
Merge vertically adjacent blocks into coherent regions

Disable with layout_processing=False.

Phase 2: OCR

Three recognition backends with different speed/accuracy tradeoffs.

Backend	Mode	Speed	CER*	How it works
`tesseract`	line	~106s	3.2%	Subprocess per line, LSTM sequence model
`tesseract`	region	~38s	—	Subprocess per region, Tesseract's own line segmentation
`tesserocr`	line	~26s	3.2%	C API bindings, no subprocess overhead
`tesserocr`	region	~25s	—	C API, region-level
`effocr`	line	~50s	11.2%	Contrastive char/word matching, ONNX

*CER measured against LLM gold-standard labels with fine-tuned news_gold_v2 model. Baseline Tesseract (eng) is ~8-11% CER. Times on a single newspaper page (~1,100 lines).

Fine-Tuned Models

The pipeline includes infrastructure for fine-tuning Tesseract on historical newspaper text using LLM-verified gold-standard labels. See dangerouspress-ocr-finetune for the training pipeline.

Phase 3: Post-Processing

Text Cleaning

Reconstructs continuous text from OCR'd lines:

Dehyphenation: "com-" + "plete" → "complete" (when next line starts lowercase)
Line joining: Continuation lines joined with spaces
Paragraph breaks: Detected via vertical gaps, column shifts, or terminal punctuation + uppercase
Semantic dashes preserved: Em-dashes and spaced dashes kept intact

Disable with text_cleaning=False or --no-text-cleaning.

Spell Correction

Optional SymSpell-based correction (spell_check=True):

Corrects words not found in dictionary (edit distance ≤ 2)
Preserves capitalization, punctuation, numbers, abbreviations
Supports custom frequency dictionaries for corpus-specific vocabulary
Logs all corrections for review

pipe = Pipeline(spell_check=True)

# With corpus-specific dictionary
from newspaper_ocr.spell_checker import SpellChecker
checker = SpellChecker(dictionary_path="my_newspaper_words.txt")

Output Formats

Format	Flag	Content
`text`	`--output text`	Plain text, paragraphs separated by blank lines
`json`	`--output json`	Structured: regions, lines, bounding boxes, confidence
`hocr`	`--output hocr`	HTML with spatial coordinates (for text overlay on images)

Architecture

Every stage is a swappable component behind an abstract interface. Adding a new backend = one file + one registry entry.

# Custom detector
from newspaper_ocr.detectors.base import Detector
class MyDetector(Detector):
    def detect(self, image) -> PageLayout: ...

# Custom recognizer
from newspaper_ocr.recognizers.base import LineRecognizer
class MyRecognizer(LineRecognizer):
    def recognize(self, line) -> Line: ...

# Plug into pipeline
pipe = Pipeline(detector=MyDetector(), recognizer=MyRecognizer())

License

MIT

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

nealcaren

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.0

Mar 25, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

newspaper_ocr-0.1.0.tar.gz (39.8 kB view details)

Uploaded Mar 25, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

newspaper_ocr-0.1.0-py3-none-any.whl (35.2 kB view details)

Uploaded Mar 25, 2026 Python 3

File details

Details for the file newspaper_ocr-0.1.0.tar.gz.

File metadata

Download URL: newspaper_ocr-0.1.0.tar.gz
Upload date: Mar 25, 2026
Size: 39.8 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for newspaper_ocr-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`8ba26c353ddcf42e0365a79057fb3396f8b5643d9a20513fbce19961b766dc5b`
MD5	`ffe22d8832b856a62a115e7740461370`
BLAKE2b-256	`0535d36a8e1e1de45fa4fbace785cbbadbe21f67138d2ec79d03c18ef14ac2cb`

See more details on using hashes here.

Provenance

The following attestation bundles were made for newspaper_ocr-0.1.0.tar.gz:

Publisher: publish.yml on nealcaren/newspaper-ocr

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: newspaper_ocr-0.1.0.tar.gz
- Subject digest: 8ba26c353ddcf42e0365a79057fb3396f8b5643d9a20513fbce19961b766dc5b
- Sigstore transparency entry: 1181159401
- Sigstore integration time: Mar 25, 2026
Source repository:
- Permalink: nealcaren/newspaper-ocr@ed5cdd9e0a1f6b0c91eb4d714040cb745b4b40b5
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/nealcaren
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@ed5cdd9e0a1f6b0c91eb4d714040cb745b4b40b5
- Trigger Event: release

File details

Details for the file newspaper_ocr-0.1.0-py3-none-any.whl.

File metadata

Download URL: newspaper_ocr-0.1.0-py3-none-any.whl
Upload date: Mar 25, 2026
Size: 35.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for newspaper_ocr-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`1a76413a557f90fdc2ce692c50ef17abb28c8990f792ac1de20a5045b9698542`
MD5	`a3bb1d603252ba22d0e22651f05fe402`
BLAKE2b-256	`ef112e79423c06fa484c84d3bb13b6471c7978259139c9755eb7f8f630df43a9`

See more details on using hashes here.

Provenance

The following attestation bundles were made for newspaper_ocr-0.1.0-py3-none-any.whl:

Publisher: publish.yml on nealcaren/newspaper-ocr

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: newspaper_ocr-0.1.0-py3-none-any.whl
- Subject digest: 1a76413a557f90fdc2ce692c50ef17abb28c8990f792ac1de20a5045b9698542
- Sigstore transparency entry: 1181159416
- Sigstore integration time: Mar 25, 2026
Source repository:
- Permalink: nealcaren/newspaper-ocr@ed5cdd9e0a1f6b0c91eb4d714040cb745b4b40b5
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/nealcaren
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@ed5cdd9e0a1f6b0c91eb4d714040cb745b4b40b5
- Trigger Event: release

newspaper-ocr 0.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

newspaper-ocr

Pipeline

Installation

Quick Start

Python

Command Line

Phase 1: Layout

Detectors

Layout Processing

Phase 2: OCR

Fine-Tuned Models

Phase 3: Post-Processing

Text Cleaning

Spell Correction

Output Formats

Architecture

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance