High-Precision OCR Extraction for LLMs and RAG Systems: PDFs, Scanned PDFs, and Images

These details have not been verified by PyPI

Project links

Homepage

Project description

RostaingOCR

Production-Grade Layout-Aware OCR for LLMs and RAG Systems

RostaingOCR is a high-performance Python library designed to extract text from PDFs, Scanned PDFs, and images while preserving complex layouts. Unlike standard OCR tools that output a "soup" of words, this library uses Deep Learning and geometric reconstruction to maintain tables, columns, and document structure.

It is specifically optimized for Retrieval-Augmented Generation (RAG) pipelines where maintaining the visual structure of data (like invoice tables) is critical for LLM comprehension.

Key Features

mj-layout-aware: Uses geometric clustering to reconstruct tables and columns. Data stays on the correct line, visually aligned.
🧹 Noise Filtering: Automatically detects and removes low-confidence text such as messy handwriting, signatures, and stamps to keep the output clean.
⚡ Local Processing: Runs 100% locally (CPU or GPU). No external APIs, no data leaving your server.
📄 Universal Input: Handles PDFs (digital & scanned) and common image formats via a robust Base64 architecture.
🔒 Privacy Focused: Temporary files are handled securely and deleted immediately after extraction.

Installation

pip install rostaing-ocr

(Note: The first run will automatically download the necessary OCR models ~300MB)

Usage

1. Basic Usage (Default Behavior)

By default, the extractor prints the result to the console and saves it to output.txt.

from rostaing_ocr import ocr_extractor

# This immediately runs the extraction using DocTR
extractor = ocr_extractor("documents/invoice.pdf")

# The extracted text is now in 'output.txt'
print(extractor) # Prints status summary (Time taken, pages processed)

2. Custom Output File

You can specify a different filename. The file will be created or overwritten automatically.

from rostaing_ocr import ocr_extractor

extractor = ocr_extractor(
    "data/report.png",
    output_file="results/report_analysis.txt"
)

3. Silent Mode (Background Processing)

Useful for batch processing or server backends where you don't want console logs.

from rostaing_ocr import ocr_extractor

extractor = ocr_extractor(
    "financial_statement.pdf",
    print_to_console=False,
    save_file=True
)

4. Direct Integration (RAG Pipelines)

Access the text variable directly without reading the file.

from rostaing_ocr import ocr_extractor

extractor = ocr_extractor("scan.jpg", print_to_console=False)

if extractor.status == "Success":
    clean_text = extractor.extracted_text
    # Send 'clean_text' to GPT, Mistral, Gemini, Claude, Grok, Groq, Llama... or your Vector DB

How It Works (Architecture)

Input Normalization: Converts PDF pages or Images into High-Res Base64 streams.
Deep Learning Inference: DBNet for detection + CRNN for recognition.
Noise Filtering: Scans confidence scores. Text with low confidence (e.g., < 0.4), such as signatures or stamps, is discarded.
Geometric Reconstruction:
- Flattens the document hierarchy.
- Clusters words into visual lines based on Y-axis alignment.
- Calculates horizontal gaps to insert dynamic spacing (tabs vs spaces) to simulate columns.
Output: Returns a clean, structured string that looks like the original document.

License

MIT License

Useful Links

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

1.2.2

Jan 21, 2026

1.2.1

Jan 16, 2026

1.2.0

Jan 15, 2026

1.1.1

Jan 13, 2026

1.1.0

Jan 13, 2026

0.4.2

Oct 16, 2025

0.4.1

Oct 15, 2025

0.4.0

Oct 15, 2025

0.3.1

Jul 11, 2025

0.3.0

Jul 10, 2025

0.2.1

Jul 2, 2025

0.2.0

Jun 30, 2025

0.1.1

Jun 29, 2025

0.1.0

Jun 29, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rostaing_ocr-1.2.2.tar.gz (7.7 kB view details)

Uploaded Jan 21, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

rostaing_ocr-1.2.2-py3-none-any.whl (7.8 kB view details)

Uploaded Jan 21, 2026 Python 3

File details

Details for the file rostaing_ocr-1.2.2.tar.gz.

File metadata

Download URL: rostaing_ocr-1.2.2.tar.gz
Upload date: Jan 21, 2026
Size: 7.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for rostaing_ocr-1.2.2.tar.gz
Algorithm	Hash digest
SHA256	`66f0b94cfcd6b880fc2d3247e3add72a580674fe1a6e4f7a6449d5d1e71b076e`
MD5	`dccf043036ae69f2fd99e73bc533a667`
BLAKE2b-256	`567ce28b4e26e89a25f68e72b63b0a2165172e0d9a13509ccbb037f7f7bc78b6`

See more details on using hashes here.

File details

Details for the file rostaing_ocr-1.2.2-py3-none-any.whl.

File metadata

Download URL: rostaing_ocr-1.2.2-py3-none-any.whl
Upload date: Jan 21, 2026
Size: 7.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for rostaing_ocr-1.2.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ad4158427621c62743c34c0732555c92b76e2c83e495532d1eabd26cfdd4f33c`
MD5	`71073ea4a24d4a28a02a6389df008377`
BLAKE2b-256	`70b1ec54de77c361997503ae72110eb9a81dac433455d3465d0e522cd7c8a574`

See more details on using hashes here.

rostaing-ocr 1.2.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

RostaingOCR

Key Features

Installation

Usage

1. Basic Usage (Default Behavior)

2. Custom Output File

3. Silent Mode (Background Processing)

4. Direct Integration (RAG Pipelines)

How It Works (Architecture)

License

Useful Links

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes