High-Precision OCR Extraction for LLMs and RAG Systems: PDFs, Scanned PDFs, and Images
Project description
RostaingOCR
Production-Grade Layout-Aware OCR for LLMs and RAG Systems
RostaingOCR is a high-performance Python library designed to extract text from PDFs, Scanned PDFs, and images while preserving complex layouts. Unlike standard OCR tools that output a "soup" of words, this library uses Deep Learning and geometric reconstruction to maintain tables, columns, and document structure.
It is specifically optimized for Retrieval-Augmented Generation (RAG) pipelines where maintaining the visual structure of data (like invoice tables) is critical for LLM comprehension.
Key Features
- mj-layout-aware: Uses geometric clustering to reconstruct tables and columns. Data stays on the correct line, visually aligned.
- 🧹 Noise Filtering: Automatically detects and removes low-confidence text such as messy handwriting, signatures, and stamps to keep the output clean.
- ⚡ Local Processing: Runs 100% locally (CPU or GPU). No external APIs, no data leaving your server.
- 📄 Universal Input: Handles PDFs (digital & scanned) and common image formats via a robust Base64 architecture.
- 🔒 Privacy Focused: Temporary files are handled securely and deleted immediately after extraction.
Installation
pip install rostaing-ocr
(Note: The first run will automatically download the necessary OCR models ~300MB)
Usage
1. Basic Usage (Default Behavior)
By default, the extractor prints the result to the console and saves it to output.txt.
from rostaing_ocr import ocr_extractor
# This immediately runs the extraction using DocTR
extractor = ocr_extractor("documents/invoice.pdf")
# The extracted text is now in 'output.txt'
print(extractor) # Prints status summary (Time taken, pages processed)
2. Custom Output File
You can specify a different filename. The file will be created or overwritten automatically.
from rostaing_ocr import ocr_extractor
extractor = ocr_extractor(
"data/report.png",
output_file="results/report_analysis.txt"
)
3. Silent Mode (Background Processing)
Useful for batch processing or server backends where you don't want console logs.
from rostaing_ocr import ocr_extractor
extractor = ocr_extractor(
"financial_statement.pdf",
print_to_console=False,
save_file=True
)
4. Direct Integration (RAG Pipelines)
Access the text variable directly without reading the file.
from rostaing_ocr import ocr_extractor
extractor = ocr_extractor("scan.jpg", print_to_console=False)
if extractor.status == "Success":
clean_text = extractor.extracted_text
# Send 'clean_text' to GPT, Mistral, Gemini, Claude, Grok, Groq, Llama... or your Vector DB
How It Works (Architecture)
- Input Normalization: Converts PDF pages or Images into High-Res Base64 streams.
- Deep Learning Inference: DBNet for detection + CRNN for recognition.
- Noise Filtering: Scans confidence scores. Text with low confidence (e.g.,
< 0.4), such as signatures or stamps, is discarded. - Geometric Reconstruction:
- Flattens the document hierarchy.
- Clusters words into visual lines based on Y-axis alignment.
- Calculates horizontal gaps to insert dynamic spacing (tabs vs spaces) to simulate columns.
- Output: Returns a clean, structured string that looks like the original document.
License
MIT License
Useful Links
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file rostaing_ocr-1.2.2.tar.gz.
File metadata
- Download URL: rostaing_ocr-1.2.2.tar.gz
- Upload date:
- Size: 7.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
66f0b94cfcd6b880fc2d3247e3add72a580674fe1a6e4f7a6449d5d1e71b076e
|
|
| MD5 |
dccf043036ae69f2fd99e73bc533a667
|
|
| BLAKE2b-256 |
567ce28b4e26e89a25f68e72b63b0a2165172e0d9a13509ccbb037f7f7bc78b6
|
File details
Details for the file rostaing_ocr-1.2.2-py3-none-any.whl.
File metadata
- Download URL: rostaing_ocr-1.2.2-py3-none-any.whl
- Upload date:
- Size: 7.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ad4158427621c62743c34c0732555c92b76e2c83e495532d1eabd26cfdd4f33c
|
|
| MD5 |
71073ea4a24d4a28a02a6389df008377
|
|
| BLAKE2b-256 |
70b1ec54de77c361997503ae72110eb9a81dac433455d3465d0e522cd7c8a574
|