Skip to main content

High-Precision OCR Extraction for LLMs and RAG Systems: PDFs, Scanned PDFs, and Images

Project description

PyPI version Python versions License Downloads

RostaingOCR

Production-Grade Layout-Aware OCR for LLMs and RAG Systems

RostaingOCR is a high-performance Python library designed to extract text from PDFs, Scanned PDFs, and images while preserving complex layouts. Unlike standard OCR tools that output a "soup" of words, this library uses Deep Learning and geometric reconstruction to maintain tables, columns, and document structure.

It is specifically optimized for Retrieval-Augmented Generation (RAG) pipelines where maintaining the visual structure of data (like invoice tables) is critical for LLM comprehension.

Key Features

  • mj-layout-aware: Uses geometric clustering to reconstruct tables and columns. Data stays on the correct line, visually aligned.
  • 🧹 Noise Filtering: Automatically detects and removes low-confidence text such as messy handwriting, signatures, and stamps to keep the output clean.
  • ⚡ Local Processing: Runs 100% locally (CPU or GPU). No external APIs, no data leaving your server.
  • 📄 Universal Input: Handles PDFs (digital & scanned) and common image formats via a robust Base64 architecture.
  • 🔒 Privacy Focused: Temporary files are handled securely and deleted immediately after extraction.

Installation

pip install rostaing-ocr

(Note: The first run will automatically download the necessary OCR models ~300MB)

Usage

1. Basic Usage (Default Behavior)

By default, the extractor prints the result to the console and saves it to output.txt.

from rostaing_ocr import ocr_extractor

# This immediately runs the extraction using DocTR
extractor = ocr_extractor("documents/invoice.pdf")

# The extracted text is now in 'output.txt'
print(extractor) # Prints status summary (Time taken, pages processed)

2. Custom Output File

You can specify a different filename. The file will be created or overwritten automatically.

from rostaing_ocr import ocr_extractor

extractor = ocr_extractor(
    "data/report.png",
    output_file="results/report_analysis.txt"
)

3. Silent Mode (Background Processing)

Useful for batch processing or server backends where you don't want console logs.

from rostaing_ocr import ocr_extractor

extractor = ocr_extractor(
    "financial_statement.pdf",
    print_to_console=False,
    save_file=True
)

4. Direct Integration (RAG Pipelines)

Access the text variable directly without reading the file.

from rostaing_ocr import ocr_extractor

extractor = ocr_extractor("scan.jpg", print_to_console=False)

if extractor.status == "Success":
    clean_text = extractor.extracted_text
    # Send 'clean_text' to GPT, Mistral, Gemini, Claude, Grok, Groq, Llama... or your Vector DB

How It Works (Architecture)

  1. Input Normalization: Converts PDF pages or Images into High-Res Base64 streams.
  2. Deep Learning Inference: DBNet for detection + CRNN for recognition.
  3. Noise Filtering: Scans confidence scores. Text with low confidence (e.g., < 0.4), such as signatures or stamps, is discarded.
  4. Geometric Reconstruction:
    • Flattens the document hierarchy.
    • Clusters words into visual lines based on Y-axis alignment.
    • Calculates horizontal gaps to insert dynamic spacing (tabs vs spaces) to simulate columns.
  5. Output: Returns a clean, structured string that looks like the original document.

License

MIT License

Useful Links

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rostaing_ocr-1.2.2.tar.gz (7.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

rostaing_ocr-1.2.2-py3-none-any.whl (7.8 kB view details)

Uploaded Python 3

File details

Details for the file rostaing_ocr-1.2.2.tar.gz.

File metadata

  • Download URL: rostaing_ocr-1.2.2.tar.gz
  • Upload date:
  • Size: 7.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for rostaing_ocr-1.2.2.tar.gz
Algorithm Hash digest
SHA256 66f0b94cfcd6b880fc2d3247e3add72a580674fe1a6e4f7a6449d5d1e71b076e
MD5 dccf043036ae69f2fd99e73bc533a667
BLAKE2b-256 567ce28b4e26e89a25f68e72b63b0a2165172e0d9a13509ccbb037f7f7bc78b6

See more details on using hashes here.

File details

Details for the file rostaing_ocr-1.2.2-py3-none-any.whl.

File metadata

  • Download URL: rostaing_ocr-1.2.2-py3-none-any.whl
  • Upload date:
  • Size: 7.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for rostaing_ocr-1.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 ad4158427621c62743c34c0732555c92b76e2c83e495532d1eabd26cfdd4f33c
MD5 71073ea4a24d4a28a02a6389df008377
BLAKE2b-256 70b1ec54de77c361997503ae72110eb9a81dac433455d3465d0e522cd7c8a574

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page