An OCR package using LLM models for text extraction from images
Project description
docling_ocr
A powerful Python package for extracting text from images and documents using advanced LLM-based models.
Overview
docling_ocr leverages state-of-the-art language models specifically designed for document understanding tasks. Unlike traditional OCR engines that rely solely on character recognition, docling_ocr uses language models that understand document context, layouts, and can handle various document formats with high accuracy.
Built on top of models like SmolDocling, this package provides a simple, intuitive interface for document text extraction tasks.
Features
- LLM-powered extraction: Uses advanced language models trained specifically for document understanding
- Context-aware recognition: Understands document layouts and context for improved accuracy
- Multi-format support: Works with scanned documents, forms, receipts, and other text-heavy images
- Simple API: Easy-to-use interface with both file and image object inputs
- Batch processing: Process entire directories of documents efficiently
- Flexible output options: Return text or save directly to files
- Extensible architecture: Abstract base class makes it easy to add new models
Installation
pip install docling_ocr
Requirements
- Python 3.7+
- PyTorch 1.10.0+
- Transformers 4.15.0+
- Pillow 8.0.0+
Quick Start
Basic Usage
from docling_ocr import SmolDoclingExtractor
# Initialize the extractor
extractor = SmolDoclingExtractor()
# Extract text from an image file
text = extractor.extract_text("path/to/document.jpg")
print(text)
# Or use the shorthand callable interface
text = extractor("path/to/document.jpg")
Using with PIL Images
from docling_ocr import SmolDoclingExtractor
from PIL import Image
# Initialize the extractor
extractor = SmolDoclingExtractor()
# Open image with PIL
image = Image.open("path/to/document.jpg")
# Extract text
text = extractor.extract_text_from_image(image)
print(text)
Batch Processing
from docling_ocr import SmolDoclingExtractor
from docling_ocr.utils import batch_process
# Initialize extractor
extractor = SmolDoclingExtractor()
# Process all images in a directory
results = batch_process(
extractor,
image_dir="path/to/documents/",
output_dir="path/to/output/",
extensions=['.jpg', '.png', '.pdf'] # Optional: specify file extensions
)
# Results contains a dictionary mapping filenames to extracted text
for filename, text in results.items():
print(f"File: {filename}")
print(f"Text: {text[:100]}...") # Print first 100 chars
print("-" * 50)
Advanced Usage
GPU Acceleration
By default, the extractor will use CUDA if available. You can explicitly specify the device:
# Use CPU explicitly
extractor = SmolDoclingExtractor(device="cpu")
# Use specific GPU
extractor = SmolDoclingExtractor(device="cuda:0")
Custom Model Configuration
You can specify a different model from the same family:
# Use a different model variant
extractor = SmolDoclingExtractor(model_name="ds4sd/SmolDocling-512M")
Adjusting Generated Text Length
For longer documents, you may want to increase the maximum generated text length:
# Extract with a longer maximum length for complex documents
text = extractor.extract_text("complex_document.pdf", max_length=1024)
Performance Considerations
- Processing time depends on the image size, complexity, and hardware
- GPU acceleration is recommended for batch processing
- First initialization loads the model which may take some time
- Subsequent calls are much faster as the model remains in memory
Comparison with Traditional OCR
docling_ocr differs from traditional OCR engines in several key ways:
| Feature | Traditional OCR | docling_ocr |
|---|---|---|
| Text Recognition | Character/word based | Context-aware language understanding |
| Layout Understanding | Limited/separate process | Integrated understanding |
| Language Understanding | Limited | Leverages LLM language capabilities |
| Format Flexibility | Engine-specific | Adaptable to various formats |
| Context Retention | Limited | Maintains document context |
Examples
Forms and Structured Documents
from docling_ocr import SmolDoclingExtractor
extractor = SmolDoclingExtractor()
form_text = extractor("tax_form.jpg")
print(form_text)
Tables and Spreadsheets
spreadsheet_text = extractor("financial_data.jpg")
print(spreadsheet_text)
Receipts and Invoices
receipt_text = extractor("receipt.jpg")
print(receipt_text)
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add some amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
Future Roadmap
- Support for PDF documents with multi-page handling
- Additional LLM-based extraction models
- Fine-tuning options for specific document types
- Structured data extraction (JSON output)
- Layout-preserving extraction options
License
This project is licensed under the MIT License - see the LICENSE file for details.
Acknowledgments
- Built on the amazing work of the SmolDocling team for the SmolDocling-256M-preview model.
- Inspired by the growing field of document AI
- Thanks to the HuggingFace team for making transformers accessible
Citation
If you use this package in your research, please cite:
@software{docling_ocr,
author = {Adhing'a Fredrick},
title = {docling_ocr: LLM-based Document Text Extraction},
year = {2025},
url = {https://github.com/FREDERICO23/docling_ocr}
}
Contact
For questions and support, please open an issue on the GitHub repository or contact adhingafredrick@gmail.com.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file docling_ocr-0.1.0.tar.gz.
File metadata
- Download URL: docling_ocr-0.1.0.tar.gz
- Upload date:
- Size: 9.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c540647a57330a35ded566f5478504a96364c760e553a3e6ba6cb8f8aecff52b
|
|
| MD5 |
a5d80e42403d0e5eec31ce82c1d4c3a8
|
|
| BLAKE2b-256 |
5ae48bb0297dd255ce74f5f2a883699a9e4cb5a56ce7a9812ec5041b58a50378
|
File details
Details for the file docling_ocr-0.1.0-py3-none-any.whl.
File metadata
- Download URL: docling_ocr-0.1.0-py3-none-any.whl
- Upload date:
- Size: 7.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f36977f52212691f88fe390d4062d95784d2a980a6b655c9f31a6610aa40594e
|
|
| MD5 |
989d19c41cdad1934deb46d8ea500672
|
|
| BLAKE2b-256 |
43288235c042029d49643fe82e4170955eb40254c96d30d55a20f50f7c61aa66
|