Docling plugin for refinire-rag to read various document formats
Project description
Refinire-RAG Docling Plugin
A powerful document processing plugin for refinire-rag that leverages IBM's Docling library to read and process various document formats including PDF, DOCX, XLSX, HTML, and images.
Features
- Multi-format Support: PDF, DOCX, XLSX, HTML, PNG, JPG, JPEG
- Advanced PDF Processing: Page layout analysis, reading order, table structure, code, formulas
- Unified Output: Consistent document representation across all formats
- Flexible Export: Markdown, plain text, and JSON output formats
- Local Processing: Secure document processing without external API calls
- OCR Support: Built-in OCR for scanned documents and images
- Batch Processing: Efficient handling of multiple documents
- Chunking: Automatic text chunking for RAG applications
Installation
# Clone the repository
git clone <repository-url>
cd refinire-rag-docling
# Install with uv (recommended)
uv add refinire-rag-docling
# Or install with pip
pip install refinire-rag-docling
Quick Start
Basic Usage
from refinire_rag_docling import DoclingLoader
# Create loader with default settings
loader = DoclingLoader()
# Load a single document
documents = loader.load("path/to/document.pdf")
# Access processed content
for doc in documents:
print(doc["content"])
print(doc["metadata"])
Custom Configuration
from refinire_rag_docling import DoclingLoader, ConversionConfig, ExportFormat
# Configure processing options
config = ConversionConfig(
export_format=ExportFormat.MARKDOWN,
chunk_size=1024,
ocr_enabled=True,
table_structure=True
)
loader = DoclingLoader(config)
documents = loader.load("document.pdf")
Factory Methods
# Quick setup for markdown output
loader = DoclingLoader.create_with_markdown_output(chunk_size=512)
# Quick setup for text output
loader = DoclingLoader.create_with_text_output(chunk_size=2048)
Batch Processing
file_paths = ["doc1.pdf", "doc2.docx", "doc3.xlsx"]
documents = loader.load_batch(file_paths)
print(f"Processed {len(documents)} documents")
Supported Formats
| Format | Extension | Features |
|---|---|---|
.pdf |
Layout analysis, OCR, table extraction | |
| Word | .docx |
Text, formatting, metadata |
| Excel | .xlsx |
Spreadsheet data, multiple sheets |
| HTML | .html |
Web content, structure |
| Images | .png, .jpg, .jpeg |
OCR text extraction |
Configuration Options
ConversionConfig
config = ConversionConfig(
export_format=ExportFormat.MARKDOWN, # MARKDOWN, TEXT, JSON
chunk_size=512, # 100-4096 characters
ocr_enabled=True, # Enable OCR for images
table_structure=True, # Preserve table structure
options={} # Additional options
)
Export Formats
- MARKDOWN: Rich text with formatting, tables, and structure
- TEXT: Plain text content only
- JSON: Structured data with metadata
Document Structure
Each processed document returns a dictionary with:
{
"content": "Extracted text content...",
"metadata": {
"source": "/path/to/file",
"format": "pdf",
"file_size": 1024,
"page_count": 5,
"processing_time": 2.3
},
"chunks": ["chunk1", "chunk2", ...], # If chunking enabled
}
Development
Setup Development Environment
# Clone and setup
git clone <repository-url>
cd refinire-rag-docling
# Install dependencies
uv add --dev pytest pytest-cov
# Run tests
pytest tests/ -v
# Run with coverage
pytest tests/ --cov=src --cov-report=term-missing
Project Structure
refinire-rag-docling/
src/
refinire_rag_docling/
__init__.py
loader.py # Main DoclingLoader class
models.py # Data models and types
services.py # Document processing logic
tests/
unit/ # Unit tests
e2e/ # Integration tests
examples/ # Usage examples
docs/ # Documentation
pyproject.toml
Running Tests
# All tests
pytest tests/
# Unit tests only
pytest tests/unit/
# With coverage
pytest tests/ --cov=src --cov-report=html
Error Handling
The plugin includes comprehensive error handling:
from refinire_rag_docling import (
DoclingLoaderError,
FileFormatNotSupportedError,
DocumentProcessingError,
ConfigurationError
)
try:
documents = loader.load("document.pdf")
except FileFormatNotSupportedError:
print("Unsupported file format")
except DocumentProcessingError as e:
print(f"Processing failed: {e}")
Performance Tips
- Batch Processing: Use
load_batch()for multiple files - Chunk Size: Optimize chunk size for your RAG system
- OCR Settings: Disable OCR for text-based documents
- Format Selection: Choose appropriate export format
Contributing
- Fork the repository
- Create a feature branch
- Add tests for new functionality
- Ensure all tests pass
- Submit a pull request
License
This project is licensed under the MIT License - see the LICENSE file for details.
Dependencies
- Docling: Document processing engine
- Python 3.10+
Acknowledgments
- IBM DS4SD team for the Docling library
- The refinire-rag ecosystem
Support
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file refinire_rag_docling-0.0.1.tar.gz.
File metadata
- Download URL: refinire_rag_docling-0.0.1.tar.gz
- Upload date:
- Size: 190.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b8f1a29e33f873b448f038df45ad15062f1efe74d4b749e2ca6e574c1d3340d7
|
|
| MD5 |
3cb90364db24583b3d400f703562ed9d
|
|
| BLAKE2b-256 |
c90c718051328e4fb4d46da870314551b4d94298d2a8335bfee862fd1bb54665
|
File details
Details for the file refinire_rag_docling-0.0.1-py3-none-any.whl.
File metadata
- Download URL: refinire_rag_docling-0.0.1-py3-none-any.whl
- Upload date:
- Size: 8.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b58608a4f45a2ad773b2b895becfdea932d0211da8072bd4f6ca6dd461127126
|
|
| MD5 |
f1bb8decd1ef93a30d932fde82c0429f
|
|
| BLAKE2b-256 |
e7e686b88183bf339906e5baf1c2df3ed81907ecfc8e1b35860400af22aac59c
|