Docling plugin for refinire-rag to read various document formats

Project description

Refinire-RAG Docling Plugin

A powerful document processing plugin for refinire-rag that leverages IBM's Docling library to read and process various document formats including PDF, DOCX, XLSX, HTML, and images.

Features

Multi-format Support: PDF, DOCX, XLSX, HTML, PNG, JPG, JPEG
Advanced PDF Processing: Page layout analysis, reading order, table structure, code, formulas
Unified Output: Consistent document representation across all formats
Flexible Export: Markdown, plain text, and JSON output formats
Local Processing: Secure document processing without external API calls
OCR Support: Built-in OCR for scanned documents and images
Batch Processing: Efficient handling of multiple documents
Chunking: Automatic text chunking for RAG applications

Installation

# Clone the repository
git clone <repository-url>
cd refinire-rag-docling

# Install with uv (recommended)
uv add refinire-rag-docling

# Or install with pip
pip install refinire-rag-docling

Quick Start

Basic Usage

from refinire_rag_docling import DoclingLoader

# Create loader with default settings
loader = DoclingLoader()

# Load a single document
documents = loader.load("path/to/document.pdf")

# Access processed content
for doc in documents:
    print(doc["content"])
    print(doc["metadata"])

Custom Configuration

from refinire_rag_docling import DoclingLoader, ConversionConfig, ExportFormat

# Configure processing options
config = ConversionConfig(
    export_format=ExportFormat.MARKDOWN,
    chunk_size=1024,
    ocr_enabled=True,
    table_structure=True
)

loader = DoclingLoader(config)
documents = loader.load("document.pdf")

Factory Methods

# Quick setup for markdown output
loader = DoclingLoader.create_with_markdown_output(chunk_size=512)

# Quick setup for text output
loader = DoclingLoader.create_with_text_output(chunk_size=2048)

Batch Processing

file_paths = ["doc1.pdf", "doc2.docx", "doc3.xlsx"]
documents = loader.load_batch(file_paths)

print(f"Processed {len(documents)} documents")

Supported Formats

Format	Extension	Features
PDF	`.pdf`	Layout analysis, OCR, table extraction
Word	`.docx`	Text, formatting, metadata
Excel	`.xlsx`	Spreadsheet data, multiple sheets
HTML	`.html`	Web content, structure
Images	`.png`, `.jpg`, `.jpeg`	OCR text extraction

Configuration Options

ConversionConfig

config = ConversionConfig(
    export_format=ExportFormat.MARKDOWN,  # MARKDOWN, TEXT, JSON
    chunk_size=512,                       # 100-4096 characters
    ocr_enabled=True,                     # Enable OCR for images
    table_structure=True,                 # Preserve table structure
    options={}                            # Additional options
)

Export Formats

MARKDOWN: Rich text with formatting, tables, and structure
TEXT: Plain text content only
JSON: Structured data with metadata

Document Structure

Each processed document returns a dictionary with:

{
    "content": "Extracted text content...",
    "metadata": {
        "source": "/path/to/file",
        "format": "pdf",
        "file_size": 1024,
        "page_count": 5,
        "processing_time": 2.3
    },
    "chunks": ["chunk1", "chunk2", ...],  # If chunking enabled
}

Development

Setup Development Environment

# Clone and setup
git clone <repository-url>
cd refinire-rag-docling

# Install dependencies
uv add --dev pytest pytest-cov

# Run tests
pytest tests/ -v

# Run with coverage
pytest tests/ --cov=src --cov-report=term-missing

Project Structure

refinire-rag-docling/
   src/
      refinire_rag_docling/
          __init__.py
          loader.py          # Main DoclingLoader class
          models.py          # Data models and types
          services.py        # Document processing logic
   tests/
      unit/                  # Unit tests
      e2e/                   # Integration tests
   examples/                  # Usage examples
   docs/                      # Documentation
   pyproject.toml

Running Tests

# All tests
pytest tests/

# Unit tests only
pytest tests/unit/

# With coverage
pytest tests/ --cov=src --cov-report=html

Error Handling

The plugin includes comprehensive error handling:

from refinire_rag_docling import (
    DoclingLoaderError,
    FileFormatNotSupportedError,
    DocumentProcessingError,
    ConfigurationError
)

try:
    documents = loader.load("document.pdf")
except FileFormatNotSupportedError:
    print("Unsupported file format")
except DocumentProcessingError as e:
    print(f"Processing failed: {e}")

Performance Tips

Batch Processing: Use load_batch() for multiple files
Chunk Size: Optimize chunk size for your RAG system
OCR Settings: Disable OCR for text-based documents
Format Selection: Choose appropriate export format

Contributing

Fork the repository
Create a feature branch
Add tests for new functionality
Ensure all tests pass
Submit a pull request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Dependencies

Docling: Document processing engine
Python 3.10+

Acknowledgments

IBM DS4SD team for the Docling library
The refinire-rag ecosystem

Support

Project details

Release history Release notifications | RSS feed

This version

0.0.1

Jun 11, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

refinire_rag_docling-0.0.1.tar.gz (190.4 kB view details)

Uploaded Jun 11, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

refinire_rag_docling-0.0.1-py3-none-any.whl (8.7 kB view details)

Uploaded Jun 11, 2025 Python 3

File details

Details for the file refinire_rag_docling-0.0.1.tar.gz.

File metadata

Download URL: refinire_rag_docling-0.0.1.tar.gz
Upload date: Jun 11, 2025
Size: 190.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.11.12

File hashes

Hashes for refinire_rag_docling-0.0.1.tar.gz
Algorithm	Hash digest
SHA256	`b8f1a29e33f873b448f038df45ad15062f1efe74d4b749e2ca6e574c1d3340d7`
MD5	`3cb90364db24583b3d400f703562ed9d`
BLAKE2b-256	`c90c718051328e4fb4d46da870314551b4d94298d2a8335bfee862fd1bb54665`

See more details on using hashes here.

File details

Details for the file refinire_rag_docling-0.0.1-py3-none-any.whl.

File metadata

Download URL: refinire_rag_docling-0.0.1-py3-none-any.whl
Upload date: Jun 11, 2025
Size: 8.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.11.12

File hashes

Hashes for refinire_rag_docling-0.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b58608a4f45a2ad773b2b895becfdea932d0211da8072bd4f6ca6dd461127126`
MD5	`f1bb8decd1ef93a30d932fde82c0429f`
BLAKE2b-256	`e7e686b88183bf339906e5baf1c2df3ed81907ecfc8e1b35860400af22aac59c`

See more details on using hashes here.

refinire-rag-docling 0.0.1

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

Refinire-RAG Docling Plugin

Features

Installation

Quick Start

Basic Usage

Custom Configuration

Factory Methods

Batch Processing

Supported Formats

Configuration Options

ConversionConfig

Export Formats

Document Structure

Development

Setup Development Environment

Project Structure

Running Tests

Error Handling

Performance Tips

Contributing

License

Dependencies

Acknowledgments

Support

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes