Skip to main content

Docling plugin for refinire-rag to read various document formats

Project description

Refinire-RAG Docling Plugin

A powerful document processing plugin for refinire-rag that leverages IBM's Docling library to read and process various document formats including PDF, DOCX, XLSX, HTML, and images.

Features

  • Multi-format Support: PDF, DOCX, XLSX, HTML, PNG, JPG, JPEG
  • Advanced PDF Processing: Page layout analysis, reading order, table structure, code, formulas
  • Unified Output: Consistent document representation across all formats
  • Flexible Export: Markdown, plain text, and JSON output formats
  • Local Processing: Secure document processing without external API calls
  • OCR Support: Built-in OCR for scanned documents and images
  • Batch Processing: Efficient handling of multiple documents
  • Chunking: Automatic text chunking for RAG applications

Installation

# Clone the repository
git clone <repository-url>
cd refinire-rag-docling

# Install with uv (recommended)
uv add refinire-rag-docling

# Or install with pip
pip install refinire-rag-docling

Quick Start

Basic Usage

from refinire_rag_docling import DoclingLoader

# Create loader with default settings
loader = DoclingLoader()

# Load a single document
documents = loader.load("path/to/document.pdf")

# Access processed content
for doc in documents:
    print(doc["content"])
    print(doc["metadata"])

Custom Configuration

from refinire_rag_docling import DoclingLoader, ConversionConfig, ExportFormat

# Configure processing options
config = ConversionConfig(
    export_format=ExportFormat.MARKDOWN,
    chunk_size=1024,
    ocr_enabled=True,
    table_structure=True
)

loader = DoclingLoader(config)
documents = loader.load("document.pdf")

Factory Methods

# Quick setup for markdown output
loader = DoclingLoader.create_with_markdown_output(chunk_size=512)

# Quick setup for text output
loader = DoclingLoader.create_with_text_output(chunk_size=2048)

Batch Processing

file_paths = ["doc1.pdf", "doc2.docx", "doc3.xlsx"]
documents = loader.load_batch(file_paths)

print(f"Processed {len(documents)} documents")

Supported Formats

Format Extension Features
PDF .pdf Layout analysis, OCR, table extraction
Word .docx Text, formatting, metadata
Excel .xlsx Spreadsheet data, multiple sheets
HTML .html Web content, structure
Images .png, .jpg, .jpeg OCR text extraction

Configuration Options

ConversionConfig

config = ConversionConfig(
    export_format=ExportFormat.MARKDOWN,  # MARKDOWN, TEXT, JSON
    chunk_size=512,                       # 100-4096 characters
    ocr_enabled=True,                     # Enable OCR for images
    table_structure=True,                 # Preserve table structure
    options={}                            # Additional options
)

Export Formats

  • MARKDOWN: Rich text with formatting, tables, and structure
  • TEXT: Plain text content only
  • JSON: Structured data with metadata

Document Structure

Each processed document returns a dictionary with:

{
    "content": "Extracted text content...",
    "metadata": {
        "source": "/path/to/file",
        "format": "pdf",
        "file_size": 1024,
        "page_count": 5,
        "processing_time": 2.3
    },
    "chunks": ["chunk1", "chunk2", ...],  # If chunking enabled
}

Development

Setup Development Environment

# Clone and setup
git clone <repository-url>
cd refinire-rag-docling

# Install dependencies
uv add --dev pytest pytest-cov

# Run tests
pytest tests/ -v

# Run with coverage
pytest tests/ --cov=src --cov-report=term-missing

Project Structure

refinire-rag-docling/
   src/
      refinire_rag_docling/
          __init__.py
          loader.py          # Main DoclingLoader class
          models.py          # Data models and types
          services.py        # Document processing logic
   tests/
      unit/                  # Unit tests
      e2e/                   # Integration tests
   examples/                  # Usage examples
   docs/                      # Documentation
   pyproject.toml

Running Tests

# All tests
pytest tests/

# Unit tests only
pytest tests/unit/

# With coverage
pytest tests/ --cov=src --cov-report=html

Error Handling

The plugin includes comprehensive error handling:

from refinire_rag_docling import (
    DoclingLoaderError,
    FileFormatNotSupportedError,
    DocumentProcessingError,
    ConfigurationError
)

try:
    documents = loader.load("document.pdf")
except FileFormatNotSupportedError:
    print("Unsupported file format")
except DocumentProcessingError as e:
    print(f"Processing failed: {e}")

Performance Tips

  1. Batch Processing: Use load_batch() for multiple files
  2. Chunk Size: Optimize chunk size for your RAG system
  3. OCR Settings: Disable OCR for text-based documents
  4. Format Selection: Choose appropriate export format

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Add tests for new functionality
  4. Ensure all tests pass
  5. Submit a pull request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Dependencies

  • Docling: Document processing engine
  • Python 3.10+

Acknowledgments

  • IBM DS4SD team for the Docling library
  • The refinire-rag ecosystem

Support

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

refinire_rag_docling-0.0.1.tar.gz (190.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

refinire_rag_docling-0.0.1-py3-none-any.whl (8.7 kB view details)

Uploaded Python 3

File details

Details for the file refinire_rag_docling-0.0.1.tar.gz.

File metadata

  • Download URL: refinire_rag_docling-0.0.1.tar.gz
  • Upload date:
  • Size: 190.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.12

File hashes

Hashes for refinire_rag_docling-0.0.1.tar.gz
Algorithm Hash digest
SHA256 b8f1a29e33f873b448f038df45ad15062f1efe74d4b749e2ca6e574c1d3340d7
MD5 3cb90364db24583b3d400f703562ed9d
BLAKE2b-256 c90c718051328e4fb4d46da870314551b4d94298d2a8335bfee862fd1bb54665

See more details on using hashes here.

File details

Details for the file refinire_rag_docling-0.0.1-py3-none-any.whl.

File metadata

File hashes

Hashes for refinire_rag_docling-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 b58608a4f45a2ad773b2b895becfdea932d0211da8072bd4f6ca6dd461127126
MD5 f1bb8decd1ef93a30d932fde82c0429f
BLAKE2b-256 e7e686b88183bf339906e5baf1c2df3ed81907ecfc8e1b35860400af22aac59c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page