Skip to main content

Docling PureCPP integration

Project description

Docling PureCPP Wrapper

This document describes how to use the Python wrapper for the Docling PureCPP project.

Description

The docling-purecpp project provides a high-performance document loading and processing solution by leveraging a C++ backend with Python bindings. This wrapper allows you to easily integrate the power of Docling into your Python applications.

Installation

As this project involves a C++ backend, it needs to be compiled first. Follow the build instructions in the main project's README to generate the necessary library files.

Usage

Basic Usage

Basic usage of DoclingLoader looks as follows:

from purecpp_libs import DoclingLoader, RAGDocument

# Can be a local file path, or a list of paths
FILE_PATH = ["/path/to/your/document.pdf", "/path/to/another/document.docx"]

loader = DoclingLoader(file_path=FILE_PATH)

# The lazy_load() method returns a generator
docs_generator = loader.lazy_load()
for doc in docs_generator:
    print(f"Content: {doc.page_content}")
    print(f"Metadata: {doc.metadata}")
    print("---")

Advanced Usage

When initializing a DoclingLoader, you can use the following parameters:

  • file_path: A string or a list of strings representing the path(s) to the source file(s).
  • converter (optional): A specific Docling DocumentConverter instance to use.
  • convert_kwargs (optional): A dictionary of keyword arguments for the conversion process.
  • export_type (optional): The export mode to use. Can be ExportType.DOC_CHUNKS (default) or ExportType.MARKDOWN.
  • md_export_kwargs (optional): A dictionary of keyword arguments for Markdown exporting.
  • chunker (optional): A specific Docling BaseChunker instance to use (for document-chunk mode).
  • meta_extractor (optional): A specific metadata extractor to use.

Example with Custom Configuration

from purecpp_libs import DoclingLoader, ExportType
from docling.document_converter import DocumentConverter
from docling.chunking import HybridChunker

# Custom converter configuration
converter = DocumentConverter()
convert_kwargs = {"do_ocr": True, "do_table_structure": True}

# Custom chunker for document chunks
chunker = HybridChunker()

# Custom markdown export settings
md_export_kwargs = {"image_placeholder": "[IMAGE]"}

loader = DoclingLoader(
    file_path="/path/to/document.pdf",
    converter=converter,
    convert_kwargs=convert_kwargs,
    export_type=ExportType.DOC_CHUNKS,  # or ExportType.MARKDOWN
    md_export_kwargs=md_export_kwargs,
    chunker=chunker
)

for doc in loader.lazy_load():
    print(f"Content: {doc.page_content}")
    print(f"Metadata: {doc.metadata}")

Export Types

The loader supports two export types:

  1. ExportType.DOC_CHUNKS (default): Exports documents as chunked content using the specified chunker
  2. ExportType.MARKDOWN: Exports documents as markdown format

Custom Metadata Extraction

You can provide a custom metadata extractor by implementing the BaseMetadataExtractor interface:

from purecpp_libs import BaseMetadataExtractor
import json

class CustomMetadataExtractor(BaseMetadataExtractor):
    def extract_chunk_meta(self, file_path, chunk):
        return {
            "source": file_path,
            "chunk_type": chunk.meta.type if hasattr(chunk.meta, 'type') else "unknown",
            "dl_meta": json.dumps(chunk.meta.export_json_dict())
        }
    
    def extract_dl_doc_meta(self, file_path, dl_doc):
        return {
            "source": file_path,
            "document_title": dl_doc.title if hasattr(dl_doc, 'title') else "Unknown"
        }

loader = DoclingLoader(
    file_path="/path/to/document.pdf",
    meta_extractor=CustomMetadataExtractor()
)

API Reference

DoclingLoader

The main class for loading and processing documents.

Parameters:

  • file_path (Union[str, Iterable[str]]): Path(s) to the source file(s)
  • converter (Optional[DocumentConverter]): Custom document converter instance
  • convert_kwargs (Optional[Dict[str, Any]]): Arguments for the conversion process
  • export_type (ExportType): Export format (DOC_CHUNKS or MARKDOWN)
  • md_export_kwargs (Optional[Dict[str, Any]]): Arguments for markdown export
  • chunker (Optional[BaseChunker]): Custom chunker for document chunks mode
  • meta_extractor (Optional[BaseMetadataExtractor]): Custom metadata extractor

Methods:

  • lazy_load(): Returns a generator that yields RAGDocument objects

RAGDocument

Represents a processed document with content and metadata.

Attributes:

  • page_content (str): The processed content of the document/chunk
  • metadata (Dict): Metadata associated with the document/chunk

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

purecpp_docling-1.0.0.tar.gz (6.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

purecpp_docling-1.0.0-py3-none-any.whl (5.2 kB view details)

Uploaded Python 3

File details

Details for the file purecpp_docling-1.0.0.tar.gz.

File metadata

  • Download URL: purecpp_docling-1.0.0.tar.gz
  • Upload date:
  • Size: 6.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.12

File hashes

Hashes for purecpp_docling-1.0.0.tar.gz
Algorithm Hash digest
SHA256 be5abe3b817bb4bdcc51b01985097efbd34351b483179564effbfbb276ba57c6
MD5 d35b64ac7b291ea69246b28364ce166a
BLAKE2b-256 1c2c6e85928c93b65b3d36c28db83a030851079ed5b6ec35d56744975270d3a7

See more details on using hashes here.

File details

Details for the file purecpp_docling-1.0.0-py3-none-any.whl.

File metadata

File hashes

Hashes for purecpp_docling-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 646fe171ebcffacfa8a5fabf7b2bb9dddc4ed679bdace4ac9444bba046d083b6
MD5 176da1b6849a09406a1dcdac53e4b13e
BLAKE2b-256 5415a5064ce90f601ae6dfc40d1a3176c1d4006729d7ea33f05c55a27e6660da

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page