Docling PureCPP integration

These details have not been verified by PyPI

Project description

Docling PureCPP Wrapper

This document describes how to use the Python wrapper for the Docling PureCPP project.

Description

The docling-purecpp project provides a high-performance document loading and processing solution by leveraging a C++ backend with Python bindings. This wrapper allows you to easily integrate the power of Docling into your Python applications.

Installation

As this project involves a C++ backend, it needs to be compiled first. Follow the build instructions in the main project's README to generate the necessary library files.

Usage

Basic Usage

Basic usage of DoclingLoader looks as follows:

from purecpp_libs import DoclingLoader, RAGDocument

# Can be a local file path, or a list of paths
FILE_PATH = ["/path/to/your/document.pdf", "/path/to/another/document.docx"]

loader = DoclingLoader(file_path=FILE_PATH)

# The lazy_load() method returns a generator
docs_generator = loader.lazy_load()
for doc in docs_generator:
    print(f"Content: {doc.page_content}")
    print(f"Metadata: {doc.metadata}")
    print("---")

Advanced Usage

When initializing a DoclingLoader, you can use the following parameters:

file_path: A string or a list of strings representing the path(s) to the source file(s).
converter (optional): A specific Docling DocumentConverter instance to use.
convert_kwargs (optional): A dictionary of keyword arguments for the conversion process.
export_type (optional): The export mode to use. Can be ExportType.DOC_CHUNKS (default) or ExportType.MARKDOWN.
md_export_kwargs (optional): A dictionary of keyword arguments for Markdown exporting.
chunker (optional): A specific Docling BaseChunker instance to use (for document-chunk mode).
meta_extractor (optional): A specific metadata extractor to use.

Example with Custom Configuration

from purecpp_libs import DoclingLoader, ExportType
from docling.document_converter import DocumentConverter
from docling.chunking import HybridChunker

# Custom converter configuration
converter = DocumentConverter()
convert_kwargs = {"do_ocr": True, "do_table_structure": True}

# Custom chunker for document chunks
chunker = HybridChunker()

# Custom markdown export settings
md_export_kwargs = {"image_placeholder": "[IMAGE]"}

loader = DoclingLoader(
    file_path="/path/to/document.pdf",
    converter=converter,
    convert_kwargs=convert_kwargs,
    export_type=ExportType.DOC_CHUNKS,  # or ExportType.MARKDOWN
    md_export_kwargs=md_export_kwargs,
    chunker=chunker
)

for doc in loader.lazy_load():
    print(f"Content: {doc.page_content}")
    print(f"Metadata: {doc.metadata}")

Export Types

The loader supports two export types:

ExportType.DOC_CHUNKS (default): Exports documents as chunked content using the specified chunker
ExportType.MARKDOWN: Exports documents as markdown format

Custom Metadata Extraction

You can provide a custom metadata extractor by implementing the BaseMetadataExtractor interface:

from purecpp_libs import BaseMetadataExtractor
import json

class CustomMetadataExtractor(BaseMetadataExtractor):
    def extract_chunk_meta(self, file_path, chunk):
        return {
            "source": file_path,
            "chunk_type": chunk.meta.type if hasattr(chunk.meta, 'type') else "unknown",
            "dl_meta": json.dumps(chunk.meta.export_json_dict())
        }
    
    def extract_dl_doc_meta(self, file_path, dl_doc):
        return {
            "source": file_path,
            "document_title": dl_doc.title if hasattr(dl_doc, 'title') else "Unknown"
        }

loader = DoclingLoader(
    file_path="/path/to/document.pdf",
    meta_extractor=CustomMetadataExtractor()
)

API Reference

DoclingLoader

The main class for loading and processing documents.

Parameters:

file_path (Union[str, Iterable[str]]): Path(s) to the source file(s)
converter (Optional[DocumentConverter]): Custom document converter instance
convert_kwargs (Optional[Dict[str, Any]]): Arguments for the conversion process
export_type (ExportType): Export format (DOC_CHUNKS or MARKDOWN)
md_export_kwargs (Optional[Dict[str, Any]]): Arguments for markdown export
chunker (Optional[BaseChunker]): Custom chunker for document chunks mode
meta_extractor (Optional[BaseMetadataExtractor]): Custom metadata extractor

Methods:

lazy_load(): Returns a generator that yields RAGDocument objects

RAGDocument

Represents a processed document with content and metadata.

Attributes:

page_content (str): The processed content of the document/chunk
metadata (Dict): Metadata associated with the document/chunk

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

1.0.0

Jul 29, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

purecpp_docling-1.0.0.tar.gz (6.1 kB view details)

Uploaded Jul 29, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

purecpp_docling-1.0.0-py3-none-any.whl (5.2 kB view details)

Uploaded Jul 29, 2025 Python 3

File details

Details for the file purecpp_docling-1.0.0.tar.gz.

File metadata

Download URL: purecpp_docling-1.0.0.tar.gz
Upload date: Jul 29, 2025
Size: 6.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.10.12

File hashes

Hashes for purecpp_docling-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`be5abe3b817bb4bdcc51b01985097efbd34351b483179564effbfbb276ba57c6`
MD5	`d35b64ac7b291ea69246b28364ce166a`
BLAKE2b-256	`1c2c6e85928c93b65b3d36c28db83a030851079ed5b6ec35d56744975270d3a7`

See more details on using hashes here.

File details

Details for the file purecpp_docling-1.0.0-py3-none-any.whl.

File metadata

Download URL: purecpp_docling-1.0.0-py3-none-any.whl
Upload date: Jul 29, 2025
Size: 5.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.10.12

File hashes

Hashes for purecpp_docling-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`646fe171ebcffacfa8a5fabf7b2bb9dddc4ed679bdace4ac9444bba046d083b6`
MD5	`176da1b6849a09406a1dcdac53e4b13e`
BLAKE2b-256	`5415a5064ce90f601ae6dfc40d1a3176c1d4006729d7ea33f05c55a27e6660da`

See more details on using hashes here.

purecpp-docling 1.0.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

Docling PureCPP Wrapper

Description

Installation

Usage

Basic Usage

Advanced Usage

Example with Custom Configuration

Export Types

Custom Metadata Extraction

API Reference

DoclingLoader

RAGDocument

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes