Docling PureCPP integration
Project description
Docling PureCPP Wrapper
This document describes how to use the Python wrapper for the Docling PureCPP project.
Description
The docling-purecpp project provides a high-performance document loading and processing solution by leveraging a C++ backend with Python bindings. This wrapper allows you to easily integrate the power of Docling into your Python applications.
Installation
As this project involves a C++ backend, it needs to be compiled first. Follow the build instructions in the main project's README to generate the necessary library files.
Usage
Basic Usage
Basic usage of DoclingLoader looks as follows:
from purecpp_libs import DoclingLoader, RAGDocument
# Can be a local file path, or a list of paths
FILE_PATH = ["/path/to/your/document.pdf", "/path/to/another/document.docx"]
loader = DoclingLoader(file_path=FILE_PATH)
# The lazy_load() method returns a generator
docs_generator = loader.lazy_load()
for doc in docs_generator:
print(f"Content: {doc.page_content}")
print(f"Metadata: {doc.metadata}")
print("---")
Advanced Usage
When initializing a DoclingLoader, you can use the following parameters:
file_path: A string or a list of strings representing the path(s) to the source file(s).converter(optional): A specific DoclingDocumentConverterinstance to use.convert_kwargs(optional): A dictionary of keyword arguments for the conversion process.export_type(optional): The export mode to use. Can beExportType.DOC_CHUNKS(default) orExportType.MARKDOWN.md_export_kwargs(optional): A dictionary of keyword arguments for Markdown exporting.chunker(optional): A specific DoclingBaseChunkerinstance to use (for document-chunk mode).meta_extractor(optional): A specific metadata extractor to use.
Example with Custom Configuration
from purecpp_libs import DoclingLoader, ExportType
from docling.document_converter import DocumentConverter
from docling.chunking import HybridChunker
# Custom converter configuration
converter = DocumentConverter()
convert_kwargs = {"do_ocr": True, "do_table_structure": True}
# Custom chunker for document chunks
chunker = HybridChunker()
# Custom markdown export settings
md_export_kwargs = {"image_placeholder": "[IMAGE]"}
loader = DoclingLoader(
file_path="/path/to/document.pdf",
converter=converter,
convert_kwargs=convert_kwargs,
export_type=ExportType.DOC_CHUNKS, # or ExportType.MARKDOWN
md_export_kwargs=md_export_kwargs,
chunker=chunker
)
for doc in loader.lazy_load():
print(f"Content: {doc.page_content}")
print(f"Metadata: {doc.metadata}")
Export Types
The loader supports two export types:
ExportType.DOC_CHUNKS(default): Exports documents as chunked content using the specified chunkerExportType.MARKDOWN: Exports documents as markdown format
Custom Metadata Extraction
You can provide a custom metadata extractor by implementing the BaseMetadataExtractor interface:
from purecpp_libs import BaseMetadataExtractor
import json
class CustomMetadataExtractor(BaseMetadataExtractor):
def extract_chunk_meta(self, file_path, chunk):
return {
"source": file_path,
"chunk_type": chunk.meta.type if hasattr(chunk.meta, 'type') else "unknown",
"dl_meta": json.dumps(chunk.meta.export_json_dict())
}
def extract_dl_doc_meta(self, file_path, dl_doc):
return {
"source": file_path,
"document_title": dl_doc.title if hasattr(dl_doc, 'title') else "Unknown"
}
loader = DoclingLoader(
file_path="/path/to/document.pdf",
meta_extractor=CustomMetadataExtractor()
)
API Reference
DoclingLoader
The main class for loading and processing documents.
Parameters:
file_path(Union[str, Iterable[str]]): Path(s) to the source file(s)converter(Optional[DocumentConverter]): Custom document converter instanceconvert_kwargs(Optional[Dict[str, Any]]): Arguments for the conversion processexport_type(ExportType): Export format (DOC_CHUNKS or MARKDOWN)md_export_kwargs(Optional[Dict[str, Any]]): Arguments for markdown exportchunker(Optional[BaseChunker]): Custom chunker for document chunks modemeta_extractor(Optional[BaseMetadataExtractor]): Custom metadata extractor
Methods:
lazy_load(): Returns a generator that yieldsRAGDocumentobjects
RAGDocument
Represents a processed document with content and metadata.
Attributes:
page_content(str): The processed content of the document/chunkmetadata(Dict): Metadata associated with the document/chunk
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file purecpp_docling-1.0.0.tar.gz.
File metadata
- Download URL: purecpp_docling-1.0.0.tar.gz
- Upload date:
- Size: 6.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
be5abe3b817bb4bdcc51b01985097efbd34351b483179564effbfbb276ba57c6
|
|
| MD5 |
d35b64ac7b291ea69246b28364ce166a
|
|
| BLAKE2b-256 |
1c2c6e85928c93b65b3d36c28db83a030851079ed5b6ec35d56744975270d3a7
|
File details
Details for the file purecpp_docling-1.0.0-py3-none-any.whl.
File metadata
- Download URL: purecpp_docling-1.0.0-py3-none-any.whl
- Upload date:
- Size: 5.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
646fe171ebcffacfa8a5fabf7b2bb9dddc4ed679bdace4ac9444bba046d083b6
|
|
| MD5 |
176da1b6849a09406a1dcdac53e4b13e
|
|
| BLAKE2b-256 |
5415a5064ce90f601ae6dfc40d1a3176c1d4006729d7ea33f05c55a27e6660da
|