llama-index readers docling integration
Project description
Docling Reader
Overview
Docling Reader uses Docling to enable fast and easy PDF document extraction and export to Markdown or JSON-serialized Docling format, for usage in LlamaIndex pipelines for RAG / QA etc.
Installation
pip install llama-index-readers-docling
Usage
Markdown export
By default, Docling Reader exports to Markdown. Basic usage looks like this:
from llama_index.readers.docling import DoclingReader
reader = DoclingReader()
docs = reader.load_data(file_path="https://arxiv.org/pdf/2408.09869")
print(f"{docs[0].text[409:462]}...")
# > ## Abstract
# >
# > This technical report introduces Docling...
JSON export
Docling Reader can also export Docling's native format to JSON:
from llama_index.readers.docling import DoclingReader
reader = DoclingReader(export_type=DoclingReader.ExportType.JSON)
docs = reader.load_data(file_path="https://arxiv.org/pdf/2408.09869")
print(f"{docs[0].text[:50]}...")
# > {"_name":"","type":"pdf-document","description":{"...
[!IMPORTANT] To appropriately parse Docling's native format, when using JSON export make sure to use a Docling Node Parser in your pipeline.
With Simple Directory Reader
The Docling Reader can also be used directly in combination with Simple Directory Reader, for example:
from llama_index.core import SimpleDirectoryReader
dir_reader = SimpleDirectoryReader(
input_dir="/path/to/docs",
file_extractor={".pdf": reader},
)
docs = dir_reader.load_data()
print(docs[0].metadata)
# > {'file_path': '/path/to/docs/2408.09869v3.pdf',
# > 'file_name': '2408.09869v3.pdf',
# > 'file_type': 'application/pdf',
# > 'file_size': 5566574,
# > 'creation_date': '2024-10-06',
# > 'last_modified_date': '2024-10-03',
# > 'dl_doc_hash': '556ad9e...'}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Close
Hashes for llama_index_readers_docling-0.1.0.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 164750d56bda3a251efd35f03126e53b5e387b6fdfbb2fd1b4a4e29917b52b9b |
|
MD5 | 487eb39b6d6ff586ca8b354223e98c42 |
|
BLAKE2b-256 | 3c86f535d849e27dffcf40d5819bd1b9a4f57b9f22507eb10ff53edd39cc70bb |
Close
Hashes for llama_index_readers_docling-0.1.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | a33bc29b5ab5c8d2087268e13243c632f16b61c362133fcb5c32347d3b0ee88a |
|
MD5 | 425d14131f07843d1322ee9c9e5833cb |
|
BLAKE2b-256 | a41e71ef6ea63bab920ea28ae7b04121779edd8591f49ba71fea2a38dc87f203 |