llama-index readers docling integration
Project description
Docling Reader
Overview
Docling Reader uses Docling to enable fast and easy extraction of PDF, DOCX, HTML, and other document types, into Markdown or JSON-serialized Docling format, for usage in LlamaIndex pipelines for RAG / QA etc.
Installation
pip install llama-index-readers-docling
Usage
Markdown export
By default, Docling Reader exports to Markdown. Basic usage looks like this:
from llama_index.readers.docling import DoclingReader
reader = DoclingReader()
docs = reader.load_data(file_path="https://arxiv.org/pdf/2408.09869")
print(f"{docs[0].text[389:442]}...")
# > ## Abstract
# >
# > This technical report introduces Docling...
JSON export
Docling Reader can also export Docling's native format to JSON:
from llama_index.readers.docling import DoclingReader
reader = DoclingReader(export_type=DoclingReader.ExportType.JSON)
docs = reader.load_data(file_path="https://arxiv.org/pdf/2408.09869")
print(f"{docs[0].text[:53]}...")
# > {"schema_name": "DoclingDocument", "version": "1.0.0"...
[!IMPORTANT] To appropriately parse Docling's native format, when using JSON export make sure to use a Docling Node Parser in your pipeline.
With Simple Directory Reader
The Docling Reader can also be used directly in combination with Simple Directory Reader, for example:
from llama_index.core import SimpleDirectoryReader
dir_reader = SimpleDirectoryReader(
input_dir="/path/to/docs",
file_extractor={".pdf": reader},
)
docs = dir_reader.load_data()
print(docs[0].metadata)
# > {'file_path': '/path/to/docs/2408.09869v3.pdf',
# > 'file_name': '2408.09869v3.pdf',
# > 'file_type': 'application/pdf',
# > 'file_size': 5566574,
# > 'creation_date': '2024-10-06',
# > 'last_modified_date': '2024-10-03'}
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file llama_index_readers_docling-0.4.1.tar.gz.
File metadata
- Download URL: llama_index_readers_docling-0.4.1.tar.gz
- Upload date:
- Size: 4.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.7.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
12dd7bc9f58af9249e599ca6376e89e0f8cac7816298668998c65a5daceae41d
|
|
| MD5 |
75680ef8219a050edb102b045d6e5114
|
|
| BLAKE2b-256 |
79d661b134e358b13c24c03847bbdc2170ccb374d4b37fcf24d8a9fb54f855ec
|
File details
Details for the file llama_index_readers_docling-0.4.1-py3-none-any.whl.
File metadata
- Download URL: llama_index_readers_docling-0.4.1-py3-none-any.whl
- Upload date:
- Size: 4.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.7.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2925ac68cf1cf51f54cfee2bd0c47702bab85c39ccb1e352642545d4480b25d7
|
|
| MD5 |
631e67b76300e68128707b3fc4a82ae6
|
|
| BLAKE2b-256 |
cc3a33e41b13b70a720bee58a6043ffad6d0baa51b1b8f9260a5f301609f3306
|