Skip to main content

llama-index readers docling integration

Project description

Docling Reader

Overview

Docling Reader uses Docling to enable fast and easy extraction of PDF, DOCX, HTML, and other document types, into Markdown or JSON-serialized Docling format, for usage in LlamaIndex pipelines for RAG / QA etc.

Installation

pip install llama-index-readers-docling

Usage

Markdown export

By default, Docling Reader exports to Markdown. Basic usage looks like this:

from llama_index.readers.docling import DoclingReader

reader = DoclingReader()
docs = reader.load_data(file_path="https://arxiv.org/pdf/2408.09869")
print(f"{docs[0].text[389:442]}...")
# > ## Abstract
# >
# > This technical report introduces Docling...

JSON export

Docling Reader can also export Docling's native format to JSON:

from llama_index.readers.docling import DoclingReader

reader = DoclingReader(export_type=DoclingReader.ExportType.JSON)
docs = reader.load_data(file_path="https://arxiv.org/pdf/2408.09869")
print(f"{docs[0].text[:53]}...")
# > {"schema_name": "DoclingDocument", "version": "1.0.0"...

[!IMPORTANT] To appropriately parse Docling's native format, when using JSON export make sure to use a Docling Node Parser in your pipeline.

With Simple Directory Reader

The Docling Reader can also be used directly in combination with Simple Directory Reader, for example:

from llama_index.core import SimpleDirectoryReader

dir_reader = SimpleDirectoryReader(
    input_dir="/path/to/docs",
    file_extractor={".pdf": reader},
)
docs = dir_reader.load_data()
print(docs[0].metadata)
# > {'file_path': '/path/to/docs/2408.09869v3.pdf',
# >  'file_name': '2408.09869v3.pdf',
# >  'file_type': 'application/pdf',
# >  'file_size': 5566574,
# >  'creation_date': '2024-10-06',
# >  'last_modified_date': '2024-10-03'}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llama_index_readers_docling-0.4.1.tar.gz (4.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llama_index_readers_docling-0.4.1-py3-none-any.whl (4.4 kB view details)

Uploaded Python 3

File details

Details for the file llama_index_readers_docling-0.4.1.tar.gz.

File metadata

File hashes

Hashes for llama_index_readers_docling-0.4.1.tar.gz
Algorithm Hash digest
SHA256 12dd7bc9f58af9249e599ca6376e89e0f8cac7816298668998c65a5daceae41d
MD5 75680ef8219a050edb102b045d6e5114
BLAKE2b-256 79d661b134e358b13c24c03847bbdc2170ccb374d4b37fcf24d8a9fb54f855ec

See more details on using hashes here.

File details

Details for the file llama_index_readers_docling-0.4.1-py3-none-any.whl.

File metadata

File hashes

Hashes for llama_index_readers_docling-0.4.1-py3-none-any.whl
Algorithm Hash digest
SHA256 2925ac68cf1cf51f54cfee2bd0c47702bab85c39ccb1e352642545d4480b25d7
MD5 631e67b76300e68128707b3fc4a82ae6
BLAKE2b-256 cc3a33e41b13b70a720bee58a6043ffad6d0baa51b1b8f9260a5f301609f3306

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page