Skip to main content

llama-index readers docling integration

Project description

Docling Reader

Overview

Docling Reader uses Docling to enable fast and easy extraction of PDF, DOCX, HTML, and other document types, into Markdown or JSON-serialized Docling format, for usage in LlamaIndex pipelines for RAG / QA etc.

Installation

pip install llama-index-readers-docling

Usage

Markdown export

By default, Docling Reader exports to Markdown. Basic usage looks like this:

from llama_index.readers.docling import DoclingReader

reader = DoclingReader()
docs = reader.load_data(file_path="https://arxiv.org/pdf/2408.09869")
print(f"{docs[0].text[389:442]}...")
# > ## Abstract
# >
# > This technical report introduces Docling...

JSON export

Docling Reader can also export Docling's native format to JSON:

from llama_index.readers.docling import DoclingReader

reader = DoclingReader(export_type=DoclingReader.ExportType.JSON)
docs = reader.load_data(file_path="https://arxiv.org/pdf/2408.09869")
print(f"{docs[0].text[:53]}...")
# > {"schema_name": "DoclingDocument", "version": "1.0.0"...

[!IMPORTANT] To appropriately parse Docling's native format, when using JSON export make sure to use a Docling Node Parser in your pipeline.

With Simple Directory Reader

The Docling Reader can also be used directly in combination with Simple Directory Reader, for example:

from llama_index.core import SimpleDirectoryReader

dir_reader = SimpleDirectoryReader(
    input_dir="/path/to/docs",
    file_extractor={".pdf": reader},
)
docs = dir_reader.load_data()
print(docs[0].metadata)
# > {'file_path': '/path/to/docs/2408.09869v3.pdf',
# >  'file_name': '2408.09869v3.pdf',
# >  'file_type': 'application/pdf',
# >  'file_size': 5566574,
# >  'creation_date': '2024-10-06',
# >  'last_modified_date': '2024-10-03'}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llama_index_readers_docling-0.4.2.tar.gz (4.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llama_index_readers_docling-0.4.2-py3-none-any.whl (4.4 kB view details)

Uploaded Python 3

File details

Details for the file llama_index_readers_docling-0.4.2.tar.gz.

File metadata

  • Download URL: llama_index_readers_docling-0.4.2.tar.gz
  • Upload date:
  • Size: 4.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.13 {"installer":{"name":"uv","version":"0.9.13"},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for llama_index_readers_docling-0.4.2.tar.gz
Algorithm Hash digest
SHA256 98e66d55cbdb5a4ae2f5265e5807c6c6380a4bd27cb85dec93f3bfc9d4208feb
MD5 3ccff86b9b31e3ad61fe97c96c7a54d9
BLAKE2b-256 e701c086bd483909e3220a4ab587cedea4062da7a6843fdb311fb7b5dd0c1a8d

See more details on using hashes here.

File details

Details for the file llama_index_readers_docling-0.4.2-py3-none-any.whl.

File metadata

  • Download URL: llama_index_readers_docling-0.4.2-py3-none-any.whl
  • Upload date:
  • Size: 4.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.13 {"installer":{"name":"uv","version":"0.9.13"},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for llama_index_readers_docling-0.4.2-py3-none-any.whl
Algorithm Hash digest
SHA256 35a6a54b5dd07f194bbfdb44329f13e1fd9560cee68b11782587426d0e817d0b
MD5 1da6d025ae01e2425a29dcbac621342c
BLAKE2b-256 a165387a3dd54b9f3cf2854a8613e092567b5fe8a6e916a0949d25311077e2c1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page