Skip to main content

Docling PDF conversion package

Project description

Docling

Docling

arXiv PyPI version Python Poetry Code style: black Imports: isort Pydantic v2 pre-commit License MIT

Docling bundles PDF document conversion to JSON and Markdown in an easy, self-contained package.

Features

  • โšก Converts any PDF document to JSON or Markdown format, stable and lightning fast
  • ๐Ÿ“‘ Understands detailed page layout, reading order and recovers table structures
  • ๐Ÿ“ Extracts metadata from the document, such as title, authors, references and language
  • ๐Ÿ” Includes OCR support for scanned PDFs
  • ๐Ÿค– Integrates easily with LLM app / RAG frameworks like ๐Ÿฆ™ LlamaIndex and ๐Ÿฆœ๐Ÿ”— LangChain
  • ๐Ÿ’ป Provides a simple and convenient CLI

Installation

To use Docling, simply install docling from your package manager, e.g. pip:

pip install docling

Works on macOS, Linux and Windows environments. Both x86_64 and arm64 architectures.

Alternative PyTorch distributions

The Docling models depend on the PyTorch library. Depending on your architecture, you might want to use a different distribution of torch. For example, you might want support for different accelerator or for a cpu-only version. All the different ways for installing torch are listed on their website https://pytorch.org/.

One common situation is the installation on Linux systems with cpu-only support. In this case, we suggest the installation of Docling with the following options

# Example for installing on the Linux cpu-only version
pip install docling --extra-index-url https://download.pytorch.org/whl/cpu
Alternative OCR engines

Docling supports multiple OCR engines for processing scanned documents. The current version provides the following engines.

Engine Installation Usage
EasyOCR Default in Docling or via pip install easyocr. EasyOcrOptions
Tesseract System dependency. See description for Tesseract and Tesserocr below. TesseractOcrOptions
Tesseract CLI System dependency. See description below. TesseractCliOcrOptions

The Docling DocumentConverter allows to choose the OCR engine with the ocr_options settings. For example

  from docling.datamodel.base_models import ConversionStatus, PipelineOptions
  from docling.datamodel.pipeline_options import PipelineOptions, EasyOcrOptions, TesseractOcrOptions
  from docling.document_converter import DocumentConverter

  pipeline_options = PipelineOptions()
  pipeline_options.do_ocr = True
  pipeline_options.ocr_options = TesseractOcrOptions()  # Use Tesseract

  doc_converter = DocumentConverter(
      pipeline_options=pipeline_options,
  )

Tesseract installation

Tesseract is a popular OCR engine which is available on most operating systems. For using this engine with Docling, Tesseract must be installed on your system, using the packaging tool of your choice. Below we provide example commands. After installing Tesseract you are expected to provide the path to its language files using the TESSDATA_PREFIX environment variable (note that it must terminate with a slash /).

For macOS, we reccomend using Homebrew.

brew install tesseract leptonica pkg-config
TESSDATA_PREFIX=/opt/homebrew/share/tessdata/
echo "Set TESSDATA_PREFIX=${TESSDATA_PREFIX}"

For Debian-based systems.

apt-get install tesseract-ocr tesseract-ocr-eng libtesseract-dev libleptonica-dev pkg-config
TESSDATA_PREFIX=$(dpkg -L tesseract-ocr-eng | grep tessdata$)
echo "Set TESSDATA_PREFIX=${TESSDATA_PREFIX}"

For RHEL systems.

dnf install tesseract tesseract-devel tesseract-langpack-eng leptonica-devel
TESSDATA_PREFIX=/usr/share/tesseract/tessdata/
echo "Set TESSDATA_PREFIX=${TESSDATA_PREFIX}"

Linking to Tesseract

The most efficient usage of the Tesseract library is via linking. Docling is using the Tesserocr package for this.

If you get into installation issues of Tesserocr, we suggest using the following installation options:

pip uninstall tesserocr
pip install --no-binary :all: tesserocr
Docling development setup

To develop for Docling (features, bugfixes etc.), install as follows from your local clone's root dir:

poetry install --all-extras

Getting started

Convert a single document

To convert invidual PDF documents, use convert_single(), for example:

from docling.document_converter import DocumentConverter

source = "https://arxiv.org/pdf/2408.09869"  # PDF path or URL
converter = DocumentConverter()
result = converter.convert_single(source)
print(result.render_as_markdown())  # output: "## Docling Technical Report[...]"
print(result.render_as_doctags())  # output: "<document><title><page_1><loc_20>..."

Convert a batch of documents

For an example of batch-converting documents, see batch_convert.py.

From a local repo clone, you can run it with:

python examples/batch_convert.py

The output of the above command will be written to ./scratch.

CLI

You can also use Docling directly from your command line to convert individual files โ€”be it local or by URLโ€” or whole directories.

A simple example would look like this:

docling https://arxiv.org/pdf/2206.01062

To see all available options (export formats etc.) run docling --help.

CLI reference

Here are the available options as of this writing (for an up-to-date listing, run docling --help):

$ docling --help

Usage: docling [OPTIONS] source

โ•ญโ”€ Arguments โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ *    input_sources      source  PDF files to convert. Can be local file / directory paths or URL. [default: None] [required] โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ
โ•ญโ”€ Options โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ --json       --no-json                            If enabled the document is exported as JSON. [default: no-json]            โ”‚
โ”‚ --md         --no-md                              If enabled the document is exported as Markdown. [default: md]             โ”‚
โ”‚ --txt        --no-txt                             If enabled the document is exported as Text. [default: no-txt]             โ”‚
โ”‚ --doctags    --no-doctags                         If enabled the document is exported as Doc Tags. [default: no-doctags]     โ”‚
โ”‚ --ocr        --no-ocr                             If enabled, the bitmap content will be processed using OCR. [default: ocr] โ”‚
โ”‚ --backend                    [pypdfium2|docling]  The PDF backend to use. [default: docling]                                 โ”‚
โ”‚ --output                     PATH                 Output directory where results are saved. [default: .]                     โ”‚
โ”‚ --version                                         Show version information.                                                  โ”‚
โ”‚ --help                                            Show this message and exit.                                                โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ

RAG

Check out the following examples showcasing RAG using Docling with standard LLM application frameworks:

Advanced features

Adjust pipeline features

The example file custom_convert.py contains multiple ways one can adjust the conversion pipeline and features.

Control pipeline options

You can control if table structure recognition or OCR should be performed by arguments passed to DocumentConverter:

doc_converter = DocumentConverter(
    artifacts_path=artifacts_path,
    pipeline_options=PipelineOptions(
        do_table_structure=False,  # controls if table structure is recovered
        do_ocr=True,  # controls if OCR is applied (ignores programmatic content)
    ),
)

Control table extraction options

You can control if table structure recognition should map the recognized structure back to PDF cells (default) or use text cells from the structure prediction itself. This can improve output quality if you find that multiple columns in extracted tables are erroneously merged into one.

from docling.datamodel.pipeline_options import PipelineOptions

pipeline_options = PipelineOptions(do_table_structure=True)
pipeline_options.table_structure_options.do_cell_matching = False  # uses text cells predicted from table structure model

doc_converter = DocumentConverter(
    artifacts_path=artifacts_path,
    pipeline_options=pipeline_options,
)

Since docling 1.16.0: You can control which TableFormer mode you want to use. Choose between TableFormerMode.FAST (default) and TableFormerMode.ACCURATE (better, but slower) to receive better quality with difficult table structures.

from docling.datamodel.pipeline_options import PipelineOptions, TableFormerMode

pipeline_options = PipelineOptions(do_table_structure=True)
pipeline_options.table_structure_options.mode = TableFormerMode.ACCURATE  # use more accurate TableFormer model

doc_converter = DocumentConverter(
    artifacts_path=artifacts_path,
    pipeline_options=pipeline_options,
)

Impose limits on the document size

You can limit the file size and number of pages which should be allowed to process per document:

conv_input = DocumentConversionInput.from_paths(
    paths=[Path("./test/data/2206.01062.pdf")],
    limits=DocumentLimits(max_num_pages=100, max_file_size=20971520)
)

Convert from binary PDF streams

You can convert PDFs from a binary stream instead of from the filesystem as follows:

buf = BytesIO(your_binary_stream)
docs = [DocumentStream(filename="my_doc.pdf", stream=buf)]
conv_input = DocumentConversionInput.from_streams(docs)
results = doc_converter.convert(conv_input)

Limit resource usage

You can limit the CPU threads used by Docling by setting the environment variable OMP_NUM_THREADS accordingly. The default setting is using 4 CPU threads.

Chunking

You can perform a hierarchy-aware chunking of a Docling document as follows:

from docling.document_converter import DocumentConverter
from docling_core.transforms.chunker import HierarchicalChunker

doc = DocumentConverter().convert_single("https://arxiv.org/pdf/2206.01062").output
chunks = list(HierarchicalChunker().chunk(doc))
print(chunks[0])
# ChunkWithMetadata(
#     path='#/main-text/1',
#     text='DocLayNet: A Large Human-Annotated Dataset [...]',
#     page=1,
#     bbox=[107.30, 672.38, 505.19, 709.08],
#     [...]
# )

Technical report

For more details on Docling's inner workings, check out the Docling Technical Report.

Contributing

Please read Contributing to Docling for details.

References

If you use Docling in your projects, please consider citing the following:

@techreport{Docling,
  author = {Deep Search Team},
  month = {8},
  title = {Docling Technical Report},
  url = {https://arxiv.org/abs/2408.09869},
  eprint = {2408.09869},
  doi = {10.48550/arXiv.2408.09869},
  version = {1.0.0},
  year = {2024}
}

License

The Docling codebase is under MIT license. For individual model usage, please refer to the model licenses found in the original packages.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docling-1.19.1.tar.gz (46.5 kB view details)

Uploaded Source

Built Distribution

docling-1.19.1-py3-none-any.whl (52.9 kB view details)

Uploaded Python 3

File details

Details for the file docling-1.19.1.tar.gz.

File metadata

  • Download URL: docling-1.19.1.tar.gz
  • Upload date:
  • Size: 46.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.3 CPython/3.10.12 Linux/6.5.0-1025-azure

File hashes

Hashes for docling-1.19.1.tar.gz
Algorithm Hash digest
SHA256 8adaa6fded86bf789520255a2868719224b3f0e6f808c61ba39af56ed8511510
MD5 9453f06404c77eeb6a4b96ebd62eaba6
BLAKE2b-256 a53707fc5dcbee133f503b74054e517bc6a6931b2aa2d8782ba9d632921394a6

See more details on using hashes here.

Provenance

File details

Details for the file docling-1.19.1-py3-none-any.whl.

File metadata

  • Download URL: docling-1.19.1-py3-none-any.whl
  • Upload date:
  • Size: 52.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.3 CPython/3.10.12 Linux/6.5.0-1025-azure

File hashes

Hashes for docling-1.19.1-py3-none-any.whl
Algorithm Hash digest
SHA256 90bcd11ebdfd40985185b4f925f4c4172f441b694ba8590d0cd6590d4b20522d
MD5 234c4cdca67c62b74057d68c6c200e8a
BLAKE2b-256 128a6252a2925749b1c54ba2f219a8fd36700ca3a5f375351b7310892401f00b

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page