Skip to main content

Docling PDF conversion package

Project description

Docling

Docling

Dockling bundles PDF document conversion to JSON and Markdown in an easy, self-contained package.

Features

  • ⚡ Converts any PDF document to JSON or Markdown format, stable and lightning fast
  • 📑 Understands detailed page layout, reading order and recovers table structures
  • 📝 Extracts metadata from the document, such as title, authors, references and language
  • 🔍 Optionally applies OCR (use with scanned PDFs)

Setup

You need Python 3.11 and poetry. Install poetry from here.

Once you have poetry installed, create an environment and install the package:

poetry env use $(which python3.11)
poetry shell
poetry install

Notes:

  • Works on macOS and Linux environments. Windows platforms are currently not tested.

Usage

For basic usage, see the convert.py example module. Run with:

python examples/convert.py

The output of the above command will be written to ./scratch.

Enable or disable pipeline features

You can control if table structure recognition or OCR should be performed by arguments passed to DocumentConverter

doc_converter = DocumentConverter(
    artifacts_path=artifacts_path,
    pipeline_options=PipelineOptions(do_table_structure=False, # Controls if table structure is recovered. 
                                     do_ocr=True), # Controls if OCR is applied (ignores programmatic content)
)

Impose limits on the document size

You can limit the file size and number of pages which should be allowed to process per document.

paths = [Path("./test/data/2206.01062.pdf")]

input = DocumentConversionInput.from_paths(
    paths, limits=DocumentLimits(max_num_pages=100, max_file_size=20971520)
)

Convert from binary PDF streams

You can convert PDFs from a binary stream instead of from the filesystem as follows:

buf = BytesIO(your_binary_stream)
docs = [DocumentStream(filename="my_doc.pdf", stream=buf)]
input = DocumentConversionInput.from_streams(docs)
converted_docs = doc_converter.convert(input)

Limit resource usage

You can limit the CPU threads used by docling by setting the environment variable OMP_NUM_THREADS accordingly. The default setting is using 4 CPU threads.

Contributing

Please read Contributing to Docling for details.

References

If you use Docling in your projects, please consider citing the following:

@software{Docling,
author = {Deep Search Team},
month = {7},
title = {{Docling}},
url = {https://github.com/DS4SD/docling},
version = {main},
year = {2024}
}

License

The Docling codebase is under MIT license. For individual model usage, please refer to the model licenses found in the original packages.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docling-0.1.1.tar.gz (27.5 kB view details)

Uploaded Source

Built Distribution

docling-0.1.1-py3-none-any.whl (32.4 kB view details)

Uploaded Python 3

File details

Details for the file docling-0.1.1.tar.gz.

File metadata

  • Download URL: docling-0.1.1.tar.gz
  • Upload date:
  • Size: 27.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.3 CPython/3.12.4 Darwin/23.5.0

File hashes

Hashes for docling-0.1.1.tar.gz
Algorithm Hash digest
SHA256 29dddc094f34adda20fe7e250463cc4ba45aa2b60a1a2b5184de309d639361b1
MD5 bcc7da2a7ef9cc2df05be37f42a6119c
BLAKE2b-256 9af6ba9e768c565ddeaad3008b90af077d358b21c6c865ee32e73cc985459932

See more details on using hashes here.

Provenance

File details

Details for the file docling-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: docling-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 32.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.3 CPython/3.12.4 Darwin/23.5.0

File hashes

Hashes for docling-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 6e0e0f85ac7c8d33df64cf38e20b0b45ab7836a205ad71bb604bedd021764fe7
MD5 b8e8075f71e4b5d27696590a07357be6
BLAKE2b-256 9f2fc93456754ec3e67375578bec7b2c18c08dce1107091636b3d7afd86473a3

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page