Skip to main content

Docling PDF conversion package

Project description

Docling

Docling

Docling bundles PDF document conversion to JSON and Markdown in an easy, self-contained package.

Features

  • ⚡ Converts any PDF document to JSON or Markdown format, stable and lightning fast
  • 📑 Understands detailed page layout, reading order and recovers table structures
  • 📝 Extracts metadata from the document, such as title, authors, references and language
  • 🔍 Optionally applies OCR (use with scanned PDFs)

Setup

You need Python 3.11 and poetry. Install poetry from here.

Once you have poetry installed, create an environment and install the package:

poetry env use $(which python3.11)
poetry shell
poetry install

Notes:

  • Works on macOS and Linux environments. Windows platforms are currently not tested.

Usage

For basic usage, see the convert.py example module. Run with:

python examples/convert.py

The output of the above command will be written to ./scratch.

Enable or disable pipeline features

You can control if table structure recognition or OCR should be performed by arguments passed to DocumentConverter

doc_converter = DocumentConverter(
    artifacts_path=artifacts_path,
    pipeline_options=PipelineOptions(do_table_structure=False, # Controls if table structure is recovered. 
                                     do_ocr=True), # Controls if OCR is applied (ignores programmatic content)
)

Impose limits on the document size

You can limit the file size and number of pages which should be allowed to process per document.

paths = [Path("./test/data/2206.01062.pdf")]

input = DocumentConversionInput.from_paths(
    paths, limits=DocumentLimits(max_num_pages=100, max_file_size=20971520)
)

Convert from binary PDF streams

You can convert PDFs from a binary stream instead of from the filesystem as follows:

buf = BytesIO(your_binary_stream)
docs = [DocumentStream(filename="my_doc.pdf", stream=buf)]
input = DocumentConversionInput.from_streams(docs)
converted_docs = doc_converter.convert(input)

Limit resource usage

You can limit the CPU threads used by docling by setting the environment variable OMP_NUM_THREADS accordingly. The default setting is using 4 CPU threads.

Contributing

Please read Contributing to Docling for details.

References

If you use Docling in your projects, please consider citing the following:

@software{Docling,
author = {Deep Search Team},
month = {7},
title = {{Docling}},
url = {https://github.com/DS4SD/docling},
version = {main},
year = {2024}
}

License

The Docling codebase is under MIT license. For individual model usage, please refer to the model licenses found in the original packages.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docling-0.1.2.tar.gz (27.5 kB view details)

Uploaded Source

Built Distribution

docling-0.1.2-py3-none-any.whl (32.4 kB view details)

Uploaded Python 3

File details

Details for the file docling-0.1.2.tar.gz.

File metadata

  • Download URL: docling-0.1.2.tar.gz
  • Upload date:
  • Size: 27.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.3 CPython/3.12.4 Darwin/23.5.0

File hashes

Hashes for docling-0.1.2.tar.gz
Algorithm Hash digest
SHA256 c65a45e3a622899e2d38837f86b00e4bc2098f0aa2658b96a26e78bffa07d247
MD5 8eade75c144d3cee22e0affad8f54fa2
BLAKE2b-256 50f38d00b1771cd9e45c7b37ee0fc612c452205351095de1f7a258824660f674

See more details on using hashes here.

Provenance

File details

Details for the file docling-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: docling-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 32.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.3 CPython/3.12.4 Darwin/23.5.0

File hashes

Hashes for docling-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 daeb8fe6e77b4aeab95b5cc0fc395bfd41ba45e620e96330f4ca958cce0028d3
MD5 aee0146c6b4bc5e107837b4ef412e722
BLAKE2b-256 4218904aae2476301e4b4348f813d1f702e89ab65941d26094d760c493dc807b

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page