Skip to main content

Simple package to extract text with coordinates from programmatic PDFs

Project description

Docling Parse

PyPI version PyPI - Python Version uv Pybind11 Platforms License MIT

Simple package to extract text, paths and bitmap images with coordinates from programmatic PDFs. This package is used in the Docling PDF conversion. Below, we show a few output of the latest parser with char, word and line level output for text, in addition to the extracted paths and bitmap resources.

To do the visualizations yourself, simply run (change word into char or line),

uv run python ./docling_parse/visualize.py -i <path-to-pdf-file> -c word --interactive
original char word line
screenshot screenshot screenshot screenshot
screenshot screenshot screenshot screenshot
screenshot screenshot screenshot screenshot
screenshot screenshot screenshot screenshot
screenshot screenshot screenshot screenshot

Quick start

Install the package from Pypi

pip install docling-parse

Convert a PDF (look in the visualize.py for a more detailed information)

from docling_core.types.doc.page import TextCellUnit
from docling_parse.pdf_parser import DoclingPdfParser, PdfDocument

parser = DoclingPdfParser()

pdf_doc: PdfDocument = parser.load(
    path_or_stream="<path-to-pdf>"
)

# PdfDocument.iterate_pages() will automatically populate pages as they are yielded.
for page_no, pred_page in pdf_doc.iterate_pages():

    # iterate over the word-cells
    for word in pred_page.iterate_cells(unit_type=TextCellUnit.WORD):
        print(word.rect, ": ", word.text)

        # create a PIL image with the char cells
    img = pred_page.render_as_image(cell_unit=TextCellUnit.CHAR)
    img.show()

Parallel parsing (multi-threaded)

Parse pages from one or more PDFs in parallel using a thread pool with backpressure:

from docling_parse.pdf_parser import (
    DoclingThreadedPdfParser,
    ThreadedPdfParserConfig,
)
from docling_parse.pdf_parsers import DecodePageConfig  # type: ignore[import]

parser_config = ThreadedPdfParserConfig(
    loglevel="fatal",
    threads=4,                # worker threads
    max_concurrent_results=32 # cap buffered results to limit memory
)
decode_config = DecodePageConfig()

parser = DoclingThreadedPdfParser(
    parser_config=parser_config,
    decode_config=decode_config,
)

# load one or more documents
for source in ["doc_a.pdf", "doc_b.pdf"]:
    doc_key = parser.load(source)
    print(doc_key, parser.page_count(doc_key))

# consume decoded pages as they become available
for result in parser.iterate_results():
    if result.success:
        seg_page = result.get_page()
        timings = result.get_timings()
        print(f"{result.doc_key} p{result.page_number}: "
              f"{len(seg_page.word_cells)} words in {timings.total():.3f}s")
    else:
        print(f"error on {result.doc_key} p{result.page_number}: {result.error_message}")

Use the CLI

$ docling-parse -h
usage: docling-parse [-h] -p PDF

Process a PDF file.

options:
  -h, --help         show this help message and exit
  -p PDF, --pdf PDF  Path to the PDF file

Performance Benchmarks

Coming soon - benchmarks will be updated for the current parser version.

For historical V1 vs V2 benchmarks, see legacy_performance_benchmarks.md.

Development

CXX

To build the parser, simply run the following command in the root folder,

rm -rf build; cmake -B ./build; cd build; make

You can run the parser from your build folder:

% ./parse.exe -h
program to process PDF files or configuration files
Usage:
  PDFProcessor [OPTION...]

  -i, --input arg          Input PDF file
  -c, --config arg         Config file
      --create-config arg  Create config file
  -p, --page arg           Pages to process (default: -1 for all) (default:
                           -1)
      --password arg       Password for accessing encrypted, password-protected files
  -o, --output arg         Output file
  -l, --loglevel arg       loglevel [error;warning;success;info]
  -h, --help               Print usage

If you don't have an input file, a template input file will be printed on the terminal.

Python

To build the package, simply run (make sure uv is installed),

uv sync

The latter will only work after a clean git clone. If you are developing and updating C++ code, please use,

# uv pip install --force-reinstall --no-deps -e .
rm -rf .venv; uv venv; uv pip install --force-reinstall --no-deps -e ".[perf-tools]"

To test the package, run:

uv run pytest ./tests -v -s

Contributing

Please read Contributing to Docling Parse for details.

References

If you use Docling in your projects, please consider citing the following:

@techreport{Docling,
  author = {Docling Team},
  month = {8},
  title = {Docling Technical Report},
  url = {https://arxiv.org/abs/2408.09869},
  eprint = {2408.09869},
  doi = {10.48550/arXiv.2408.09869},
  version = {1.0.0},
  year = {2024}
}

License

The Docling Parse codebase is under MIT license. For individual model usage, please refer to the model licenses found in the original packages.

LF AI & Data

Docling (and also docling-parse) is hosted as a project in the LF AI & Data Foundation.

IBM ❤️ Open Source AI

The project was started by the AI for knowledge team at IBM Research Zurich.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docling_parse-6.2.0.tar.gz (6.7 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

docling_parse-6.2.0-cp314-cp314-win_amd64.whl (11.4 MB view details)

Uploaded CPython 3.14Windows x86-64

docling_parse-6.2.0-cp314-cp314-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (10.2 MB view details)

Uploaded CPython 3.14manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

docling_parse-6.2.0-cp314-cp314-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl (9.8 MB view details)

Uploaded CPython 3.14manylinux: glibc 2.26+ ARM64manylinux: glibc 2.28+ ARM64

docling_parse-6.2.0-cp314-cp314-macosx_14_0_arm64.whl (9.1 MB view details)

Uploaded CPython 3.14macOS 14.0+ ARM64

docling_parse-6.2.0-cp313-cp313-win_amd64.whl (11.0 MB view details)

Uploaded CPython 3.13Windows x86-64

docling_parse-6.2.0-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (10.2 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

docling_parse-6.2.0-cp313-cp313-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl (9.8 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.26+ ARM64manylinux: glibc 2.28+ ARM64

docling_parse-6.2.0-cp313-cp313-macosx_14_0_arm64.whl (9.1 MB view details)

Uploaded CPython 3.13macOS 14.0+ ARM64

docling_parse-6.2.0-cp312-cp312-win_amd64.whl (11.0 MB view details)

Uploaded CPython 3.12Windows x86-64

docling_parse-6.2.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (10.2 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

docling_parse-6.2.0-cp312-cp312-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl (9.8 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.26+ ARM64manylinux: glibc 2.28+ ARM64

docling_parse-6.2.0-cp312-cp312-macosx_14_0_arm64.whl (9.1 MB view details)

Uploaded CPython 3.12macOS 14.0+ ARM64

docling_parse-6.2.0-cp311-cp311-win_amd64.whl (11.0 MB view details)

Uploaded CPython 3.11Windows x86-64

docling_parse-6.2.0-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (10.2 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

docling_parse-6.2.0-cp311-cp311-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl (9.8 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.26+ ARM64manylinux: glibc 2.28+ ARM64

docling_parse-6.2.0-cp311-cp311-macosx_14_0_arm64.whl (9.1 MB view details)

Uploaded CPython 3.11macOS 14.0+ ARM64

docling_parse-6.2.0-cp310-cp310-win_amd64.whl (11.0 MB view details)

Uploaded CPython 3.10Windows x86-64

docling_parse-6.2.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (10.1 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ x86-64

docling_parse-6.2.0-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (9.9 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ ARM64

docling_parse-6.2.0-cp310-cp310-macosx_14_0_arm64.whl (9.1 MB view details)

Uploaded CPython 3.10macOS 14.0+ ARM64

File details

Details for the file docling_parse-6.2.0.tar.gz.

File metadata

  • Download URL: docling_parse-6.2.0.tar.gz
  • Upload date:
  • Size: 6.7 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for docling_parse-6.2.0.tar.gz
Algorithm Hash digest
SHA256 f13d6c49e3b5f9caaf0d626e0dcc7948c5b4700d0eae0559ec353ed07c4f2f50
MD5 3df0ee32175925fd3419cafda62dcda5
BLAKE2b-256 64462c9c0738452368ad63018f380f4ad6fad8c69b64f04222aa012190bc8a4f

See more details on using hashes here.

File details

Details for the file docling_parse-6.2.0-cp314-cp314-win_amd64.whl.

File metadata

File hashes

Hashes for docling_parse-6.2.0-cp314-cp314-win_amd64.whl
Algorithm Hash digest
SHA256 1a5cd6cf5e2f8f9deb608fb7302d8fc1fa26c048406aa0c2073d4167e09af113
MD5 a2f240280c1cffb62b002753b1e0f54c
BLAKE2b-256 d146293a1a171f5f267a1fa0f531aaa0f8e5b95d2976a067141f82e37a7dfcba

See more details on using hashes here.

File details

Details for the file docling_parse-6.2.0-cp314-cp314-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for docling_parse-6.2.0-cp314-cp314-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 81514a0109e394be018fb8283ab4f2716b829e291ae4bd2daa6a814fdbd6c0d0
MD5 1d121c269e4634d2fee67938e7051470
BLAKE2b-256 797998757a1aa32db2222cf22d34c36f651487bdff19e9fee2182485d7200b12

See more details on using hashes here.

File details

Details for the file docling_parse-6.2.0-cp314-cp314-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for docling_parse-6.2.0-cp314-cp314-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 ceb56f53d27dc3e8a85142c783a80ee91a37d4890b2346d52de439f3a0ca2773
MD5 8d741ed84c3e24f1e0a1becf77034e72
BLAKE2b-256 05fa8c7cde7f7e8ffc6a265f8e72660c57a420d0606b63e46cd53437f9be6a0f

See more details on using hashes here.

File details

Details for the file docling_parse-6.2.0-cp314-cp314-macosx_14_0_arm64.whl.

File metadata

File hashes

Hashes for docling_parse-6.2.0-cp314-cp314-macosx_14_0_arm64.whl
Algorithm Hash digest
SHA256 b31d928a08a8b3c04d5b3a40a03cdb85130c7aa204c1cc41319cb7fc2b15f960
MD5 cfbf32d8d3bc09c15dc7812560e82670
BLAKE2b-256 68c31680c28e9a202c751567fc26f4c808078524161f0ec7fb35c6d01ea22082

See more details on using hashes here.

File details

Details for the file docling_parse-6.2.0-cp313-cp313-win_amd64.whl.

File metadata

File hashes

Hashes for docling_parse-6.2.0-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 b2fb3942929eba7bebea5ba62e79d2fd789705367b62987d1928b120b8b1dd0a
MD5 7bf88fbbb1aad8b4ab85ba89aceea49b
BLAKE2b-256 9023471a9e1bbdf5f1894a54352992c15a535d6d3eb2239a4768cd762c2dda18

See more details on using hashes here.

File details

Details for the file docling_parse-6.2.0-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for docling_parse-6.2.0-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 f8d269e41c7fc2d12f22418b163920f0c4ab11d63b945d3425e28d6d2aef30c5
MD5 5aee69a718908bef77dd5a000850b350
BLAKE2b-256 3ffd07da1935f80750d149deb286e385af5d8e4a5a5f399fd41ce2ddfa7e57d4

See more details on using hashes here.

File details

Details for the file docling_parse-6.2.0-cp313-cp313-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for docling_parse-6.2.0-cp313-cp313-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 6ffc27d4f02a119049904267712429865b028214e1ebaa1ced7bf3ce618b078a
MD5 312e2502ee06127628472acaddd2c505
BLAKE2b-256 d609862198dcd8dea49247595e87e2a9ce6694832d93d31f45e9fe680600127f

See more details on using hashes here.

File details

Details for the file docling_parse-6.2.0-cp313-cp313-macosx_14_0_arm64.whl.

File metadata

File hashes

Hashes for docling_parse-6.2.0-cp313-cp313-macosx_14_0_arm64.whl
Algorithm Hash digest
SHA256 c5377a1061d10ed1ac951ae9d3b08a0c0ab7a9277481d58d78284af8e533496c
MD5 e9fea7d331636b7ed5fef88424b95658
BLAKE2b-256 5399bc5feb96e27f0ff38c9ff03e070f29ab6452cf7398b8432c7a1b5bfe153c

See more details on using hashes here.

File details

Details for the file docling_parse-6.2.0-cp312-cp312-win_amd64.whl.

File metadata

File hashes

Hashes for docling_parse-6.2.0-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 6f2be525e2b117afe84033375354c1cee4f77a4598807ca75d5873fd507a52e1
MD5 176993e80a338d87d3b6fa6321ce08ad
BLAKE2b-256 0329c46b57a3cce07a14810f539a4402d7d347ddc2b2c63501c344c0541a8697

See more details on using hashes here.

File details

Details for the file docling_parse-6.2.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for docling_parse-6.2.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 05f6b1e15408741953ee4beb61442168c3267489634ce16ebd8e9214deec621e
MD5 5367297b1b07ba5cee1a580cc7fd7ce4
BLAKE2b-256 3cdd572cde51f4c192a2752680e76fcb030cb997f656b4eea3b196fe8b7b7b2b

See more details on using hashes here.

File details

Details for the file docling_parse-6.2.0-cp312-cp312-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for docling_parse-6.2.0-cp312-cp312-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 33093dfb3c8105feb618887a127b19327e09fae7bf374eecbf5d10663d474a1e
MD5 115b93b2401bd403b834ec68afd73df5
BLAKE2b-256 d0e03ed96ada48b96670a0817bd3fc11f7e6808aaf7d491354dd3b3deddb0725

See more details on using hashes here.

File details

Details for the file docling_parse-6.2.0-cp312-cp312-macosx_14_0_arm64.whl.

File metadata

File hashes

Hashes for docling_parse-6.2.0-cp312-cp312-macosx_14_0_arm64.whl
Algorithm Hash digest
SHA256 6085c2d4611c16fb9b6b96472e4d3ecea4ca701d9b8be58776b4d2572cd98cdc
MD5 9c4062394a944a109994c66882bb0f03
BLAKE2b-256 6b1b507361edae548952993d75160884ce7895a93e92cc66b4e30b2cc3616091

See more details on using hashes here.

File details

Details for the file docling_parse-6.2.0-cp311-cp311-win_amd64.whl.

File metadata

File hashes

Hashes for docling_parse-6.2.0-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 883ef9e545f4545ab50ce6cf27df9dc9816e4d9c5e07cfb37d8bfa672c10c948
MD5 b6f2865e7410eba2aafe0c3b206c3a77
BLAKE2b-256 efe7b313b88f8d012bc0309e12466976d8a20cd34cdf29624fc3c07540d76c79

See more details on using hashes here.

File details

Details for the file docling_parse-6.2.0-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for docling_parse-6.2.0-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 aa428204bfcd07d7fd28bfe0aae3511c17d1167048313c7347880d3a03201038
MD5 28b8b62fee5677808fbfd1a57da38899
BLAKE2b-256 2bf7efa24da9d5d7d80e5479d7c996599a01dd2f8837094c34b7f7c53f9c28c3

See more details on using hashes here.

File details

Details for the file docling_parse-6.2.0-cp311-cp311-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for docling_parse-6.2.0-cp311-cp311-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 06d3aa622950952fe868e8b576026e9e1a5295e1c07f10e4e809f8745548ac73
MD5 3da2f4a4cb68fd249fe49bb41c09b8e4
BLAKE2b-256 13b3ef291f56028d78d13e9ed88f3d74bae364f8af4a98b4f7d9309585990d0a

See more details on using hashes here.

File details

Details for the file docling_parse-6.2.0-cp311-cp311-macosx_14_0_arm64.whl.

File metadata

File hashes

Hashes for docling_parse-6.2.0-cp311-cp311-macosx_14_0_arm64.whl
Algorithm Hash digest
SHA256 a6d915c2521a556946f75f66b46a9692a315c8ded318f695804e90f32c420bb0
MD5 340e098eab5d5ced7cc578b1a1f9fb8c
BLAKE2b-256 4ec7a7de59bef6db2256f67e8fc6b7ef84ffd5490af14495e68ddf379916437c

See more details on using hashes here.

File details

Details for the file docling_parse-6.2.0-cp310-cp310-win_amd64.whl.

File metadata

File hashes

Hashes for docling_parse-6.2.0-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 8132631b37b9a1e4fc6c25f470c76f8e2f54b8a4c112227aaccbe2e77f32b504
MD5 65c3a45f27a38bed7723e52aa5949870
BLAKE2b-256 e4e20d3dab8db19fc7cb5b89311e6f5639c92662a945a27a45e84b8d0edd9d94

See more details on using hashes here.

File details

Details for the file docling_parse-6.2.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for docling_parse-6.2.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 f078d2cb305207335d2ec0980ad1712ae78cddd570f75ac5b603f6a3bf3c3406
MD5 c31756d44c6f00cd83d8f415683d9ce3
BLAKE2b-256 57dbeff6f9d3472f392375fb011c9dd579cc6c67cbe6b1f2c8c3646ba2e6c7a2

See more details on using hashes here.

File details

Details for the file docling_parse-6.2.0-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for docling_parse-6.2.0-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 7209e39385adc0dffc305d9c3ba4f8098ca9723a82f1f9f343369072d7934704
MD5 d25173b33d6674dec897ed991aabc695
BLAKE2b-256 7fba1dd21810401468928f56e35a4950e58aadb0840f455398d3c2ccad7bedda

See more details on using hashes here.

File details

Details for the file docling_parse-6.2.0-cp310-cp310-macosx_14_0_arm64.whl.

File metadata

File hashes

Hashes for docling_parse-6.2.0-cp310-cp310-macosx_14_0_arm64.whl
Algorithm Hash digest
SHA256 250c01fa68b56e35c11f884dce6f061bd7aebb21a5c146aa72b8c52d29f78bfd
MD5 1ff866b9dd36cc9f5b128f53f59ecbb8
BLAKE2b-256 d4d18fb8ea204505adaeb325a8a2aa6b93436eeff92d22ef6ab0022487d5b32e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page