Skip to main content

Simple package to extract text with coordinates from programmatic PDFs

Project description

Docling Parse

PyPI version PyPI - Python Version uv Pybind11 Platforms License MIT

Simple package to extract text, paths and bitmap images with coordinates from programmatic PDFs. This package is used in the Docling PDF conversion. Below, we show a few output of the latest parser with char, word and line level output for text, in addition to the extracted paths and bitmap resources.

To do the visualizations yourself, simply run (change word into char or line),

uv run python ./docling_parse/visualize.py -i <path-to-pdf-file> -c word --interactive
original char word line
screenshot screenshot screenshot screenshot
screenshot screenshot screenshot screenshot
screenshot screenshot screenshot screenshot
screenshot screenshot screenshot screenshot
screenshot screenshot screenshot screenshot

Quick start

Install the package from Pypi

pip install docling-parse

Convert a PDF (look in the visualize.py for a more detailed information)

from docling_core.types.doc.page import TextCellUnit
from docling_parse.pdf_parser import DoclingPdfParser, PdfDocument

parser = DoclingPdfParser()

pdf_doc: PdfDocument = parser.load(
    path_or_stream="<path-to-pdf>"
)

# PdfDocument.iterate_pages() will automatically populate pages as they are yielded.
for page_no, pred_page in pdf_doc.iterate_pages():

    # iterate over the word-cells
    for word in pred_page.iterate_cells(unit_type=TextCellUnit.WORD):
        print(word.rect, ": ", word.text)

        # create a PIL image with the char cells
    img = pred_page.render_as_image(cell_unit=TextCellUnit.CHAR)
    img.show()

Parallel parsing (multi-threaded)

Parse pages from one or more PDFs in parallel using a thread pool with backpressure:

from docling_parse.pdf_parser import (
    DoclingThreadedPdfParser,
    ThreadedPdfParserConfig,
)
from docling_parse.pdf_parsers import DecodePageConfig  # type: ignore[import]

parser_config = ThreadedPdfParserConfig(
    loglevel="fatal",
    threads=4,                # worker threads
    max_concurrent_results=32 # cap buffered results to limit memory
)
decode_config = DecodePageConfig()

parser = DoclingThreadedPdfParser(
    parser_config=parser_config,
    decode_config=decode_config,
)

# load one or more documents
for source in ["doc_a.pdf", "doc_b.pdf"]:
    parser.load(source)

# consume decoded pages as they become available
while parser.has_tasks():
    task = parser.get_task()

    if task.success:
        page_decoder, timings = task.get()
        print(f"{task.doc_key} p{task.page_number}: "
              f"{len(list(page_decoder.get_word_cells()))} words")
    else:
        print(f"error on {task.doc_key} p{task.page_number}: {task.error()}")

Use the CLI

$ docling-parse -h
usage: docling-parse [-h] -p PDF

Process a PDF file.

options:
  -h, --help         show this help message and exit
  -p PDF, --pdf PDF  Path to the PDF file

Performance Benchmarks

Coming soon - benchmarks will be updated for the current parser version.

For historical V1 vs V2 benchmarks, see legacy_performance_benchmarks.md.

Development

CXX

To build the parser, simply run the following command in the root folder,

rm -rf build; cmake -B ./build; cd build; make

You can run the parser from your build folder:

% ./parse.exe -h
program to process PDF files or configuration files
Usage:
  PDFProcessor [OPTION...]

  -i, --input arg          Input PDF file
  -c, --config arg         Config file
      --create-config arg  Create config file
  -p, --page arg           Pages to process (default: -1 for all) (default:
                           -1)
      --password arg       Password for accessing encrypted, password-protected files
  -o, --output arg         Output file
  -l, --loglevel arg       loglevel [error;warning;success;info]
  -h, --help               Print usage

If you don't have an input file, a template input file will be printed on the terminal.

Python

To build the package, simply run (make sure uv is installed),

uv sync

The latter will only work after a clean git clone. If you are developing and updating C++ code, please use,

# uv pip install --force-reinstall --no-deps -e .
rm -rf .venv; uv venv; uv pip install --force-reinstall --no-deps -e ".[perf-tools]"

To test the package, run:

uv run pytest ./tests -v -s

Contributing

Please read Contributing to Docling Parse for details.

References

If you use Docling in your projects, please consider citing the following:

@techreport{Docling,
  author = {Docling Team},
  month = {8},
  title = {Docling Technical Report},
  url = {https://arxiv.org/abs/2408.09869},
  eprint = {2408.09869},
  doi = {10.48550/arXiv.2408.09869},
  version = {1.0.0},
  year = {2024}
}

License

The Docling Parse codebase is under MIT license. For individual model usage, please refer to the model licenses found in the original packages.

LF AI & Data

Docling (and also docling-parse) is hosted as a project in the LF AI & Data Foundation.

IBM ❤️ Open Source AI

The project was started by the AI for knowledge team at IBM Research Zurich.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docling_parse-5.6.2.tar.gz (61.1 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

docling_parse-5.6.2-cp314-cp314-win_amd64.whl (9.6 MB view details)

Uploaded CPython 3.14Windows x86-64

docling_parse-5.6.2-cp314-cp314-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (8.3 MB view details)

Uploaded CPython 3.14manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

docling_parse-5.6.2-cp314-cp314-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl (8.2 MB view details)

Uploaded CPython 3.14manylinux: glibc 2.26+ ARM64manylinux: glibc 2.28+ ARM64

docling_parse-5.6.2-cp314-cp314-macosx_14_0_arm64.whl (7.8 MB view details)

Uploaded CPython 3.14macOS 14.0+ ARM64

docling_parse-5.6.2-cp313-cp313-win_amd64.whl (9.2 MB view details)

Uploaded CPython 3.13Windows x86-64

docling_parse-5.6.2-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (8.3 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

docling_parse-5.6.2-cp313-cp313-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl (8.2 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.26+ ARM64manylinux: glibc 2.28+ ARM64

docling_parse-5.6.2-cp313-cp313-macosx_14_0_arm64.whl (7.8 MB view details)

Uploaded CPython 3.13macOS 14.0+ ARM64

docling_parse-5.6.2-cp312-cp312-win_amd64.whl (9.2 MB view details)

Uploaded CPython 3.12Windows x86-64

docling_parse-5.6.2-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (8.3 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

docling_parse-5.6.2-cp312-cp312-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl (8.2 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.26+ ARM64manylinux: glibc 2.28+ ARM64

docling_parse-5.6.2-cp312-cp312-macosx_14_0_arm64.whl (7.8 MB view details)

Uploaded CPython 3.12macOS 14.0+ ARM64

docling_parse-5.6.2-cp311-cp311-win_amd64.whl (9.2 MB view details)

Uploaded CPython 3.11Windows x86-64

docling_parse-5.6.2-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (8.3 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

docling_parse-5.6.2-cp311-cp311-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl (8.2 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.26+ ARM64manylinux: glibc 2.28+ ARM64

docling_parse-5.6.2-cp311-cp311-macosx_14_0_arm64.whl (7.8 MB view details)

Uploaded CPython 3.11macOS 14.0+ ARM64

docling_parse-5.6.2-cp310-cp310-win_amd64.whl (9.2 MB view details)

Uploaded CPython 3.10Windows x86-64

docling_parse-5.6.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (8.3 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ x86-64

docling_parse-5.6.2-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (8.2 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ ARM64

docling_parse-5.6.2-cp310-cp310-macosx_14_0_arm64.whl (7.8 MB view details)

Uploaded CPython 3.10macOS 14.0+ ARM64

File details

Details for the file docling_parse-5.6.2.tar.gz.

File metadata

  • Download URL: docling_parse-5.6.2.tar.gz
  • Upload date:
  • Size: 61.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for docling_parse-5.6.2.tar.gz
Algorithm Hash digest
SHA256 9686c47ee69fd6d8c53889d30ac98f035caa56a61a746265e47c7b2b53dcb7a0
MD5 b15a261cc3b74742907f4e2e3d19ef82
BLAKE2b-256 15a73d358f4d17482c729b076e188b77e723025a34105fe4959f0a7bd3636374

See more details on using hashes here.

File details

Details for the file docling_parse-5.6.2-cp314-cp314-win_amd64.whl.

File metadata

File hashes

Hashes for docling_parse-5.6.2-cp314-cp314-win_amd64.whl
Algorithm Hash digest
SHA256 f65bf98478053c75ba3e1184bb2f54e737059b81645a8807ac437ec4e8181016
MD5 1a06d78176cdc0d73ee1191ea0c6c3fa
BLAKE2b-256 ed54139c1aa0eb2123a1bc495008c5ee3cb8ecf0d7a88f4cdd26b21840e5f5fc

See more details on using hashes here.

File details

Details for the file docling_parse-5.6.2-cp314-cp314-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for docling_parse-5.6.2-cp314-cp314-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 bc32cb60e0dcf7577187d1ca9ad607df65736a2588caf12746dcebb29ff576cf
MD5 35dd330928f378b5d456b365cd1f9e5e
BLAKE2b-256 7c54dd62fb65a1af9c83ad64eae8c8faebe4dbda3f791197a59ccd29fb98f22e

See more details on using hashes here.

File details

Details for the file docling_parse-5.6.2-cp314-cp314-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for docling_parse-5.6.2-cp314-cp314-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 ecd41ab7a48a17a32d27d57b2228a7bf292ff4b520cc3bff8c2697acef032c2a
MD5 988fded921aa9b3313876313863d0a95
BLAKE2b-256 4442541268e4732a131cbd511cf20514c8f671ff2e745a2760bb800e08abfb47

See more details on using hashes here.

File details

Details for the file docling_parse-5.6.2-cp314-cp314-macosx_14_0_arm64.whl.

File metadata

File hashes

Hashes for docling_parse-5.6.2-cp314-cp314-macosx_14_0_arm64.whl
Algorithm Hash digest
SHA256 1c00916127264a883ee0cc9900398cd949675b62a34050073dab14553d8c9bfa
MD5 d32885ae59f3b29a81ec66ec85ea6588
BLAKE2b-256 ecdd0c8ad26cd7e501c7df240a6a356964d07cdc6d226f1d9125e13de7836fe7

See more details on using hashes here.

File details

Details for the file docling_parse-5.6.2-cp313-cp313-win_amd64.whl.

File metadata

File hashes

Hashes for docling_parse-5.6.2-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 36fce345bea8dfaea42f4999414c7ea292d46b62e07972b69529a63568f450d0
MD5 a5e1c3cb8550a44f4862f224f4c35985
BLAKE2b-256 79f6373d3ba99e6ff7646da8363d85f9166962c2e8541a65222c94f28300e038

See more details on using hashes here.

File details

Details for the file docling_parse-5.6.2-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for docling_parse-5.6.2-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 b9af19e0994dff8d8ff2e1d9676d21ecc6aae647efe358e02baace0d1a4ea285
MD5 c2a855a8b604005f2c598c57090d74df
BLAKE2b-256 2a4b61c767e4f5c654fe51bf90395b5fa13570022d79ac2c857f18a59df7039b

See more details on using hashes here.

File details

Details for the file docling_parse-5.6.2-cp313-cp313-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for docling_parse-5.6.2-cp313-cp313-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 4fe220d55e514db6c5b08be42a902f424d95d060d51f31b86bfea8289b07f936
MD5 fb463eff78c0e48065c05a32e08da670
BLAKE2b-256 22a43bc9fc67f242b618adc3df6bb849cd1cc8ab63e132d0f054b85afe86d70f

See more details on using hashes here.

File details

Details for the file docling_parse-5.6.2-cp313-cp313-macosx_14_0_arm64.whl.

File metadata

File hashes

Hashes for docling_parse-5.6.2-cp313-cp313-macosx_14_0_arm64.whl
Algorithm Hash digest
SHA256 fdbd2e52fc98980d1f8e859887c0ed3d97b4699954ae40a8cc45aa45be90a7de
MD5 569071aca1e472a0f12294306c39ce3f
BLAKE2b-256 c8d68a06018cec766f8919e2a207d40e1a41ba6857a2484f34cfb20c759de8d6

See more details on using hashes here.

File details

Details for the file docling_parse-5.6.2-cp312-cp312-win_amd64.whl.

File metadata

File hashes

Hashes for docling_parse-5.6.2-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 a2973100a2e27ca8ef951579085c9931d435ad6164b8fe883af92ee7b8f635a6
MD5 813f53f50f017466ce5e16c5b7827a38
BLAKE2b-256 fd002b63b32f45283acbeba6eec0a7f41ac86b5428b40ddc08c86ce28c11ec78

See more details on using hashes here.

File details

Details for the file docling_parse-5.6.2-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for docling_parse-5.6.2-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 db2901c635c25f0b7c69854fc2d549eb3cd162b4b3307f0e8fb861f9a0ff46db
MD5 c743728a95098b95b3f2d06adb55bf6a
BLAKE2b-256 e1dab111669e4d12e2226704cb8071fbee358fedcaaa30af926ae669c6c4362b

See more details on using hashes here.

File details

Details for the file docling_parse-5.6.2-cp312-cp312-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for docling_parse-5.6.2-cp312-cp312-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 c3e78241e0bbf8a23907a24e5bf9e598fafda9d2c7841fca827566430e252c40
MD5 ffd39f317b32b4fe2bb23a18f9346aa6
BLAKE2b-256 b7b6d1dce64f29c8a65fb960010347324c5cbc454936de29caf638c34a97d644

See more details on using hashes here.

File details

Details for the file docling_parse-5.6.2-cp312-cp312-macosx_14_0_arm64.whl.

File metadata

File hashes

Hashes for docling_parse-5.6.2-cp312-cp312-macosx_14_0_arm64.whl
Algorithm Hash digest
SHA256 604d08c6c281750ea1ad456b3e4fbfe8d070b18bafdf90d238b6377add746636
MD5 0ba7d2737d4dea9569b0efbb89ab67d1
BLAKE2b-256 ab4f804350aa00dd903d41b411a7e4a675a508decf40b45ddbd677b8186e54af

See more details on using hashes here.

File details

Details for the file docling_parse-5.6.2-cp311-cp311-win_amd64.whl.

File metadata

File hashes

Hashes for docling_parse-5.6.2-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 f01c393f22ec3a7e53c53fb953a9f8967eea79ae63b93912cca3cbd8c61c990e
MD5 b7bc3a46f9b9feb87c49c70c6fb03e1f
BLAKE2b-256 90c9d2b471651d1d3768e77ccafd8147212324de2e096139070b19035416573c

See more details on using hashes here.

File details

Details for the file docling_parse-5.6.2-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for docling_parse-5.6.2-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 a8414cb2b98cf01e858ffe5e8770647387d8636d68030c255266e4e4809330f7
MD5 06ccb1d816b8707f532bf5a80d93371f
BLAKE2b-256 afe8d6ddc4fdfaf70fdf0c03b372f910f3434c78a4149890e6200f2d24ae3eed

See more details on using hashes here.

File details

Details for the file docling_parse-5.6.2-cp311-cp311-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for docling_parse-5.6.2-cp311-cp311-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 a560853942840be3b18a2327879deaf502c223b309dbafdaee3a198c46a3dd3c
MD5 a4f83b4d7da7bbf2934fd5bf964648ab
BLAKE2b-256 7b3f785b643af33c8110d88392620621fb53092a982224871ce195ec69393f00

See more details on using hashes here.

File details

Details for the file docling_parse-5.6.2-cp311-cp311-macosx_14_0_arm64.whl.

File metadata

File hashes

Hashes for docling_parse-5.6.2-cp311-cp311-macosx_14_0_arm64.whl
Algorithm Hash digest
SHA256 c528cc2325c4d342fa22531f1b50eac559c2548af36a9d54b5b0b00ffc29b997
MD5 aad190a2631101a0086367c7a2a59e3b
BLAKE2b-256 b59168e1f57ca7c1864fa25d8f6968d826fcd92586834944e612b77748a93d61

See more details on using hashes here.

File details

Details for the file docling_parse-5.6.2-cp310-cp310-win_amd64.whl.

File metadata

File hashes

Hashes for docling_parse-5.6.2-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 a206abc8d08a511bf8ec26c11f8cc7457440f527bda5fb9c9ea1761c881a0ed3
MD5 2ece43b98778efd52d6d1d39dffa58da
BLAKE2b-256 59a005becb4326d16ac0bc9f662991d1c6b5df4efcf934bec0cba0dcea2e3ef1

See more details on using hashes here.

File details

Details for the file docling_parse-5.6.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for docling_parse-5.6.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 26dd349a66b19952d9c4f271d154df018a69103c89aad219838cb2f57b87f655
MD5 19398b54d287206a1176d13f594d2ec1
BLAKE2b-256 c0572dc8356af29cd056034ea18cda2ba89e10e31b52cc3645ea4b1673c6c3c9

See more details on using hashes here.

File details

Details for the file docling_parse-5.6.2-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for docling_parse-5.6.2-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 19a3e3d9a57b615c9d50e3fefd12ac2e12128510d6d7637edf1ac78389d5e4c0
MD5 99aea5c27ca388b470498b1e364c744d
BLAKE2b-256 c7a0e3bec59e8aabde70a9bb2dec767aaa8bc2069d9d52b92ed5d171e19fe9a0

See more details on using hashes here.

File details

Details for the file docling_parse-5.6.2-cp310-cp310-macosx_14_0_arm64.whl.

File metadata

File hashes

Hashes for docling_parse-5.6.2-cp310-cp310-macosx_14_0_arm64.whl
Algorithm Hash digest
SHA256 0599c9febcccc0e8a7f1575775c88287b58631acafb03b1014bc7dc55a5431fa
MD5 3d4306bfa7e35300dc8c4d0e13d07d1e
BLAKE2b-256 ff6684b9718a0cb4b10a9669e253fc4644889f07bb72e66c9c7959cc58aa656d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page