Skip to main content

Simple package to extract text with coordinates from programmatic PDFs

Project description

Docling Parse

PyPI version PyPI - Python Version uv Pybind11 Platforms License MIT

Simple package to extract text, paths and bitmap images with coordinates from programmatic PDFs. This package is used in the Docling PDF conversion. Below, we show a few output of the latest parser with char, word and line level output for text, in addition to the extracted paths and bitmap resources.

To do the visualizations yourself, simply run (change word into char or line),

uv run python ./docling_parse/visualize.py -i <path-to-pdf-file> -c word --interactive
original char word line
screenshot screenshot screenshot screenshot
screenshot screenshot screenshot screenshot
screenshot screenshot screenshot screenshot
screenshot screenshot screenshot screenshot
screenshot screenshot screenshot screenshot

Quick start

Install the package from Pypi

pip install docling-parse

Convert a PDF (look in the visualize.py for a more detailed information)

from docling_core.types.doc.page import TextCellUnit
from docling_parse.pdf_parser import DoclingPdfParser, PdfDocument

parser = DoclingPdfParser()

pdf_doc: PdfDocument = parser.load(
    path_or_stream="<path-to-pdf>"
)

# PdfDocument.iterate_pages() will automatically populate pages as they are yielded.
for page_no, pred_page in pdf_doc.iterate_pages():

    # iterate over the word-cells
    for word in pred_page.iterate_cells(unit_type=TextCellUnit.WORD):
        print(word.rect, ": ", word.text)

        # create a PIL image with the char cells
    img = pred_page.render_as_image(cell_unit=TextCellUnit.CHAR)
    img.show()

Parallel parsing (multi-threaded)

Parse pages from one or more PDFs in parallel using a thread pool with backpressure:

from docling_parse.pdf_parser import (
    DoclingThreadedPdfParser,
    ThreadedPdfParserConfig,
)
from docling_parse.pdf_parsers import DecodePageConfig  # type: ignore[import]

parser_config = ThreadedPdfParserConfig(
    loglevel="fatal",
    threads=4,                # worker threads
    max_concurrent_results=32 # cap buffered results to limit memory
)
decode_config = DecodePageConfig()

parser = DoclingThreadedPdfParser(
    parser_config=parser_config,
    decode_config=decode_config,
)

# load one or more documents
for source in ["doc_a.pdf", "doc_b.pdf"]:
    parser.load(source)

# consume decoded pages as they become available
while parser.has_tasks():
    task = parser.get_task()

    if task.success:
        page_decoder, timings = task.get()
        print(f"{task.doc_key} p{task.page_number}: "
              f"{len(list(page_decoder.get_word_cells()))} words")
    else:
        print(f"error on {task.doc_key} p{task.page_number}: {task.error()}")

Use the CLI

$ docling-parse -h
usage: docling-parse [-h] -p PDF

Process a PDF file.

options:
  -h, --help         show this help message and exit
  -p PDF, --pdf PDF  Path to the PDF file

Performance Benchmarks

Coming soon - benchmarks will be updated for the current parser version.

For historical V1 vs V2 benchmarks, see legacy_performance_benchmarks.md.

Development

CXX

To build the parser, simply run the following command in the root folder,

rm -rf build; cmake -B ./build; cd build; make

You can run the parser from your build folder:

% ./parse.exe -h
program to process PDF files or configuration files
Usage:
  PDFProcessor [OPTION...]

  -i, --input arg          Input PDF file
  -c, --config arg         Config file
      --create-config arg  Create config file
  -p, --page arg           Pages to process (default: -1 for all) (default:
                           -1)
      --password arg       Password for accessing encrypted, password-protected files
  -o, --output arg         Output file
  -l, --loglevel arg       loglevel [error;warning;success;info]
  -h, --help               Print usage

If you don't have an input file, a template input file will be printed on the terminal.

Python

To build the package, simply run (make sure uv is installed),

uv sync

The latter will only work after a clean git clone. If you are developing and updating C++ code, please use,

# uv pip install --force-reinstall --no-deps -e .
rm -rf .venv; uv venv; uv pip install --force-reinstall --no-deps -e ".[perf-tools]"

To test the package, run:

uv run pytest ./tests -v -s

Contributing

Please read Contributing to Docling Parse for details.

References

If you use Docling in your projects, please consider citing the following:

@techreport{Docling,
  author = {Docling Team},
  month = {8},
  title = {Docling Technical Report},
  url = {https://arxiv.org/abs/2408.09869},
  eprint = {2408.09869},
  doi = {10.48550/arXiv.2408.09869},
  version = {1.0.0},
  year = {2024}
}

License

The Docling Parse codebase is under MIT license. For individual model usage, please refer to the model licenses found in the original packages.

LF AI & Data

Docling (and also docling-parse) is hosted as a project in the LF AI & Data Foundation.

IBM ❤️ Open Source AI

The project was started by the AI for knowledge team at IBM Research Zurich.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docling_parse-5.5.0.tar.gz (57.5 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

docling_parse-5.5.0-cp314-cp314-win_amd64.whl (9.6 MB view details)

Uploaded CPython 3.14Windows x86-64

docling_parse-5.5.0-cp314-cp314-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (8.3 MB view details)

Uploaded CPython 3.14manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

docling_parse-5.5.0-cp314-cp314-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl (8.2 MB view details)

Uploaded CPython 3.14manylinux: glibc 2.26+ ARM64manylinux: glibc 2.28+ ARM64

docling_parse-5.5.0-cp314-cp314-macosx_14_0_arm64.whl (7.8 MB view details)

Uploaded CPython 3.14macOS 14.0+ ARM64

docling_parse-5.5.0-cp313-cp313-win_amd64.whl (9.2 MB view details)

Uploaded CPython 3.13Windows x86-64

docling_parse-5.5.0-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (8.3 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

docling_parse-5.5.0-cp313-cp313-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl (8.2 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.26+ ARM64manylinux: glibc 2.28+ ARM64

docling_parse-5.5.0-cp313-cp313-macosx_14_0_arm64.whl (7.8 MB view details)

Uploaded CPython 3.13macOS 14.0+ ARM64

docling_parse-5.5.0-cp312-cp312-win_amd64.whl (9.2 MB view details)

Uploaded CPython 3.12Windows x86-64

docling_parse-5.5.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (8.3 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

docling_parse-5.5.0-cp312-cp312-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl (8.2 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.26+ ARM64manylinux: glibc 2.28+ ARM64

docling_parse-5.5.0-cp312-cp312-macosx_14_0_arm64.whl (7.8 MB view details)

Uploaded CPython 3.12macOS 14.0+ ARM64

docling_parse-5.5.0-cp311-cp311-win_amd64.whl (9.2 MB view details)

Uploaded CPython 3.11Windows x86-64

docling_parse-5.5.0-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (8.3 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

docling_parse-5.5.0-cp311-cp311-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl (8.2 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.26+ ARM64manylinux: glibc 2.28+ ARM64

docling_parse-5.5.0-cp311-cp311-macosx_14_0_arm64.whl (7.8 MB view details)

Uploaded CPython 3.11macOS 14.0+ ARM64

docling_parse-5.5.0-cp310-cp310-win_amd64.whl (9.2 MB view details)

Uploaded CPython 3.10Windows x86-64

docling_parse-5.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (8.3 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ x86-64

docling_parse-5.5.0-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (8.2 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ ARM64

docling_parse-5.5.0-cp310-cp310-macosx_14_0_arm64.whl (7.8 MB view details)

Uploaded CPython 3.10macOS 14.0+ ARM64

File details

Details for the file docling_parse-5.5.0.tar.gz.

File metadata

  • Download URL: docling_parse-5.5.0.tar.gz
  • Upload date:
  • Size: 57.5 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for docling_parse-5.5.0.tar.gz
Algorithm Hash digest
SHA256 0914c7174f8fe497d406f4814a70cdfccb4e09d8b2ba90a6e92d02704f5a4a65
MD5 cf0b80024237996d32c30cd15e6aa345
BLAKE2b-256 e69468453bf4136e82f7c94168f0332466822cdb5f226c8a0e1335de21c595ed

See more details on using hashes here.

File details

Details for the file docling_parse-5.5.0-cp314-cp314-win_amd64.whl.

File metadata

File hashes

Hashes for docling_parse-5.5.0-cp314-cp314-win_amd64.whl
Algorithm Hash digest
SHA256 ce24075b85557ae5943d99793d007379c0ef61f2529cf942334fa1306c58022c
MD5 b1858eedcdac64552dcdae2eb463e5eb
BLAKE2b-256 92c88c04fe1582fae1e65e1ada2dd651a2bd229caa031ffbcdc0f4719392f50e

See more details on using hashes here.

File details

Details for the file docling_parse-5.5.0-cp314-cp314-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for docling_parse-5.5.0-cp314-cp314-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 75753287a042f47119fbb131effc17668bd54fce71b940dac41d7ed1a37a1a0d
MD5 f85197f775021baa63dc57606e1035ff
BLAKE2b-256 1458af4769eb8c716e232f5fddc369c1ac821d8f9030f8c9f85d22ef104930c2

See more details on using hashes here.

File details

Details for the file docling_parse-5.5.0-cp314-cp314-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for docling_parse-5.5.0-cp314-cp314-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 0f29ee02a63fac5a5611f318954d4ae383758cd22c851b5298a926320e368a4e
MD5 c25af727a6092fce44a7709522322c6b
BLAKE2b-256 49938fe80eaf72b514b6ba0e2a07f57a006572608bda05ac18279482cc547958

See more details on using hashes here.

File details

Details for the file docling_parse-5.5.0-cp314-cp314-macosx_14_0_arm64.whl.

File metadata

File hashes

Hashes for docling_parse-5.5.0-cp314-cp314-macosx_14_0_arm64.whl
Algorithm Hash digest
SHA256 fbee5f433bc26bd643affb391c8eea2ffb59207526caded09327f1ac25bc5164
MD5 822da9d5dd5d9508a5bd751eca9c565e
BLAKE2b-256 bdb0c2ca426423abc6bf92cafa02c0d1b16c03223d9c8e90c46bf12b247890ab

See more details on using hashes here.

File details

Details for the file docling_parse-5.5.0-cp313-cp313-win_amd64.whl.

File metadata

File hashes

Hashes for docling_parse-5.5.0-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 db271c7b2323b09b8296eb9c321e016262241bf62f3586a9d3b12f53b97d5a29
MD5 b4a23b8fb481e40c2dda0f8f00fe7f84
BLAKE2b-256 517434fbe48a058fb4fef8e3ded0392b8ba64741a32561e5f23b677cb29b39a2

See more details on using hashes here.

File details

Details for the file docling_parse-5.5.0-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for docling_parse-5.5.0-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 b0e78c3b32078b2ffa8ea08dbada5e3708c70f302a1b9d1468626d2682b93901
MD5 0ee930a67de06d4ba5d45f73b929be96
BLAKE2b-256 a329352858ee62c09d27aad6312f383d15608526b63a0c7f10fe769ffc7e8735

See more details on using hashes here.

File details

Details for the file docling_parse-5.5.0-cp313-cp313-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for docling_parse-5.5.0-cp313-cp313-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 88f9dde1903652bd6b153fdeaa1f9043f3cc6762386b2b69258ad2414019db1d
MD5 f4a5be86f221cb393efe71acfe32754e
BLAKE2b-256 4f272b7e67cd1fe918fe91b62afea242d9fac8d49da50f90271962b40823edaa

See more details on using hashes here.

File details

Details for the file docling_parse-5.5.0-cp313-cp313-macosx_14_0_arm64.whl.

File metadata

File hashes

Hashes for docling_parse-5.5.0-cp313-cp313-macosx_14_0_arm64.whl
Algorithm Hash digest
SHA256 a6916033bab153741f97fe96020f8db9b3567d5d5db71d470ac9a7184889ca59
MD5 e208b4004ddebbb8673efc72c768d85d
BLAKE2b-256 e95de09e9d8a994cea16cec13276a9d5a67703fd27f1c67fcea04bb175ea8acf

See more details on using hashes here.

File details

Details for the file docling_parse-5.5.0-cp312-cp312-win_amd64.whl.

File metadata

File hashes

Hashes for docling_parse-5.5.0-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 be0ee5c30455a9865e1fa6b5d9ff0a2fdc42094ac12c04902296429e3f4d8c39
MD5 1936eac4cea8952b48dad5c0432a9d12
BLAKE2b-256 b312a0f6d8ad390d27d300ea04834ccd1c776e3713534c78c0cecc08a5af0414

See more details on using hashes here.

File details

Details for the file docling_parse-5.5.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for docling_parse-5.5.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 574b0d3d2beaeaa6e4aeafc63563204ae784758179c14980fa63e26a4b315a39
MD5 6cf17afa13afad2d6f7dea491aede720
BLAKE2b-256 15120d71fa0ce26b7f77ebd86de1108a4954a08be4969ff48fd027ed90a90594

See more details on using hashes here.

File details

Details for the file docling_parse-5.5.0-cp312-cp312-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for docling_parse-5.5.0-cp312-cp312-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 0a464610e170c50ba25c41f053ba5141fa4fa338fcabec1c857416cfcf23dfa8
MD5 162dd01b1a742d8963f0e4206dc2ea4a
BLAKE2b-256 7843c4443f5f88e892a6337165c541cc348e3c4ecf29fa04781e144cde56064b

See more details on using hashes here.

File details

Details for the file docling_parse-5.5.0-cp312-cp312-macosx_14_0_arm64.whl.

File metadata

File hashes

Hashes for docling_parse-5.5.0-cp312-cp312-macosx_14_0_arm64.whl
Algorithm Hash digest
SHA256 75521736460264dc77ee54b2f5bb96fa2607983ba97c99a06e3f4a4d534dbf53
MD5 ddbe3170088ef510bbaf2c737d61f448
BLAKE2b-256 6af57fbebe472f2964be18527e6728a3d8e618b6272b9555a08f38f9e7912515

See more details on using hashes here.

File details

Details for the file docling_parse-5.5.0-cp311-cp311-win_amd64.whl.

File metadata

File hashes

Hashes for docling_parse-5.5.0-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 13eac1ef3cd99b5d2659cf01799d11b9cd8e2fb6bf90392ac081a1bf54b5fdcd
MD5 f07f1dd5a3e5f22b690ba1254501a152
BLAKE2b-256 2176734246f62997d4da08ef1af569b26bbaef6f09d97f82a871052720080c70

See more details on using hashes here.

File details

Details for the file docling_parse-5.5.0-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for docling_parse-5.5.0-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 70545978146d6a4d763f524d09d9f1ae6f58819a71d8aa186d7aa1d6a20d4a92
MD5 6374a3a60a4bac874a7f5ac1aa5a5814
BLAKE2b-256 8e3e62f767db4684d761e1ad4890759c25d202012d36ce113dd6b89765e3c448

See more details on using hashes here.

File details

Details for the file docling_parse-5.5.0-cp311-cp311-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for docling_parse-5.5.0-cp311-cp311-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 ccbb063c665016e3bd231720db18c13eba98c2bf78a745c9276413e3d7ee46ea
MD5 1af779134e6801398ff138599e054df0
BLAKE2b-256 22166c70f5c4f9c2bf80f77378e780a16d887ba7827ad376c295b9ce90f38e80

See more details on using hashes here.

File details

Details for the file docling_parse-5.5.0-cp311-cp311-macosx_14_0_arm64.whl.

File metadata

File hashes

Hashes for docling_parse-5.5.0-cp311-cp311-macosx_14_0_arm64.whl
Algorithm Hash digest
SHA256 980591e6c71db27e2cdcc3b7f999b77401793b897b3d2f87b83c9c3913b77098
MD5 3a0470115ce742496fe7f13a18b4f6c2
BLAKE2b-256 4ff3387ea2d5c006036f5d9f8dd510aa1b9a968f932ee676475e6c702e92a3f5

See more details on using hashes here.

File details

Details for the file docling_parse-5.5.0-cp310-cp310-win_amd64.whl.

File metadata

File hashes

Hashes for docling_parse-5.5.0-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 a7d0be6490bd19c4adb7f11b4682e134c7f937f521c7cf217fba9828ee600ed2
MD5 04804519f6bf9ffd9acb9777f65fb1d7
BLAKE2b-256 c45d071516f5850be9551d48a04574ae138f75af8e828e52ab5acdbb082f92c4

See more details on using hashes here.

File details

Details for the file docling_parse-5.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for docling_parse-5.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 8ac1164c6277bb519af0098626bb8c6f078b4f6b750fcbf6c4a325fdb80740ba
MD5 7003bc9db927e638872cf4d7800c110d
BLAKE2b-256 33dd7c96350bbe3bb8c415b50ff311f4bc1b8f838dd7a7c4c15dc11484993f9b

See more details on using hashes here.

File details

Details for the file docling_parse-5.5.0-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for docling_parse-5.5.0-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 d88db10b56b49b7af6475b87cb7d3c4a3dae78476495ae627497a569964bc28b
MD5 de7f4a4f85414cc1085814b0d7d30d51
BLAKE2b-256 254311a599391f69b7a12b021d671696ad3ed31462025ad6b1254ad2068be15b

See more details on using hashes here.

File details

Details for the file docling_parse-5.5.0-cp310-cp310-macosx_14_0_arm64.whl.

File metadata

File hashes

Hashes for docling_parse-5.5.0-cp310-cp310-macosx_14_0_arm64.whl
Algorithm Hash digest
SHA256 d4563b23a1345ca2189d2b1c766aa2d55073400e73b03d2e56991268d56207f6
MD5 bb97d8724397d8ee05b303e01dd85686
BLAKE2b-256 b555e56e5af3395114020f7671b37c42e7c6f746feb5a7e8b3afa08a57d99ab3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page