Skip to main content

Simple package to extract text with coordinates from programmatic PDFs

Project description

Docling Parse

PyPI version PyPI - Python Version uv Pybind11 Platforms License MIT

Simple package to extract text, paths and bitmap images with coordinates from programmatic PDFs. This package is used in the Docling PDF conversion. Below, we show a few output of the latest parser with char, word and line level output for text, in addition to the extracted paths and bitmap resources.

To do the visualizations yourself, simply run (change word into char or line),

uv run python ./docling_parse/visualize.py -i <path-to-pdf-file> -c word --interactive
original char word line
screenshot screenshot screenshot screenshot
screenshot screenshot screenshot screenshot
screenshot screenshot screenshot screenshot
screenshot screenshot screenshot screenshot
screenshot screenshot screenshot screenshot

Quick start

Install the package from Pypi

pip install docling-parse

Convert a PDF (look in the visualize.py for a more detailed information)

from docling_core.types.doc.page import TextCellUnit
from docling_parse.pdf_parser import DoclingPdfParser, PdfDocument

parser = DoclingPdfParser()

pdf_doc: PdfDocument = parser.load(
    path_or_stream="<path-to-pdf>"
)

# PdfDocument.iterate_pages() will automatically populate pages as they are yielded.
for page_no, pred_page in pdf_doc.iterate_pages():

    # iterate over the word-cells
    for word in pred_page.iterate_cells(unit_type=TextCellUnit.WORD):
        print(word.rect, ": ", word.text)

        # create a PIL image with the char cells
    img = pred_page.render_as_image(cell_unit=TextCellUnit.CHAR)
    img.show()

Parallel parsing (multi-threaded)

Parse pages from one or more PDFs in parallel using a thread pool with backpressure:

from docling_parse.pdf_parser import (
    DoclingThreadedPdfParser,
    ThreadedPdfParserConfig,
)
from docling_parse.pdf_parsers import DecodePageConfig  # type: ignore[import]

parser_config = ThreadedPdfParserConfig(
    loglevel="fatal",
    threads=4,                # worker threads
    max_concurrent_results=32 # cap buffered results to limit memory
)
decode_config = DecodePageConfig()

parser = DoclingThreadedPdfParser(
    parser_config=parser_config,
    decode_config=decode_config,
)

# load one or more documents
for source in ["doc_a.pdf", "doc_b.pdf"]:
    parser.load(source)

# consume decoded pages as they become available
while parser.has_tasks():
    task = parser.get_task()

    if task.success:
        page_decoder, timings = task.get()
        print(f"{task.doc_key} p{task.page_number}: "
              f"{len(list(page_decoder.get_word_cells()))} words")
    else:
        print(f"error on {task.doc_key} p{task.page_number}: {task.error()}")

Use the CLI

$ docling-parse -h
usage: docling-parse [-h] -p PDF

Process a PDF file.

options:
  -h, --help         show this help message and exit
  -p PDF, --pdf PDF  Path to the PDF file

Performance Benchmarks

Coming soon - benchmarks will be updated for the current parser version.

For historical V1 vs V2 benchmarks, see legacy_performance_benchmarks.md.

Development

CXX

To build the parser, simply run the following command in the root folder,

rm -rf build; cmake -B ./build; cd build; make

You can run the parser from your build folder:

% ./parse.exe -h
program to process PDF files or configuration files
Usage:
  PDFProcessor [OPTION...]

  -i, --input arg          Input PDF file
  -c, --config arg         Config file
      --create-config arg  Create config file
  -p, --page arg           Pages to process (default: -1 for all) (default:
                           -1)
      --password arg       Password for accessing encrypted, password-protected files
  -o, --output arg         Output file
  -l, --loglevel arg       loglevel [error;warning;success;info]
  -h, --help               Print usage

If you don't have an input file, a template input file will be printed on the terminal.

Python

To build the package, simply run (make sure uv is installed),

uv sync

The latter will only work after a clean git clone. If you are developing and updating C++ code, please use,

# uv pip install --force-reinstall --no-deps -e .
rm -rf .venv; uv venv; uv pip install --force-reinstall --no-deps -e ".[perf-tools]"

To test the package, run:

uv run pytest ./tests -v -s

Contributing

Please read Contributing to Docling Parse for details.

References

If you use Docling in your projects, please consider citing the following:

@techreport{Docling,
  author = {Docling Team},
  month = {8},
  title = {Docling Technical Report},
  url = {https://arxiv.org/abs/2408.09869},
  eprint = {2408.09869},
  doi = {10.48550/arXiv.2408.09869},
  version = {1.0.0},
  year = {2024}
}

License

The Docling Parse codebase is under MIT license. For individual model usage, please refer to the model licenses found in the original packages.

LF AI & Data

Docling (and also docling-parse) is hosted as a project in the LF AI & Data Foundation.

IBM ❤️ Open Source AI

The project was started by the AI for knowledge team at IBM Research Zurich.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docling_parse-5.11.0.tar.gz (6.7 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

docling_parse-5.11.0-cp314-cp314-win_amd64.whl (11.3 MB view details)

Uploaded CPython 3.14Windows x86-64

docling_parse-5.11.0-cp314-cp314-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (10.2 MB view details)

Uploaded CPython 3.14manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

docling_parse-5.11.0-cp314-cp314-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl (9.8 MB view details)

Uploaded CPython 3.14manylinux: glibc 2.26+ ARM64manylinux: glibc 2.28+ ARM64

docling_parse-5.11.0-cp314-cp314-macosx_14_0_arm64.whl (9.1 MB view details)

Uploaded CPython 3.14macOS 14.0+ ARM64

docling_parse-5.11.0-cp313-cp313-win_amd64.whl (10.9 MB view details)

Uploaded CPython 3.13Windows x86-64

docling_parse-5.11.0-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (10.2 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

docling_parse-5.11.0-cp313-cp313-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl (9.8 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.26+ ARM64manylinux: glibc 2.28+ ARM64

docling_parse-5.11.0-cp313-cp313-macosx_14_0_arm64.whl (9.1 MB view details)

Uploaded CPython 3.13macOS 14.0+ ARM64

docling_parse-5.11.0-cp312-cp312-win_amd64.whl (10.9 MB view details)

Uploaded CPython 3.12Windows x86-64

docling_parse-5.11.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (10.2 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

docling_parse-5.11.0-cp312-cp312-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl (9.8 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.26+ ARM64manylinux: glibc 2.28+ ARM64

docling_parse-5.11.0-cp312-cp312-macosx_14_0_arm64.whl (9.1 MB view details)

Uploaded CPython 3.12macOS 14.0+ ARM64

docling_parse-5.11.0-cp311-cp311-win_amd64.whl (10.9 MB view details)

Uploaded CPython 3.11Windows x86-64

docling_parse-5.11.0-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (10.2 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

docling_parse-5.11.0-cp311-cp311-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl (9.8 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.26+ ARM64manylinux: glibc 2.28+ ARM64

docling_parse-5.11.0-cp311-cp311-macosx_14_0_arm64.whl (9.1 MB view details)

Uploaded CPython 3.11macOS 14.0+ ARM64

docling_parse-5.11.0-cp310-cp310-win_amd64.whl (10.9 MB view details)

Uploaded CPython 3.10Windows x86-64

docling_parse-5.11.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (10.1 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ x86-64

docling_parse-5.11.0-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (9.8 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ ARM64

docling_parse-5.11.0-cp310-cp310-macosx_14_0_arm64.whl (9.1 MB view details)

Uploaded CPython 3.10macOS 14.0+ ARM64

File details

Details for the file docling_parse-5.11.0.tar.gz.

File metadata

  • Download URL: docling_parse-5.11.0.tar.gz
  • Upload date:
  • Size: 6.7 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for docling_parse-5.11.0.tar.gz
Algorithm Hash digest
SHA256 8bb50d8ce23b7f3c8817e73c54c6ee6f323e4153e9a2adfac4ac348176924832
MD5 f46511e396dbac904e8ccde3c6ac194f
BLAKE2b-256 1965bf70d3bc8dd4774ec46b586b292522d93caae33e599c07dc77aa8183572c

See more details on using hashes here.

File details

Details for the file docling_parse-5.11.0-cp314-cp314-win_amd64.whl.

File metadata

File hashes

Hashes for docling_parse-5.11.0-cp314-cp314-win_amd64.whl
Algorithm Hash digest
SHA256 bdea2e71ca7b90f41404c904d430e66bab27d5a2319998394356517f7bb3d81a
MD5 5fa898e5129701fb8614c0a49ba0c8aa
BLAKE2b-256 fd661b1e7881f479c6a29c1731794f9d431c225ebba841c6d111996cc10d8eda

See more details on using hashes here.

File details

Details for the file docling_parse-5.11.0-cp314-cp314-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for docling_parse-5.11.0-cp314-cp314-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 e7fa0f42a506c4dc0edf7f444a4ca485569360992a6166c2a76e68fa6ac9893f
MD5 e4fedef89e9f46f5e834eabeda2647b2
BLAKE2b-256 1a7fc8dbbd4ac53b91e9de0595a2c30674101251a8b32a26d0e06269a36af7d2

See more details on using hashes here.

File details

Details for the file docling_parse-5.11.0-cp314-cp314-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for docling_parse-5.11.0-cp314-cp314-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 a66160af1c9c1faea7d5371bea4116d51a02f02283b79dc4f6662cb3b86e252b
MD5 680c6bee4c60b0a3a2328a3667649e82
BLAKE2b-256 6c664e8a922887c4c1e9bb14fd81484eb864b5075c60498593f201c14b59ca59

See more details on using hashes here.

File details

Details for the file docling_parse-5.11.0-cp314-cp314-macosx_14_0_arm64.whl.

File metadata

File hashes

Hashes for docling_parse-5.11.0-cp314-cp314-macosx_14_0_arm64.whl
Algorithm Hash digest
SHA256 322aa61ceea1be44f2e488cb7366c88f79b9600a467756027e342304bb7a3ed7
MD5 6cef0ddf2ba85fc47821aaa68d666113
BLAKE2b-256 aa0a0459d4d4c07d6cfb9b31de60b966cb16d5c4238ea58e1a7178fbd8d6ea8e

See more details on using hashes here.

File details

Details for the file docling_parse-5.11.0-cp313-cp313-win_amd64.whl.

File metadata

File hashes

Hashes for docling_parse-5.11.0-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 422be41d47718ef8a2d426482c9e7b33675ed8b161f1a4d2d7702512964e5011
MD5 21bf834d61259f7f258bf35dff7055ec
BLAKE2b-256 395c92165c1eb695d019c5bbcb220f840f6975252fc8511aca78a6989d3a065c

See more details on using hashes here.

File details

Details for the file docling_parse-5.11.0-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for docling_parse-5.11.0-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 043c806a1e22ec5e07213776af87de8452473387bb2d57d6e50f1c2fac517da7
MD5 5e35e37a35bd7c05c8718c3a8811dfbe
BLAKE2b-256 a5f580029e68ac9b2bd99f388e97ab3623fc0ce314f2dcbb95cfd7804527aa24

See more details on using hashes here.

File details

Details for the file docling_parse-5.11.0-cp313-cp313-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for docling_parse-5.11.0-cp313-cp313-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 4634a72e5d1c20ea4989a1c92366fa2653b1c2c30708523f1ee04348a4bfade0
MD5 832ca5fc5d176e2842b77a5cc7217461
BLAKE2b-256 be9d757e6e72a32714a9f8cf485b61918ec17aa0af00649960cca87f075b728d

See more details on using hashes here.

File details

Details for the file docling_parse-5.11.0-cp313-cp313-macosx_14_0_arm64.whl.

File metadata

File hashes

Hashes for docling_parse-5.11.0-cp313-cp313-macosx_14_0_arm64.whl
Algorithm Hash digest
SHA256 a4fa8353a9f19754e1aff18701d02a9aac699258bf3284fbf8f53048e6c38cb5
MD5 93d9281d1f4a8b90d843aae58ae16b94
BLAKE2b-256 f6dbc560924583e2d907adfbb8e6d6ca4e99a51b034c19eeb10575abefd805f9

See more details on using hashes here.

File details

Details for the file docling_parse-5.11.0-cp312-cp312-win_amd64.whl.

File metadata

File hashes

Hashes for docling_parse-5.11.0-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 18a55b8ce81f7f03da0c47b511a344362ce74a1587e1e16543d064f1eaf66433
MD5 6fe8348b96e4804e1290c057efa82988
BLAKE2b-256 a203a5e759201c3855dc8fa874c77e802e7906b69e0b2d7c301091cafdbbf49a

See more details on using hashes here.

File details

Details for the file docling_parse-5.11.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for docling_parse-5.11.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 1bf4b7647c7ba4cdafd0e08046af0c1e4fe5954330319b0cfd4eb7ebdd429d2f
MD5 8c2febb83d5c072bbc8269d9fc790547
BLAKE2b-256 22693bef8634a67ff54cda5aacb295888678a08268daa9904c446c820a31d136

See more details on using hashes here.

File details

Details for the file docling_parse-5.11.0-cp312-cp312-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for docling_parse-5.11.0-cp312-cp312-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 8d87c32304947a6b5dc5966b88e649c14a365e77e7b473c5412b679f1f220808
MD5 db2054231ca1c711a4cebcc52e10d9f5
BLAKE2b-256 e850c010c08378160510666b12505dac5412531c50ca0c6aa63e8e83813e3e28

See more details on using hashes here.

File details

Details for the file docling_parse-5.11.0-cp312-cp312-macosx_14_0_arm64.whl.

File metadata

File hashes

Hashes for docling_parse-5.11.0-cp312-cp312-macosx_14_0_arm64.whl
Algorithm Hash digest
SHA256 111ab0672773d2e9597bf0f55ce31e4fa75086faaf14781f06d9323434f21ace
MD5 7b40dca9219928b47e0acac6208ecc13
BLAKE2b-256 f91a04014d3501a4568545d4f79ba176483fbf3c5fc9f36a2a5ac3bbdd4f75c9

See more details on using hashes here.

File details

Details for the file docling_parse-5.11.0-cp311-cp311-win_amd64.whl.

File metadata

File hashes

Hashes for docling_parse-5.11.0-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 d12989007a3cf3828467e8cd690b33f7fe3440dfe9750fb200b4c548e389a6d7
MD5 c1eb46778bc7b701848d009ca8303c78
BLAKE2b-256 cf4460e82fef3bea75006af59639fe8067a06fb25283883e55f274114a8f0866

See more details on using hashes here.

File details

Details for the file docling_parse-5.11.0-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for docling_parse-5.11.0-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 265e7070ea9f3b606f1bb914de611115fae4ffb3522e9ff7cd6cdead27715dce
MD5 dae9a126076de77bb935f71e57fb9eae
BLAKE2b-256 3cfb8f765cba030306dcbfdc58f624c48160e4b0fa5032215a1084eb99bf7080

See more details on using hashes here.

File details

Details for the file docling_parse-5.11.0-cp311-cp311-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for docling_parse-5.11.0-cp311-cp311-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 89cb4082d41ead95859c9648b23144df879bfe5a84454991393530a09d845e58
MD5 710890d562b56fb2d7472fd1e480faf9
BLAKE2b-256 1211cf3ccdf01964ee7abcd9982568f152949906d714c6aeafcfd662a08e57f5

See more details on using hashes here.

File details

Details for the file docling_parse-5.11.0-cp311-cp311-macosx_14_0_arm64.whl.

File metadata

File hashes

Hashes for docling_parse-5.11.0-cp311-cp311-macosx_14_0_arm64.whl
Algorithm Hash digest
SHA256 3dc528751f6a1eaabff9ba7dcc6686485514cdc7aff1c9ba7d5427ea67d92376
MD5 4cc9b288c64fb1ecbaade07a3f2c2fda
BLAKE2b-256 fbd83b47a14b4047913e83fa49ce3b332487410952752a24a0bc0ebc87badeeb

See more details on using hashes here.

File details

Details for the file docling_parse-5.11.0-cp310-cp310-win_amd64.whl.

File metadata

File hashes

Hashes for docling_parse-5.11.0-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 bb7b0b7cca56b8ef1fea60782bca3c6ab2e8a55f23417d5155c84bc9155339b9
MD5 49454a547f62520358dbe5f379143614
BLAKE2b-256 bacd2808ce0e95d7efd2e8e9d2883675504c351e8c68818c044e6d48f3d634c2

See more details on using hashes here.

File details

Details for the file docling_parse-5.11.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for docling_parse-5.11.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 651c9a85aae863f4fa116869a1b81813d4af243565f7a6104cfa0a9532f11e0c
MD5 a3f2db6800b74ad2f7593f38b8f5a744
BLAKE2b-256 aa583f3dd94e6f8170ab01745146cbf796c5fde217c79b7a41275f60989e6a24

See more details on using hashes here.

File details

Details for the file docling_parse-5.11.0-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for docling_parse-5.11.0-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 1263d71878d36235cc9a1380d726e78535013b477c9fa6779efa60e394a64e54
MD5 ec1e052c265b4271b8f7256ea041dae6
BLAKE2b-256 118acc9c1ce02c51fa347dc53d2bbfe1cd2c6ba708ebd15fec5f4e995345d111

See more details on using hashes here.

File details

Details for the file docling_parse-5.11.0-cp310-cp310-macosx_14_0_arm64.whl.

File metadata

File hashes

Hashes for docling_parse-5.11.0-cp310-cp310-macosx_14_0_arm64.whl
Algorithm Hash digest
SHA256 cb405edcad2872a7d97f2dccd8fb14edce0bce2a2ac9224b18e8c65c44e7ce22
MD5 62e014a7ddadc9c5451d93ba1a286e9f
BLAKE2b-256 0701937d46621c973461198354655f951fe1de6ba363b90ee7c32f63ca914dff

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page