Skip to main content

Simple package to extract text with coordinates from programmatic PDFs

Project description

Docling Parse

PyPI version PyPI - Python Version uv Pybind11 Platforms License MIT

Simple package to extract text, paths and bitmap images with coordinates from programmatic PDFs. This package is used in the Docling PDF conversion. Below, we show a few output of the latest parser with char, word and line level output for text, in addition to the extracted paths and bitmap resources.

To do the visualizations yourself, simply run (change word into char or line),

uv run python ./docling_parse/visualize.py -i <path-to-pdf-file> -c word --interactive
original char word line
screenshot screenshot screenshot screenshot
screenshot screenshot screenshot screenshot
screenshot screenshot screenshot screenshot
screenshot screenshot screenshot screenshot
screenshot screenshot screenshot screenshot

Quick start

Install the package from Pypi

pip install docling-parse

Convert a PDF (look in the visualize.py for a more detailed information)

from docling_core.types.doc.page import TextCellUnit
from docling_parse.pdf_parser import DoclingPdfParser, PdfDocument

parser = DoclingPdfParser()

pdf_doc: PdfDocument = parser.load(
    path_or_stream="<path-to-pdf>"
)

# PdfDocument.iterate_pages() will automatically populate pages as they are yielded.
for page_no, pred_page in pdf_doc.iterate_pages():

    # iterate over the word-cells
    for word in pred_page.iterate_cells(unit_type=TextCellUnit.WORD):
        print(word.rect, ": ", word.text)

        # create a PIL image with the char cells
    img = pred_page.render_as_image(cell_unit=TextCellUnit.CHAR)
    img.show()

Parallel parsing (multi-threaded)

Parse pages from one or more PDFs in parallel using a thread pool with backpressure:

from docling_parse.pdf_parser import (
    DoclingThreadedPdfParser,
    ThreadedPdfParserConfig,
)
from docling_parse.pdf_parsers import DecodePageConfig  # type: ignore[import]

parser_config = ThreadedPdfParserConfig(
    loglevel="fatal",
    threads=4,                # worker threads
    max_concurrent_results=32 # cap buffered results to limit memory
)
decode_config = DecodePageConfig()

parser = DoclingThreadedPdfParser(
    parser_config=parser_config,
    decode_config=decode_config,
)

# load one or more documents
for source in ["doc_a.pdf", "doc_b.pdf"]:
    parser.load(source)

# consume decoded pages as they become available
while parser.has_tasks():
    task = parser.get_task()

    if task.success:
        page_decoder, timings = task.get()
        print(f"{task.doc_key} p{task.page_number}: "
              f"{len(list(page_decoder.get_word_cells()))} words")
    else:
        print(f"error on {task.doc_key} p{task.page_number}: {task.error()}")

Use the CLI

$ docling-parse -h
usage: docling-parse [-h] -p PDF

Process a PDF file.

options:
  -h, --help         show this help message and exit
  -p PDF, --pdf PDF  Path to the PDF file

Performance Benchmarks

Coming soon - benchmarks will be updated for the current parser version.

For historical V1 vs V2 benchmarks, see legacy_performance_benchmarks.md.

Development

CXX

To build the parser, simply run the following command in the root folder,

rm -rf build; cmake -B ./build; cd build; make

You can run the parser from your build folder:

% ./parse.exe -h
program to process PDF files or configuration files
Usage:
  PDFProcessor [OPTION...]

  -i, --input arg          Input PDF file
  -c, --config arg         Config file
      --create-config arg  Create config file
  -p, --page arg           Pages to process (default: -1 for all) (default:
                           -1)
      --password arg       Password for accessing encrypted, password-protected files
  -o, --output arg         Output file
  -l, --loglevel arg       loglevel [error;warning;success;info]
  -h, --help               Print usage

If you don't have an input file, a template input file will be printed on the terminal.

Python

To build the package, simply run (make sure uv is installed),

uv sync

The latter will only work after a clean git clone. If you are developing and updating C++ code, please use,

# uv pip install --force-reinstall --no-deps -e .
rm -rf .venv; uv venv; uv pip install --force-reinstall --no-deps -e ".[perf-tools]"

To test the package, run:

uv run pytest ./tests -v -s

Contributing

Please read Contributing to Docling Parse for details.

References

If you use Docling in your projects, please consider citing the following:

@techreport{Docling,
  author = {Docling Team},
  month = {8},
  title = {Docling Technical Report},
  url = {https://arxiv.org/abs/2408.09869},
  eprint = {2408.09869},
  doi = {10.48550/arXiv.2408.09869},
  version = {1.0.0},
  year = {2024}
}

License

The Docling Parse codebase is under MIT license. For individual model usage, please refer to the model licenses found in the original packages.

LF AI & Data

Docling (and also docling-parse) is hosted as a project in the LF AI & Data Foundation.

IBM ❤️ Open Source AI

The project was started by the AI for knowledge team at IBM Research Zurich.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docling_parse-5.9.0.tar.gz (66.7 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

docling_parse-5.9.0-cp314-cp314-win_amd64.whl (10.8 MB view details)

Uploaded CPython 3.14Windows x86-64

docling_parse-5.9.0-cp314-cp314-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (9.6 MB view details)

Uploaded CPython 3.14manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

docling_parse-5.9.0-cp314-cp314-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl (9.3 MB view details)

Uploaded CPython 3.14manylinux: glibc 2.26+ ARM64manylinux: glibc 2.28+ ARM64

docling_parse-5.9.0-cp314-cp314-macosx_14_0_arm64.whl (8.6 MB view details)

Uploaded CPython 3.14macOS 14.0+ ARM64

docling_parse-5.9.0-cp313-cp313-win_amd64.whl (10.4 MB view details)

Uploaded CPython 3.13Windows x86-64

docling_parse-5.9.0-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (9.6 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

docling_parse-5.9.0-cp313-cp313-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl (9.3 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.26+ ARM64manylinux: glibc 2.28+ ARM64

docling_parse-5.9.0-cp313-cp313-macosx_14_0_arm64.whl (8.6 MB view details)

Uploaded CPython 3.13macOS 14.0+ ARM64

docling_parse-5.9.0-cp312-cp312-win_amd64.whl (10.4 MB view details)

Uploaded CPython 3.12Windows x86-64

docling_parse-5.9.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (9.6 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

docling_parse-5.9.0-cp312-cp312-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl (9.3 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.26+ ARM64manylinux: glibc 2.28+ ARM64

docling_parse-5.9.0-cp312-cp312-macosx_14_0_arm64.whl (8.6 MB view details)

Uploaded CPython 3.12macOS 14.0+ ARM64

docling_parse-5.9.0-cp311-cp311-win_amd64.whl (10.4 MB view details)

Uploaded CPython 3.11Windows x86-64

docling_parse-5.9.0-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (9.6 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

docling_parse-5.9.0-cp311-cp311-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl (9.3 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.26+ ARM64manylinux: glibc 2.28+ ARM64

docling_parse-5.9.0-cp311-cp311-macosx_14_0_arm64.whl (8.6 MB view details)

Uploaded CPython 3.11macOS 14.0+ ARM64

docling_parse-5.9.0-cp310-cp310-win_amd64.whl (10.4 MB view details)

Uploaded CPython 3.10Windows x86-64

docling_parse-5.9.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (9.6 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ x86-64

docling_parse-5.9.0-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (9.3 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ ARM64

docling_parse-5.9.0-cp310-cp310-macosx_14_0_arm64.whl (8.6 MB view details)

Uploaded CPython 3.10macOS 14.0+ ARM64

File details

Details for the file docling_parse-5.9.0.tar.gz.

File metadata

  • Download URL: docling_parse-5.9.0.tar.gz
  • Upload date:
  • Size: 66.7 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for docling_parse-5.9.0.tar.gz
Algorithm Hash digest
SHA256 c6812a143225490096cc2491a200b8731670c1dadff9aaf928c481bd5feba410
MD5 7876cd643e642cd6c8776cd97be65f44
BLAKE2b-256 f91069dc586f0ef54cc4e21e50debcb6bc52a77571482c88b7664aa725a7f150

See more details on using hashes here.

File details

Details for the file docling_parse-5.9.0-cp314-cp314-win_amd64.whl.

File metadata

File hashes

Hashes for docling_parse-5.9.0-cp314-cp314-win_amd64.whl
Algorithm Hash digest
SHA256 71663ffa62578127cc8372cbcda87405194ed19510cc2fe32e54524f46fd935c
MD5 f7ae641024184ff08986a41ba9cbe538
BLAKE2b-256 861099d039d0c297dcae9b2ea5583cb9ef15a4b6464e52213e822b9631685bb9

See more details on using hashes here.

File details

Details for the file docling_parse-5.9.0-cp314-cp314-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for docling_parse-5.9.0-cp314-cp314-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 c144a501e646441112469a7c7e7d7daba6d02e94b06b296dd31c57a8a6ac09e2
MD5 1373d3dd777c9fd0a046fbfaf0f79e06
BLAKE2b-256 38bf4593e9d47313cc5e1c2a588ba2f88d4665d13455156b6f20cef0827de498

See more details on using hashes here.

File details

Details for the file docling_parse-5.9.0-cp314-cp314-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for docling_parse-5.9.0-cp314-cp314-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 6d99e3441501b81adc65dcee8fb4155d6b943b528301fd25e3c1bf8d99805744
MD5 226252ea736f57e0b14e8e31136e10cb
BLAKE2b-256 83b31bbbc0dcf705c2cff54ab571dd1c9b99195915c37aaea7b18f2e8b875790

See more details on using hashes here.

File details

Details for the file docling_parse-5.9.0-cp314-cp314-macosx_14_0_arm64.whl.

File metadata

File hashes

Hashes for docling_parse-5.9.0-cp314-cp314-macosx_14_0_arm64.whl
Algorithm Hash digest
SHA256 3f8c76504254fee807b92cae32339f91f68ba7daa4af2e58096343a0f3360706
MD5 4906b375f1cb14c6007e45ff7a9c7056
BLAKE2b-256 77e9fb1de1f03a47c020084aec014670ac74ad18c07c5f0f89ac4f2837fe2c4a

See more details on using hashes here.

File details

Details for the file docling_parse-5.9.0-cp313-cp313-win_amd64.whl.

File metadata

File hashes

Hashes for docling_parse-5.9.0-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 25a65bf93b826f733c3169623df720933294a89357c3dfef335e454b57507804
MD5 1671f5d38295d10f73dfaac77118e399
BLAKE2b-256 4eae7880fd8b64b59f5d132426ec2cbe4db7595494254dbb3ffb5b9517ddb768

See more details on using hashes here.

File details

Details for the file docling_parse-5.9.0-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for docling_parse-5.9.0-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 3ae90c0444034b1252881c99cec3a02779108df71ccf5a8eafaec7d4c5b4a8e0
MD5 a0f24fa1cce4a0465c53337e72f64a58
BLAKE2b-256 91360a7001fa865a7023b3b26b97eb16a0ad0dfa472836e4042a8053be39ce37

See more details on using hashes here.

File details

Details for the file docling_parse-5.9.0-cp313-cp313-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for docling_parse-5.9.0-cp313-cp313-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 292f54cceba3847d94a34c9110deb932df475185e0773a0297c17d646a0ec641
MD5 fae8a347d4c9737f8c0cbc9e82c8b3ab
BLAKE2b-256 6f5a5716684a43e6ff0199be57f3b2177b36c2f69449d63a1a5b4db5b5419800

See more details on using hashes here.

File details

Details for the file docling_parse-5.9.0-cp313-cp313-macosx_14_0_arm64.whl.

File metadata

File hashes

Hashes for docling_parse-5.9.0-cp313-cp313-macosx_14_0_arm64.whl
Algorithm Hash digest
SHA256 0ff97842fd48bcc0ffae3dc8dfd1c96cca45b024395bdabea1ff2706bd23b44e
MD5 4a58468bf05c408217d79e31c8504541
BLAKE2b-256 41287284bc189214e5c2a9ed15d0849a51f44d40dd9df9238d03c6db664bfc9e

See more details on using hashes here.

File details

Details for the file docling_parse-5.9.0-cp312-cp312-win_amd64.whl.

File metadata

File hashes

Hashes for docling_parse-5.9.0-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 fcbea80304e7a1549e8cf049c0b3ff8b27e8d99150fc86e65fa1839506c7c002
MD5 3c4b791b6c2faf876aca4c6acb74df8b
BLAKE2b-256 5e22986312f5d7ec860e83fed6b3a604a736700510cb04e0fd8b8ab52a3bfedc

See more details on using hashes here.

File details

Details for the file docling_parse-5.9.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for docling_parse-5.9.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 27eb3358564998f5f85264b093efc6e09d967113211448438911c646baa8c9b8
MD5 641f6c82d04647781e193cd4e34f03f0
BLAKE2b-256 2068f5ba9c8bb743e65b79448089bf27d73189aca9ba781bd97d8712ff51595e

See more details on using hashes here.

File details

Details for the file docling_parse-5.9.0-cp312-cp312-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for docling_parse-5.9.0-cp312-cp312-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 5e141b536ccd954b612f2d7a091bf31e4684af07866ad6fa8b92b83fd60972e4
MD5 e68edda79b387a4218b80b5755038bcc
BLAKE2b-256 8454fc38b47d77d2ef97fdfb9a67e92daecaa68e29b3c54d6409f725b5901686

See more details on using hashes here.

File details

Details for the file docling_parse-5.9.0-cp312-cp312-macosx_14_0_arm64.whl.

File metadata

File hashes

Hashes for docling_parse-5.9.0-cp312-cp312-macosx_14_0_arm64.whl
Algorithm Hash digest
SHA256 f9bb08e9e26cdd30d102d1a81420aca4a4b4136af2070d179147529ed991a64f
MD5 ea75c13e074460f4c13640641596d8ac
BLAKE2b-256 a5c2c98e01230920c151c679e4526fd655a8f10fe0ce9e34a4d49b3f456ee200

See more details on using hashes here.

File details

Details for the file docling_parse-5.9.0-cp311-cp311-win_amd64.whl.

File metadata

File hashes

Hashes for docling_parse-5.9.0-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 17dea2d9e467feb5b7fe53c58ed7493fffb9482563e8f065d426c87fe1078beb
MD5 8d6a11927246463ccf4882311d907254
BLAKE2b-256 9e44a786427fb8f77578639da41937f51284cff0b756d1507eeae5aee34c60ca

See more details on using hashes here.

File details

Details for the file docling_parse-5.9.0-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for docling_parse-5.9.0-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 afd7cd326ebe5de545e327f45b14be3e9b683efee0714d1b784f1314b1e22275
MD5 6410726c4f1f30e42858b4f22d727022
BLAKE2b-256 4eba8954e384e3e94b745279d5c213b5096a8bedce92ea69acea3377110835a6

See more details on using hashes here.

File details

Details for the file docling_parse-5.9.0-cp311-cp311-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for docling_parse-5.9.0-cp311-cp311-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 322152aa19c74547a145b1563c6a1d3a1773ad39fcf4c0a7554ef333701101de
MD5 a369833f23da15353ff3f24dc126e159
BLAKE2b-256 a691eb49ee414b97190303047abd888478fe9596ae9af7c631668bca37ce0b93

See more details on using hashes here.

File details

Details for the file docling_parse-5.9.0-cp311-cp311-macosx_14_0_arm64.whl.

File metadata

File hashes

Hashes for docling_parse-5.9.0-cp311-cp311-macosx_14_0_arm64.whl
Algorithm Hash digest
SHA256 5880485aaf7d16cb398c67fcb804abc52f3797364338354fcc13240dac0e829e
MD5 a4b70bb8e3c8d2ddbeebce483e8baf7e
BLAKE2b-256 f46c3d6a840a208835b18235dc39a55a49ffbe36b739dffcd23edb43d56f977e

See more details on using hashes here.

File details

Details for the file docling_parse-5.9.0-cp310-cp310-win_amd64.whl.

File metadata

File hashes

Hashes for docling_parse-5.9.0-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 7b54b2272af1a4b6812f30d3b77c7774b021f34b65f2ee7032c561da2cc2c0a8
MD5 4c680191a8509992498ae782def0d254
BLAKE2b-256 9a94873be136532196e7224c94810826c9517ae6b0065c620c288799c4f9d48b

See more details on using hashes here.

File details

Details for the file docling_parse-5.9.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for docling_parse-5.9.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 690f10074ec05c69fb76050c282965ed9072c16f8eb020bc2483e228f0dfe39e
MD5 295b345d18fc4fa130999e7c66c38697
BLAKE2b-256 7c45cf9bfd6515d8e34181befa9a7567680fee7e109be5902138e665b3021179

See more details on using hashes here.

File details

Details for the file docling_parse-5.9.0-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for docling_parse-5.9.0-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 21d1b0fdcb6965d3b1c1a224d87ce6cddc3c52649125ddec951d6b99dcda57da
MD5 23967077b248c226502a3ada9321fbcc
BLAKE2b-256 bf49ed3b83457b4aef027ceff9d24348fb4397101497721d9449da8292eeb246

See more details on using hashes here.

File details

Details for the file docling_parse-5.9.0-cp310-cp310-macosx_14_0_arm64.whl.

File metadata

File hashes

Hashes for docling_parse-5.9.0-cp310-cp310-macosx_14_0_arm64.whl
Algorithm Hash digest
SHA256 e7794b173e4d9ae0ea061106aedc98093951394efc7305c7adffe4c43918369a
MD5 4b1ef4b15b78707bf90fb5661f9dce02
BLAKE2b-256 58a0f04284a3e620d93d496ecfcf3e88bff46661c1bf0b2e90fe8c515ca6b6a4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page