Skip to main content

Simple package to extract text with coordinates from programmatic PDFs

Project description

Docling Parse

PyPI version PyPI - Python Version uv Pybind11 Platforms License MIT

Simple package to extract text, paths and bitmap images with coordinates from programmatic PDFs. This package is used in the Docling PDF conversion. Below, we show a few output of the latest parser with char, word and line level output for text, in addition to the extracted paths and bitmap resources.

To do the visualizations yourself, simply run (change word into char or line),

uv run python ./docling_parse/visualize.py -i <path-to-pdf-file> -c word --interactive
original char word line
screenshot screenshot screenshot screenshot
screenshot screenshot screenshot screenshot
screenshot screenshot screenshot screenshot
screenshot screenshot screenshot screenshot
screenshot screenshot screenshot screenshot

Quick start

Install the package from Pypi

pip install docling-parse

Convert a PDF (look in the visualize.py for a more detailed information)

from docling_core.types.doc.page import TextCellUnit
from docling_parse.pdf_parser import DoclingPdfParser, PdfDocument

parser = DoclingPdfParser()

pdf_doc: PdfDocument = parser.load(
    path_or_stream="<path-to-pdf>"
)

# PdfDocument.iterate_pages() will automatically populate pages as they are yielded.
for page_no, pred_page in pdf_doc.iterate_pages():

    # iterate over the word-cells
    for word in pred_page.iterate_cells(unit_type=TextCellUnit.WORD):
        print(word.rect, ": ", word.text)

        # create a PIL image with the char cells
    img = pred_page.render_as_image(cell_unit=TextCellUnit.CHAR)
    img.show()

Parallel parsing (multi-threaded)

Parse pages from one or more PDFs in parallel using a thread pool with backpressure:

from docling_parse.pdf_parser import (
    DoclingThreadedPdfParser,
    ThreadedPdfParserConfig,
)
from docling_parse.pdf_parsers import DecodePageConfig  # type: ignore[import]

parser_config = ThreadedPdfParserConfig(
    loglevel="fatal",
    threads=4,                # worker threads
    max_concurrent_results=32 # cap buffered results to limit memory
)
decode_config = DecodePageConfig()

parser = DoclingThreadedPdfParser(
    parser_config=parser_config,
    decode_config=decode_config,
)

# load one or more documents
for source in ["doc_a.pdf", "doc_b.pdf"]:
    parser.load(source)

# consume decoded pages as they become available
while parser.has_tasks():
    task = parser.get_task()

    if task.success:
        page_decoder, timings = task.get()
        print(f"{task.doc_key} p{task.page_number}: "
              f"{len(list(page_decoder.get_word_cells()))} words")
    else:
        print(f"error on {task.doc_key} p{task.page_number}: {task.error()}")

Use the CLI

$ docling-parse -h
usage: docling-parse [-h] -p PDF

Process a PDF file.

options:
  -h, --help         show this help message and exit
  -p PDF, --pdf PDF  Path to the PDF file

Performance Benchmarks

Coming soon - benchmarks will be updated for the current parser version.

For historical V1 vs V2 benchmarks, see legacy_performance_benchmarks.md.

Development

CXX

To build the parser, simply run the following command in the root folder,

rm -rf build; cmake -B ./build; cd build; make

You can run the parser from your build folder:

% ./parse.exe -h
program to process PDF files or configuration files
Usage:
  PDFProcessor [OPTION...]

  -i, --input arg          Input PDF file
  -c, --config arg         Config file
      --create-config arg  Create config file
  -p, --page arg           Pages to process (default: -1 for all) (default:
                           -1)
      --password arg       Password for accessing encrypted, password-protected files
  -o, --output arg         Output file
  -l, --loglevel arg       loglevel [error;warning;success;info]
  -h, --help               Print usage

If you don't have an input file, a template input file will be printed on the terminal.

Python

To build the package, simply run (make sure uv is installed),

uv sync

The latter will only work after a clean git clone. If you are developing and updating C++ code, please use,

# uv pip install --force-reinstall --no-deps -e .
rm -rf .venv; uv venv; uv pip install --force-reinstall --no-deps -e ".[perf-tools]"

To test the package, run:

uv run pytest ./tests -v -s

Contributing

Please read Contributing to Docling Parse for details.

References

If you use Docling in your projects, please consider citing the following:

@techreport{Docling,
  author = {Docling Team},
  month = {8},
  title = {Docling Technical Report},
  url = {https://arxiv.org/abs/2408.09869},
  eprint = {2408.09869},
  doi = {10.48550/arXiv.2408.09869},
  version = {1.0.0},
  year = {2024}
}

License

The Docling Parse codebase is under MIT license. For individual model usage, please refer to the model licenses found in the original packages.

LF AI & Data

Docling (and also docling-parse) is hosted as a project in the LF AI & Data Foundation.

IBM ❤️ Open Source AI

The project was started by the AI for knowledge team at IBM Research Zurich.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docling_parse-5.7.0.tar.gz (64.4 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

docling_parse-5.7.0-cp314-cp314-win_amd64.whl (10.7 MB view details)

Uploaded CPython 3.14Windows x86-64

docling_parse-5.7.0-cp314-cp314-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (9.6 MB view details)

Uploaded CPython 3.14manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

docling_parse-5.7.0-cp314-cp314-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl (9.3 MB view details)

Uploaded CPython 3.14manylinux: glibc 2.26+ ARM64manylinux: glibc 2.28+ ARM64

docling_parse-5.7.0-cp314-cp314-macosx_14_0_arm64.whl (8.5 MB view details)

Uploaded CPython 3.14macOS 14.0+ ARM64

docling_parse-5.7.0-cp313-cp313-win_amd64.whl (10.4 MB view details)

Uploaded CPython 3.13Windows x86-64

docling_parse-5.7.0-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (9.6 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

docling_parse-5.7.0-cp313-cp313-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl (9.3 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.26+ ARM64manylinux: glibc 2.28+ ARM64

docling_parse-5.7.0-cp313-cp313-macosx_14_0_arm64.whl (8.5 MB view details)

Uploaded CPython 3.13macOS 14.0+ ARM64

docling_parse-5.7.0-cp312-cp312-win_amd64.whl (10.4 MB view details)

Uploaded CPython 3.12Windows x86-64

docling_parse-5.7.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (9.6 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

docling_parse-5.7.0-cp312-cp312-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl (9.3 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.26+ ARM64manylinux: glibc 2.28+ ARM64

docling_parse-5.7.0-cp312-cp312-macosx_14_0_arm64.whl (8.5 MB view details)

Uploaded CPython 3.12macOS 14.0+ ARM64

docling_parse-5.7.0-cp311-cp311-win_amd64.whl (10.4 MB view details)

Uploaded CPython 3.11Windows x86-64

docling_parse-5.7.0-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (9.6 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

docling_parse-5.7.0-cp311-cp311-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl (9.3 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.26+ ARM64manylinux: glibc 2.28+ ARM64

docling_parse-5.7.0-cp311-cp311-macosx_14_0_arm64.whl (8.5 MB view details)

Uploaded CPython 3.11macOS 14.0+ ARM64

docling_parse-5.7.0-cp310-cp310-win_amd64.whl (10.3 MB view details)

Uploaded CPython 3.10Windows x86-64

docling_parse-5.7.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (9.5 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ x86-64

docling_parse-5.7.0-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (9.3 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ ARM64

docling_parse-5.7.0-cp310-cp310-macosx_14_0_arm64.whl (8.5 MB view details)

Uploaded CPython 3.10macOS 14.0+ ARM64

File details

Details for the file docling_parse-5.7.0.tar.gz.

File metadata

  • Download URL: docling_parse-5.7.0.tar.gz
  • Upload date:
  • Size: 64.4 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for docling_parse-5.7.0.tar.gz
Algorithm Hash digest
SHA256 c77209c2e093ca5f8266952bd13b95aef09dfa38e6995ecf855971819786c93d
MD5 3a51ac996a1023342bff3974b82e3d6f
BLAKE2b-256 22ce2dff1c13dffd5557833b83697556126cbe78ad3d60adfbd9c915e6b8b464

See more details on using hashes here.

File details

Details for the file docling_parse-5.7.0-cp314-cp314-win_amd64.whl.

File metadata

File hashes

Hashes for docling_parse-5.7.0-cp314-cp314-win_amd64.whl
Algorithm Hash digest
SHA256 499a150a7900226126a77752a0328bf768353c0058b2680b439ddf8ab33bd84b
MD5 e200e265c4d3c0150bbb3378fa07e2d3
BLAKE2b-256 c214b8fca55211ee3b7e43de2b62d543d34ad97fb8e26caca76f7fb70090b493

See more details on using hashes here.

File details

Details for the file docling_parse-5.7.0-cp314-cp314-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for docling_parse-5.7.0-cp314-cp314-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 3fafdac5c54d4630abfda339d60b8b7cb0ac5799a2570fbef5985244a4595a78
MD5 86f5c08fbe4cf2ab1ca1bbcc4bc0ee5f
BLAKE2b-256 ec46cb3f037ce0886990a1ad8051e0a376dad50faa291eee584d46178e781d32

See more details on using hashes here.

File details

Details for the file docling_parse-5.7.0-cp314-cp314-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for docling_parse-5.7.0-cp314-cp314-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 2fa76923024257e22192852e169ea871beff1b25ad8e8ec81f105d400bd87997
MD5 6a466e2c79dfb8f85694abef0d7b6a78
BLAKE2b-256 822a0954f7ff6a1872c4af22408a567105c59454c63583107aca44df8b9da459

See more details on using hashes here.

File details

Details for the file docling_parse-5.7.0-cp314-cp314-macosx_14_0_arm64.whl.

File metadata

File hashes

Hashes for docling_parse-5.7.0-cp314-cp314-macosx_14_0_arm64.whl
Algorithm Hash digest
SHA256 9784275fd21a51fbe17e3b1642bcc607d001cef41854610e13b0210b718297b2
MD5 271fb0f9372a47d392a2e5c598b4a4c2
BLAKE2b-256 91fa3d8e884462bf6e4e6d74585f9586d46d8ab3e97937d697ff2c0477d130b5

See more details on using hashes here.

File details

Details for the file docling_parse-5.7.0-cp313-cp313-win_amd64.whl.

File metadata

File hashes

Hashes for docling_parse-5.7.0-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 f122a81390e2869e03cf110de0ff4db6f5c57ce7d95def82fe0c5f1c3838fdf7
MD5 b7303dcea493b96d2437d0ef54ec0ab7
BLAKE2b-256 ee626607673219fa157628f5c2ccb7e8bf1715f36c54cebaf46f031cc1bd6727

See more details on using hashes here.

File details

Details for the file docling_parse-5.7.0-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for docling_parse-5.7.0-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 41785ee7b472d7a688f183e33c927c6b364ac8432898ff4616b99de1b1ae518d
MD5 a64fc8bff903b70e970094d01fe1cb62
BLAKE2b-256 01618fbe76e34cd6715a5974f599ca1524f730847d6eebe73f7a230f391fab9b

See more details on using hashes here.

File details

Details for the file docling_parse-5.7.0-cp313-cp313-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for docling_parse-5.7.0-cp313-cp313-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 b2247152e4438d01cc51bc9d5d6524a8da06362d3a80ec84397f6b3b414b577f
MD5 4e95f460e8adaaf86b7bed70adb6c132
BLAKE2b-256 a6234205b2d8e0007d18d2bef7c67257272594f23a26882acdec06b13aabe858

See more details on using hashes here.

File details

Details for the file docling_parse-5.7.0-cp313-cp313-macosx_14_0_arm64.whl.

File metadata

File hashes

Hashes for docling_parse-5.7.0-cp313-cp313-macosx_14_0_arm64.whl
Algorithm Hash digest
SHA256 d480fff217fc62183ca97259347c09f46e7539fcacedfb860ecdae628c0247a0
MD5 c4e5b9311b7584eb3b4d0366ec3dc1d3
BLAKE2b-256 c9dad781ee9da13b4d952e3baf5d7d01f429d60afe30ef90b1d70afc5960613c

See more details on using hashes here.

File details

Details for the file docling_parse-5.7.0-cp312-cp312-win_amd64.whl.

File metadata

File hashes

Hashes for docling_parse-5.7.0-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 763b53a30ea171e3a58f92d2892682692ae6a34001dfcad4f01806c18cbd021b
MD5 8df332a600e319eda42d886124161e61
BLAKE2b-256 e5b19f9a1006de94e6775b2a332fd72a5d91478e4a9eda878a369d33e0ab23a6

See more details on using hashes here.

File details

Details for the file docling_parse-5.7.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for docling_parse-5.7.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 92e819292ab3ee2852a296b0189dfa972916446518fe977eefdfb2ea6823d86e
MD5 b4f8ed68adedf59e5efdc891f0001195
BLAKE2b-256 3be617d7c19e4e4193aec5219ebbb4a8baf0afafa6d82c11df04a05e8483c759

See more details on using hashes here.

File details

Details for the file docling_parse-5.7.0-cp312-cp312-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for docling_parse-5.7.0-cp312-cp312-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 7503f5321ef94b455c4cd56e3d437699205d2150f2f3c93889dd64309b34d342
MD5 b6554e08047af537b2cdeadf35cddaef
BLAKE2b-256 1bd1f2a7815da9c8df51306fe941b4c829fa53bdaf866331caa0917508c1bade

See more details on using hashes here.

File details

Details for the file docling_parse-5.7.0-cp312-cp312-macosx_14_0_arm64.whl.

File metadata

File hashes

Hashes for docling_parse-5.7.0-cp312-cp312-macosx_14_0_arm64.whl
Algorithm Hash digest
SHA256 a645b47bc637a63e87b86b3995fe319b63be116e1b7bc9ec1fd44edb00356f6d
MD5 3e7e3dc73f3c36b87cd770814e713195
BLAKE2b-256 979d14269974385ae0b1d6fb31df0224e0ae83aefb9931288282222f908fd704

See more details on using hashes here.

File details

Details for the file docling_parse-5.7.0-cp311-cp311-win_amd64.whl.

File metadata

File hashes

Hashes for docling_parse-5.7.0-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 660bbcc1fe7736289cb1e57ea8f770266e7095c3708e40b35b3c0e7d9ca08d81
MD5 ae6e48d0f300991f58cd2f253984f3b8
BLAKE2b-256 bfc4744e9f6150c7373d6ffa61ebed7957819f4c0e00c6794ea1473f9a11c799

See more details on using hashes here.

File details

Details for the file docling_parse-5.7.0-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for docling_parse-5.7.0-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 af12d1a011687cb46a0879d4b6dcb8534be393cb70de5d7428a335706af53dcc
MD5 c06be006436236b751c3859f65b317e8
BLAKE2b-256 621a8dd86721b8dc653e750e1531359abb0548568a92c08d781348fafb17ff29

See more details on using hashes here.

File details

Details for the file docling_parse-5.7.0-cp311-cp311-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for docling_parse-5.7.0-cp311-cp311-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 7fcab1f5c4a82925305897d198ad19a27e05a6859fe0c917c321040490d968dc
MD5 d84e5ab3d289ba0a0be622a2b95554ed
BLAKE2b-256 44c15181c34b2c6841222fff3a4a4ad082b4441c33a7e47227d21582021e7ed6

See more details on using hashes here.

File details

Details for the file docling_parse-5.7.0-cp311-cp311-macosx_14_0_arm64.whl.

File metadata

File hashes

Hashes for docling_parse-5.7.0-cp311-cp311-macosx_14_0_arm64.whl
Algorithm Hash digest
SHA256 4a4df3a79b413e2fcaa9f4494c355045778b18fd71db070e6f9166e19d00b193
MD5 3f2333e5b933b6ca8e44b8edd9026024
BLAKE2b-256 6190164b10d24064e3186ba679b80f118a09644f67e938a90324d3a9b1294d64

See more details on using hashes here.

File details

Details for the file docling_parse-5.7.0-cp310-cp310-win_amd64.whl.

File metadata

File hashes

Hashes for docling_parse-5.7.0-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 8acf03df37e475c523d3e2fd9101ec21f4f7de532adc4dd7b9394890dcc0547c
MD5 56625659a5896b5cf216e626281d1906
BLAKE2b-256 54cb175436f1fb29a5338bc6cc32a88ab319910dec55bf873f35cf4f8221cc2f

See more details on using hashes here.

File details

Details for the file docling_parse-5.7.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for docling_parse-5.7.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 e4f78f8a570bb33e9557ec3c93e4939bec8bf4d9d96032e34616a877a3bda84f
MD5 419c4e70b46f3ea3c4c139bad793f8f9
BLAKE2b-256 a820ecd4da5492d6fafae8402d79251c389ac74e428bcab98c9c32a5d7439157

See more details on using hashes here.

File details

Details for the file docling_parse-5.7.0-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for docling_parse-5.7.0-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 78631d7a9dafe716fb92af00199a585e9959454dd87d178d82ad583cc62af68c
MD5 74eec756f629552dc22408ca7c51d91a
BLAKE2b-256 16ff08d6c25131e1dc8ab9cc745ea7b86168be9367c094389c98b29ed62152d0

See more details on using hashes here.

File details

Details for the file docling_parse-5.7.0-cp310-cp310-macosx_14_0_arm64.whl.

File metadata

File hashes

Hashes for docling_parse-5.7.0-cp310-cp310-macosx_14_0_arm64.whl
Algorithm Hash digest
SHA256 e4d218e0983cdf447eb994b657fed7ba9b324ab2544b7a004ef97736b3b44b7c
MD5 fc9f3851c6e3b4244b834608db7bbcaa
BLAKE2b-256 227b79a3aadb6b58b1e29660db833202d40a648a032475f52dadd994bc6a778e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page