Skip to main content

Simple package to extract text with coordinates from programmatic PDFs

Project description

Docling Parse

PyPI version PyPI - Python Version uv Pybind11 Platforms License MIT

Simple package to extract text, paths and bitmap images with coordinates from programmatic PDFs. This package is used in the Docling PDF conversion. Below, we show a few output of the latest parser with char, word and line level output for text, in addition to the extracted paths and bitmap resources.

To do the visualizations yourself, simply run (change word into char or line),

uv run python ./docling_parse/visualize.py -i <path-to-pdf-file> -c word --interactive
original char word line
screenshot screenshot screenshot screenshot
screenshot screenshot screenshot screenshot
screenshot screenshot screenshot screenshot
screenshot screenshot screenshot screenshot
screenshot screenshot screenshot screenshot

Quick start

Install the package from Pypi

pip install docling-parse

Convert a PDF (look in the visualize.py for a more detailed information)

from docling_core.types.doc.page import TextCellUnit
from docling_parse.pdf_parser import DoclingPdfParser, PdfDocument

parser = DoclingPdfParser()

pdf_doc: PdfDocument = parser.load(
    path_or_stream="<path-to-pdf>"
)

# PdfDocument.iterate_pages() will automatically populate pages as they are yielded.
for page_no, pred_page in pdf_doc.iterate_pages():

    # iterate over the word-cells
    for word in pred_page.iterate_cells(unit_type=TextCellUnit.WORD):
        print(word.rect, ": ", word.text)

        # create a PIL image with the char cells
    img = pred_page.render_as_image(cell_unit=TextCellUnit.CHAR)
    img.show()

Parallel parsing (multi-threaded)

Parse pages from one or more PDFs in parallel using a thread pool with backpressure:

from docling_parse.pdf_parser import (
    DoclingThreadedPdfParser,
    ThreadedPdfParserConfig,
)
from docling_parse.pdf_parsers import DecodePageConfig  # type: ignore[import]

parser_config = ThreadedPdfParserConfig(
    loglevel="fatal",
    threads=4,                # worker threads
    max_concurrent_results=32 # cap buffered results to limit memory
)
decode_config = DecodePageConfig()

parser = DoclingThreadedPdfParser(
    parser_config=parser_config,
    decode_config=decode_config,
)

# load one or more documents
for source in ["doc_a.pdf", "doc_b.pdf"]:
    parser.load(source)

# consume decoded pages as they become available
while parser.has_tasks():
    task = parser.get_task()

    if task.success:
        page_decoder, timings = task.get()
        print(f"{task.doc_key} p{task.page_number}: "
              f"{len(list(page_decoder.get_word_cells()))} words")
    else:
        print(f"error on {task.doc_key} p{task.page_number}: {task.error()}")

Use the CLI

$ docling-parse -h
usage: docling-parse [-h] -p PDF

Process a PDF file.

options:
  -h, --help         show this help message and exit
  -p PDF, --pdf PDF  Path to the PDF file

Performance Benchmarks

Coming soon - benchmarks will be updated for the current parser version.

For historical V1 vs V2 benchmarks, see legacy_performance_benchmarks.md.

Development

CXX

To build the parser, simply run the following command in the root folder,

rm -rf build; cmake -B ./build; cd build; make

You can run the parser from your build folder:

% ./parse.exe -h
program to process PDF files or configuration files
Usage:
  PDFProcessor [OPTION...]

  -i, --input arg          Input PDF file
  -c, --config arg         Config file
      --create-config arg  Create config file
  -p, --page arg           Pages to process (default: -1 for all) (default:
                           -1)
      --password arg       Password for accessing encrypted, password-protected files
  -o, --output arg         Output file
  -l, --loglevel arg       loglevel [error;warning;success;info]
  -h, --help               Print usage

If you don't have an input file, a template input file will be printed on the terminal.

Python

To build the package, simply run (make sure uv is installed),

uv sync

The latter will only work after a clean git clone. If you are developing and updating C++ code, please use,

# uv pip install --force-reinstall --no-deps -e .
rm -rf .venv; uv venv; uv pip install --force-reinstall --no-deps -e ".[perf-tools]"

To test the package, run:

uv run pytest ./tests -v -s

Contributing

Please read Contributing to Docling Parse for details.

References

If you use Docling in your projects, please consider citing the following:

@techreport{Docling,
  author = {Docling Team},
  month = {8},
  title = {Docling Technical Report},
  url = {https://arxiv.org/abs/2408.09869},
  eprint = {2408.09869},
  doi = {10.48550/arXiv.2408.09869},
  version = {1.0.0},
  year = {2024}
}

License

The Docling Parse codebase is under MIT license. For individual model usage, please refer to the model licenses found in the original packages.

LF AI & Data

Docling (and also docling-parse) is hosted as a project in the LF AI & Data Foundation.

IBM ❤️ Open Source AI

The project was started by the AI for knowledge team at IBM Research Zurich.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docling_parse-5.10.0.tar.gz (6.7 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

docling_parse-5.10.0-cp314-cp314-win_amd64.whl (11.3 MB view details)

Uploaded CPython 3.14Windows x86-64

docling_parse-5.10.0-cp314-cp314-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (10.2 MB view details)

Uploaded CPython 3.14manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

docling_parse-5.10.0-cp314-cp314-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl (9.8 MB view details)

Uploaded CPython 3.14manylinux: glibc 2.26+ ARM64manylinux: glibc 2.28+ ARM64

docling_parse-5.10.0-cp314-cp314-macosx_14_0_arm64.whl (9.1 MB view details)

Uploaded CPython 3.14macOS 14.0+ ARM64

docling_parse-5.10.0-cp313-cp313-win_amd64.whl (10.9 MB view details)

Uploaded CPython 3.13Windows x86-64

docling_parse-5.10.0-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (10.2 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

docling_parse-5.10.0-cp313-cp313-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl (9.8 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.26+ ARM64manylinux: glibc 2.28+ ARM64

docling_parse-5.10.0-cp313-cp313-macosx_14_0_arm64.whl (9.1 MB view details)

Uploaded CPython 3.13macOS 14.0+ ARM64

docling_parse-5.10.0-cp312-cp312-win_amd64.whl (10.9 MB view details)

Uploaded CPython 3.12Windows x86-64

docling_parse-5.10.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (10.2 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

docling_parse-5.10.0-cp312-cp312-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl (9.8 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.26+ ARM64manylinux: glibc 2.28+ ARM64

docling_parse-5.10.0-cp312-cp312-macosx_14_0_arm64.whl (9.1 MB view details)

Uploaded CPython 3.12macOS 14.0+ ARM64

docling_parse-5.10.0-cp311-cp311-win_amd64.whl (10.9 MB view details)

Uploaded CPython 3.11Windows x86-64

docling_parse-5.10.0-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (10.2 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

docling_parse-5.10.0-cp311-cp311-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl (9.8 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.26+ ARM64manylinux: glibc 2.28+ ARM64

docling_parse-5.10.0-cp311-cp311-macosx_14_0_arm64.whl (9.1 MB view details)

Uploaded CPython 3.11macOS 14.0+ ARM64

docling_parse-5.10.0-cp310-cp310-win_amd64.whl (10.9 MB view details)

Uploaded CPython 3.10Windows x86-64

docling_parse-5.10.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (10.1 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ x86-64

docling_parse-5.10.0-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (9.8 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ ARM64

docling_parse-5.10.0-cp310-cp310-macosx_14_0_arm64.whl (9.1 MB view details)

Uploaded CPython 3.10macOS 14.0+ ARM64

File details

Details for the file docling_parse-5.10.0.tar.gz.

File metadata

  • Download URL: docling_parse-5.10.0.tar.gz
  • Upload date:
  • Size: 6.7 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for docling_parse-5.10.0.tar.gz
Algorithm Hash digest
SHA256 5f953f673893f801f742558c0d6329d903fa4bbf4e60415c757dcb36dcba90dc
MD5 57c106b6606cdc328446fe4cb583d88f
BLAKE2b-256 5a0c48a9bf8f935903e3cb7e69f25026b0c376b1788097cce2c94b5b496ec26c

See more details on using hashes here.

File details

Details for the file docling_parse-5.10.0-cp314-cp314-win_amd64.whl.

File metadata

File hashes

Hashes for docling_parse-5.10.0-cp314-cp314-win_amd64.whl
Algorithm Hash digest
SHA256 2ea22eff632cc9e21e0e9bebce86cf856fc05259267026cd94dd5403bc8da6fe
MD5 3930f1bf60439d0015199d6b67f0f316
BLAKE2b-256 683f92abde5e78fafa77bc769e3c453e4493be517fe55d57d529b266acf24d92

See more details on using hashes here.

File details

Details for the file docling_parse-5.10.0-cp314-cp314-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for docling_parse-5.10.0-cp314-cp314-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 26c4e23d1b8d81ee2ceba0feabf0615daa989283acd0c393f69fd143c8a5bd8d
MD5 1798b6026782fd024573a96d23dc5b01
BLAKE2b-256 ba045aafd5feaaeb1d5e1d65cc56420438b39dcd3222294e79e215c2dcce2ab5

See more details on using hashes here.

File details

Details for the file docling_parse-5.10.0-cp314-cp314-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for docling_parse-5.10.0-cp314-cp314-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 2eddc44dcb448a7b9bae7ca95bccb9a97f978814a952e8b735b54b320ce34297
MD5 ce9ec35e9230ec802f3fbecd2a17fb6e
BLAKE2b-256 63a1bfee88c1d047f1202c36453c5059d280554a226f93bd891d80a32c2aa364

See more details on using hashes here.

File details

Details for the file docling_parse-5.10.0-cp314-cp314-macosx_14_0_arm64.whl.

File metadata

File hashes

Hashes for docling_parse-5.10.0-cp314-cp314-macosx_14_0_arm64.whl
Algorithm Hash digest
SHA256 e83f16fbca028cb542a2feac57f91c3b39c70119eba851d90fc1566b67a725ab
MD5 22eaf9f6a218cd27179a95464d9df4fd
BLAKE2b-256 d27faba616fc276a8f185e08af042e7b06e95772003ddcb40909f0eda1fe0afb

See more details on using hashes here.

File details

Details for the file docling_parse-5.10.0-cp313-cp313-win_amd64.whl.

File metadata

File hashes

Hashes for docling_parse-5.10.0-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 83edd224024e9f891e541aa0785103d6ae9f545d398d4053e37f8628444cf6b5
MD5 87f3192e428d65f169f59c0b382e6cfa
BLAKE2b-256 adcaecc4736915b33a086687d0713ba34d98bd8c60172f70099c3d5ba1661a16

See more details on using hashes here.

File details

Details for the file docling_parse-5.10.0-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for docling_parse-5.10.0-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 c4a3bc04e954c7d271f54126d5a42f015ceac7a9fd22943c21751d172253d3f9
MD5 f1df197a8c3fd827af418c52b466bdfa
BLAKE2b-256 f8bac5f4de11adcd82ce93c21df4fb297f6ac93bf0953bd1f3b9796b97b81cd4

See more details on using hashes here.

File details

Details for the file docling_parse-5.10.0-cp313-cp313-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for docling_parse-5.10.0-cp313-cp313-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 e9209826a4c5bbfbebd479aa237ac6f43973fc75a68b8b7e5731b1720b248b92
MD5 675afa40b8b4eca84c8107d4df4318ff
BLAKE2b-256 8983ec0d68f045c5ab1de4aa9ee7c8fa3c7c5b1b112dafbd6aef16e30fbbf126

See more details on using hashes here.

File details

Details for the file docling_parse-5.10.0-cp313-cp313-macosx_14_0_arm64.whl.

File metadata

File hashes

Hashes for docling_parse-5.10.0-cp313-cp313-macosx_14_0_arm64.whl
Algorithm Hash digest
SHA256 f7221d8f5567d135818b78683625cef72e2a32833a446bcc9c83409792163522
MD5 4ab9dbd4d960cb4b9be64cc8631ad891
BLAKE2b-256 705ca9e822d319beaef6ecfd915e5e44d4ce92bf70ec9b9fb2ebaac573e5cd7e

See more details on using hashes here.

File details

Details for the file docling_parse-5.10.0-cp312-cp312-win_amd64.whl.

File metadata

File hashes

Hashes for docling_parse-5.10.0-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 c6ae34923f3d084ac3ff47cf2edd998439299c7bd4f1e2a8d2245d5e0697b0bf
MD5 fce737a6975b2f8031f429806c160c41
BLAKE2b-256 db4c19a87434865c4c9dcb65f6b1e39f76c0b0acd3ecb6a420e4f5233ac89a2d

See more details on using hashes here.

File details

Details for the file docling_parse-5.10.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for docling_parse-5.10.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 6fd6668f4a0c27916ddc1ffada1dd483d34a2033bdb0794e81b1ebf4f5766f32
MD5 9798c671c02829f82ff35b30ed6b3cff
BLAKE2b-256 00debb6df3de415a639c2d84650924b488748404b39e0ed244b9a8c6dbc1a9b6

See more details on using hashes here.

File details

Details for the file docling_parse-5.10.0-cp312-cp312-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for docling_parse-5.10.0-cp312-cp312-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 b8f886566264793a084046b6d3ee75156510bfbee14360f700c5ddf38f113bad
MD5 3476f685bc644962c3e2f240000e2028
BLAKE2b-256 aafccaf8ee42d0f15f6369a528b76b31377527d89e69ca9512657c665b63b464

See more details on using hashes here.

File details

Details for the file docling_parse-5.10.0-cp312-cp312-macosx_14_0_arm64.whl.

File metadata

File hashes

Hashes for docling_parse-5.10.0-cp312-cp312-macosx_14_0_arm64.whl
Algorithm Hash digest
SHA256 10f2c564552c2a0d1ccbf38ea250bc1608abfb88a5b907ac49f3331157a4b77b
MD5 72eb023b3603e08833ff9124e2cd9447
BLAKE2b-256 aa4f4d8fdbece925b978070e6eb23f8c0ae24f8e9e0cd4a849813128a03297f2

See more details on using hashes here.

File details

Details for the file docling_parse-5.10.0-cp311-cp311-win_amd64.whl.

File metadata

File hashes

Hashes for docling_parse-5.10.0-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 966eebb69a4b73f20af461c5ee261097fd2dd923ffe18cd360ab999b23619130
MD5 9e0ddd39075da021aa6a22a384a51450
BLAKE2b-256 8e4956ca316fb35606720452e72f3e1c83f05974f0d56521c2512ffd1987bea2

See more details on using hashes here.

File details

Details for the file docling_parse-5.10.0-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for docling_parse-5.10.0-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 c5983907a5b36a7242150537f8565958fe095829c5ab33ac72f368d1b97b21c3
MD5 8e1f2bd20ea2bc53dae3cf665f8fe9e8
BLAKE2b-256 2e90ecab13d6123b957bc15de353fe2e3d9ade99d01fae8c8fdd03204f1b7d8b

See more details on using hashes here.

File details

Details for the file docling_parse-5.10.0-cp311-cp311-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for docling_parse-5.10.0-cp311-cp311-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 d4f166e227e8218f410c9ef976902fd3f29b6655705f288f1ba051582788e2c5
MD5 a48d1351be0bc2135619ee6dadc96b9f
BLAKE2b-256 b2d2a775e15c2dae0996633cbfc1cfa03f92c06d9b5b0f7b688f2fb4993cd2d5

See more details on using hashes here.

File details

Details for the file docling_parse-5.10.0-cp311-cp311-macosx_14_0_arm64.whl.

File metadata

File hashes

Hashes for docling_parse-5.10.0-cp311-cp311-macosx_14_0_arm64.whl
Algorithm Hash digest
SHA256 64619b98472c6a0c609de9a7095a7a8fc5970e92758f523264bc827946e74ed8
MD5 b726d7bac1235f676fb5c6b2118499a4
BLAKE2b-256 787aa60dde1f6b1f1c9679d8c327931f298b6757f053d58d1b34af57f40919b7

See more details on using hashes here.

File details

Details for the file docling_parse-5.10.0-cp310-cp310-win_amd64.whl.

File metadata

File hashes

Hashes for docling_parse-5.10.0-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 162a6aeb9cf497546aa520faad22b42939ec516cdaee0cdbd6421d2e89aaf38d
MD5 ce3fc1f088410907af997cf8eedfca42
BLAKE2b-256 d1c3cda016c1cfab766751c346340e2d315bb208775cd051d2ffaeeef672ca75

See more details on using hashes here.

File details

Details for the file docling_parse-5.10.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for docling_parse-5.10.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 8409a754a9a96f017316e57297536142d05da792f33e9f5b2ba93330a8d3528e
MD5 2ae33b89ab740e6d014008f626994387
BLAKE2b-256 3154f45648728add59a387ba989fc7562d2327f9de59491a31b9e4188be9b78a

See more details on using hashes here.

File details

Details for the file docling_parse-5.10.0-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for docling_parse-5.10.0-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 3b7aac9696359984ad18d5a7abaed2a8f1a297cb738ff960104ae999239571c4
MD5 81e06dd1bf6cfe1011f799ec384458ad
BLAKE2b-256 479bbbb99211aeac737f07ee62943db2f205bec3855dd84408856621b4b9dff6

See more details on using hashes here.

File details

Details for the file docling_parse-5.10.0-cp310-cp310-macosx_14_0_arm64.whl.

File metadata

File hashes

Hashes for docling_parse-5.10.0-cp310-cp310-macosx_14_0_arm64.whl
Algorithm Hash digest
SHA256 0a7726170b9dfcaed04902352780272a1f6283248279c09601edb3c7c3ab0f67
MD5 4382abea0b23153bd3413d0ade07cf71
BLAKE2b-256 ff1d086597a6cf953150e27bc1c6f32ea01d9734af28db6efc5feaa892373535

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page