Skip to main content

Simple package to extract text with coordinates from programmatic PDFs

Project description

Docling Parse

PyPI version PyPI - Python Version uv Pybind11 Platforms License MIT

Simple package to extract text, paths and bitmap images with coordinates from programmatic PDFs. This package is used in the Docling PDF conversion. Below, we show a few output of the latest parser with char, word and line level output for text, in addition to the extracted paths and bitmap resources.

To do the visualizations yourself, simply run (change word into char or line),

uv run python ./docling_parse/visualize.py -i <path-to-pdf-file> -c word --interactive
original char word line
screenshot screenshot screenshot screenshot
screenshot screenshot screenshot screenshot
screenshot screenshot screenshot screenshot
screenshot screenshot screenshot screenshot
screenshot screenshot screenshot screenshot

Quick start

Install the package from Pypi

pip install docling-parse

Convert a PDF (look in the visualize.py for a more detailed information)

from docling_core.types.doc.page import TextCellUnit
from docling_parse.pdf_parser import DoclingPdfParser, PdfDocument

parser = DoclingPdfParser()

pdf_doc: PdfDocument = parser.load(
    path_or_stream="<path-to-pdf>"
)

# PdfDocument.iterate_pages() will automatically populate pages as they are yielded.
for page_no, pred_page in pdf_doc.iterate_pages():

    # iterate over the word-cells
    for word in pred_page.iterate_cells(unit_type=TextCellUnit.WORD):
        print(word.rect, ": ", word.text)

        # create a PIL image with the char cells
    img = pred_page.render_as_image(cell_unit=TextCellUnit.CHAR)
    img.show()

Parallel parsing (multi-threaded)

Parse pages from one or more PDFs in parallel using a thread pool with backpressure:

from docling_parse.pdf_parser import (
    DoclingThreadedPdfParser,
    ThreadedPdfParserConfig,
)
from docling_parse.pdf_parsers import DecodePageConfig  # type: ignore[import]

parser_config = ThreadedPdfParserConfig(
    loglevel="fatal",
    threads=4,                # worker threads
    max_concurrent_results=32 # cap buffered results to limit memory
)
decode_config = DecodePageConfig()

parser = DoclingThreadedPdfParser(
    parser_config=parser_config,
    decode_config=decode_config,
)

# load one or more documents
for source in ["doc_a.pdf", "doc_b.pdf"]:
    parser.load(source)

# consume decoded pages as they become available
while parser.has_tasks():
    task = parser.get_task()

    if task.success:
        page_decoder, timings = task.get()
        print(f"{task.doc_key} p{task.page_number}: "
              f"{len(list(page_decoder.get_word_cells()))} words")
    else:
        print(f"error on {task.doc_key} p{task.page_number}: {task.error()}")

Use the CLI

$ docling-parse -h
usage: docling-parse [-h] -p PDF

Process a PDF file.

options:
  -h, --help         show this help message and exit
  -p PDF, --pdf PDF  Path to the PDF file

Performance Benchmarks

Coming soon - benchmarks will be updated for the current parser version.

For historical V1 vs V2 benchmarks, see legacy_performance_benchmarks.md.

Development

CXX

To build the parser, simply run the following command in the root folder,

rm -rf build; cmake -B ./build; cd build; make

You can run the parser from your build folder:

% ./parse.exe -h
program to process PDF files or configuration files
Usage:
  PDFProcessor [OPTION...]

  -i, --input arg          Input PDF file
  -c, --config arg         Config file
      --create-config arg  Create config file
  -p, --page arg           Pages to process (default: -1 for all) (default:
                           -1)
      --password arg       Password for accessing encrypted, password-protected files
  -o, --output arg         Output file
  -l, --loglevel arg       loglevel [error;warning;success;info]
  -h, --help               Print usage

If you don't have an input file, a template input file will be printed on the terminal.

Python

To build the package, simply run (make sure uv is installed),

uv sync

The latter will only work after a clean git clone. If you are developing and updating C++ code, please use,

# uv pip install --force-reinstall --no-deps -e .
rm -rf .venv; uv venv; uv pip install --force-reinstall --no-deps -e ".[perf-tools]"

To test the package, run:

uv run pytest ./tests -v -s

Contributing

Please read Contributing to Docling Parse for details.

References

If you use Docling in your projects, please consider citing the following:

@techreport{Docling,
  author = {Docling Team},
  month = {8},
  title = {Docling Technical Report},
  url = {https://arxiv.org/abs/2408.09869},
  eprint = {2408.09869},
  doi = {10.48550/arXiv.2408.09869},
  version = {1.0.0},
  year = {2024}
}

License

The Docling Parse codebase is under MIT license. For individual model usage, please refer to the model licenses found in the original packages.

LF AI & Data

Docling (and also docling-parse) is hosted as a project in the LF AI & Data Foundation.

IBM ❤️ Open Source AI

The project was started by the AI for knowledge team at IBM Research Zurich.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docling_parse-5.8.0.tar.gz (66.0 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

docling_parse-5.8.0-cp314-cp314-win_amd64.whl (10.8 MB view details)

Uploaded CPython 3.14Windows x86-64

docling_parse-5.8.0-cp314-cp314-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (9.6 MB view details)

Uploaded CPython 3.14manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

docling_parse-5.8.0-cp314-cp314-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl (9.3 MB view details)

Uploaded CPython 3.14manylinux: glibc 2.26+ ARM64manylinux: glibc 2.28+ ARM64

docling_parse-5.8.0-cp314-cp314-macosx_14_0_arm64.whl (8.5 MB view details)

Uploaded CPython 3.14macOS 14.0+ ARM64

docling_parse-5.8.0-cp313-cp313-win_amd64.whl (10.4 MB view details)

Uploaded CPython 3.13Windows x86-64

docling_parse-5.8.0-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (9.6 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

docling_parse-5.8.0-cp313-cp313-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl (9.3 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.26+ ARM64manylinux: glibc 2.28+ ARM64

docling_parse-5.8.0-cp313-cp313-macosx_14_0_arm64.whl (8.5 MB view details)

Uploaded CPython 3.13macOS 14.0+ ARM64

docling_parse-5.8.0-cp312-cp312-win_amd64.whl (10.4 MB view details)

Uploaded CPython 3.12Windows x86-64

docling_parse-5.8.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (9.6 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

docling_parse-5.8.0-cp312-cp312-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl (9.3 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.26+ ARM64manylinux: glibc 2.28+ ARM64

docling_parse-5.8.0-cp312-cp312-macosx_14_0_arm64.whl (8.5 MB view details)

Uploaded CPython 3.12macOS 14.0+ ARM64

docling_parse-5.8.0-cp311-cp311-win_amd64.whl (10.4 MB view details)

Uploaded CPython 3.11Windows x86-64

docling_parse-5.8.0-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (9.6 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

docling_parse-5.8.0-cp311-cp311-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl (9.3 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.26+ ARM64manylinux: glibc 2.28+ ARM64

docling_parse-5.8.0-cp311-cp311-macosx_14_0_arm64.whl (8.5 MB view details)

Uploaded CPython 3.11macOS 14.0+ ARM64

docling_parse-5.8.0-cp310-cp310-win_amd64.whl (10.4 MB view details)

Uploaded CPython 3.10Windows x86-64

docling_parse-5.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (9.6 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ x86-64

docling_parse-5.8.0-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (9.3 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ ARM64

docling_parse-5.8.0-cp310-cp310-macosx_14_0_arm64.whl (8.5 MB view details)

Uploaded CPython 3.10macOS 14.0+ ARM64

File details

Details for the file docling_parse-5.8.0.tar.gz.

File metadata

  • Download URL: docling_parse-5.8.0.tar.gz
  • Upload date:
  • Size: 66.0 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for docling_parse-5.8.0.tar.gz
Algorithm Hash digest
SHA256 cbb1d591dd94edab4ab3b81e9e42a3e4c7fe9ab3c3e690dccd498602aae63c5a
MD5 ae7a91ac649c630263cf3ac72c7dfc5b
BLAKE2b-256 be577b98e3ccf1ed40977bf832f028c68c248b0df1c25a5a33a50c2b2943ea72

See more details on using hashes here.

File details

Details for the file docling_parse-5.8.0-cp314-cp314-win_amd64.whl.

File metadata

File hashes

Hashes for docling_parse-5.8.0-cp314-cp314-win_amd64.whl
Algorithm Hash digest
SHA256 55277e276ad858b985af5b371c5f376dccdd3da32dc6b3ebbb8f94b887dbc722
MD5 2815698ec09952a423f278e94f2b836d
BLAKE2b-256 0c81207d6fc97a9018084bcf6f9a417f874485502508ba53012e6eb019318a96

See more details on using hashes here.

File details

Details for the file docling_parse-5.8.0-cp314-cp314-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for docling_parse-5.8.0-cp314-cp314-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 79eae11ceb79700555020072fa75fec3d0dd2a532cf1b3c9fff3e328682bb93f
MD5 bee7c334e8158326e037af157c9eefc3
BLAKE2b-256 f6e66a6d84c7ba818319e4fe7f2ff849deec6d47dd92d8c2b12bec7e0ee2a485

See more details on using hashes here.

File details

Details for the file docling_parse-5.8.0-cp314-cp314-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for docling_parse-5.8.0-cp314-cp314-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 7f2d38c1d3a53ddf395e8bee8d2ea4be02c1f625a6cb4204a0a9976ad4a2b884
MD5 a601a855ade23e6026b6a187d3d0fdae
BLAKE2b-256 28532695c70945835c07b13d5a943566531ca8bf7796ece25dc1719f4cfec4a5

See more details on using hashes here.

File details

Details for the file docling_parse-5.8.0-cp314-cp314-macosx_14_0_arm64.whl.

File metadata

File hashes

Hashes for docling_parse-5.8.0-cp314-cp314-macosx_14_0_arm64.whl
Algorithm Hash digest
SHA256 ca2e328929a36438ca06719c54ac6546ecc25bb99f653fa24235bd5b3a190b34
MD5 6e10841ef03aacadcdf30edf816a57c2
BLAKE2b-256 990b0f776ff6f71ff904738e161f6159c54089abf7582f227f272957a55cd06e

See more details on using hashes here.

File details

Details for the file docling_parse-5.8.0-cp313-cp313-win_amd64.whl.

File metadata

File hashes

Hashes for docling_parse-5.8.0-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 7343ee48b0480593ed08b04ed0b09421724a6dec63d82c23fac436129b32c66a
MD5 dd9039a54336650f9e7f5b1f8bdbf612
BLAKE2b-256 7e9e4ab1b16f6ba17f9695df79faa08a332b09a2d333d609036a7d0106538d57

See more details on using hashes here.

File details

Details for the file docling_parse-5.8.0-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for docling_parse-5.8.0-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 2d9139f8da5e6553a36afb40dba614011ebd1bf97e5d17896ace07191a289c4b
MD5 e5eed029cc4be7cf903df4905dbb9b05
BLAKE2b-256 542240990653103c2eb83b073d2aca47aa95b767f1360214fca4c6339df105c3

See more details on using hashes here.

File details

Details for the file docling_parse-5.8.0-cp313-cp313-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for docling_parse-5.8.0-cp313-cp313-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 85c896983aaa7b95f409ed52014da59a945f2b914291c0782740e6a5b6d39028
MD5 a288fe0228de7ccce5786dfcde9fb737
BLAKE2b-256 93291030c13b257be7a4317bc7837c22366eff6d961ca6d6604b426dc8a9adcd

See more details on using hashes here.

File details

Details for the file docling_parse-5.8.0-cp313-cp313-macosx_14_0_arm64.whl.

File metadata

File hashes

Hashes for docling_parse-5.8.0-cp313-cp313-macosx_14_0_arm64.whl
Algorithm Hash digest
SHA256 292b82a9773c66a76e5ee376cfdde4a4d6a8edae6a4493aba4013d939e7a213f
MD5 126aea44d7a7b52834f588ceeb97587f
BLAKE2b-256 33c9799cc497b71537bafb6b8bf66fcccf303f8a84684503e8783d489db03aab

See more details on using hashes here.

File details

Details for the file docling_parse-5.8.0-cp312-cp312-win_amd64.whl.

File metadata

File hashes

Hashes for docling_parse-5.8.0-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 6f72b0fdd370e825777f7a9989c390c630774870390c7277b7f016bfae395d6a
MD5 62751b1f3b979dffdba11f0a4292ffa6
BLAKE2b-256 f01b90c5447a00a652a81e2b4fea86b33a694b1e0fec3b9fb1862f9b6f48f54a

See more details on using hashes here.

File details

Details for the file docling_parse-5.8.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for docling_parse-5.8.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 987d8eacb0f515f53a860329acc5c826487a9d2ff4430f08bd37498854cdab42
MD5 e5a341e160e1babea294e8f3eb8bf834
BLAKE2b-256 3aa6686adf6ed39d9de9912b233b8d0bd4f5e8113023aef47630ffde12ff0ba4

See more details on using hashes here.

File details

Details for the file docling_parse-5.8.0-cp312-cp312-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for docling_parse-5.8.0-cp312-cp312-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 8b2c7455b058525cdd46d4c6b7c429871f096aa7718ce1b8481dae426358cf29
MD5 a7684e18254c70b6cf0397cd89405510
BLAKE2b-256 d198b9307f84a7753cc369bbdd81f0183f308e8be1efeb2998193a494f8a8f44

See more details on using hashes here.

File details

Details for the file docling_parse-5.8.0-cp312-cp312-macosx_14_0_arm64.whl.

File metadata

File hashes

Hashes for docling_parse-5.8.0-cp312-cp312-macosx_14_0_arm64.whl
Algorithm Hash digest
SHA256 a37c8c0aab730a9857c726420925cccc304a16abd91f054b25726394ee1ac836
MD5 00d9852c1877df6ec58b381f9048f630
BLAKE2b-256 f9613038e3a759df3aff0f02628eaeb71f6068b428ddd62981e639c5acf1eca8

See more details on using hashes here.

File details

Details for the file docling_parse-5.8.0-cp311-cp311-win_amd64.whl.

File metadata

File hashes

Hashes for docling_parse-5.8.0-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 854630f6ef7889d1757611194330d88fbbe53c0b202b5a010a467bf059f715da
MD5 e96df3b71dadf283f0c40cde61e4ea7c
BLAKE2b-256 eb185bee07b6ef6451b71904e0d21d7721af964fd92f3465305ef791d7a3cf56

See more details on using hashes here.

File details

Details for the file docling_parse-5.8.0-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for docling_parse-5.8.0-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 860fbd5f2d30774d1c739d373aec14b7e074fdace191e5ac16750e7b14f136f4
MD5 b12e35d66d19f447d59058ae1c935bcf
BLAKE2b-256 9603962449ed1b6692e16c3cae0cf00fd60145d620dd1886aedacd1636727dec

See more details on using hashes here.

File details

Details for the file docling_parse-5.8.0-cp311-cp311-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for docling_parse-5.8.0-cp311-cp311-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 b3908496e6949d2e56e361fc743a8f9248cb0f76807a1860027dde02be14f854
MD5 8f32e14363e64ec00abff26692992400
BLAKE2b-256 5b550265703d03377ad7ad3c4d482b00265275061ac15470dc815815944637cf

See more details on using hashes here.

File details

Details for the file docling_parse-5.8.0-cp311-cp311-macosx_14_0_arm64.whl.

File metadata

File hashes

Hashes for docling_parse-5.8.0-cp311-cp311-macosx_14_0_arm64.whl
Algorithm Hash digest
SHA256 fd1ae1cc22a96ccef76f82756ff7958d2a1eb38804e7cd9eed6ae951e2480c30
MD5 eb07ada2f2ff267df9dad6fb4877cd68
BLAKE2b-256 fc0957e47cc861f4e98201d6b881c6a7683e84f8ad20e2c1d619fe94c39ab7f2

See more details on using hashes here.

File details

Details for the file docling_parse-5.8.0-cp310-cp310-win_amd64.whl.

File metadata

File hashes

Hashes for docling_parse-5.8.0-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 ac2c03347de9a0f02cdd46385ee4ae05f91eefc72aeac4749389d17f661dd7d5
MD5 be8bc3bb28bb9a8d9d9e076bba549609
BLAKE2b-256 637a3670258908f6e5cf04251b9547967ebbf28211e29ede30eb5da41e0b509a

See more details on using hashes here.

File details

Details for the file docling_parse-5.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for docling_parse-5.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 b149bd7eeb91a5c6bdbc4a9bd87055a2a06d9ea959bf34d309580c1722d2e2b9
MD5 f7ff7d81559ee850f6d0cbb575f8aa03
BLAKE2b-256 c5bac05c35a75b358ddaafdf0cd1e3f3737091722c6547b692cd66a99071159a

See more details on using hashes here.

File details

Details for the file docling_parse-5.8.0-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for docling_parse-5.8.0-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 a2e81da134baff612ea38ff0af3bf17deef196195d2415bfcf4f531bc7d0dd84
MD5 0aea763e70eee56b555f8f84883976e9
BLAKE2b-256 f138ebd2fd850eef60d9c201cfb28b24bc3c8a27efeb34e817c12f544453a3c2

See more details on using hashes here.

File details

Details for the file docling_parse-5.8.0-cp310-cp310-macosx_14_0_arm64.whl.

File metadata

File hashes

Hashes for docling_parse-5.8.0-cp310-cp310-macosx_14_0_arm64.whl
Algorithm Hash digest
SHA256 241d09a904d8e4b70a2c040252a75a088e971a7926a46973389cb3235a5cab74
MD5 df8eaf661e8050165bd057fb270eb067
BLAKE2b-256 063802a686660fe89a6f6775618ae43f9d4b76f615edc7374a1e8e1bf648fb73

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page