Skip to main content

Simple package to extract text with coordinates from programmatic PDFs

Project description

Docling Parse

PyPI version PyPI - Python Version uv Pybind11 Platforms License MIT

Simple package to extract text, paths and bitmap images with coordinates from programmatic PDFs. This package is used in the Docling PDF conversion. Below, we show a few output of the latest parser with char, word and line level output for text, in addition to the extracted paths and bitmap resources.

To do the visualizations yourself, simply run (change word into char or line),

uv run python ./docling_parse/visualize.py -i <path-to-pdf-file> -c word --interactive
original char word line
screenshot screenshot screenshot screenshot
screenshot screenshot screenshot screenshot
screenshot screenshot screenshot screenshot
screenshot screenshot screenshot screenshot
screenshot screenshot screenshot screenshot

Quick start

Install the package from Pypi

pip install docling-parse

Convert a PDF (look in the visualize.py for a more detailed information)

from docling_core.types.doc.page import TextCellUnit
from docling_parse.pdf_parser import DoclingPdfParser, PdfDocument

parser = DoclingPdfParser()

pdf_doc: PdfDocument = parser.load(
    path_or_stream="<path-to-pdf>"
)

# PdfDocument.iterate_pages() will automatically populate pages as they are yielded.
for page_no, pred_page in pdf_doc.iterate_pages():

    # iterate over the word-cells
    for word in pred_page.iterate_cells(unit_type=TextCellUnit.WORD):
        print(word.rect, ": ", word.text)

        # create a PIL image with the char cells
    img = pred_page.render_as_image(cell_unit=TextCellUnit.CHAR)
    img.show()

Parallel parsing (multi-threaded)

Parse pages from one or more PDFs in parallel using a thread pool with backpressure:

from docling_parse.pdf_parser import (
    DoclingThreadedPdfParser,
    ThreadedPdfParserConfig,
)
from docling_parse.pdf_parsers import DecodePageConfig  # type: ignore[import]

parser_config = ThreadedPdfParserConfig(
    loglevel="fatal",
    threads=4,                # worker threads
    max_concurrent_results=32 # cap buffered results to limit memory
)
decode_config = DecodePageConfig()

parser = DoclingThreadedPdfParser(
    parser_config=parser_config,
    decode_config=decode_config,
)

# load one or more documents
for source in ["doc_a.pdf", "doc_b.pdf"]:
    parser.load(source)

# consume decoded pages as they become available
while parser.has_tasks():
    task = parser.get_task()

    if task.success:
        page_decoder, timings = task.get()
        print(f"{task.doc_key} p{task.page_number}: "
              f"{len(list(page_decoder.get_word_cells()))} words")
    else:
        print(f"error on {task.doc_key} p{task.page_number}: {task.error()}")

Use the CLI

$ docling-parse -h
usage: docling-parse [-h] -p PDF

Process a PDF file.

options:
  -h, --help         show this help message and exit
  -p PDF, --pdf PDF  Path to the PDF file

Performance Benchmarks

Coming soon - benchmarks will be updated for the current parser version.

For historical V1 vs V2 benchmarks, see legacy_performance_benchmarks.md.

Development

CXX

To build the parser, simply run the following command in the root folder,

rm -rf build; cmake -B ./build; cd build; make

You can run the parser from your build folder:

% ./parse.exe -h
program to process PDF files or configuration files
Usage:
  PDFProcessor [OPTION...]

  -i, --input arg          Input PDF file
  -c, --config arg         Config file
      --create-config arg  Create config file
  -p, --page arg           Pages to process (default: -1 for all) (default:
                           -1)
      --password arg       Password for accessing encrypted, password-protected files
  -o, --output arg         Output file
  -l, --loglevel arg       loglevel [error;warning;success;info]
  -h, --help               Print usage

If you don't have an input file, a template input file will be printed on the terminal.

Python

To build the package, simply run (make sure uv is installed),

uv sync

The latter will only work after a clean git clone. If you are developing and updating C++ code, please use,

# uv pip install --force-reinstall --no-deps -e .
rm -rf .venv; uv venv; uv pip install --force-reinstall --no-deps -e ".[perf-tools]"

To test the package, run:

uv run pytest ./tests -v -s

Contributing

Please read Contributing to Docling Parse for details.

References

If you use Docling in your projects, please consider citing the following:

@techreport{Docling,
  author = {Docling Team},
  month = {8},
  title = {Docling Technical Report},
  url = {https://arxiv.org/abs/2408.09869},
  eprint = {2408.09869},
  doi = {10.48550/arXiv.2408.09869},
  version = {1.0.0},
  year = {2024}
}

License

The Docling Parse codebase is under MIT license. For individual model usage, please refer to the model licenses found in the original packages.

LF AI & Data

Docling (and also docling-parse) is hosted as a project in the LF AI & Data Foundation.

IBM ❤️ Open Source AI

The project was started by the AI for knowledge team at IBM Research Zurich.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docling_parse-5.10.1.tar.gz (6.7 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

docling_parse-5.10.1-cp314-cp314-win_amd64.whl (11.3 MB view details)

Uploaded CPython 3.14Windows x86-64

docling_parse-5.10.1-cp314-cp314-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (10.2 MB view details)

Uploaded CPython 3.14manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

docling_parse-5.10.1-cp314-cp314-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl (9.8 MB view details)

Uploaded CPython 3.14manylinux: glibc 2.26+ ARM64manylinux: glibc 2.28+ ARM64

docling_parse-5.10.1-cp314-cp314-macosx_14_0_arm64.whl (9.1 MB view details)

Uploaded CPython 3.14macOS 14.0+ ARM64

docling_parse-5.10.1-cp313-cp313-win_amd64.whl (10.9 MB view details)

Uploaded CPython 3.13Windows x86-64

docling_parse-5.10.1-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (10.2 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

docling_parse-5.10.1-cp313-cp313-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl (9.8 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.26+ ARM64manylinux: glibc 2.28+ ARM64

docling_parse-5.10.1-cp313-cp313-macosx_14_0_arm64.whl (9.1 MB view details)

Uploaded CPython 3.13macOS 14.0+ ARM64

docling_parse-5.10.1-cp312-cp312-win_amd64.whl (10.9 MB view details)

Uploaded CPython 3.12Windows x86-64

docling_parse-5.10.1-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (10.2 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

docling_parse-5.10.1-cp312-cp312-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl (9.8 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.26+ ARM64manylinux: glibc 2.28+ ARM64

docling_parse-5.10.1-cp312-cp312-macosx_14_0_arm64.whl (9.1 MB view details)

Uploaded CPython 3.12macOS 14.0+ ARM64

docling_parse-5.10.1-cp311-cp311-win_amd64.whl (10.9 MB view details)

Uploaded CPython 3.11Windows x86-64

docling_parse-5.10.1-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (10.2 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

docling_parse-5.10.1-cp311-cp311-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl (9.8 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.26+ ARM64manylinux: glibc 2.28+ ARM64

docling_parse-5.10.1-cp311-cp311-macosx_14_0_arm64.whl (9.1 MB view details)

Uploaded CPython 3.11macOS 14.0+ ARM64

docling_parse-5.10.1-cp310-cp310-win_amd64.whl (10.9 MB view details)

Uploaded CPython 3.10Windows x86-64

docling_parse-5.10.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (10.1 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ x86-64

docling_parse-5.10.1-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (9.8 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ ARM64

docling_parse-5.10.1-cp310-cp310-macosx_14_0_arm64.whl (9.1 MB view details)

Uploaded CPython 3.10macOS 14.0+ ARM64

File details

Details for the file docling_parse-5.10.1.tar.gz.

File metadata

  • Download URL: docling_parse-5.10.1.tar.gz
  • Upload date:
  • Size: 6.7 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for docling_parse-5.10.1.tar.gz
Algorithm Hash digest
SHA256 10a3d2ba211134f6d1fa9b6be8ef690eb0b1a03b043473a3ef8408ad7b4a857a
MD5 3ec4deb2a326d2c5a48ac35b29f86549
BLAKE2b-256 a7b8e68f8ec44692d2f913210dd46cb3e7e6e1959053bb05d5c94c5331010f3c

See more details on using hashes here.

File details

Details for the file docling_parse-5.10.1-cp314-cp314-win_amd64.whl.

File metadata

File hashes

Hashes for docling_parse-5.10.1-cp314-cp314-win_amd64.whl
Algorithm Hash digest
SHA256 72762914e05708b67d65b9307be97376a0e6d4cc014f7899887c51d68c722841
MD5 3c4a76f2848baedfa12654a8b05c2d2a
BLAKE2b-256 77930c7bc48ff471113dad488f5415c141cfb77e004e52f6ab5b7cb3f8fb0b3e

See more details on using hashes here.

File details

Details for the file docling_parse-5.10.1-cp314-cp314-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for docling_parse-5.10.1-cp314-cp314-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 d2c324dd7a21185f1f20a637f0b7551ad119d846efdaaa51a09f414863c59d01
MD5 707f71a32c56bfdd470c989e059af2d7
BLAKE2b-256 15b1a971a39898eec76a603e11129203d57713f3201a4f59981ccbd8d3830a62

See more details on using hashes here.

File details

Details for the file docling_parse-5.10.1-cp314-cp314-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for docling_parse-5.10.1-cp314-cp314-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 08b1dcb765625d408277a90fdf0e745862108b23919bd4911c73f07a82d7bc88
MD5 e6d06045b80b59aeef095a2190e13026
BLAKE2b-256 63005e5b81b75b848ed5c073de62b0d9942ad7b4c013681e1ed825f340a11b2e

See more details on using hashes here.

File details

Details for the file docling_parse-5.10.1-cp314-cp314-macosx_14_0_arm64.whl.

File metadata

File hashes

Hashes for docling_parse-5.10.1-cp314-cp314-macosx_14_0_arm64.whl
Algorithm Hash digest
SHA256 1a6e8ecb08ed99ced61e6ac1acfa36fe8f44aa999e36a9aa7f086ada871a0b9d
MD5 8c370ee30977fa4eeb1fe1bc85d98ee6
BLAKE2b-256 0113ca2e148574b1e0e90d32880a50d6c41d4a4a05e4a86db8ca9b76780495b8

See more details on using hashes here.

File details

Details for the file docling_parse-5.10.1-cp313-cp313-win_amd64.whl.

File metadata

File hashes

Hashes for docling_parse-5.10.1-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 2c24dc14f45efa16d1882cd1bb5bcc48e3acff1fd5de1505abf95ad7f49950a8
MD5 1d07c2f0f0bffdfa866c01910b984762
BLAKE2b-256 629bf465a56a838b19e950e3f7bff46b94ccfdfce7e478c03f76f674d1a989a4

See more details on using hashes here.

File details

Details for the file docling_parse-5.10.1-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for docling_parse-5.10.1-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 b0934853bd3bea3a193dcef7e22eca4087a8ec1664f8ea9b5bceb6dddcbb3759
MD5 e76865705a7c098271435b835e9b4954
BLAKE2b-256 40a553df8e581d4ab933c8f8bed4091716172c09fd2fd17a83f438a524659b91

See more details on using hashes here.

File details

Details for the file docling_parse-5.10.1-cp313-cp313-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for docling_parse-5.10.1-cp313-cp313-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 c1f9bdc5259dd78db70becbe7e53cc7f93ecfce53ed8886e7ad2fadcb7df17bc
MD5 bb4c8443e195209f23508efcc296c6be
BLAKE2b-256 5fb1a5a9132b3dc47f5d21be9b38f8f4e006b017af86fabaf0acac24bf5db122

See more details on using hashes here.

File details

Details for the file docling_parse-5.10.1-cp313-cp313-macosx_14_0_arm64.whl.

File metadata

File hashes

Hashes for docling_parse-5.10.1-cp313-cp313-macosx_14_0_arm64.whl
Algorithm Hash digest
SHA256 450a9dc433d511f647a178f4558624c4f274679e4ae847febe698b14842f536a
MD5 0add19b1618ee18369d4d5f5b8c0c763
BLAKE2b-256 659946e50685ff0d8b7ab7eb39a7425540771378b2dea04a094d1d6e85467e57

See more details on using hashes here.

File details

Details for the file docling_parse-5.10.1-cp312-cp312-win_amd64.whl.

File metadata

File hashes

Hashes for docling_parse-5.10.1-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 e8e4ae0929b55301c59252453ce87406c003229adac16258c4b6eb31a76d5cb5
MD5 03fcbc130b6ce813689accfde8e2e958
BLAKE2b-256 e77320384b93e220bbeee09b0bc3978907afda1a13599d7a9990614ad1c682ed

See more details on using hashes here.

File details

Details for the file docling_parse-5.10.1-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for docling_parse-5.10.1-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 478ada90c52b704a04a3c8b4171e3385bb8b5b2f02b9d57c7a5bb06d9cac34fa
MD5 7a8bd2b81baa47e8f7bf93107d33af94
BLAKE2b-256 17fa11dd3328ab708143a291ae53d388a5170095a90b8dc8428e5274e2c09194

See more details on using hashes here.

File details

Details for the file docling_parse-5.10.1-cp312-cp312-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for docling_parse-5.10.1-cp312-cp312-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 59880a29231083c17a73533e09abc0610f10a99343762d795de4ceac5b15dfbf
MD5 1bf061747d729159202162de71e89ae7
BLAKE2b-256 eeb385737cecca0e5ed9dc13370e78862054075f64eb988024069296898ec741

See more details on using hashes here.

File details

Details for the file docling_parse-5.10.1-cp312-cp312-macosx_14_0_arm64.whl.

File metadata

File hashes

Hashes for docling_parse-5.10.1-cp312-cp312-macosx_14_0_arm64.whl
Algorithm Hash digest
SHA256 8f58e1bf1c6cdf1bfe0594f0903b4b9c33dfc3c2dbba61681f8533158ed640a6
MD5 0a8b71df94d82b10b582227d517a1acb
BLAKE2b-256 3e4a27e213493bac0877a030f030d44152ba9ef676aebc5890f4dd3e8037592e

See more details on using hashes here.

File details

Details for the file docling_parse-5.10.1-cp311-cp311-win_amd64.whl.

File metadata

File hashes

Hashes for docling_parse-5.10.1-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 3c90a8b28a9ce012e55e3dd2ae632fc735e5827759cd36fbb8cbbb7da361aec3
MD5 4c1903e33bd3e2fc600ee60e9d50c2a3
BLAKE2b-256 cc892e3bb89731e354a1a565194267b70a245f52a45f3590b4757a01ede69c39

See more details on using hashes here.

File details

Details for the file docling_parse-5.10.1-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for docling_parse-5.10.1-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 f70d4364fbc9dd62cd4f75e0cff93e3618e06bf96686575f7d2e8c5a6fa4f823
MD5 4d80778283d82429ac9729006af27c43
BLAKE2b-256 26c2f9e956aacbf9c88ac228b5abf08962ba2ec88e2b3810703f212e8d0f6df7

See more details on using hashes here.

File details

Details for the file docling_parse-5.10.1-cp311-cp311-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for docling_parse-5.10.1-cp311-cp311-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 31912b95f29db264c9c6b32b4884160ab01ea919307f860493f562cf8c7f9ea1
MD5 59f722f8e828b80860d89af53d1552c0
BLAKE2b-256 8eff022881eb3ec824527851e6a7640d99251815b89772e1a14087cd7e47e0bd

See more details on using hashes here.

File details

Details for the file docling_parse-5.10.1-cp311-cp311-macosx_14_0_arm64.whl.

File metadata

File hashes

Hashes for docling_parse-5.10.1-cp311-cp311-macosx_14_0_arm64.whl
Algorithm Hash digest
SHA256 8a4b52d966b9b9e8400290e1400c549cf73e52e1636f7345e2b8b7f7e10c04c4
MD5 d8e2e811c9853f001a715f1620fcc01a
BLAKE2b-256 385d6ed2d12f7c9db1f714093306141e437b6a199b72c11bae82d9980bd22a23

See more details on using hashes here.

File details

Details for the file docling_parse-5.10.1-cp310-cp310-win_amd64.whl.

File metadata

File hashes

Hashes for docling_parse-5.10.1-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 4427ec4a5cc42a92aaab9104375180df70f2c1206c0261ba33dc9640f3744837
MD5 15b021e5e6521887148a68dbdc139b19
BLAKE2b-256 53c68028ca5e196e19deab3299e49e0701a656e355b0860f33c76cbc2a9ff843

See more details on using hashes here.

File details

Details for the file docling_parse-5.10.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for docling_parse-5.10.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 8bdf7df582e5e7d50dbc61138545a6dcc6c00878db9521d0fa58aa1ef4bb26ea
MD5 4f80907afa958e92578c688373174c8b
BLAKE2b-256 01c5efcdb4e6d4bb581c448a3981b7983eaddf862cb2eac9e0cda980e39aa9f2

See more details on using hashes here.

File details

Details for the file docling_parse-5.10.1-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for docling_parse-5.10.1-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 643ad0b95db00acc674dc7941f6314b60aa2de8a35c15f04d5c42d89a75a1414
MD5 c277b8fcde282038b0357e2ac2d4ced4
BLAKE2b-256 5affcbff26277fb93839456b0f4d163c6d55fbe02ccc9089cddc67c14fc301bf

See more details on using hashes here.

File details

Details for the file docling_parse-5.10.1-cp310-cp310-macosx_14_0_arm64.whl.

File metadata

File hashes

Hashes for docling_parse-5.10.1-cp310-cp310-macosx_14_0_arm64.whl
Algorithm Hash digest
SHA256 88eaa801a44d518c110e50d381beefe480f7f7d6485779947f4ed918d55be000
MD5 274aef301df3322176238b716c60cada
BLAKE2b-256 f5ba0520b9b74c73dc6c970e23ddb54d900c4195290fa49bfb35530f9619efdf

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page