Skip to main content

Simple package to extract text with coordinates from programmatic PDFs

Project description

Docling Parse

PyPI version PyPI - Python Version uv Pybind11 Platforms License MIT

Simple package to extract text, paths and bitmap images with coordinates from programmatic PDFs. This package is used in the Docling PDF conversion. Below, we show a few output of the latest parser with char, word and line level output for text, in addition to the extracted paths and bitmap resources.

To do the visualizations yourself, simply run (change word into char or line),

uv run python ./docling_parse/visualize.py -i <path-to-pdf-file> -c word --interactive
original char word line
screenshot screenshot screenshot screenshot
screenshot screenshot screenshot screenshot
screenshot screenshot screenshot screenshot
screenshot screenshot screenshot screenshot
screenshot screenshot screenshot screenshot

Quick start

Install the package from Pypi

pip install docling-parse

Convert a PDF (look in the visualize.py for a more detailed information)

from docling_core.types.doc.page import TextCellUnit
from docling_parse.pdf_parser import DoclingPdfParser, PdfDocument

parser = DoclingPdfParser()

pdf_doc: PdfDocument = parser.load(
    path_or_stream="<path-to-pdf>"
)

# PdfDocument.iterate_pages() will automatically populate pages as they are yielded.
for page_no, pred_page in pdf_doc.iterate_pages():

    # iterate over the word-cells
    for word in pred_page.iterate_cells(unit_type=TextCellUnit.WORD):
        print(word.rect, ": ", word.text)

        # create a PIL image with the char cells
    img = pred_page.render_as_image(cell_unit=TextCellUnit.CHAR)
    img.show()

Parallel parsing (multi-threaded)

Parse pages from one or more PDFs in parallel using a thread pool with backpressure:

from docling_parse.pdf_parser import (
    DoclingThreadedPdfParser,
    ThreadedPdfParserConfig,
)
from docling_parse.pdf_parsers import DecodePageConfig  # type: ignore[import]

parser_config = ThreadedPdfParserConfig(
    loglevel="fatal",
    threads=4,                # worker threads
    max_concurrent_results=32 # cap buffered results to limit memory
)
decode_config = DecodePageConfig()

parser = DoclingThreadedPdfParser(
    parser_config=parser_config,
    decode_config=decode_config,
)

# load one or more documents
for source in ["doc_a.pdf", "doc_b.pdf"]:
    doc_key = parser.load(source)
    print(doc_key, parser.page_count(doc_key))

# consume decoded pages as they become available
for result in parser.iterate_results():
    if result.success:
        seg_page = result.get_page()
        timings = result.get_timings()
        print(f"{result.doc_key} p{result.page_number}: "
              f"{len(seg_page.word_cells)} words in {timings.total():.3f}s")
    else:
        print(f"error on {result.doc_key} p{result.page_number}: {result.error_message}")

Use the CLI

$ docling-parse -h
usage: docling-parse [-h] -p PDF

Process a PDF file.

options:
  -h, --help         show this help message and exit
  -p PDF, --pdf PDF  Path to the PDF file

Performance Benchmarks

Coming soon - benchmarks will be updated for the current parser version.

For historical V1 vs V2 benchmarks, see legacy_performance_benchmarks.md.

Development

CXX

To build the parser, simply run the following command in the root folder,

rm -rf build; cmake -B ./build; cd build; make

You can run the parser from your build folder:

% ./parse.exe -h
program to process PDF files or configuration files
Usage:
  PDFProcessor [OPTION...]

  -i, --input arg          Input PDF file
  -c, --config arg         Config file
      --create-config arg  Create config file
  -p, --page arg           Pages to process (default: -1 for all) (default:
                           -1)
      --password arg       Password for accessing encrypted, password-protected files
  -o, --output arg         Output file
  -l, --loglevel arg       loglevel [error;warning;success;info]
  -h, --help               Print usage

If you don't have an input file, a template input file will be printed on the terminal.

Python

To build the package, simply run (make sure uv is installed),

uv sync

The latter will only work after a clean git clone. If you are developing and updating C++ code, please use,

# uv pip install --force-reinstall --no-deps -e .
rm -rf .venv; uv venv; uv pip install --force-reinstall --no-deps -e ".[perf-tools]"

To test the package, run:

uv run pytest ./tests -v -s

Contributing

Please read Contributing to Docling Parse for details.

References

If you use Docling in your projects, please consider citing the following:

@techreport{Docling,
  author = {Docling Team},
  month = {8},
  title = {Docling Technical Report},
  url = {https://arxiv.org/abs/2408.09869},
  eprint = {2408.09869},
  doi = {10.48550/arXiv.2408.09869},
  version = {1.0.0},
  year = {2024}
}

License

The Docling Parse codebase is under MIT license. For individual model usage, please refer to the model licenses found in the original packages.

LF AI & Data

Docling (and also docling-parse) is hosted as a project in the LF AI & Data Foundation.

IBM ❤️ Open Source AI

The project was started by the AI for knowledge team at IBM Research Zurich.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docling_parse-6.1.0.tar.gz (6.7 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

docling_parse-6.1.0-cp314-cp314-win_amd64.whl (11.4 MB view details)

Uploaded CPython 3.14Windows x86-64

docling_parse-6.1.0-cp314-cp314-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (10.2 MB view details)

Uploaded CPython 3.14manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

docling_parse-6.1.0-cp314-cp314-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl (9.8 MB view details)

Uploaded CPython 3.14manylinux: glibc 2.26+ ARM64manylinux: glibc 2.28+ ARM64

docling_parse-6.1.0-cp314-cp314-macosx_14_0_arm64.whl (9.1 MB view details)

Uploaded CPython 3.14macOS 14.0+ ARM64

docling_parse-6.1.0-cp313-cp313-win_amd64.whl (11.0 MB view details)

Uploaded CPython 3.13Windows x86-64

docling_parse-6.1.0-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (10.2 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

docling_parse-6.1.0-cp313-cp313-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl (9.8 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.26+ ARM64manylinux: glibc 2.28+ ARM64

docling_parse-6.1.0-cp313-cp313-macosx_14_0_arm64.whl (9.1 MB view details)

Uploaded CPython 3.13macOS 14.0+ ARM64

docling_parse-6.1.0-cp312-cp312-win_amd64.whl (11.0 MB view details)

Uploaded CPython 3.12Windows x86-64

docling_parse-6.1.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (10.2 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

docling_parse-6.1.0-cp312-cp312-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl (9.8 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.26+ ARM64manylinux: glibc 2.28+ ARM64

docling_parse-6.1.0-cp312-cp312-macosx_14_0_arm64.whl (9.1 MB view details)

Uploaded CPython 3.12macOS 14.0+ ARM64

docling_parse-6.1.0-cp311-cp311-win_amd64.whl (11.0 MB view details)

Uploaded CPython 3.11Windows x86-64

docling_parse-6.1.0-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (10.2 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

docling_parse-6.1.0-cp311-cp311-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl (9.8 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.26+ ARM64manylinux: glibc 2.28+ ARM64

docling_parse-6.1.0-cp311-cp311-macosx_14_0_arm64.whl (9.1 MB view details)

Uploaded CPython 3.11macOS 14.0+ ARM64

docling_parse-6.1.0-cp310-cp310-win_amd64.whl (11.0 MB view details)

Uploaded CPython 3.10Windows x86-64

docling_parse-6.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (10.1 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ x86-64

docling_parse-6.1.0-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (9.9 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ ARM64

docling_parse-6.1.0-cp310-cp310-macosx_14_0_arm64.whl (9.1 MB view details)

Uploaded CPython 3.10macOS 14.0+ ARM64

File details

Details for the file docling_parse-6.1.0.tar.gz.

File metadata

  • Download URL: docling_parse-6.1.0.tar.gz
  • Upload date:
  • Size: 6.7 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for docling_parse-6.1.0.tar.gz
Algorithm Hash digest
SHA256 9d56cfde4dcdbca9cd205fd26ef738e52adc80c17e305604ecbb3d919e0b3ae4
MD5 b2b389f8793eb64e881996016b903b2d
BLAKE2b-256 a0c11356939579bc917374883bc7278bbf800a5f9788752fa85d71da4d12d5f4

See more details on using hashes here.

File details

Details for the file docling_parse-6.1.0-cp314-cp314-win_amd64.whl.

File metadata

File hashes

Hashes for docling_parse-6.1.0-cp314-cp314-win_amd64.whl
Algorithm Hash digest
SHA256 2d57ab6ee858eab7c18986c6a0c403473e487082a7ce0533103d90d746bc8293
MD5 2b7475c05fcda58c0c34c36c55278ed6
BLAKE2b-256 69fe83502f0c4a0689496ae103ac5c72f6f37248eda7470f561250191e063f52

See more details on using hashes here.

File details

Details for the file docling_parse-6.1.0-cp314-cp314-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for docling_parse-6.1.0-cp314-cp314-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 fc2bee945e2d8d94fe473fabc51fa17e72d79ba963eaf60b0cc9889bb2769bd7
MD5 b979e8854242912f8d91c12377672afa
BLAKE2b-256 7f44d7fa505317ff88da566cf3630906c6024be1c21bfe32b302ba3946a866ac

See more details on using hashes here.

File details

Details for the file docling_parse-6.1.0-cp314-cp314-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for docling_parse-6.1.0-cp314-cp314-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 d5494fab7565dbdaa95f139ab600a11cfbad8f2689727e1071b54191fc6639c8
MD5 d32bfe238c03293975775155d9bf1ad1
BLAKE2b-256 35ca439df88e539e413e8c059441c7b31a96316fea57c9571810a405f6ac31a9

See more details on using hashes here.

File details

Details for the file docling_parse-6.1.0-cp314-cp314-macosx_14_0_arm64.whl.

File metadata

File hashes

Hashes for docling_parse-6.1.0-cp314-cp314-macosx_14_0_arm64.whl
Algorithm Hash digest
SHA256 2e4c209457a6a413ad2b445c1c4b270c31c2b50246c64e4e73c5f9b401147d82
MD5 fdd1d707130888fbaba521c05120c0fb
BLAKE2b-256 cf8d4084988909237d7874d0754f0732cfc26cf767f33a2d6bea5f85525f7991

See more details on using hashes here.

File details

Details for the file docling_parse-6.1.0-cp313-cp313-win_amd64.whl.

File metadata

File hashes

Hashes for docling_parse-6.1.0-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 73c291aa135533663a06995806aed4e5e68bf80d65d6005366249d0d5cabcdc4
MD5 c22612dba54392985c0a2f01a3890f45
BLAKE2b-256 9274e3163f4f1006b7d8ebe0cf63bcad1abad10b1688658ca8a6a52782782b7b

See more details on using hashes here.

File details

Details for the file docling_parse-6.1.0-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for docling_parse-6.1.0-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 4820351b51263757fa110fdc9cfc4b4d44c1ffa17bbf53811b812fd861abc261
MD5 693f6e62db0c29ea17e33d646ec50225
BLAKE2b-256 d990838cc38482c00673674b610384f61a504859f07fe91ee91faa1f1a76ace6

See more details on using hashes here.

File details

Details for the file docling_parse-6.1.0-cp313-cp313-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for docling_parse-6.1.0-cp313-cp313-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 b7ba00a5e0620719ff85491e11a03a8b4a4bb7df6fa76e10bc5d455f5b52302d
MD5 9e7dec8d7e6d01959e7ff1b1c5149e3e
BLAKE2b-256 d6e671cf94496cd204d86e5a106b21c4ade31f3f59c2ca012cb930629b78d47c

See more details on using hashes here.

File details

Details for the file docling_parse-6.1.0-cp313-cp313-macosx_14_0_arm64.whl.

File metadata

File hashes

Hashes for docling_parse-6.1.0-cp313-cp313-macosx_14_0_arm64.whl
Algorithm Hash digest
SHA256 d07aa6ae487d502568f5427657de75521a4eed0b1be4189316e2f50af065db53
MD5 842902f513d278c98e9719b50d8f9fbb
BLAKE2b-256 6222e67562c92b911cd4ccd7d071d0ca3332d4e250ffbdc3788e48965d860fef

See more details on using hashes here.

File details

Details for the file docling_parse-6.1.0-cp312-cp312-win_amd64.whl.

File metadata

File hashes

Hashes for docling_parse-6.1.0-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 78095db101df5f9f3f37c3107aa5f93124b6f2ae344a8b8fd95f5eee286d8200
MD5 a7a650feedefca8de2326032f6d98e0d
BLAKE2b-256 9915695706d6cd88d2a9f2662bf1c426d7f467b4062cda7b7beb195fdf0ff9d8

See more details on using hashes here.

File details

Details for the file docling_parse-6.1.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for docling_parse-6.1.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 5e5d76ba6d2fe043b85777867a45d760e27ef82c6f567651ec85a9eb437e293d
MD5 6658a6079a1b0fd122e23789a05e53ba
BLAKE2b-256 acbe423e3cf8bc2a079eb672df3be14f08514ccde5558d921e195cfcc325d232

See more details on using hashes here.

File details

Details for the file docling_parse-6.1.0-cp312-cp312-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for docling_parse-6.1.0-cp312-cp312-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 12f49d5bc1b697b51bee940d043eca9fefeef2608dfc10d6a9db09cda3862150
MD5 85e7df9a533bd9438550bb1b8e2a2509
BLAKE2b-256 426c9c302d34b84a0dcdd983ab0bb4719a2d813ae24f0c05f616bea8342abca8

See more details on using hashes here.

File details

Details for the file docling_parse-6.1.0-cp312-cp312-macosx_14_0_arm64.whl.

File metadata

File hashes

Hashes for docling_parse-6.1.0-cp312-cp312-macosx_14_0_arm64.whl
Algorithm Hash digest
SHA256 dc1c25d9503d68aaeb57ae861ea161ca8159bdcfa59ca31145f001ff3594cdc7
MD5 57b4ad6be7296af358f63dcf199be013
BLAKE2b-256 4a6941b462926ce01ed227e0d2d61af8460408fcac360a819db2c03e3896e91d

See more details on using hashes here.

File details

Details for the file docling_parse-6.1.0-cp311-cp311-win_amd64.whl.

File metadata

File hashes

Hashes for docling_parse-6.1.0-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 764327b852bcd0c8367958e1f38eac7e193a67cd0964dfaea301b7c6cf7570cf
MD5 cc874c74f28d90069f0953e30fcdf343
BLAKE2b-256 2b765e404413322aeab6ce99f10761659cc7cc41e115dd7ed9582dd91512a749

See more details on using hashes here.

File details

Details for the file docling_parse-6.1.0-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for docling_parse-6.1.0-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 fafc71b414f97030661c71572c2d3c32a7676b39dc9a536ce85ec8638fc3992b
MD5 7af22efc2215acd645a87f185af8365a
BLAKE2b-256 67b64ad369c99c231e6b9b293b0bedbe1708425c87c7574f923824d30d10b125

See more details on using hashes here.

File details

Details for the file docling_parse-6.1.0-cp311-cp311-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for docling_parse-6.1.0-cp311-cp311-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 59db62c1c4f44fb9409a53c6e016df3170e6c36526cb7856f6c947597f330b9e
MD5 35be22ab58f3409e893cdf26010a8ccb
BLAKE2b-256 83f0505031056ea7f6d2d07ebaee4fe70c6b6d3cfef65ecc20547fc7d6cdd53d

See more details on using hashes here.

File details

Details for the file docling_parse-6.1.0-cp311-cp311-macosx_14_0_arm64.whl.

File metadata

File hashes

Hashes for docling_parse-6.1.0-cp311-cp311-macosx_14_0_arm64.whl
Algorithm Hash digest
SHA256 4785f31b58daacbf2f3c70f836015e0e2b8a411f85aaecdfd210cb1e3a306f43
MD5 0e4e493122fa6ea7abb5813a7edce049
BLAKE2b-256 e8bbaf172e6743a58bf657cd1c8aabe4cfb24828c9e381c4e5eb09f6c087c613

See more details on using hashes here.

File details

Details for the file docling_parse-6.1.0-cp310-cp310-win_amd64.whl.

File metadata

File hashes

Hashes for docling_parse-6.1.0-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 1f313a8466a26faebababdcb5c249ad6ba679960de8abcb036b0be687aa9fa69
MD5 9f9766401cb13ea6701134fb101e51d8
BLAKE2b-256 178dceb0b1f013bfaf7fb3273218e4c26fe5e113d090db97b8ce2a469145ca9a

See more details on using hashes here.

File details

Details for the file docling_parse-6.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for docling_parse-6.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 bbeab59fc57deae23e89e55ccf73538bb1bb8d9b2a1fd37aa4bec09aae077aa7
MD5 03a683d17f7341b4fa375e928cb80092
BLAKE2b-256 55128b1c08c9fcb428de14256f941e83c22fbb051887c3e79e182fd84da6fcb3

See more details on using hashes here.

File details

Details for the file docling_parse-6.1.0-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for docling_parse-6.1.0-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 3a89c2c907cb92550f7e7e31211d13fca6b929b31a54924613c01267a64ce0cc
MD5 4c9650dd306966cfe668a8bdc89121c6
BLAKE2b-256 1ad449f0f40646e1c0c0f774ccd575a2fda2865076b0e8791220f3aed1dca672

See more details on using hashes here.

File details

Details for the file docling_parse-6.1.0-cp310-cp310-macosx_14_0_arm64.whl.

File metadata

File hashes

Hashes for docling_parse-6.1.0-cp310-cp310-macosx_14_0_arm64.whl
Algorithm Hash digest
SHA256 3e3055f848342f441e0dafd35b59bff883072aa0496b05803c7100ac904ca046
MD5 7c9716d25fa5d5fafaf288196c8f57a0
BLAKE2b-256 5aa079a24e5ed5c69d7435359e38eca7ee78629a7885a060f0a2289c7f89c2bd

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page