Skip to main content

Simple package to extract text with coordinates from programmatic PDFs

Project description

Docling Parse

PyPI version PyPI - Python Version uv Pybind11 Platforms License MIT

Simple package to extract text, paths and bitmap images with coordinates from programmatic PDFs. This package is used in the Docling PDF conversion. Below, we show a few output of the latest parser with char, word and line level output for text, in addition to the extracted paths and bitmap resources.

To do the visualizations yourself, simply run (change word into char or line),

uv run python ./docling_parse/visualize.py -i <path-to-pdf-file> -c word --interactive
original char word line
screenshot screenshot screenshot screenshot
screenshot screenshot screenshot screenshot
screenshot screenshot screenshot screenshot
screenshot screenshot screenshot screenshot
screenshot screenshot screenshot screenshot

Quick start

Install the package from Pypi

pip install docling-parse

Convert a PDF (look in the visualize.py for a more detailed information)

from docling_core.types.doc.page import TextCellUnit
from docling_parse.pdf_parser import DoclingPdfParser, PdfDocument

parser = DoclingPdfParser()

pdf_doc: PdfDocument = parser.load(
    path_or_stream="<path-to-pdf>"
)

# PdfDocument.iterate_pages() will automatically populate pages as they are yielded.
for page_no, pred_page in pdf_doc.iterate_pages():

    # iterate over the word-cells
    for word in pred_page.iterate_cells(unit_type=TextCellUnit.WORD):
        print(word.rect, ": ", word.text)

        # create a PIL image with the char cells
    img = pred_page.render_as_image(cell_unit=TextCellUnit.CHAR)
    img.show()

Parallel parsing (multi-threaded)

Parse pages from one or more PDFs in parallel using a thread pool with backpressure:

from docling_parse.pdf_parser import (
    DecodeConfig,
    DoclingThreadedPdfParser,
    ThreadedPdfParserConfig,
)

parser_config = ThreadedPdfParserConfig(
    loglevel="fatal",
    threads=4,                # worker threads
    max_concurrent_results=32 # cap buffered results to limit memory
)
decode_config = DecodeConfig()

parser = DoclingThreadedPdfParser(
    parser_config=parser_config,
    decode_config=decode_config,
)

# load one or more documents
for source in ["doc_a.pdf", "doc_b.pdf"]:
    doc_key = parser.load(source)
    print(doc_key, parser.page_count(doc_key))

# consume decoded pages as they become available
for result in parser.iterate_results():
    if result.success:
        seg_page = result.get_page()
        timings = result.get_timings()
        print(f"{result.doc_key} p{result.page_number}: "
              f"{len(seg_page.word_cells)} words in {timings.total():.3f}s")
    else:
        print(f"error on {result.doc_key} p{result.page_number}: {result.error_message}")

Use the CLI

$ docling-parse -h
usage: docling-parse [-h] -p PDF

Process a PDF file.

options:
  -h, --help         show this help message and exit
  -p PDF, --pdf PDF  Path to the PDF file

Performance Benchmarks

Coming soon - benchmarks will be updated for the current parser version.

For historical V1 vs V2 benchmarks, see legacy_performance_benchmarks.md.

Development

CXX

To build the parser, simply run the following command in the root folder,

rm -rf build; cmake -B ./build; cd build; make

You can run the parser from your build folder:

% ./parse.exe -h
program to process PDF files or configuration files
Usage:
  PDFProcessor [OPTION...]

  -i, --input arg          Input PDF file
  -c, --config arg         Config file
      --create-config arg  Create config file
  -p, --page arg           Pages to process (default: -1 for all) (default:
                           -1)
      --password arg       Password for accessing encrypted, password-protected files
  -o, --output arg         Output file
  -l, --loglevel arg       loglevel [error;warning;success;info]
  -h, --help               Print usage

If you don't have an input file, a template input file will be printed on the terminal.

Python

To build the package, simply run (make sure uv is installed),

uv sync

The latter will only work after a clean git clone. If you are developing and updating C++ code, please use,

# uv pip install --force-reinstall --no-deps -e .
rm -rf .venv; uv venv; uv pip install --force-reinstall --no-deps -e ".[perf-tools]"

or

BUILD_THREADS=12 uv pip install --force-reinstall --no-deps -e ".[perf]"

To test the package, run:

uv run pytest ./tests -v -s

Contributing

Please read Contributing to Docling Parse for details.

References

If you use Docling in your projects, please consider citing the following:

@techreport{Docling,
  author = {Docling Team},
  month = {8},
  title = {Docling Technical Report},
  url = {https://arxiv.org/abs/2408.09869},
  eprint = {2408.09869},
  doi = {10.48550/arXiv.2408.09869},
  version = {1.0.0},
  year = {2024}
}

License

The Docling Parse codebase is under MIT license. For individual model usage, please refer to the model licenses found in the original packages.

LF AI & Data

Docling (and also docling-parse) is hosted as a project in the LF AI & Data Foundation.

IBM ❤️ Open Source AI

The project was started by the AI for knowledge team at IBM Research Zurich.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docling_parse-7.0.0.tar.gz (6.7 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

docling_parse-7.0.0-cp314-cp314-win_amd64.whl (11.4 MB view details)

Uploaded CPython 3.14Windows x86-64

docling_parse-7.0.0-cp314-cp314-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (10.2 MB view details)

Uploaded CPython 3.14manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

docling_parse-7.0.0-cp314-cp314-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl (9.8 MB view details)

Uploaded CPython 3.14manylinux: glibc 2.26+ ARM64manylinux: glibc 2.28+ ARM64

docling_parse-7.0.0-cp314-cp314-macosx_14_0_arm64.whl (9.2 MB view details)

Uploaded CPython 3.14macOS 14.0+ ARM64

docling_parse-7.0.0-cp313-cp313-win_amd64.whl (11.0 MB view details)

Uploaded CPython 3.13Windows x86-64

docling_parse-7.0.0-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (10.2 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

docling_parse-7.0.0-cp313-cp313-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl (9.8 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.26+ ARM64manylinux: glibc 2.28+ ARM64

docling_parse-7.0.0-cp313-cp313-macosx_14_0_arm64.whl (9.2 MB view details)

Uploaded CPython 3.13macOS 14.0+ ARM64

docling_parse-7.0.0-cp312-cp312-win_amd64.whl (11.0 MB view details)

Uploaded CPython 3.12Windows x86-64

docling_parse-7.0.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (10.2 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

docling_parse-7.0.0-cp312-cp312-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl (9.8 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.26+ ARM64manylinux: glibc 2.28+ ARM64

docling_parse-7.0.0-cp312-cp312-macosx_14_0_arm64.whl (9.2 MB view details)

Uploaded CPython 3.12macOS 14.0+ ARM64

docling_parse-7.0.0-cp311-cp311-win_amd64.whl (11.0 MB view details)

Uploaded CPython 3.11Windows x86-64

docling_parse-7.0.0-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (10.2 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

docling_parse-7.0.0-cp311-cp311-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl (9.8 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.26+ ARM64manylinux: glibc 2.28+ ARM64

docling_parse-7.0.0-cp311-cp311-macosx_14_0_arm64.whl (9.2 MB view details)

Uploaded CPython 3.11macOS 14.0+ ARM64

docling_parse-7.0.0-cp310-cp310-win_amd64.whl (11.0 MB view details)

Uploaded CPython 3.10Windows x86-64

docling_parse-7.0.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (10.2 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ x86-64

docling_parse-7.0.0-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (9.9 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ ARM64

docling_parse-7.0.0-cp310-cp310-macosx_14_0_arm64.whl (9.2 MB view details)

Uploaded CPython 3.10macOS 14.0+ ARM64

File details

Details for the file docling_parse-7.0.0.tar.gz.

File metadata

  • Download URL: docling_parse-7.0.0.tar.gz
  • Upload date:
  • Size: 6.7 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for docling_parse-7.0.0.tar.gz
Algorithm Hash digest
SHA256 bc7710a13d0e0619ee288499ac163637f9356425781f4bed83c3e6a061cb86d4
MD5 0fe03b06c8f7530b8dfad0a3ac02548b
BLAKE2b-256 7a28feffb968ddb332b1d2c2a5d2b9be47e36f9ab8121367ff9781f4ef0f43ae

See more details on using hashes here.

File details

Details for the file docling_parse-7.0.0-cp314-cp314-win_amd64.whl.

File metadata

File hashes

Hashes for docling_parse-7.0.0-cp314-cp314-win_amd64.whl
Algorithm Hash digest
SHA256 c42b1a5610d9bd3e2350a05ba12888fc6c3d138ffba7397500c264046a9f02de
MD5 6d840bcce47ba5e78abe4935551367a1
BLAKE2b-256 3967b0c7337b695892496783aea0a9232ca5eaf36b380114e603e00a1957937a

See more details on using hashes here.

File details

Details for the file docling_parse-7.0.0-cp314-cp314-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for docling_parse-7.0.0-cp314-cp314-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 d6e80648156358a96ea7cfbce21fcb586dfc025825cfb386dcae1a5740c6c3e8
MD5 f82f4b6f19f6fc9cc4f98322e1da694d
BLAKE2b-256 3545763d0453098ee6491700afa9b5a2a996420d918cdfbaec326b4e1913297f

See more details on using hashes here.

File details

Details for the file docling_parse-7.0.0-cp314-cp314-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for docling_parse-7.0.0-cp314-cp314-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 bfbab78118b87214cea6a5468e610d44340952cb695e0834d7637c792e58909b
MD5 81a15dd85a039b2c4c7e96e90d3559aa
BLAKE2b-256 fb7d0afcbfeccdaf59854e6e8eb29e9b1ced6245bd808dcb2ef87e0f0a2b2432

See more details on using hashes here.

File details

Details for the file docling_parse-7.0.0-cp314-cp314-macosx_14_0_arm64.whl.

File metadata

File hashes

Hashes for docling_parse-7.0.0-cp314-cp314-macosx_14_0_arm64.whl
Algorithm Hash digest
SHA256 c733a4d33a8f23cf91be033cbe187762c4a7b0cb11126fcdfc612cb432df1c72
MD5 60cec651a7f39f65201bc3deddc9342e
BLAKE2b-256 68c5b73fd52002b45f32c681e5e031bae8ae20a1af996931af270fb5d54da942

See more details on using hashes here.

File details

Details for the file docling_parse-7.0.0-cp313-cp313-win_amd64.whl.

File metadata

File hashes

Hashes for docling_parse-7.0.0-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 c386c18bb4f7aba42bec227c4efeeb99468a1b5ca3a7886a2dd08dec75d0b5fa
MD5 06102eab91dc48628461ab522130422f
BLAKE2b-256 c0fb88fcc144e438d4588b994af1b71a0a0bc2a6a986f6f9b597fe6e697aa672

See more details on using hashes here.

File details

Details for the file docling_parse-7.0.0-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for docling_parse-7.0.0-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 fe5d5fcdd098166d0bf4f0b3c449aca26bf736c37fcad02179d61853a31ce4e3
MD5 5fed13a6886bcd6094868df0931c753f
BLAKE2b-256 4a673bfe88293ae70cba56abb1751649ce9cd6ba2d4c26e9228f1bb6abbe222a

See more details on using hashes here.

File details

Details for the file docling_parse-7.0.0-cp313-cp313-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for docling_parse-7.0.0-cp313-cp313-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 5b9e9807fa8d1705ba589d1b935b8aa506c6260dede8d3c02f31e92fad7930b4
MD5 0202372fbf1fde48fbf7b5d21c72b589
BLAKE2b-256 e85f387cc3a3c48d7591594f165e997f471fd8958b019faa240ef42a3a1ccc58

See more details on using hashes here.

File details

Details for the file docling_parse-7.0.0-cp313-cp313-macosx_14_0_arm64.whl.

File metadata

File hashes

Hashes for docling_parse-7.0.0-cp313-cp313-macosx_14_0_arm64.whl
Algorithm Hash digest
SHA256 b0982fff0d7a754d7a037bff1a3afd2f2b601dc5fb6f1d5ed7cfc9b24afe7e95
MD5 7e84f246276cfecbd749ce436038f376
BLAKE2b-256 78bcb042eb64a18aabeab11ba16c46df64ac69590feabd267afecb1afc961794

See more details on using hashes here.

File details

Details for the file docling_parse-7.0.0-cp312-cp312-win_amd64.whl.

File metadata

File hashes

Hashes for docling_parse-7.0.0-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 ca0411a058ce8ac823dd9b0e665917f3a53ac65b278390e052da868825513cef
MD5 078470e92b33713925a6b75dcb5bf23a
BLAKE2b-256 66a82c3054477eb7e171bb5f44eb9792b66c9f27ace2fabde5dcbada56085bda

See more details on using hashes here.

File details

Details for the file docling_parse-7.0.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for docling_parse-7.0.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 da3a005be3e8a58795ab2f75cf2277be933c677c880e9ee58c33506cd69176ea
MD5 6443c72de1598bd9f59a16ac067b071f
BLAKE2b-256 8de82be20897f8569d02c93a26390e517c1faa336a566a84a867f5a97ae89c5f

See more details on using hashes here.

File details

Details for the file docling_parse-7.0.0-cp312-cp312-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for docling_parse-7.0.0-cp312-cp312-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 a453afacbf659133aae4072b708828e3f9c745215d23aeac70620f65ab71e239
MD5 56944f35bc596cc8748a68c616d322a6
BLAKE2b-256 056d83d0e80bcd87d0810c3ef37d617b24a1af854af84f297a06d126d563b02e

See more details on using hashes here.

File details

Details for the file docling_parse-7.0.0-cp312-cp312-macosx_14_0_arm64.whl.

File metadata

File hashes

Hashes for docling_parse-7.0.0-cp312-cp312-macosx_14_0_arm64.whl
Algorithm Hash digest
SHA256 37ef02b683e3a36557d45c9038f4e43cc5a293e544df0ab3ea5db159f5a32b21
MD5 44c37ea2a41cd4e285840d76df48f255
BLAKE2b-256 de8bef2c2b7837d01474851f17829404796232e78204aa737a1cce5d29cf6f13

See more details on using hashes here.

File details

Details for the file docling_parse-7.0.0-cp311-cp311-win_amd64.whl.

File metadata

File hashes

Hashes for docling_parse-7.0.0-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 761d4782f874512ac01c9ef6b6cb18a490ba836e180e7d50dddcb5755792ff20
MD5 ed49b57f7b6802dca508d962b6d77693
BLAKE2b-256 19839e502a4998d3f14a879c3176ffb50c0ff062af6a2d83ff98ef75c6623788

See more details on using hashes here.

File details

Details for the file docling_parse-7.0.0-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for docling_parse-7.0.0-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 c894a5a5be472b7e28553310c1453b2f0e91bf8cce90ee7bab86549f6f6b5336
MD5 b2575f859e2902a071b3038284c405d2
BLAKE2b-256 173a3fe1f9073cc2d111611f6ea758dbc64bd10fa2062090a56ee67a537d1a87

See more details on using hashes here.

File details

Details for the file docling_parse-7.0.0-cp311-cp311-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for docling_parse-7.0.0-cp311-cp311-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 3acc7cc42b416e8d4739b88ed7e2c88fd7c6b32cd1f8f579ee4a3180e8a2e04b
MD5 6e2764b0a43589ffc5da43ce278f1b0b
BLAKE2b-256 045c292d3fe8957c1b7b6393aa41d390adf9697e78c15b5f288f8ffeb0d6316d

See more details on using hashes here.

File details

Details for the file docling_parse-7.0.0-cp311-cp311-macosx_14_0_arm64.whl.

File metadata

File hashes

Hashes for docling_parse-7.0.0-cp311-cp311-macosx_14_0_arm64.whl
Algorithm Hash digest
SHA256 d212544f181afccf36e66a8f695a2bd36a91650d4b69a130de2a88f187d53663
MD5 e04602d582636ed283779e75954e15f8
BLAKE2b-256 5d24da92474e2ac43f6eb2f7f0a5f6b569699125696de6fffaf0c4809f8e24ad

See more details on using hashes here.

File details

Details for the file docling_parse-7.0.0-cp310-cp310-win_amd64.whl.

File metadata

File hashes

Hashes for docling_parse-7.0.0-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 1b77972da00823d61b1b2773d0cef3209ee69d12ed8ce8cf92b4135cc1d5931f
MD5 b7bdb5942d83416aff1622c365fa5742
BLAKE2b-256 d66af3b8a77f94daf4bf687e4f00b8acfc26bb1a1370e953fc41d2f0be3299e8

See more details on using hashes here.

File details

Details for the file docling_parse-7.0.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for docling_parse-7.0.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 2a5632e24b4d04308460e22ae2793b45beb6bd966675c27c8847b5bf455c3955
MD5 e26a7a937d13c6965f3ba297b4dc08c6
BLAKE2b-256 17a980fd6785d38b26663e56606d32517475db8a8b7325b44b09129f96f85191

See more details on using hashes here.

File details

Details for the file docling_parse-7.0.0-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for docling_parse-7.0.0-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 aba568ce63de90363edaabe27830ef64ec77fd572ad3540f04f81b3b05d86076
MD5 5499248798de84447299c386761c1d99
BLAKE2b-256 c02f579c8d1a1412de2c2b79f48e95e674801cfca363dc844a356b5eae136369

See more details on using hashes here.

File details

Details for the file docling_parse-7.0.0-cp310-cp310-macosx_14_0_arm64.whl.

File metadata

File hashes

Hashes for docling_parse-7.0.0-cp310-cp310-macosx_14_0_arm64.whl
Algorithm Hash digest
SHA256 85336508e7ca46410151cff06e47af34d5b8633c4fb269a484681090aed9c804
MD5 2ab4a83e2ba484b2ef7322390f4ce3a7
BLAKE2b-256 61511e34cf4a58caf0fba4a622b6d93f273eee5eeb74aa417ec8c61e659223c7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page