Skip to main content

Simple package to extract text with coordinates from programmatic PDFs

Project description

Docling Parse

PyPI version PyPI - Python Version uv Pybind11 Platforms License MIT

Simple package to extract text, paths and bitmap images with coordinates from programmatic PDFs. This package is used in the Docling PDF conversion. Below, we show a few output of the latest parser with char, word and line level output for text, in addition to the extracted paths and bitmap resources.

To do the visualizations yourself, simply run (change word into char or line),

uv run python ./docling_parse/visualize.py -i <path-to-pdf-file> -c word --interactive
original char word line
screenshot screenshot screenshot screenshot
screenshot screenshot screenshot screenshot
screenshot screenshot screenshot screenshot
screenshot screenshot screenshot screenshot
screenshot screenshot screenshot screenshot

Quick start

Install the package from Pypi

pip install docling-parse

Convert a PDF (look in the visualize.py for a more detailed information)

from docling_core.types.doc.page import TextCellUnit
from docling_parse.pdf_parser import DoclingPdfParser, PdfDocument

parser = DoclingPdfParser()

pdf_doc: PdfDocument = parser.load(
    path_or_stream="<path-to-pdf>"
)

# PdfDocument.iterate_pages() will automatically populate pages as they are yielded.
for page_no, pred_page in pdf_doc.iterate_pages():

    # iterate over the word-cells
    for word in pred_page.iterate_cells(unit_type=TextCellUnit.WORD):
        print(word.rect, ": ", word.text)

        # create a PIL image with the char cells
    img = pred_page.render_as_image(cell_unit=TextCellUnit.CHAR)
    img.show()

Parallel parsing (multi-threaded)

Parse pages from one or more PDFs in parallel using a thread pool with backpressure:

from docling_parse.pdf_parser import (
    DoclingThreadedPdfParser,
    ThreadedPdfParserConfig,
)
from docling_parse.pdf_parsers import DecodePageConfig  # type: ignore[import]

parser_config = ThreadedPdfParserConfig(
    loglevel="fatal",
    threads=4,                # worker threads
    max_concurrent_results=32 # cap buffered results to limit memory
)
decode_config = DecodePageConfig()

parser = DoclingThreadedPdfParser(
    parser_config=parser_config,
    decode_config=decode_config,
)

# load one or more documents
for source in ["doc_a.pdf", "doc_b.pdf"]:
    doc_key = parser.load(source)
    print(doc_key, parser.page_count(doc_key))

# consume decoded pages as they become available
for result in parser.iterate_results():
    if result.success:
        seg_page = result.get_page()
        timings = result.get_timings()
        print(f"{result.doc_key} p{result.page_number}: "
              f"{len(seg_page.word_cells)} words in {timings.total():.3f}s")
    else:
        print(f"error on {result.doc_key} p{result.page_number}: {result.error_message}")

Use the CLI

$ docling-parse -h
usage: docling-parse [-h] -p PDF

Process a PDF file.

options:
  -h, --help         show this help message and exit
  -p PDF, --pdf PDF  Path to the PDF file

Performance Benchmarks

Coming soon - benchmarks will be updated for the current parser version.

For historical V1 vs V2 benchmarks, see legacy_performance_benchmarks.md.

Development

CXX

To build the parser, simply run the following command in the root folder,

rm -rf build; cmake -B ./build; cd build; make

You can run the parser from your build folder:

% ./parse.exe -h
program to process PDF files or configuration files
Usage:
  PDFProcessor [OPTION...]

  -i, --input arg          Input PDF file
  -c, --config arg         Config file
      --create-config arg  Create config file
  -p, --page arg           Pages to process (default: -1 for all) (default:
                           -1)
      --password arg       Password for accessing encrypted, password-protected files
  -o, --output arg         Output file
  -l, --loglevel arg       loglevel [error;warning;success;info]
  -h, --help               Print usage

If you don't have an input file, a template input file will be printed on the terminal.

Python

To build the package, simply run (make sure uv is installed),

uv sync

The latter will only work after a clean git clone. If you are developing and updating C++ code, please use,

# uv pip install --force-reinstall --no-deps -e .
rm -rf .venv; uv venv; uv pip install --force-reinstall --no-deps -e ".[perf-tools]"

To test the package, run:

uv run pytest ./tests -v -s

Contributing

Please read Contributing to Docling Parse for details.

References

If you use Docling in your projects, please consider citing the following:

@techreport{Docling,
  author = {Docling Team},
  month = {8},
  title = {Docling Technical Report},
  url = {https://arxiv.org/abs/2408.09869},
  eprint = {2408.09869},
  doi = {10.48550/arXiv.2408.09869},
  version = {1.0.0},
  year = {2024}
}

License

The Docling Parse codebase is under MIT license. For individual model usage, please refer to the model licenses found in the original packages.

LF AI & Data

Docling (and also docling-parse) is hosted as a project in the LF AI & Data Foundation.

IBM ❤️ Open Source AI

The project was started by the AI for knowledge team at IBM Research Zurich.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docling_parse-6.0.0.tar.gz (6.7 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

docling_parse-6.0.0-cp314-cp314-win_amd64.whl (11.4 MB view details)

Uploaded CPython 3.14Windows x86-64

docling_parse-6.0.0-cp314-cp314-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (10.2 MB view details)

Uploaded CPython 3.14manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

docling_parse-6.0.0-cp314-cp314-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl (9.8 MB view details)

Uploaded CPython 3.14manylinux: glibc 2.26+ ARM64manylinux: glibc 2.28+ ARM64

docling_parse-6.0.0-cp314-cp314-macosx_14_0_arm64.whl (9.1 MB view details)

Uploaded CPython 3.14macOS 14.0+ ARM64

docling_parse-6.0.0-cp313-cp313-win_amd64.whl (11.0 MB view details)

Uploaded CPython 3.13Windows x86-64

docling_parse-6.0.0-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (10.2 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

docling_parse-6.0.0-cp313-cp313-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl (9.8 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.26+ ARM64manylinux: glibc 2.28+ ARM64

docling_parse-6.0.0-cp313-cp313-macosx_14_0_arm64.whl (9.1 MB view details)

Uploaded CPython 3.13macOS 14.0+ ARM64

docling_parse-6.0.0-cp312-cp312-win_amd64.whl (11.0 MB view details)

Uploaded CPython 3.12Windows x86-64

docling_parse-6.0.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (10.2 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

docling_parse-6.0.0-cp312-cp312-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl (9.8 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.26+ ARM64manylinux: glibc 2.28+ ARM64

docling_parse-6.0.0-cp312-cp312-macosx_14_0_arm64.whl (9.1 MB view details)

Uploaded CPython 3.12macOS 14.0+ ARM64

docling_parse-6.0.0-cp311-cp311-win_amd64.whl (11.0 MB view details)

Uploaded CPython 3.11Windows x86-64

docling_parse-6.0.0-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (10.2 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

docling_parse-6.0.0-cp311-cp311-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl (9.8 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.26+ ARM64manylinux: glibc 2.28+ ARM64

docling_parse-6.0.0-cp311-cp311-macosx_14_0_arm64.whl (9.1 MB view details)

Uploaded CPython 3.11macOS 14.0+ ARM64

docling_parse-6.0.0-cp310-cp310-win_amd64.whl (10.9 MB view details)

Uploaded CPython 3.10Windows x86-64

docling_parse-6.0.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (10.1 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ x86-64

docling_parse-6.0.0-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (9.9 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ ARM64

docling_parse-6.0.0-cp310-cp310-macosx_14_0_arm64.whl (9.1 MB view details)

Uploaded CPython 3.10macOS 14.0+ ARM64

File details

Details for the file docling_parse-6.0.0.tar.gz.

File metadata

  • Download URL: docling_parse-6.0.0.tar.gz
  • Upload date:
  • Size: 6.7 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for docling_parse-6.0.0.tar.gz
Algorithm Hash digest
SHA256 0bb3849adb68e612a84e6ec26e6e7fb66223b98eac0671deb0988382187a274b
MD5 05ad8b9df24e984c979d65751fd6af9e
BLAKE2b-256 61ab0bb33ec481c96db84bfc90b5d7a8863903e1da906ba025f45d4488823c91

See more details on using hashes here.

File details

Details for the file docling_parse-6.0.0-cp314-cp314-win_amd64.whl.

File metadata

File hashes

Hashes for docling_parse-6.0.0-cp314-cp314-win_amd64.whl
Algorithm Hash digest
SHA256 a870e595bf3115cf334297ec2f16ff504b6fededda487c9548138c00d98ad23a
MD5 a20fee514789003a78ce3f624e1261a6
BLAKE2b-256 18c23635304f0bdcf909fb39a87f0d5a01c55d07c64772eb23ff9e6be4f4d0dc

See more details on using hashes here.

File details

Details for the file docling_parse-6.0.0-cp314-cp314-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for docling_parse-6.0.0-cp314-cp314-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 0e94b6af24cefdccf664d61162205a466992fb50147c66ef7a20eabde3c3abb7
MD5 bf4b9b16dc280617dbc043a016b8bd82
BLAKE2b-256 8c57452c3524c7333b0e3516981e9b59920d0f8495820a3e05c1a87695f62db6

See more details on using hashes here.

File details

Details for the file docling_parse-6.0.0-cp314-cp314-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for docling_parse-6.0.0-cp314-cp314-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 41dcd26b2ab6344656ba5f9d8d008fc14c08a8102d1dbe1904acaf4a7c5e4687
MD5 1002680941d5e0bce5701776c7697be0
BLAKE2b-256 315ec8259607fd339e69b8fd2742fc49d1949deb185774e5edb36533522a25d4

See more details on using hashes here.

File details

Details for the file docling_parse-6.0.0-cp314-cp314-macosx_14_0_arm64.whl.

File metadata

File hashes

Hashes for docling_parse-6.0.0-cp314-cp314-macosx_14_0_arm64.whl
Algorithm Hash digest
SHA256 0a470fb69d73a1b1ed3672132e94b37d0b8d8ca84e3dcf33e957401fa7f50430
MD5 258f09230a9ee4175738f8fdfb22f561
BLAKE2b-256 52a147964c350f23b790eed5deb527db71e8bc2e558031d81ac71ad363252de5

See more details on using hashes here.

File details

Details for the file docling_parse-6.0.0-cp313-cp313-win_amd64.whl.

File metadata

File hashes

Hashes for docling_parse-6.0.0-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 fe5be13f8bb2b6bb28dc80cb93bd0ee0dc6f2aa1341af057223099be01fddc66
MD5 132328f6a4e07fd734c6ddc1658ea528
BLAKE2b-256 d67a3dab2156f2b2c06d3a5d203bbeef80e590d4e7cde3ee4a34f35090cba1ed

See more details on using hashes here.

File details

Details for the file docling_parse-6.0.0-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for docling_parse-6.0.0-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 6d031744b0042ff65cf53faad1406a6216e25696e1927ded37cb7f7fe8af5101
MD5 239e90b2357accb5141c5c243c73a46a
BLAKE2b-256 7ab1f977d42b425634e711d33a05b090b679feec5a41127b01bcd78bc9a7e274

See more details on using hashes here.

File details

Details for the file docling_parse-6.0.0-cp313-cp313-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for docling_parse-6.0.0-cp313-cp313-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 ec62ad0110e634a8216d9e40c175b6247dd17bc5f5d4a7db63acc0a76b2727b9
MD5 2e45c7e6abb2ef6c99cd9f827d0f0066
BLAKE2b-256 d0085a39379bff1ff31c1b7926e20da2477f2a835ca26139460150d4ecc54f48

See more details on using hashes here.

File details

Details for the file docling_parse-6.0.0-cp313-cp313-macosx_14_0_arm64.whl.

File metadata

File hashes

Hashes for docling_parse-6.0.0-cp313-cp313-macosx_14_0_arm64.whl
Algorithm Hash digest
SHA256 600196e1de7ff957fd34aaa59716ef8d855bdab8af07bb7f47b4d951a5e8295f
MD5 2f5bb05fbe9520af57214646168708f8
BLAKE2b-256 94d72617356993675b1f25ca10b1a13f1ec06da577a02688c9fc74aa8478fea3

See more details on using hashes here.

File details

Details for the file docling_parse-6.0.0-cp312-cp312-win_amd64.whl.

File metadata

File hashes

Hashes for docling_parse-6.0.0-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 352f8ea74e7878d147ef50447d47344cd0f32bf872257d6c1298de83065cd79c
MD5 623a13b5158f903286524cd90ebaf0e8
BLAKE2b-256 f80e35f265a5cff454dedb871c1a4a680afc5e93e1d0ff7b24fbcd6162380d78

See more details on using hashes here.

File details

Details for the file docling_parse-6.0.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for docling_parse-6.0.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 879240d4706a066fcde3e98963429c8e803f34776aed722cc44f20fc6a7129e7
MD5 fda459f68ccd1c352e94e4c2ace65be7
BLAKE2b-256 9b685f901efbaa61a913e36db8247e8380a1ba78b575f310e6cfbaf02892b0ed

See more details on using hashes here.

File details

Details for the file docling_parse-6.0.0-cp312-cp312-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for docling_parse-6.0.0-cp312-cp312-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 a2a075c00aba75404bd1df8652c32831807c4cb7c8a095d5da78879528daadf8
MD5 af053515dba978de7b4da8b7b088babc
BLAKE2b-256 6da803e26afe86e04f3423d2b127956d85adb0cca9963b26a3037da0a4aaec78

See more details on using hashes here.

File details

Details for the file docling_parse-6.0.0-cp312-cp312-macosx_14_0_arm64.whl.

File metadata

File hashes

Hashes for docling_parse-6.0.0-cp312-cp312-macosx_14_0_arm64.whl
Algorithm Hash digest
SHA256 252f3da8508e6bbaa2531735fb3ce48283ccc0908706c9e4f99695659e8972dd
MD5 a339be8f93a8767cf2bd1eedf903ddd2
BLAKE2b-256 03e108c194298bc543b9e7248408891fa968a779903614d226529be74f863847

See more details on using hashes here.

File details

Details for the file docling_parse-6.0.0-cp311-cp311-win_amd64.whl.

File metadata

File hashes

Hashes for docling_parse-6.0.0-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 e1da4df7a01deef76dc789c1699a0aa37fee80b5104cf029a7496a96cd55c58e
MD5 605c2ce77c1d184235686fec9a66b0ec
BLAKE2b-256 e56543ae0a8375333540bf504f83e87b2aea875108cd6d8bb2098d1396516b17

See more details on using hashes here.

File details

Details for the file docling_parse-6.0.0-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for docling_parse-6.0.0-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 873443032f7b70c89c0d4706e44b0e86786049b77761170c13724f47b431a1e0
MD5 2c279ec9d08b1fb9146264da0167146a
BLAKE2b-256 32bc9127931bd9b7ed93d94df6251b0a9a7acfc12dac830ec8e8e5fd0fa1c566

See more details on using hashes here.

File details

Details for the file docling_parse-6.0.0-cp311-cp311-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for docling_parse-6.0.0-cp311-cp311-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 67be04285998aee699cae968d43e080c1b2a1fafb0647a76b0e3be0b94efc1a6
MD5 df4675ff4339b6c96fbaacd32e976c5e
BLAKE2b-256 4e8c3486d788fe7b1ffe0202163afbff8285ba37fabcf502d2c619fac2ad7181

See more details on using hashes here.

File details

Details for the file docling_parse-6.0.0-cp311-cp311-macosx_14_0_arm64.whl.

File metadata

File hashes

Hashes for docling_parse-6.0.0-cp311-cp311-macosx_14_0_arm64.whl
Algorithm Hash digest
SHA256 bd310587e6fd501c006e979bab47aac656b730ac7958a176d11ceadb620be508
MD5 2e143caeab34767c9ce5a9a1692f5151
BLAKE2b-256 1e5a497f9dcba14d88a9edf328f20f0fca1af30dde3cd54cfe20489f90d8e3b8

See more details on using hashes here.

File details

Details for the file docling_parse-6.0.0-cp310-cp310-win_amd64.whl.

File metadata

File hashes

Hashes for docling_parse-6.0.0-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 56d3d02aacf5488a54be0bab8929285b6311d800890b0a0192ed718c174fa573
MD5 68b496c93bd830938a04baf19e55f990
BLAKE2b-256 cd9bb85d48424b056c5ddac3053028bf46f64f56c34e8b6a827f7ff9666585e5

See more details on using hashes here.

File details

Details for the file docling_parse-6.0.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for docling_parse-6.0.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 9c4bdf8754d81500db80e51d401de1a02b948f338be48519e26ac8f87be38f7e
MD5 d46176c5662ca2ec4d183f86351f6dc7
BLAKE2b-256 60eaa1a25da8a721dcb23ff12ef5d622ce5674085a4f957ceb21965556772719

See more details on using hashes here.

File details

Details for the file docling_parse-6.0.0-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for docling_parse-6.0.0-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 858d86b6fb987df01469bc4b5b1d0afae82ae756ac444a2ef3047bb5a9cbd436
MD5 88d3e54c2da33e737716c01c9ac8bc9a
BLAKE2b-256 23d3f6150106f959889877253bdb42a6922650d58c43b1dd8e6e75c74fbe7a56

See more details on using hashes here.

File details

Details for the file docling_parse-6.0.0-cp310-cp310-macosx_14_0_arm64.whl.

File metadata

File hashes

Hashes for docling_parse-6.0.0-cp310-cp310-macosx_14_0_arm64.whl
Algorithm Hash digest
SHA256 746c18a28e37a0be8ce2a745e73a2bf549afeabaa33d2d3c5d49661647b97d66
MD5 bbcc38b93b236be13ea741989d4d2c09
BLAKE2b-256 89025acaa0b84f896fbbcae89308a5e12cf2644c9da54aa3494380623b135e4f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page