Skip to main content

Simple package to extract text with coordinates from programmatic PDFs

Project description

Docling Parse

PyPI version PyPI - Python Version uv Pybind11 Platforms License MIT

Simple package to extract text, paths and bitmap images with coordinates from programmatic PDFs. This package is used in the Docling PDF conversion. Below, we show a few output of the latest parser with char, word and line level output for text, in addition to the extracted paths and bitmap resources.

To do the visualizations yourself, simply run (change word into char or line),

uv run python ./docling_parse/visualize.py -i <path-to-pdf-file> -c word --interactive
original char word line
screenshot screenshot screenshot screenshot
screenshot screenshot screenshot screenshot
screenshot screenshot screenshot screenshot
screenshot screenshot screenshot screenshot
screenshot screenshot screenshot screenshot

Quick start

Install the package from Pypi

pip install docling-parse

Convert a PDF (look in the visualize.py for a more detailed information)

from docling_core.types.doc.page import TextCellUnit
from docling_parse.pdf_parser import DoclingPdfParser, PdfDocument

parser = DoclingPdfParser()

pdf_doc: PdfDocument = parser.load(
    path_or_stream="<path-to-pdf>"
)

# PdfDocument.iterate_pages() will automatically populate pages as they are yielded.
for page_no, pred_page in pdf_doc.iterate_pages():

    # iterate over the word-cells
    for word in pred_page.iterate_cells(unit_type=TextCellUnit.WORD):
        print(word.rect, ": ", word.text)

        # create a PIL image with the char cells
    img = pred_page.render_as_image(cell_unit=TextCellUnit.CHAR)
    img.show()

Use the CLI

$ docling-parse -h
usage: docling-parse [-h] -p PDF

Process a PDF file.

options:
  -h, --help         show this help message and exit
  -p PDF, --pdf PDF  Path to the PDF file

Performance Benchmarks

Coming soon - benchmarks will be updated for the current parser version.

For historical V1 vs V2 benchmarks, see legacy_performance_benchmarks.md.

Development

CXX

To build the parser, simply run the following command in the root folder,

rm -rf build; cmake -B ./build; cd build; make

You can run the parser from your build folder:

% ./parse.exe -h
program to process PDF files or configuration files
Usage:
  PDFProcessor [OPTION...]

  -i, --input arg          Input PDF file
  -c, --config arg         Config file
      --create-config arg  Create config file
  -p, --page arg           Pages to process (default: -1 for all) (default:
                           -1)
      --password arg       Password for accessing encrypted, password-protected files
  -o, --output arg         Output file
  -l, --loglevel arg       loglevel [error;warning;success;info]
  -h, --help               Print usage

If you don't have an input file, a template input file will be printed on the terminal.

Python

To build the package, simply run (make sure uv is installed),

uv sync

The latter will only work after a clean git clone. If you are developing and updating C++ code, please use,

# uv pip install --force-reinstall --no-deps -e .
rm -rf .venv; uv venv; uv pip install --force-reinstall --no-deps -e ".[perf-tools]"

To test the package, run:

uv run pytest ./tests -v -s

Contributing

Please read Contributing to Docling Parse for details.

References

If you use Docling in your projects, please consider citing the following:

@techreport{Docling,
  author = {Docling Team},
  month = {8},
  title = {Docling Technical Report},
  url = {https://arxiv.org/abs/2408.09869},
  eprint = {2408.09869},
  doi = {10.48550/arXiv.2408.09869},
  version = {1.0.0},
  year = {2024}
}

License

The Docling Parse codebase is under MIT license. For individual model usage, please refer to the model licenses found in the original packages.

LF AI & Data

Docling (and also docling-parse) is hosted as a project in the LF AI & Data Foundation.

IBM ❤️ Open Source AI

The project was started by the AI for knowledge team at IBM Research Zurich.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docling_parse-5.2.0.tar.gz (50.0 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

docling_parse-5.2.0-pp310-pypy310_pp73-win_amd64.whl (10.6 MB view details)

Uploaded PyPyWindows x86-64

docling_parse-5.2.0-cp314-cp314-win_amd64.whl (9.5 MB view details)

Uploaded CPython 3.14Windows x86-64

docling_parse-5.2.0-cp314-cp314-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (8.2 MB view details)

Uploaded CPython 3.14manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

docling_parse-5.2.0-cp314-cp314-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl (8.2 MB view details)

Uploaded CPython 3.14manylinux: glibc 2.26+ ARM64manylinux: glibc 2.28+ ARM64

docling_parse-5.2.0-cp314-cp314-macosx_14_0_arm64.whl (7.8 MB view details)

Uploaded CPython 3.14macOS 14.0+ ARM64

docling_parse-5.2.0-cp313-cp313-win_amd64.whl (9.1 MB view details)

Uploaded CPython 3.13Windows x86-64

docling_parse-5.2.0-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (8.2 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

docling_parse-5.2.0-cp313-cp313-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl (8.2 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.26+ ARM64manylinux: glibc 2.28+ ARM64

docling_parse-5.2.0-cp313-cp313-macosx_14_0_arm64.whl (7.8 MB view details)

Uploaded CPython 3.13macOS 14.0+ ARM64

docling_parse-5.2.0-cp312-cp312-win_amd64.whl (9.1 MB view details)

Uploaded CPython 3.12Windows x86-64

docling_parse-5.2.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (8.2 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

docling_parse-5.2.0-cp312-cp312-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl (8.2 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.26+ ARM64manylinux: glibc 2.28+ ARM64

docling_parse-5.2.0-cp312-cp312-macosx_14_0_arm64.whl (7.8 MB view details)

Uploaded CPython 3.12macOS 14.0+ ARM64

docling_parse-5.2.0-cp311-cp311-win_amd64.whl (9.1 MB view details)

Uploaded CPython 3.11Windows x86-64

docling_parse-5.2.0-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (8.2 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

docling_parse-5.2.0-cp311-cp311-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl (8.2 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.26+ ARM64manylinux: glibc 2.28+ ARM64

docling_parse-5.2.0-cp311-cp311-macosx_14_0_arm64.whl (7.8 MB view details)

Uploaded CPython 3.11macOS 14.0+ ARM64

docling_parse-5.2.0-cp310-cp310-win_amd64.whl (9.1 MB view details)

Uploaded CPython 3.10Windows x86-64

docling_parse-5.2.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (8.3 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ x86-64

docling_parse-5.2.0-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (8.2 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ ARM64

docling_parse-5.2.0-cp310-cp310-macosx_14_0_arm64.whl (7.8 MB view details)

Uploaded CPython 3.10macOS 14.0+ ARM64

File details

Details for the file docling_parse-5.2.0.tar.gz.

File metadata

  • Download URL: docling_parse-5.2.0.tar.gz
  • Upload date:
  • Size: 50.0 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for docling_parse-5.2.0.tar.gz
Algorithm Hash digest
SHA256 ba67e3cde7fddd3faaac1b32cf80659bd1810cc8ddc7ffbfecd6e39ccdc3d8fa
MD5 219c2b19e153a80233a58cc836bf8d16
BLAKE2b-256 b9222ae652e99215b63cd2b284da4b9c719ddce1bd1344d29f639de145cddb5a

See more details on using hashes here.

File details

Details for the file docling_parse-5.2.0-pp310-pypy310_pp73-win_amd64.whl.

File metadata

File hashes

Hashes for docling_parse-5.2.0-pp310-pypy310_pp73-win_amd64.whl
Algorithm Hash digest
SHA256 ac0aac21175c9842fa81eb7aa4d7bc10afbc8aeeda50c5dfd32f458d235284e3
MD5 5151e4d74013597b0133e4037521462b
BLAKE2b-256 e3088bc922ab7b329564b02a9297ddb29d6482bf66e781a648ef8f01f0c8f6b6

See more details on using hashes here.

File details

Details for the file docling_parse-5.2.0-cp314-cp314-win_amd64.whl.

File metadata

File hashes

Hashes for docling_parse-5.2.0-cp314-cp314-win_amd64.whl
Algorithm Hash digest
SHA256 d04a2821257d4396fd8061b29d54b2f0e82e00bb4bebc46502ba933281c19c68
MD5 3b811b5b713c85a92b13342b15715024
BLAKE2b-256 5f231aa1612ee9a97b789447780612a7f914d527ea5197f18bc61eecac9deb92

See more details on using hashes here.

File details

Details for the file docling_parse-5.2.0-cp314-cp314-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for docling_parse-5.2.0-cp314-cp314-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 878e0303a03e95bfb2ddb695cbb2664a59a4c65a7c1398992f310dfc10c2c119
MD5 4e17231b0841dd731297146bebce9971
BLAKE2b-256 6c09d97636a82ce6664ce854b567a829f682ad6df7bac77e1a8fbdfd833e04c3

See more details on using hashes here.

File details

Details for the file docling_parse-5.2.0-cp314-cp314-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for docling_parse-5.2.0-cp314-cp314-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 5415e89e2115f7c1083513d94f91bff9e18617f87dccd18da15137151b19beb1
MD5 edf985a01b0efa6ef89a43adb7d2c41d
BLAKE2b-256 b820842f40a53953a7670ac30b5e9ced783b8a84197049bac1ec2f82949a49f7

See more details on using hashes here.

File details

Details for the file docling_parse-5.2.0-cp314-cp314-macosx_14_0_arm64.whl.

File metadata

File hashes

Hashes for docling_parse-5.2.0-cp314-cp314-macosx_14_0_arm64.whl
Algorithm Hash digest
SHA256 13b1088a8ebec297ffa080723bca31fb3b1956d1da15b3fbdf7d1786b18fe089
MD5 a93697ef693fc46333c58ce85badfa64
BLAKE2b-256 baf9b7f504fb140e58f24295a6c4db090a4508a9fa90285420a36e07b46be9ea

See more details on using hashes here.

File details

Details for the file docling_parse-5.2.0-cp313-cp313-win_amd64.whl.

File metadata

File hashes

Hashes for docling_parse-5.2.0-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 011f8017fa2e09d21607e890c818020f9f07391c295c5ed41bd867374d89e152
MD5 b517dbdbfa18de4dce49e359edf1536e
BLAKE2b-256 2d930649428b7334fedb8d83a8bb89559ca3e9e1bf76a60d7fe277165a9c5762

See more details on using hashes here.

File details

Details for the file docling_parse-5.2.0-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for docling_parse-5.2.0-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 1aaff9b79415891d921292597b72fe794b23248aea51fed03dda826ecd8e6329
MD5 bf40701b9d00d462bc2ac601ef3c786f
BLAKE2b-256 09316ca00c12dc19e7343048bba7bc1b3f35062185ce9d727ce0cafec384cbfa

See more details on using hashes here.

File details

Details for the file docling_parse-5.2.0-cp313-cp313-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for docling_parse-5.2.0-cp313-cp313-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 1a5157276b20e9006f0fc4a0bd5f5318f8fce37fb8bb510dbd1bf2c59944818c
MD5 8d8b2a4279c8e726a58a0d615ccbe1ef
BLAKE2b-256 a44141251dcd6145ca41b109404a7cad4ab1c517656ca06520d953c36ac34502

See more details on using hashes here.

File details

Details for the file docling_parse-5.2.0-cp313-cp313-macosx_14_0_arm64.whl.

File metadata

File hashes

Hashes for docling_parse-5.2.0-cp313-cp313-macosx_14_0_arm64.whl
Algorithm Hash digest
SHA256 5795e14ee400855ca52a573a550c07ba6f9be0b57fca8ea0f7a89c102e033fe7
MD5 8f1a39583cb04271f188bef2190aea92
BLAKE2b-256 c22d62b41d9d08c4572b5b82defe472929fa8b50ddf147a599f81ecdffea27e3

See more details on using hashes here.

File details

Details for the file docling_parse-5.2.0-cp312-cp312-win_amd64.whl.

File metadata

File hashes

Hashes for docling_parse-5.2.0-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 eafe67dd254ba41ca36ccadbb0d85fc837b77ec2245b3b0d36cf0d7dcaddd0b2
MD5 ca881861cd2bc6ad7abdb14ec075a937
BLAKE2b-256 46e5d94c2f4d265f5e8162d2e9e410b3502afcabfe42057efaa367a2af8e3986

See more details on using hashes here.

File details

Details for the file docling_parse-5.2.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for docling_parse-5.2.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 59b2e672fd499746d3558cbb51997229eef9c4096d94094e738e994bc01dbd8f
MD5 a90f81fcf6a69dd74b62870be21793c0
BLAKE2b-256 a1fbb7c28ab177ae1bab026232e665eaa83f989d00d269937ede0ad7544e386f

See more details on using hashes here.

File details

Details for the file docling_parse-5.2.0-cp312-cp312-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for docling_parse-5.2.0-cp312-cp312-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 76d31abb9bb6a28b43d00791deda315908f397630fa74ad4bdfcf9551fa5a53f
MD5 d8115bc6d830b52e1f41541fd18dd405
BLAKE2b-256 1483c5539892b64015a94c5b80898e94330c017d91aca3e62c1733cc7fc18669

See more details on using hashes here.

File details

Details for the file docling_parse-5.2.0-cp312-cp312-macosx_14_0_arm64.whl.

File metadata

File hashes

Hashes for docling_parse-5.2.0-cp312-cp312-macosx_14_0_arm64.whl
Algorithm Hash digest
SHA256 472e9ebfd2e8c1dd03024ffc34a1d954ca95b53c1c818b35a143d709e60039eb
MD5 b5bfbb6264974703d9f324876143010f
BLAKE2b-256 5746f442f2cfae0ab54a595ad37a4a8728518f578f33b4bdc11a1f873f466958

See more details on using hashes here.

File details

Details for the file docling_parse-5.2.0-cp311-cp311-win_amd64.whl.

File metadata

File hashes

Hashes for docling_parse-5.2.0-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 dde65caf4dfccc5799cf3850215cb3beef0f06d417b286bfc1c93924860dbb02
MD5 88df3288e329e085856f384635e14cf8
BLAKE2b-256 a99aa4dfe9e48e3097755a889f6ae8342895569b76ae8b5c0c97eabe489a7cbf

See more details on using hashes here.

File details

Details for the file docling_parse-5.2.0-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for docling_parse-5.2.0-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 094e4fea1163b27bc7e0b013a413978527ef730786af1612ed1bfedafa780eae
MD5 3f7a05d5eb1203fb43ed243169aa2808
BLAKE2b-256 4c0259ad1678fd737bbef44f483348db82412899e2192766b7d7108694b94b37

See more details on using hashes here.

File details

Details for the file docling_parse-5.2.0-cp311-cp311-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for docling_parse-5.2.0-cp311-cp311-manylinux_2_26_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 19bae8b748de16b060b339492e997d25abaf5bfd5e403a49ad33bb01bef2bcd1
MD5 905f4a30117a865c8e3eee17d86e96c5
BLAKE2b-256 b13020b7a042935c4973f37bc0ceedac27ace256bb5dcb419b66f9c5e8f95b47

See more details on using hashes here.

File details

Details for the file docling_parse-5.2.0-cp311-cp311-macosx_14_0_arm64.whl.

File metadata

File hashes

Hashes for docling_parse-5.2.0-cp311-cp311-macosx_14_0_arm64.whl
Algorithm Hash digest
SHA256 71de4b67e88d31ee9205dcacdd2a7ed49dc851e186ef73d202d9690e8eb289c6
MD5 0640c848d95923a927808fd3ffedc46e
BLAKE2b-256 7497886d0063074bf42868fe6c70244b1ced44a6eda44fcf6998fdea50f81407

See more details on using hashes here.

File details

Details for the file docling_parse-5.2.0-cp310-cp310-win_amd64.whl.

File metadata

File hashes

Hashes for docling_parse-5.2.0-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 0ef5485a390bf62a22006f239870e65cc6a8392f1a496f16727713ca93f0c94e
MD5 f2f030245279a8b843a54cbaa8bd9934
BLAKE2b-256 6be6d541d7f222e730879a6dbd907a3a1b8110a437f5b05336d7d4f0182ba5bb

See more details on using hashes here.

File details

Details for the file docling_parse-5.2.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for docling_parse-5.2.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 206fa09ebfb54c60fd212f4712b57dbfd2d7a0e79ff3fbed2e86efe61ed75931
MD5 3c100e5a1c33da6cb3f5f24d009439b8
BLAKE2b-256 f8077b2cd3641c4a61ff981b90095871fee8007b2b2db58b547086935d2105e0

See more details on using hashes here.

File details

Details for the file docling_parse-5.2.0-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for docling_parse-5.2.0-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 afa8116e2c824d8569c39d5b55b09e73661fed2ef212f8c9a5b24616bfdf1bc2
MD5 4c2bbb7f254b70ddf0545c13c5329623
BLAKE2b-256 e843a55cd87f2cd6c1c0890e34c82287715dd6781630c30fa872010d519b2a73

See more details on using hashes here.

File details

Details for the file docling_parse-5.2.0-cp310-cp310-macosx_14_0_arm64.whl.

File metadata

File hashes

Hashes for docling_parse-5.2.0-cp310-cp310-macosx_14_0_arm64.whl
Algorithm Hash digest
SHA256 cd69f0491d151add8d86b24f12af4c1bdd3219d0ab378a567f18a61d4455c1d7
MD5 514860ea1c8b0e1e6162810dd3bd8f88
BLAKE2b-256 56955e73326ac04fca9ab0ac600ea83dbe492fc1bc640963274021211874a81b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page