Simple package to extract text with coordinates from programmatic PDFs
Project description
Docling Parse
Simple package to extract text with coordinates from programmatic PDFs. This package is part of the Docling conversion.
Quick start
Install the package from Pypi
pip install docling-parse
Convert a PDF
from docling_parse.docling_parse import pdf_parser
parser = pdf_parser()
doc = parser.find_cells("mydoc.pdf")
for i, page in enumerate(doc["pages"]):
for j, cell in enumerate(page["cells"]):
print(i, "\t", j, "\t", cell["content"]["rnormalized"])
Use the CLI
$ docling-parse -h
usage: docling-parse [-h] -p PDF
Process a PDF file.
options:
-h, --help show this help message and exit
-p PDF, --pdf PDF Path to the PDF file
Development
CXX
To build the parse, simply run the following command in the root folder,
rm -rf build; cmake -B ./build; cd build; make
You can run the parser from your build folder with
./parse.exe <input-file> <optional-logging:true>
If you dont have an input file, then a template input file will be printed on the terminal.
Python
To build the package, simply run (make sure poetry is installed),
poetry build
To test the package, run,
poetry run pytest ./tests/test_parse.py
Contributing
Please read Contributing to Docling Parse for details.
References
If you use Docling in your projects, please consider citing the following:
@software{Docling,
author = {Deep Search Team},
month = {7},
title = {{Docling}},
url = {https://github.com/DS4SD/docling},
version = {main},
year = {2024}
}
License
The Docling Parse codebase is under MIT license. For individual model usage, please refer to the model licenses found in the original packages.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distributions
Hashes for docling_parse-0.3.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ca5cfcd27bbb0afac1def449e152ee80147c3343c46871b8dea3eac574ac7e36 |
|
MD5 | a2c59ee358e2f408fddf4e2f8e8e2bf8 |
|
BLAKE2b-256 | 054c6bc8ba11d5e84d01dc24307ebb159cc400a4e9068ae831a287f9c1e3103e |
Hashes for docling_parse-0.3.0-cp312-cp312-macosx_14_0_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 59c4f018e80bc130ae8ee9820ff305e05f8407c33e0b745648694fe570e720ad |
|
MD5 | e8da775f86159d562efbf4c28a9b4aaa |
|
BLAKE2b-256 | 91cfacc4935fa046264bf357b62700b17b0417c4aa818e0d9e87905cf865757f |
Hashes for docling_parse-0.3.0-cp312-cp312-macosx_14_0_arm64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9c44490a4260521f10f446c98c04e579a46d6f6afb252bcf851298b4a2f6b017 |
|
MD5 | 974661803cfd7e42e24a4e8158a59618 |
|
BLAKE2b-256 | 2d85abee8197d21c7c5aa53717c715cfcad60f92b8ad9c0cb76522474fb6c23c |
Hashes for docling_parse-0.3.0-cp312-cp312-macosx_13_6_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | d5545b206a9ab1ad81490924e120dc31e36b01be7d8ada954a7097d1d31e2a1d |
|
MD5 | d861f2e08f252ddabf536672806522db |
|
BLAKE2b-256 | ce9608f11af276351ec47f82006e80da0817b70e06ff952cf2b9653a1cdf33d6 |
Hashes for docling_parse-0.3.0-cp312-cp312-macosx_13_6_arm64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | d4d9930296f8908812902511056b7c01fe79c408513995c8871c0d4fd690920d |
|
MD5 | b58a98a429498cc08e8a6ddba41fb0d8 |
|
BLAKE2b-256 | ab74498def364e7020d50f533534b07185fc835ad3a9e3218dc95fed2ed8090c |
Hashes for docling_parse-0.3.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2a539512581171ecc2f7feffcfff9c62d17c4d1895efc78df0e3816bdee6c292 |
|
MD5 | 2d9f393510e83d8dfa68e0723fed1be8 |
|
BLAKE2b-256 | 770f1bd47487b4e0548501b174585001094406871099ee6f6e094c72e9043155 |
Hashes for docling_parse-0.3.0-cp311-cp311-macosx_14_0_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 11ea4e21d774e97f63d12c0bc83a8ca792c7d957b06dc4035618372a2bd10967 |
|
MD5 | c2e39ac020410a19500b7f13e8b72211 |
|
BLAKE2b-256 | 4f71f086873be9ec05dd852a2b27786c2e8115a8b0dbcb9c0f013cf6461ba477 |
Hashes for docling_parse-0.3.0-cp311-cp311-macosx_14_0_arm64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ba1b81fdc2e1e15409ea236bb653e33145fce9b45dca807d2e9911cce57a253c |
|
MD5 | 0d844e40ccfaa2e29f8b626a969fe7ae |
|
BLAKE2b-256 | d7108b7bfdd1831900d0e14602ecc97b0340dcbaf8923189f50bc2e5d90fa892 |
Hashes for docling_parse-0.3.0-cp311-cp311-macosx_13_6_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | a44c94208399456cba9cb06bcea8356b040a24e5e74f37e78832188717acdf25 |
|
MD5 | c715d7b5b9c53941ca336469c10c761c |
|
BLAKE2b-256 | 0dd7a7bb94dd75ace64afec79e98592f43a58b64157d5ad33824c5e45bd140be |
Hashes for docling_parse-0.3.0-cp311-cp311-macosx_13_6_arm64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | d0262f7aa420db2e746cdaa38d714e3e9b5b1d905ec7ffc741546645c98fcff3 |
|
MD5 | f345cdb968f4378d78c27ba2a09c7012 |
|
BLAKE2b-256 | e9c09803497b9418599097b682f0225a4fbecaaa36d2ee78eb494dc989ee2812 |
Hashes for docling_parse-0.3.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6cf132931b6fb1a688b3c60a5ac29dc5f17c92002bf664f0291b98c5006db84a |
|
MD5 | 029f16f29b999011f1e47ec9294e3499 |
|
BLAKE2b-256 | 99893a748a3d96c4c52083ab71a0b8c31bc396589dcd9e8fcf119992f18a58d4 |
Hashes for docling_parse-0.3.0-cp310-cp310-macosx_14_0_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 54b08b7ea23ca65a9ce0069599804fe72b4c106d053221f3d4513d9bc0960659 |
|
MD5 | 0d1121abdd1b28617e07c6ca98eb8c1d |
|
BLAKE2b-256 | 3234f02075a97792a2aaaac66c28e4341a0cdf137bb80989a274eb5d313b0b24 |
Hashes for docling_parse-0.3.0-cp310-cp310-macosx_14_0_arm64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | a67dbb5d619c6aa167fb481c93cf3a276807b7e0f7809683b868ce4888837ce6 |
|
MD5 | e0e098995a7a6fdcfccb528909ec2058 |
|
BLAKE2b-256 | c6da3d75cf1c8a37f2453ff3b14b855596255c35bc21b19555155f35b3512444 |
Hashes for docling_parse-0.3.0-cp310-cp310-macosx_13_6_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4432e33a77649667845fb2f23a8f7439fcf674c435ae9c8bab101507417b9063 |
|
MD5 | 1e3e27437de387ae5b52fb09db8b69be |
|
BLAKE2b-256 | 22504d5b53427283246ffa1ce616d812f563b9a6b061e394d669dd5419ccbf17 |
Hashes for docling_parse-0.3.0-cp310-cp310-macosx_13_6_arm64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | c4df96d8aada8ca5d5477c6ecbf2ae47036b54c64d13f288a7700371d04a6736 |
|
MD5 | f37b03f908a05a5425c107d16c02444a |
|
BLAKE2b-256 | de18ebde2ab84d46754561441eac26479d10b58cda71a4cddd61cc85cefccfbe |
Hashes for docling_parse-0.3.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0b760b68e6d5bf945835aa110c15ec270b04949f59c0c657db87c1c63d6937dd |
|
MD5 | 0d050f6e0b4f5f81012469220d1ecc27 |
|
BLAKE2b-256 | d6dacf92645d53c6096ea7b68911508543c842f5f02daadc163f84a0501b709c |
Hashes for docling_parse-0.3.0-cp39-cp39-macosx_14_0_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 37c3c060d0f720e5e2e95ebbe731fa918e21094518647703c79ccb88491c6440 |
|
MD5 | 7a5e7f1f4c7f7d4404ead49790f428ae |
|
BLAKE2b-256 | 87a8d3798233e39086324fb41c4e4a2596ce6d62c12f8721da6bae51b48fdea2 |
Hashes for docling_parse-0.3.0-cp39-cp39-macosx_14_0_arm64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7271aea1720c5fe4bb3e8a247f5663368de26e587d0d2ea2eadb683eede69554 |
|
MD5 | 5e5f7a31406c33974ec1d61a4be22ade |
|
BLAKE2b-256 | 341fa298ae0d601ea467b63d8b85726c43dee83bc67e670e0994cbc88eeee8b2 |
Hashes for docling_parse-0.3.0-cp39-cp39-macosx_13_6_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | af21952fc0356715f81b11c7ef8413cd51273e83a2c09be28b3743baec52691d |
|
MD5 | b41281a2e536b722247b134ad7bbd5fd |
|
BLAKE2b-256 | 3d8a323908d8860a8c92e8c15dcda1f83ef701d85e3fdddf2d2006a1240cf7e8 |
Hashes for docling_parse-0.3.0-cp39-cp39-macosx_13_6_arm64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | fcbdb6445c8cbcabeb9ed17f33624a25b7d8d2c10e8e8a44b7339e844dc09d95 |
|
MD5 | 781458a44a28525723aa9107b146e092 |
|
BLAKE2b-256 | cd440e3f4f20a1801346c71e2cb424fe677c7464fa5ee921b4ca081f11376bb6 |