Simple package to extract text with coordinates from programmatic PDFs
Project description
Docling Parse
Simple package to extract text with coordinates from programmatic PDFs. This package is part of the Docling conversion.
Quick start
Install the package from Pypi
pip install docling-parse
Convert a PDF
from docling_parse.docling_parse import pdf_parser
# Do this only once to load fonts (avoid initialising it many times)
parser = pdf_parser()
# parser.set_loglevel(1) # 1=error, 2=warning, 3=success, 4=info
doc_file = "my-doc.pdf" # filename
doc_key = f"key={pdf_doc}" # unique document key (eg hash, UUID, etc)
# Load the document from file using filename doc_file. This only loads
# the QPDF document, but no extracted data
success = parser.load_document(doc_key, doc_file)
# Open the file in binary mode and read its contents
# with open(pdf_doc, "rb") as file:
# file_content = file.read()
# Create a BytesIO object and write the file contents to it
# bytes_io = io.BytesIO(file_content)
# success = parser.load_document_from_bytesio(doc_key, bytes_io)
# Parse the entire document in one go, easier, but could require
# a lot (more) memory as parsing page-by-page
# json_doc = parser.parse_pdf_from_key(doc_key)
# Get number of pages
num_pages = parser.number_of_pages(doc_key)
# Parse page by page to minimize memory footprint
for page in range(0, num_pages):
# Internal memory for page is auto-deleted after this call.
# No need to unload a specifc page
json_doc = parser.parse_pdf_from_key_on_page(doc_key, page)
if "pages" not in json_doc: # page could not get parsed
continue
# parsed page is the first one!
json_page = json_doc["pages"][0]
page_dimensions = [json_page["dimensions"]["width"], json_page["dimensions"]["height"]]
# find text cells
cells=[]
for cell_id,cell in enumerate(json_page["cells"]):
cells.append([page,
cell_id,
cell["content"]["rnormalized"], # text
cell["box"]["device"][0], # x0 (lower left x)
cell["box"]["device"][1], # y0 (lower left y)
cell["box"]["device"][2], # x1 (upper right x)
cell["box"]["device"][3], # y1 (upper right y)
])
# find bitmap images
images=[]
for image_id,image in enumerate(json_page["images"]):
images.append([page,
image_id,
image["box"][0], # x0 (lower left x)
image["box"][1], # y0 (lower left y)
image["box"][2], # x1 (upper right x)
image["box"][3], # y1 (upper right y)
])
# find paths
paths=[]
for path_id,path in enumerate(json_page["paths"]):
paths.append([page,
path_id,
path["x-values"], # array of x values
path["y-values"], # array of y values
])
# Unload the (QPDF) document and buffers
parser.unload_document(doc_key)
# Unloads everything at once
# parser.unload_documents()
Use the CLI
$ docling-parse -h
usage: docling-parse [-h] -p PDF
Process a PDF file.
options:
-h, --help show this help message and exit
-p PDF, --pdf PDF Path to the PDF file
Development
CXX
To build the parse, simply run the following command in the root folder,
rm -rf build; cmake -B ./build; cd build; make
You can run the parser from your build folder with
./parse.exe <input-file> <optional-logging:true>
If you dont have an input file, then a template input file will be printed on the terminal.
Python
To build the package, simply run (make sure poetry is installed),
poetry build
To test the package, run,
poetry run pytest ./tests/test_parse.py
Contributing
Please read Contributing to Docling Parse for details.
References
If you use Docling in your projects, please consider citing the following:
@software{Docling,
author = {Deep Search Team},
month = {7},
title = {{Docling}},
url = {https://github.com/DS4SD/docling},
version = {main},
year = {2024}
}
License
The Docling Parse codebase is under MIT license. For individual model usage, please refer to the model licenses found in the original packages.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distributions
Hashes for docling_parse-1.2.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 69929026cdf2e85b85ab082e91262ae581ab292f0328a9bf5f03787bfcae87d8 |
|
MD5 | 00e316a57a66a89bb74c9927cb5337a9 |
|
BLAKE2b-256 | 6f6dcb863e25be794164c3c978631982445f5027dd0fcc8359c3fd1ca270e27f |
Hashes for docling_parse-1.2.1-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | d533902bc58649d8d47c2f616d726720f2a6ae931975a1191246800cb4ec4337 |
|
MD5 | 62a9f022481353b0e5cdc9a72db21f9a |
|
BLAKE2b-256 | d54de7225e6f5cdef12a6d29b2642cf7ad9c8bfa92614adea213d3fc162c079a |
Hashes for docling_parse-1.2.1-cp312-cp312-macosx_14_0_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | e7c212578d93c1e0ea8327f7cdca94aa4421ae1957343552d75a3d58fce8f48d |
|
MD5 | 5a49c9fb992c3cd54eb1909eb12d06e6 |
|
BLAKE2b-256 | 16e16e3cc9d66809bfa55c5941fffc9e723f397d3c6cc36e913a390688563ec1 |
Hashes for docling_parse-1.2.1-cp312-cp312-macosx_14_0_arm64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ada6b1b3447fe2be96063a9902e965a34b5afd73f6f757f7b07bf88efd3c7d05 |
|
MD5 | f67a33e2993866f6a76c57019441fa49 |
|
BLAKE2b-256 | ab9eece6a3bea0384c97ea8d0b515b5a347a5e148d98fc8a168f4541bae8661a |
Hashes for docling_parse-1.2.1-cp312-cp312-macosx_13_0_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9eff4e10ed8d0e4dada4f79248d9dce13d747c7d353af67f8db4162ca9a772a5 |
|
MD5 | 6a280318bf921e4a22afe0c8278c61b0 |
|
BLAKE2b-256 | faa5199efd10868bad23a811c6dac588574d181384fa83df6884718d4c4c98f1 |
Hashes for docling_parse-1.2.1-cp312-cp312-macosx_13_0_arm64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 71af21014d8e0bf08ad629d55f5b637892e333fc8d17351ca597f976e6930a7f |
|
MD5 | d43fe737ae06ab50f9f7b04c29d37f16 |
|
BLAKE2b-256 | baae35010e1a5e11239bdfc72af8a33d7d3f06e83cc79984eaaff782dad6b08e |
Hashes for docling_parse-1.2.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3d6d54c36252dbaee3a6ec66942fa862ad2f6700e64f5a740c30bb498ee35350 |
|
MD5 | ce31874f9f221e080cfbb3faae2f6355 |
|
BLAKE2b-256 | e64b9dc98ffd228440e706de70726d768f52c81b6ebf6d11ac81939d8c8541c7 |
Hashes for docling_parse-1.2.1-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | bc922641637b3e172d180b1970492baa2f28bf43d67b3bfe2257ea423dbf2096 |
|
MD5 | d0e5a9eef3968ce9e37504fb8ff6d156 |
|
BLAKE2b-256 | 06d3cfe183bb5d50bc1e3319080e22f1bdcd7485e7bfc094494b50bb2c309c04 |
Hashes for docling_parse-1.2.1-cp311-cp311-macosx_14_0_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1122b95b33e5bde97c0c2a12485ba68d0e43e21ee158586e8cde144e22bd107b |
|
MD5 | 611cdc165fe5b2bb6e4ac6aa65c27091 |
|
BLAKE2b-256 | 21308f8b13fd885f1ff5723c651e19e2a706642f98f61da010f90be59ebfefaf |
Hashes for docling_parse-1.2.1-cp311-cp311-macosx_14_0_arm64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | aac1134ccbb41276e65001e08abc49ee3b59a67eeb301ea4609671dad0d1e2ff |
|
MD5 | 46a2e868d5a875750884188690722834 |
|
BLAKE2b-256 | e8cc0bcb5f71f55efbf3909afadf3d89aa3e0bbed12a0e91219909229819a983 |
Hashes for docling_parse-1.2.1-cp311-cp311-macosx_13_0_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1f8bbc3a53991a730b790b0878412abe9a927114489355a9ac803df32bf1f45b |
|
MD5 | d023f32f4b6769ffbe7a7e61441f0fa0 |
|
BLAKE2b-256 | 9566fd2d5878c611a536bf3194cd949f0b1e5ae72163c674b9f135b47bc1a5e9 |
Hashes for docling_parse-1.2.1-cp311-cp311-macosx_13_0_arm64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | c8f7b90092ee6758a9b211c2b11fc7f36c2c1b15bec442ad68f84a8f1d2ac4e9 |
|
MD5 | 8ecba4d01c287f815a49ea5f668b67cd |
|
BLAKE2b-256 | 276425f4a2809a2e8591c544a24d03223ba5fa471d5ffa6a7354cfe2c58f6955 |
Hashes for docling_parse-1.2.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8e78ea38eb7ddcb48b604abaf84f81b5a822c1e7eba63c1758074091554ca6db |
|
MD5 | 9b42d13dd052486032c829c91a637ed0 |
|
BLAKE2b-256 | 07c5ae42fd5196dbb0e87a40a417638dd7ab114dc536411f1b352996df6f1240 |
Hashes for docling_parse-1.2.1-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6048cc403c2c72d54669b6f9661deb25ebcbee8d7e68ba752d73721921313da1 |
|
MD5 | 491d918573c8d5808930677236b61880 |
|
BLAKE2b-256 | 57875f6e9f7a2fc7c960fbe86dbff5db738618e26e0d4e31f30b23513da36161 |
Hashes for docling_parse-1.2.1-cp310-cp310-macosx_14_0_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | a58bba545cefeec7f48bedda540534e960ec63f063500b57ae72e87343802647 |
|
MD5 | 4407277ed9770369e016b0014f403225 |
|
BLAKE2b-256 | 3f521ee54cedaccb2e38b96708fe39e50732cfe5ff011f62bdf4564b61404a0d |
Hashes for docling_parse-1.2.1-cp310-cp310-macosx_14_0_arm64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6a16ffdb3c55168c4a9d4ab634a09fb2a6395d71da713b8d7186d198ae72818e |
|
MD5 | 06c28c1466bc818973c3085e7a7d5c32 |
|
BLAKE2b-256 | fe5b85217a13a5ca05409c08e692c0441fc804522dd41fc26195d9efc443f699 |
Hashes for docling_parse-1.2.1-cp310-cp310-macosx_13_0_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | d7c32cfb8038500c8624e333313f46501675adaba2837c8294d562e2991c0864 |
|
MD5 | 02bbb211272917e4ef9ab8975e093eb4 |
|
BLAKE2b-256 | f14106467ff5c8701b3d88655a70bedf56ccd14bd6a73ae69d8ff0da7ff44fa0 |
Hashes for docling_parse-1.2.1-cp310-cp310-macosx_13_0_arm64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 851971d19bca8f29685da91679c62797bcaf072ed4add51cbd021602aaff1a71 |
|
MD5 | 9e2b387519b60aaad3e2925ac24b0cf6 |
|
BLAKE2b-256 | 2c6e7cf0760d20fd4fda7a48acd4d6f1ff09fc35da86d9261750b84e42c4ec14 |
Hashes for docling_parse-1.2.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9e93746818278101df56efd105b5e64b90fb7e35b329732849de23bdfeb7c53b |
|
MD5 | 9079f9149177c007e5383b79b15dc0da |
|
BLAKE2b-256 | 1bc53790b81afe17ceab60b41a9c8cbdf8ce38243f6be485501bfb939504c7f3 |
Hashes for docling_parse-1.2.1-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 97f5ab23c32b91d80fe8947dd69379c00c207272283e6ff66d83e07abbafba8e |
|
MD5 | 4c7d798fb91bad64cdfd701525484ae4 |
|
BLAKE2b-256 | 9b6cfbc724cb4d8600699e27f0c3a24b3596b44aac46ff2dfd61d75cf8df5322 |
Hashes for docling_parse-1.2.1-cp39-cp39-macosx_14_0_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 277b3921c14f17b890478fef2743e74c40d27f2fed9f4a537ef7810fe0166439 |
|
MD5 | 958c32c4e263328cc7cadd80172131d5 |
|
BLAKE2b-256 | 9b75beff53c6306665dee2336a19eb6e09bb90c65777be3b1bc3d50ed0d3675c |
Hashes for docling_parse-1.2.1-cp39-cp39-macosx_14_0_arm64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 61056d6061139907de9c029c5ed532db82960c12e84fe34c3f68b9561894d9d4 |
|
MD5 | 56de78f35fad2797d686af2506e1ec86 |
|
BLAKE2b-256 | 1de3725ffdc681d795f8084e737553346e8afd41bf74749c3f3f843c442f8676 |
Hashes for docling_parse-1.2.1-cp39-cp39-macosx_13_0_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | a2ab761aa874808e6e68a6a42001e55d2c0b60d6245f2eb9005523ab2305f0e6 |
|
MD5 | 22ec3cf6bdb5ba24a9dcb420d9155f34 |
|
BLAKE2b-256 | db78fa39846bc77ebb742c4232e981463c719de262b7ef31b4a64af88e0876d3 |
Hashes for docling_parse-1.2.1-cp39-cp39-macosx_13_0_arm64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 68b095f82bc75c88f79196402292b37111bc941e654a1bc9089bcb6584614bc4 |
|
MD5 | 1b62a4dc1a3d14fbca68bd46d5a93767 |
|
BLAKE2b-256 | f1e5261c09a25344c27e7aabe88625161e1a600c1d8ffe9908ccd68bc4ad00b1 |