Skip to main content

Simple package to extract text with coordinates from programmatic PDFs

Project description

Docling Parse

PyPI version PyPI - Python Version Poetry Pybind11 Platforms License MIT

Simple package to extract text, paths and bitmap images with coordinates from programmatic PDFs. This package is used in the Docling PDF conversion.

Version Original Word-level Snippet-level Performance
V1 screenshot Not Supported v1 snippet ~0.250 page/sec
V2 v1 word v2 snippet ~0.050 page/sec

[~5-10X faster than v1]

Quick start

Install the package from Pypi

pip install docling-parse

Convert a PDF (look in the visualise.py for a more detailed information)

from docling_parse.docling_parse import pdf_parser_v2

# Do this only once to load fonts (avoid initialising it many times)
parser = pdf_parser_v2()

# parser.set_loglevel(1) # 1=error, 2=warning, 3=success, 4=info

doc_file = "my-doc.pdf" # filename
doc_key = f"key={pdf_doc}" # unique document key (eg hash, UUID, etc)

# Load the document from file using filename doc_file. This only loads
# the QPDF document, but no extracted data
success = parser.load_document(doc_key, doc_file)

# Open the file in binary mode and read its contents
# with open(pdf_doc, "rb") as file:
#      file_content = file.read()

# Create a BytesIO object and write the file contents to it
# bytes_io = io.BytesIO(file_content)
# success = parser.load_document_from_bytesio(doc_key, bytes_io)

# Parse the entire document in one go, easier, but could require
# a lot (more) memory as parsing page-by-page
# json_doc = parser.parse_pdf_from_key(doc_key)	

# Get number of pages
num_pages = parser.number_of_pages(doc_key)

# Parse page by page to minimize memory footprint
for page in range(0, num_pages):

    # Internal memory for page is auto-deleted after this call.
    # No need to unload a specifc page 
    json_doc = parser.parse_pdf_from_key_on_page(doc_key, page)

    if "pages" not in json_doc:  # page could not get parsed
       continue

    # parsed page is the first one!				  
    json_page = json_doc["pages"][0] 
    
	# <Insert your own code>

# Unload the (QPDF) document and buffers
parser.unload_document(doc_key)

# Unloads everything at once
# parser.unload_documents()

Use the CLI

$ docling-parse -h
usage: docling-parse [-h] -p PDF

Process a PDF file.

options:
  -h, --help         show this help message and exit
  -p PDF, --pdf PDF  Path to the PDF file

Performance Benchmarks

We ran the v1 and v2 parser on DocLayNet. We found the following overall behavior

parser-performance

Development

CXX

To build the parse, simply run the following command in the root folder,

rm -rf build; cmake -B ./build; cd build; make

You can run the parser from your build folder. Example from parse_v1,

% ./parse_v1.exe -h
A program to process PDF files or configuration files
Usage:
  PDFProcessor [OPTION...]

  -i, --input arg          Input PDF file
  -c, --config arg         Config file
      --create-config arg  Create config file
  -o, --output arg         Output file
  -l, --loglevel arg       loglevel [error;warning;success;info]
  -h, --help               Print usage

Example from parse_v2,

% ./parse_v2.exe -h
program to process PDF files or configuration files
Usage:
  PDFProcessor [OPTION...]

  -i, --input arg          Input PDF file
  -c, --config arg         Config file
      --create-config arg  Create config file
  -p, --page arg           Pages to process (default: -1 for all) (default:
                           -1)
  -o, --output arg         Output file
  -l, --loglevel arg       loglevel [error;warning;success;info]
  -h, --help               Print usage

If you dont have an input file, then a template input file will be printed on the terminal.

Python

To build the package, simply run (make sure poetry is installed),

poetry build

To test the package, run:

poetry run pytest ./tests -v -s

Contributing

Please read Contributing to Docling Parse for details.

References

If you use Docling in your projects, please consider citing the following:

@techreport{Docling,
  author = {Deep Search Team},
  month = {8},
  title = {Docling Technical Report},
  url = {https://arxiv.org/abs/2408.09869},
  eprint = {2408.09869},
  doi = {10.48550/arXiv.2408.09869},
  version = {1.0.0},
  year = {2024}
}

License

The Docling Parse codebase is under MIT license. For individual model usage, please refer to the model licenses found in the original packages.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

docling_parse-2.1.1-cp313-cp313-win_amd64.whl (23.2 MB view details)

Uploaded CPython 3.13 Windows x86-64

docling_parse-2.1.1-cp312-cp312-win_amd64.whl (23.2 MB view details)

Uploaded CPython 3.12 Windows x86-64

docling_parse-2.1.1-cp312-cp312-macosx_14_0_arm64.whl (21.9 MB view details)

Uploaded CPython 3.12 macOS 14.0+ ARM64

docling_parse-2.1.1-cp311-cp311-win_amd64.whl (23.2 MB view details)

Uploaded CPython 3.11 Windows x86-64

docling_parse-2.1.1-cp311-cp311-macosx_14_0_x86_64.whl (22.0 MB view details)

Uploaded CPython 3.11 macOS 14.0+ x86-64

docling_parse-2.1.1-cp311-cp311-macosx_14_0_arm64.whl (21.9 MB view details)

Uploaded CPython 3.11 macOS 14.0+ ARM64

docling_parse-2.1.1-cp311-cp311-macosx_13_0_arm64.whl (21.9 MB view details)

Uploaded CPython 3.11 macOS 13.0+ ARM64

docling_parse-2.1.1-cp310-cp310-macosx_14_0_x86_64.whl (22.0 MB view details)

Uploaded CPython 3.10 macOS 14.0+ x86-64

docling_parse-2.1.1-cp310-cp310-macosx_14_0_arm64.whl (21.9 MB view details)

Uploaded CPython 3.10 macOS 14.0+ ARM64

docling_parse-2.1.1-cp310-cp310-macosx_13_0_x86_64.whl (22.0 MB view details)

Uploaded CPython 3.10 macOS 13.0+ x86-64

docling_parse-2.1.1-cp310-cp310-macosx_13_0_arm64.whl (21.9 MB view details)

Uploaded CPython 3.10 macOS 13.0+ ARM64

docling_parse-2.1.1-cp39-cp39-win_amd64.whl (23.2 MB view details)

Uploaded CPython 3.9 Windows x86-64

File details

Details for the file docling_parse-2.1.1-cp313-cp313-win_amd64.whl.

File metadata

File hashes

Hashes for docling_parse-2.1.1-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 537678c2b790ed54e75e6d08c72f3e9af3afe5d73bd7ef06b6c377640496e313
MD5 d62afec50a2a39972e2ba4a67989c23c
BLAKE2b-256 3948b6c546672cd097c6407c67f8451a8fbe89b5ad4dcb917b571924a46dd0a6

See more details on using hashes here.

File details

Details for the file docling_parse-2.1.1-cp312-cp312-win_amd64.whl.

File metadata

File hashes

Hashes for docling_parse-2.1.1-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 85d53d212cd0964cd205cb37e61d9d2dacc401a4c243749184779daf826ad4f5
MD5 469bcd54bbb810f7cbf5890179880b3e
BLAKE2b-256 4ac70da2a57cddededdc97aabeb04b07720cda73510017aaa0c5d5aa8e30536e

See more details on using hashes here.

File details

Details for the file docling_parse-2.1.1-cp312-cp312-macosx_14_0_arm64.whl.

File metadata

File hashes

Hashes for docling_parse-2.1.1-cp312-cp312-macosx_14_0_arm64.whl
Algorithm Hash digest
SHA256 013a1d352a368ce38bbd9d7739c02ba12e1915ae02a35cd35d339c1366f9324b
MD5 756586e06831493cf61fac2c0a52488d
BLAKE2b-256 cee413510440ff4d6899cb33e4f98baf6c54be304116563b1a5687db22350a7b

See more details on using hashes here.

File details

Details for the file docling_parse-2.1.1-cp311-cp311-win_amd64.whl.

File metadata

File hashes

Hashes for docling_parse-2.1.1-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 150cab9ee13ddc2473fcc70481aa2ff1c9c58e36d0baed6adc62b4bd91f0ea31
MD5 9c2c0885c478500b515bf8b500ad34bf
BLAKE2b-256 8aaaf58d56b35dbcabc57212ca844ed8ae3992144bed8db35f5b9f648922ea0b

See more details on using hashes here.

File details

Details for the file docling_parse-2.1.1-cp311-cp311-macosx_14_0_x86_64.whl.

File metadata

File hashes

Hashes for docling_parse-2.1.1-cp311-cp311-macosx_14_0_x86_64.whl
Algorithm Hash digest
SHA256 f43d9ad8338e2a39fb4c66faa782ef6046da5b1788d0bcf8e8e757924119f796
MD5 cb776f253e4def9ea4b7c800e7abc6c0
BLAKE2b-256 9fd9dba8a1fc3bc7f0c23b7f6bab23a6206aaaf02aceef2eb9c7491fb2e14c47

See more details on using hashes here.

File details

Details for the file docling_parse-2.1.1-cp311-cp311-macosx_14_0_arm64.whl.

File metadata

File hashes

Hashes for docling_parse-2.1.1-cp311-cp311-macosx_14_0_arm64.whl
Algorithm Hash digest
SHA256 e8b0bbf4c5ebf137e3b5365e75ab48432f2e3daedcb40084d8a066b5de0a6a3a
MD5 9af11a2659646d515ec3c53a09976bea
BLAKE2b-256 ec44606476a20a0d8b295c44f2dd12f9600a7ac7247ee74fc6cb163b8886738f

See more details on using hashes here.

File details

Details for the file docling_parse-2.1.1-cp311-cp311-macosx_13_0_arm64.whl.

File metadata

File hashes

Hashes for docling_parse-2.1.1-cp311-cp311-macosx_13_0_arm64.whl
Algorithm Hash digest
SHA256 2c6cb749b2491bf37ca08bd92e98481ae33f6c1745abe09f608f5a4bd4311478
MD5 f813d2169cb455295a07d5d39b323b5f
BLAKE2b-256 6e5a639ee258ecf5535c87c8a52355b286212fa78d1e786539597c50357915d3

See more details on using hashes here.

File details

Details for the file docling_parse-2.1.1-cp310-cp310-macosx_14_0_x86_64.whl.

File metadata

File hashes

Hashes for docling_parse-2.1.1-cp310-cp310-macosx_14_0_x86_64.whl
Algorithm Hash digest
SHA256 c8897c0d985092a67d02b5a60330fac78a0518cbcab425e19d7d45e02800a884
MD5 5646550d7c7858b81b701cea22f26818
BLAKE2b-256 072a19c38355145af994148a975aef6bd6ae029dacfa4465882f8d6d94184fda

See more details on using hashes here.

File details

Details for the file docling_parse-2.1.1-cp310-cp310-macosx_14_0_arm64.whl.

File metadata

File hashes

Hashes for docling_parse-2.1.1-cp310-cp310-macosx_14_0_arm64.whl
Algorithm Hash digest
SHA256 afaa4455d911344c47cd023e8f3559b0f86f7ddfc2950e45335d656655e974fe
MD5 caa64529a4b2d57a84437ab3847a6cb1
BLAKE2b-256 01620ca62b650ed8c8d4dee0936a0bf15322606736f6f6cc47b4fec7d00ed8e1

See more details on using hashes here.

File details

Details for the file docling_parse-2.1.1-cp310-cp310-macosx_13_0_x86_64.whl.

File metadata

File hashes

Hashes for docling_parse-2.1.1-cp310-cp310-macosx_13_0_x86_64.whl
Algorithm Hash digest
SHA256 dfe47a32a2a32b6869c9a52c1e1c114275dc126b22744e50809dbd7b144356dd
MD5 f267725f8fe42cf0ee8aa2580c7f2550
BLAKE2b-256 f0ff184fb514c54cf4dcea916edd53732c4101f21fc178a36beb57598293ecbd

See more details on using hashes here.

File details

Details for the file docling_parse-2.1.1-cp310-cp310-macosx_13_0_arm64.whl.

File metadata

File hashes

Hashes for docling_parse-2.1.1-cp310-cp310-macosx_13_0_arm64.whl
Algorithm Hash digest
SHA256 a53eda4863b2c9b5fc6f0794e8949a331156a7095a6e6a504ad9c958ad919b84
MD5 1c090cd8605068e50c2f683cfb644a24
BLAKE2b-256 46e8d2bc44c4e7662a7ee316983da941cb416314b028e6892a2d03c7629aa491

See more details on using hashes here.

File details

Details for the file docling_parse-2.1.1-cp39-cp39-win_amd64.whl.

File metadata

File hashes

Hashes for docling_parse-2.1.1-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 169822b79e79890753f236647c5f5b9d734ba481293c4134efff692edff0acde
MD5 c35f3624a9f7473001844d40bfad6171
BLAKE2b-256 76dc85e59401ce73e7c5347a0b02e58a1bb58aef19f3742e918e5b2079df8e0d

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page