Skip to main content

Simple package to extract text with coordinates from programmatic PDFs

Project description

Docling Parse

PyPI version PyPI - Python Version Poetry Pybind11 Platforms License MIT

Simple package to extract text, paths and bitmap images with coordinates from programmatic PDFs. This package is used in the Docling PDF conversion.

Version Original Word-level Snippet-level Performance
V1 screenshot Not Supported v1 snippet ~0.250 page/sec
V2 v1 word v2 snippet ~0.050 page/sec

[~5-10X faster than v1]

Quick start

Install the package from Pypi

pip install docling-parse

Convert a PDF (look in the visualise.py for a more detailed information)

from docling_parse.docling_parse import pdf_parser_v2

# Do this only once to load fonts (avoid initialising it many times)
parser = pdf_parser_v2()

# parser.set_loglevel(1) # 1=error, 2=warning, 3=success, 4=info

doc_file = "my-doc.pdf" # filename
doc_key = f"key={pdf_doc}" # unique document key (eg hash, UUID, etc)

# Load the document from file using filename doc_file. This only loads
# the QPDF document, but no extracted data
success = parser.load_document(doc_key, doc_file)

# Open the file in binary mode and read its contents
# with open(pdf_doc, "rb") as file:
#      file_content = file.read()

# Create a BytesIO object and write the file contents to it
# bytes_io = io.BytesIO(file_content)
# success = parser.load_document_from_bytesio(doc_key, bytes_io)

# Parse the entire document in one go, easier, but could require
# a lot (more) memory as parsing page-by-page
# json_doc = parser.parse_pdf_from_key(doc_key)	

# Get number of pages
num_pages = parser.number_of_pages(doc_key)

# Parse page by page to minimize memory footprint
for page in range(0, num_pages):

    # Internal memory for page is auto-deleted after this call.
    # No need to unload a specifc page 
    json_doc = parser.parse_pdf_from_key_on_page(doc_key, page)

    if "pages" not in json_doc:  # page could not get parsed
       continue

    # parsed page is the first one!				  
    json_page = json_doc["pages"][0] 
    
	# <Insert your own code>

# Unload the (QPDF) document and buffers
parser.unload_document(doc_key)

# Unloads everything at once
# parser.unload_documents()

Use the CLI

$ docling-parse -h
usage: docling-parse [-h] -p PDF

Process a PDF file.

options:
  -h, --help         show this help message and exit
  -p PDF, --pdf PDF  Path to the PDF file

Performance Benchmarks

We ran the v1 and v2 parser on DocLayNet. We found the following overall behavior

parser-performance

Development

CXX

To build the parse, simply run the following command in the root folder,

rm -rf build; cmake -B ./build; cd build; make

You can run the parser from your build folder. Example from parse_v1,

% ./parse_v1.exe -h
A program to process PDF files or configuration files
Usage:
  PDFProcessor [OPTION...]

  -i, --input arg          Input PDF file
  -c, --config arg         Config file
      --create-config arg  Create config file
  -o, --output arg         Output file
  -l, --loglevel arg       loglevel [error;warning;success;info]
  -h, --help               Print usage

Example from parse_v2,

% ./parse_v2.exe -h
program to process PDF files or configuration files
Usage:
  PDFProcessor [OPTION...]

  -i, --input arg          Input PDF file
  -c, --config arg         Config file
      --create-config arg  Create config file
  -p, --page arg           Pages to process (default: -1 for all) (default:
                           -1)
  -o, --output arg         Output file
  -l, --loglevel arg       loglevel [error;warning;success;info]
  -h, --help               Print usage

If you dont have an input file, then a template input file will be printed on the terminal.

Python

To build the package, simply run (make sure poetry is installed),

poetry build

To test the package, run:

poetry run pytest ./tests -v -s

Contributing

Please read Contributing to Docling Parse for details.

References

If you use Docling in your projects, please consider citing the following:

@techreport{Docling,
  author = {Deep Search Team},
  month = {8},
  title = {Docling Technical Report},
  url = {https://arxiv.org/abs/2408.09869},
  eprint = {2408.09869},
  doi = {10.48550/arXiv.2408.09869},
  version = {1.0.0},
  year = {2024}
}

License

The Docling Parse codebase is under MIT license. For individual model usage, please refer to the model licenses found in the original packages.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docling_parse-2.0.3.tar.gz (24.4 MB view hashes)

Uploaded Source

Built Distributions

docling_parse-2.0.3-pp310-pypy310_pp73-win_amd64.whl (24.9 MB view hashes)

Uploaded PyPy Windows x86-64

docling_parse-2.0.3-cp313-cp313-win_amd64.whl (23.1 MB view hashes)

Uploaded CPython 3.13 Windows x86-64

docling_parse-2.0.3-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (22.4 MB view hashes)

Uploaded CPython 3.13 manylinux: glibc 2.17+ x86-64

docling_parse-2.0.3-cp313-cp313-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (22.3 MB view hashes)

Uploaded CPython 3.13 manylinux: glibc 2.17+ ARM64

docling_parse-2.0.3-cp313-cp313-macosx_14_0_x86_64.whl (21.9 MB view hashes)

Uploaded CPython 3.13 macOS 14.0+ x86-64

docling_parse-2.0.3-cp313-cp313-macosx_14_0_arm64.whl (21.8 MB view hashes)

Uploaded CPython 3.13 macOS 14.0+ ARM64

docling_parse-2.0.3-cp313-cp313-macosx_13_0_x86_64.whl (22.0 MB view hashes)

Uploaded CPython 3.13 macOS 13.0+ x86-64

docling_parse-2.0.3-cp313-cp313-macosx_13_0_arm64.whl (21.9 MB view hashes)

Uploaded CPython 3.13 macOS 13.0+ ARM64

docling_parse-2.0.3-cp312-cp312-win_amd64.whl (23.1 MB view hashes)

Uploaded CPython 3.12 Windows x86-64

docling_parse-2.0.3-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (22.4 MB view hashes)

Uploaded CPython 3.12 manylinux: glibc 2.17+ x86-64

docling_parse-2.0.3-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (22.3 MB view hashes)

Uploaded CPython 3.12 manylinux: glibc 2.17+ ARM64

docling_parse-2.0.3-cp312-cp312-macosx_14_0_x86_64.whl (21.9 MB view hashes)

Uploaded CPython 3.12 macOS 14.0+ x86-64

docling_parse-2.0.3-cp312-cp312-macosx_14_0_arm64.whl (21.8 MB view hashes)

Uploaded CPython 3.12 macOS 14.0+ ARM64

docling_parse-2.0.3-cp312-cp312-macosx_13_0_x86_64.whl (22.0 MB view hashes)

Uploaded CPython 3.12 macOS 13.0+ x86-64

docling_parse-2.0.3-cp312-cp312-macosx_13_0_arm64.whl (21.9 MB view hashes)

Uploaded CPython 3.12 macOS 13.0+ ARM64

docling_parse-2.0.3-cp311-cp311-win_amd64.whl (23.1 MB view hashes)

Uploaded CPython 3.11 Windows x86-64

docling_parse-2.0.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (22.4 MB view hashes)

Uploaded CPython 3.11 manylinux: glibc 2.17+ x86-64

docling_parse-2.0.3-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (22.3 MB view hashes)

Uploaded CPython 3.11 manylinux: glibc 2.17+ ARM64

docling_parse-2.0.3-cp311-cp311-macosx_14_0_x86_64.whl (21.9 MB view hashes)

Uploaded CPython 3.11 macOS 14.0+ x86-64

docling_parse-2.0.3-cp311-cp311-macosx_14_0_arm64.whl (21.8 MB view hashes)

Uploaded CPython 3.11 macOS 14.0+ ARM64

docling_parse-2.0.3-cp311-cp311-macosx_13_0_x86_64.whl (22.0 MB view hashes)

Uploaded CPython 3.11 macOS 13.0+ x86-64

docling_parse-2.0.3-cp311-cp311-macosx_13_0_arm64.whl (21.9 MB view hashes)

Uploaded CPython 3.11 macOS 13.0+ ARM64

docling_parse-2.0.3-cp310-cp310-win_amd64.whl (23.1 MB view hashes)

Uploaded CPython 3.10 Windows x86-64

docling_parse-2.0.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (22.4 MB view hashes)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64

docling_parse-2.0.3-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (22.3 MB view hashes)

Uploaded CPython 3.10 manylinux: glibc 2.17+ ARM64

docling_parse-2.0.3-cp310-cp310-macosx_14_0_x86_64.whl (21.9 MB view hashes)

Uploaded CPython 3.10 macOS 14.0+ x86-64

docling_parse-2.0.3-cp310-cp310-macosx_14_0_arm64.whl (21.8 MB view hashes)

Uploaded CPython 3.10 macOS 14.0+ ARM64

docling_parse-2.0.3-cp310-cp310-macosx_13_0_x86_64.whl (22.0 MB view hashes)

Uploaded CPython 3.10 macOS 13.0+ x86-64

docling_parse-2.0.3-cp310-cp310-macosx_13_0_arm64.whl (21.9 MB view hashes)

Uploaded CPython 3.10 macOS 13.0+ ARM64

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page