Skip to main content

This package enables inference of header hierarchy in the docling PDF parsing pipeline.

Project description

docling-hierarchical-pdf

Release Build status codecov Commit activity License

This package enables inference of header hierarchy in the docling PDF parsing pipeline.

What it does:

Docling currently does not support the extraction of header hierarchies from PDF documents. This package attempts to infer and correct the hierarchy of headings based on a few simple rules and then corrects the docling Document hierarchy accordingly.

Inference

The rules are:

  • Numbering-based: Attempt to infer the hierarchy from heading numbering. Arabic and roman numbering as well as outline numbering using letters.
  • Style-based: If the above fails try to infer the headings by font size and style (bold / italic).

Results are as follows:

Header hierarchy before reconstruction:

Richtlinie 10-00
Einfuhrzollveranlagungsverfahren
Abkürzungsverzeichnis
1  Veranlagungsschritte im Zollveranlagungsverfahren
Ablaufschema Zollveranlagungsverfahren:
1.1  Zuführen
1.2  Zollüberwachung und Zollprüfung
1.3  Gestellen und summarisches Anmelden
1.3.1  Allgemeines
1.3.2  Form der summarischen Anmeldung
1.3.3  Manipulationen
...

After reconstruction:

  Richtlinie 10-00
  Einfuhrzollveranlagungsverfahren
  Abkürzungsverzeichnis
  1  Veranlagungsschritte im Zollveranlagungsverfahren
    Ablaufschema Zollveranlagungsverfahren:
    1.1  Zuführen
    1.2  Zollüberwachung und Zollprüfung
    1.3  Gestellen und summarisches Anmelden
      1.3.1  Allgemeines
      1.3.2  Form der summarischen Anmeldung
      1.3.3  Manipulationen
      ...

Applying the hierarchy

The current solution reorders the hierarchy tree of document items according to the inference results:

  • Headings become sorted into parent/child relationship as inferred from the heading hierarchy.
  • Heading get assigned with the inferred heading level (level attribute of SectionHeaderItem)
  • Any Items (except for furniture) that follow a heading become children of that last heading.

Verification

The current solution has been tested on 60+ text-based PDF documents using the docling DocumentConverter with default parameters and gave satisfying results. In an attempt to test the performance with a public dataset 20+ document from the HDRDoc dataset have been tested. This dataset is based on images so the default VLM-pipeline of docling was used. Performance was inferior to pure-text PDFs, which was limited by the performance of docling VLM-parsing.

Limitations

  • The proposed solution uses the ConversionReult object rather than the DoclingDocument it produces, because DoclingDocument does not contain information on font style of text-based PDFs, which is present in the ConversionResult. The more information is available the is the inference result.
  • The solution entirely relies on docling parsing - if docling does not identify a header then there is no way to get it back with this postprocessing - but docling does pretty well for text-based PDFs.
  • The proposed solution currently does not take TOC-bookmarks into account, but I am planning to integrate that soon.
  • The proposed solution has not yet been evaluated on the full HRDoc dataset, but I am planning to do this soon.

How to use it:

Install it:

pip install XXXX

Use it:

from docling.document_converter import DocumentConverter
from hierarchical.postprocessor import ResultPostprocessor

source = "my_file.pdf"  # document per local path or URL
converter = DocumentConverter()
result = converter.convert(source)
# the postprocessor modifies the result.document in place.
ResultPostprocessor(result).process()

# enjoy the reordered document
result.document.export_to_markdown()

Citation

If you use this software for your project please cite Docling as well as the following:

@software{docling_hierarchical,
  author = {Roman, Kayan},
  month = {09},
  title = {{docling-hierarchical-pdf}},
  url = {https://github.com/krrome/docling-hierarchical-pdf},
  version = {0.0.1},
  year = {2025}
}

Repository initiated with fpgmaas/cookiecutter-uv.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docling_hierarchical_pdf-0.0.1.tar.gz (290.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

docling_hierarchical_pdf-0.0.1-py3-none-any.whl (12.9 kB view details)

Uploaded Python 3

File details

Details for the file docling_hierarchical_pdf-0.0.1.tar.gz.

File metadata

File hashes

Hashes for docling_hierarchical_pdf-0.0.1.tar.gz
Algorithm Hash digest
SHA256 5dfb547b3a3b945e9aac9885e9ff25fb0c9ccfee37ea4dc2be3d8240d9875818
MD5 efedbd0fba6013d08fb386a7c1c9fbb9
BLAKE2b-256 a2ca8c3b0e43fcca3b093c63cdbd82688ae4a9368b672e7b6ca9d02564457f70

See more details on using hashes here.

File details

Details for the file docling_hierarchical_pdf-0.0.1-py3-none-any.whl.

File metadata

File hashes

Hashes for docling_hierarchical_pdf-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 eb486113adeb7a1418f0743fb5fb1774da007cbb13abaa0adbb6bd8780e5de58
MD5 31401dc4df0c8faa36e41d828a31503d
BLAKE2b-256 ecd5e653eaf84e7678f28db46afaf3184c6974da693fd4c944faa94ac335dc37

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page