This package enables inference of header hierarchy in the docling PDF parsing pipeline.

These details have not been verified by PyPI

Project links

Project description

docling-hierarchical-pdf

This package enables inference of header hierarchy in the docling PDF parsing pipeline.

Github repository: https://github.com/krrome/docling-hierarchical-pdf/
Documentation https://krrome.github.io/docling-hierarchical-pdf/

What it does:

Docling currently does not support the extraction of header hierarchies from PDF documents. This package attempts to infer and correct the hierarchy of headings based on a few simple rules and then corrects the docling Document hierarchy accordingly.

Import from bookmarks (PDF-metadata)

This package uses pymupdf to try to extract the TOC from "PDF-bookmarks". If successful, the headings and texts in a Docling result are corrected to match the structure in the PDF metadata. This means that the code doesn't only correct the hierarchy levels of section headings that were correctly parsed by docling, but it also attempts a best effort solution converting headings missed by docling into headings and vice versa.

Stylistic inference

The rules are:

Numbering-based: Attempt to infer the hierarchy from heading numbering. Arabic and roman numbering as well as outline numbering using letters.
Style-based: If the above fails try to infer the headings by font size and style (bold / italic).

Results are as follows:

Header hierarchy before reconstruction:

Richtlinie 10-00
Einfuhrzollveranlagungsverfahren
Abkürzungsverzeichnis
1  Veranlagungsschritte im Zollveranlagungsverfahren
Ablaufschema Zollveranlagungsverfahren:
1.1  Zuführen
1.2  Zollüberwachung und Zollprüfung
1.3  Gestellen und summarisches Anmelden
1.3.1  Allgemeines
1.3.2  Form der summarischen Anmeldung
1.3.3  Manipulationen
...

After reconstruction:

  Richtlinie 10-00
  Einfuhrzollveranlagungsverfahren
  Abkürzungsverzeichnis
  1  Veranlagungsschritte im Zollveranlagungsverfahren
    Ablaufschema Zollveranlagungsverfahren:
    1.1  Zuführen
    1.2  Zollüberwachung und Zollprüfung
    1.3  Gestellen und summarisches Anmelden
      1.3.1  Allgemeines
      1.3.2  Form der summarischen Anmeldung
      1.3.3  Manipulationen
      ...

Applying the hierarchy

The current solution reorders the hierarchy tree of document items according to the inference results:

Headings become sorted into parent/child relationship as inferred from the heading hierarchy.
Heading get assigned with the inferred heading level (level attribute of SectionHeaderItem)
Any Items (except for furniture) that follow a heading become children of that last heading.

Verification

The current solution has been tested on 60+ text-based PDF documents using the docling DocumentConverter with default parameters and gave satisfying results. In an attempt to test the performance with a public dataset 20+ document from the HDRDoc dataset have been tested. This dataset is based on images so the default VLM-pipeline of docling was used. Performance was inferior to pure-text PDFs, which was limited by the performance of docling VLM-parsing.

Limitations

The proposed solution uses the ConversionReult object rather than the DoclingDocument it produces, because DoclingDocument does not contain information on font style of text-based PDFs, which is present in the ConversionResult. The more information is available the is the inference result.
The solution entirely relies on docling parsing - if docling does not identify a header then there is no way to get it back with this postprocessing - but docling does pretty well for text-based PDFs.
The proposed solution has not yet been evaluated on the full HRDoc dataset.

How to use it:

Install it:

pip install docling-hierarchical-pdf

Use it:

from docling.document_converter import DocumentConverter
from hierarchical.postprocessor import ResultPostprocessor

source = "my_file.pdf"  # document per local path or URL
converter = DocumentConverter()
result = converter.convert(source)
# the postprocessor modifies the result.document in place.
ResultPostprocessor(result).process()

# enjoy the reordered document - for example convert it to markdown
result.document.export_to_markdown()

# or use a chunker on it...

or for the VLM-pipeline:

from docling.datamodel.base_models import InputFormat
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.pipeline.vlm_pipeline import VlmPipeline
from hierarchical.postprocessor import ResultPostprocessor

source = "my_scanned.pdf"  # document per local path or URL

converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(
            pipeline_cls=VlmPipeline,
        ),
    }
)
result = converter.convert(source=source)
ResultPostprocessor(result).process()

# enjoy the reordered document - for example convert it to markdown
result.document.export_to_markdown()

# or use a chunker on it...

FAQ

Working with DocumentStream sources / PDFFileNotFoundException:

If you run into the PDFFileNotFoundException then your source attribute to DocumentConverter().convert(source=source) has either been of type str or of type DocumentStream so there is the Docling conversion result unfortunately does not hold a valid reference to the source file anymore. Hence the Postprocessor needs your help - if source was a string then you can add the source=source when instantiating ResultPostprocessor - full example:

from docling.document_converter import DocumentConverter
from hierarchical.postprocessor import ResultPostprocessor

source = "my_file.pdf"  # document per local path or URL
converter = DocumentConverter()
result = converter.convert(source)
# the postprocessor modifies the result.document in place.
ResultPostprocessor(result, source=source).process()
# ...

If you have used a DocumentStream object as source you are unfortunately in the situation that you will have to pass a valid Path to the PDF as a source argument to ResultPostprocessor or a new, open BytesIO stream or DocumentStream object as a source argument to ResultPostprocessor. The reason is that docling closes the source stream when it is finished - so no more reading from that stream is possible.

Exception handling for ToC extraction from metadata:

You want to handle exceptions regarding File-IO / Streams yourself - great, just set raise_on_error to True when instantiating ResultPostprocessor.

Citation

If you use this software for your project please cite Docling as well as the following:

@software{docling_hierarchical,
  author = {Roman, Kayan},
  month = {09},
  title = {{docling-hierarchical-pdf}},
  url = {https://github.com/krrome/docling-hierarchical-pdf},
  version = {0.0.1},
  year = {2025}
}

Repository initiated with fpgmaas/cookiecutter-uv.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.8

Apr 24, 2026

This version

0.1.6

Mar 23, 2026

0.1.5

Feb 19, 2026

0.1.3

Jan 6, 2026

0.1.2

Oct 20, 2025

0.1.1

Oct 13, 2025

0.1.0

Oct 6, 2025

0.0.2

Sep 26, 2025

0.0.1

Sep 25, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docling_hierarchical_pdf-0.1.6.tar.gz (296.9 kB view details)

Uploaded Mar 23, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

docling_hierarchical_pdf-0.1.6-py3-none-any.whl (17.3 kB view details)

Uploaded Mar 23, 2026 Python 3

File details

Details for the file docling_hierarchical_pdf-0.1.6.tar.gz.

File metadata

Download URL: docling_hierarchical_pdf-0.1.6.tar.gz
Upload date: Mar 23, 2026
Size: 296.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.6.14

File hashes

Hashes for docling_hierarchical_pdf-0.1.6.tar.gz
Algorithm	Hash digest
SHA256	`7e30e4d6fcd8b3df746fee0984f01b0210138fea1755ea1736f89bc0d33830bc`
MD5	`455d87b2688ae9b5204b0db1fbfa5efe`
BLAKE2b-256	`d076a10b791d56061a389f1b49b8ff63a36cbdee494c4cc47abb508e29636f07`

See more details on using hashes here.

File details

Details for the file docling_hierarchical_pdf-0.1.6-py3-none-any.whl.

File metadata

Download URL: docling_hierarchical_pdf-0.1.6-py3-none-any.whl
Upload date: Mar 23, 2026
Size: 17.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.6.14

File hashes

Hashes for docling_hierarchical_pdf-0.1.6-py3-none-any.whl
Algorithm	Hash digest
SHA256	`6496206445a894f85b3936f702bb1114deaeaf6c0cecd1f8dc6a9f9c0ecbc2a1`
MD5	`191d0c265e72f6f3d674fb2daef57a84`
BLAKE2b-256	`63360b42562378b69a1d9621e9fba4170c43b495b6612972737c3a650db8e812`

See more details on using hashes here.

docling-hierarchical-pdf 0.1.6

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

docling-hierarchical-pdf

What it does:

Import from bookmarks (PDF-metadata)

Stylistic inference

Applying the hierarchy

Verification

Limitations

How to use it:

FAQ

Working with DocumentStream sources / PDFFileNotFoundException:

Exception handling for ToC extraction from metadata:

Citation

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes