Open-source document indexing library for building hierarchical trees from PDF and Markdown.

These details have not been verified by PyPI

Project description

doctr

Deterministic document indexing library with PageIndex-style tree output:

title
node_id
start_index
end_index
summary
nodes

Works for pdf, docx, xlsx/xlsm, md/markdown, txt, msg, plus optional embedded-file recursion.

Package Name

PyPI distribution name: doctr-index
Python import name: doctr

Compatibility:

Legacy imports from pdfindexing are still supported via shim modules.

Install

Core:

pip install -e '.[dev]'

From PyPI:

pip install doctr-index

With Office (docx, xlsx, xlsm):

pip install -e '.[office]'

With Docling:

pip install -e '.[docling]'

With OCR provider adapter:

pip install -e '.[ocr]'

Architecture (Separate Methods)

Document input (pdf/docx/xlsx/...)
Docling conversion (layout + OCR + reading order + tables)
Tree index builder
Retrieval/chat layer

Use DocumentPipeline when you want these stages explicitly separated.

Quick Start

from doctr import index_document

idx = index_document(
    "/path/to/report.pdf",
    prefer_toc_hierarchy=True,
    include_embedded=True,
    max_embedded_depth=2,
)

tree = idx.to_pageindex_dict(include_empty_nodes=False)
print(tree["nodes"][0])

Full Public API (Every Export) with Usage

1) `index_document(...)`

Single entrypoint for supported local files.

from doctr import index_document

idx = index_document(
    "/path/to/file.pdf",
    max_toc_pages=20,
    prefer_toc_hierarchy=False,
    summary_max_chars=None,   # full summaries
    include_embedded=False,
    max_embedded_depth=2,
)

2) `index_pdf_file(path, ...)`

PDF-specific indexing.

from doctr import index_pdf_file

idx = index_pdf_file("/path/to/file.pdf", prefer_toc_hierarchy=True)

3) `index_docx_file(path)`

DOCX indexing from heading styles and paragraph sections.

from doctr import index_docx_file

idx = index_docx_file("/path/to/file.docx")

4) `index_xlsx_file(path)`

XLSX/XLSM indexing by sheets and row chunks.

from doctr import index_xlsx_file

idx = index_xlsx_file("/path/to/file.xlsx")

5) `index_markdown_file(path)`

from doctr import index_markdown_file

idx = index_markdown_file("/path/to/file.md")

6) `index_markdown_text(markdown)`

from doctr import index_markdown_text

idx = index_markdown_text("# Root\n## Child\n")

7) `DocumentIndexer`

Class API with normal indexing + OCR indexing methods.

from doctr import DocumentIndexer

indexer = DocumentIndexer()
idx = indexer.index_document("/path/to/file.pdf", include_embedded=True)

8) `DocumentIndexer.index_with_ocr(...)`

Use OCR payload directly or provider.

from doctr import DocumentIndexer

ocr_payload = {
    "doc_id": "pi-abc123",
    "status": "completed",
    "result": [
        {
            "title": "Financial Stability",
            "page_index": 21,
            "text": "Section text...",
            "nodes": []
        }
    ]
}

idx = DocumentIndexer().index_with_ocr("/path/to/scanned.pdf", ocr_payload=ocr_payload)

9) `document_index_from_ocr_payload(payload, source=...)`

Standalone converter from OCR node payload to DocumentIndex.

from doctr import document_index_from_ocr_payload

idx = document_index_from_ocr_payload(ocr_payload, source="/path/to/scanned.pdf")

10) `PageIndexOCRProvider`

Optional remote OCR adapter.

from doctr import PageIndexOCRProvider, DocumentIndexer

provider = PageIndexOCRProvider(api_key="YOUR_KEY", base_url="https://api.pageindex.ai")
indexer = DocumentIndexer(ocr_provider=provider)
idx = indexer.index_with_ocr("/path/to/scanned.pdf")

11) `DocumentPipeline`

Stage-based API.

from doctr import DocumentPipeline

pipeline = DocumentPipeline()

# Stage 1
inp = pipeline.document_input("/path/to/report.pdf")

# Stage 2
converted = pipeline.docling_conversion("/path/to/report.pdf")

# Stage 3
idx = pipeline.build_tree_index(converted=converted)

# Stage 4
context = pipeline.retrieve_for_chat(idx, "What are major risks?", top_k=6)

12) `DoclingConverterAdapter`

Direct Docling conversion wrapper.

from doctr import DoclingConverterAdapter

adapter = DoclingConverterAdapter()
converted = adapter.convert("/path/to/file.pdf")
print(converted.markdown[:200])

13) `ConvertedDocument`

Return type from DoclingConverterAdapter.convert.

from doctr import ConvertedDocument

obj = ConvertedDocument(
    source_path="/tmp/a.pdf",
    markdown="# Parsed document",
    metadata={"converter": "docling"},
)

14) `retrieve_context(index_payload, question, top_k=6)`

Context retrieval helper for chat prompts.

from doctr import retrieve_context, index_document

idx = index_document("/path/to/file.pdf")
payload = idx.to_pageindex_dict(include_empty_nodes=False)
ctx = retrieve_context(payload, "What changed in monetary policy?", top_k=8)
print(ctx)

15) `DocumentIndex` model

Primary output object from indexers.

from doctr import index_document

idx = index_document("/path/to/file.pdf")
print(idx.to_dict())
print(idx.to_pageindex_dict(include_empty_nodes=False))

16) `SectionNode` model

Useful for custom node construction.

from doctr import SectionNode

n = SectionNode(title="Section A", start_page=10, end_page=12, summary="...")

17) `IndexEnricher` protocol

Type contract for custom post-processing.

from doctr import index_document

def enrich(idx):
    for node in idx.nodes:
        if node.summary:
            node.summary = node.summary[:300]
    return idx

idx = index_document("/path/to/file.pdf", enricher=enrich)

Embedded Files (Files within Files)

Supported best-effort extraction:

Office embeddings (*/embeddings/*) in docx/xlsx/xlsm/pptx
PDF file attachments

Enable it:

from doctr import index_document

idx = index_document(
    "/path/to/container.docx",
    include_embedded=True,
    max_embedded_depth=2,
)

Results:

Adds Embedded Files branch in tree
Writes extraction/index status in metadata["embedded_files"]

CLI

doctr /path/to/file.pdf \
  --prefer-toc-hierarchy \
  --include-embedded \
  --max-embedded-depth 2 \
  --format pageindex \
  --output output_index.json

Examples

Chat over index + Sonar: chat_with_document.py
Local OCR from scratch (no API keys, no doctr import): local_ocr_tree_indexer.py
Docling 4-stage demo: docling_pipeline_demo.py

Notes

Normal indexing is local and does not require API keys.
OCR provider mode needs provider credentials.
nodes: [] on a node means it is a valid leaf node.

Development

pytest -q

License

MIT

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.1.3

Mar 9, 2026

0.1.2

Mar 7, 2026

This version

0.1.1

Mar 7, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

doctr_index-0.1.1.tar.gz (25.0 MB view details)

Uploaded Mar 7, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

doctr_index-0.1.1-py3-none-any.whl (24.9 MB view details)

Uploaded Mar 7, 2026 Python 3

File details

Details for the file doctr_index-0.1.1.tar.gz.

File metadata

Download URL: doctr_index-0.1.1.tar.gz
Upload date: Mar 7, 2026
Size: 25.0 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.8

File hashes

Hashes for doctr_index-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`7163f34cea88a793d4bbde662df13a7ad13da1f11ab22dc8500fd0a502d7176b`
MD5	`51d9d515fa0850e81e900ae9798ce315`
BLAKE2b-256	`3442f8725790fc35f679130475be35ff21b1c8251cd97c881439d297aa4e56d2`

See more details on using hashes here.

File details

Details for the file doctr_index-0.1.1-py3-none-any.whl.

File metadata

Download URL: doctr_index-0.1.1-py3-none-any.whl
Upload date: Mar 7, 2026
Size: 24.9 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.8

File hashes

Hashes for doctr_index-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0b3d2fa1b444ffe12ef5a84b551e39c841fa9fe7cbef281c49ab99f00bd3feb0`
MD5	`a777af752f75be6ebacc65fb262545d0`
BLAKE2b-256	`1c3800dddfedc7be2d4d0765e53dce84b089116b582ace566b3b8d197f934e7a`

See more details on using hashes here.

doctr-index 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

doctr

Package Name

Install

Architecture (Separate Methods)

Quick Start

Full Public API (Every Export) with Usage

1) index_document(...)

2) index_pdf_file(path, ...)

3) index_docx_file(path)

4) index_xlsx_file(path)

5) index_markdown_file(path)

6) index_markdown_text(markdown)

7) DocumentIndexer

8) DocumentIndexer.index_with_ocr(...)

9) document_index_from_ocr_payload(payload, source=...)

10) PageIndexOCRProvider

11) DocumentPipeline

12) DoclingConverterAdapter

13) ConvertedDocument

14) retrieve_context(index_payload, question, top_k=6)

15) DocumentIndex model

16) SectionNode model

17) IndexEnricher protocol

Embedded Files (Files within Files)

CLI

Examples

Notes

Development

License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

1) `index_document(...)`

2) `index_pdf_file(path, ...)`

3) `index_docx_file(path)`

4) `index_xlsx_file(path)`

5) `index_markdown_file(path)`

6) `index_markdown_text(markdown)`

7) `DocumentIndexer`

8) `DocumentIndexer.index_with_ocr(...)`

9) `document_index_from_ocr_payload(payload, source=...)`

10) `PageIndexOCRProvider`

11) `DocumentPipeline`

12) `DoclingConverterAdapter`

13) `ConvertedDocument`

14) `retrieve_context(index_payload, question, top_k=6)`

15) `DocumentIndex` model

16) `SectionNode` model

17) `IndexEnricher` protocol