Open-source document indexing library for building hierarchical trees from PDF and Markdown.

These details have not been verified by PyPI

Project description

doctr

Deterministic multi-format document tree indexing for RAG and agent workflows.

Overview

doctr turns real-world documents into PageIndex-style tree nodes:

title
node_id
start_index
end_index
summary
nodes

Supported input formats:

pdf
docx
xlsx / xlsm
md / markdown
txt
msg (text fallback)
embedded files inside supported containers (best effort)

Package Names

PyPI distribution: doctr-index
Python import: doctr
Legacy compatibility import: pdfindexing (shim)

Installation

pip install doctr-index

Optional extras:

pip install 'doctr-index[office]'   # docx/xlsx
pip install 'doctr-index[docling]'  # docling adapter
pip install 'doctr-index[ocr]'      # OCR provider adapter

Quick Start

from doctr import index_document

idx = index_document(
    "/path/to/report.pdf",
    prefer_toc_hierarchy=True,
    include_embedded=True,
    max_embedded_depth=2,
)

tree = idx.to_pageindex_dict(include_empty_nodes=False)
print(tree["nodes"][0])

4-Stage Pipeline

from doctr import DocumentPipeline

pipeline = DocumentPipeline()

# 1) Document input
info = pipeline.document_input("/path/to/report.pdf")

# 2) Docling conversion
converted = pipeline.docling_conversion("/path/to/report.pdf")

# 3) Tree index builder
index = pipeline.build_tree_index(converted=converted)

# 4) Retrieval/chat context
context = pipeline.retrieve_for_chat(index, "What are the key risks?", top_k=6)
print(context)

Python Function Reference

Core indexing

from doctr import (
    index_document,
    index_pdf_file,
    index_docx_file,
    index_xlsx_file,
    index_markdown_file,
    index_markdown_text,
)

idx1 = index_document("/path/to/file.pdf")
idx2 = index_pdf_file("/path/to/file.pdf")
idx3 = index_docx_file("/path/to/file.docx")
idx4 = index_xlsx_file("/path/to/file.xlsx")
idx5 = index_markdown_file("/path/to/file.md")
idx6 = index_markdown_text("# Root\n## Child")

Class Usage

from doctr import DocumentIndexer

indexer = DocumentIndexer()
idx = indexer.index_document("/path/to/file.pdf", include_embedded=True)

OCR payload mapping (no API key required)

from doctr import DocumentIndexer, document_index_from_ocr_payload

payload = {
    "doc_id": "pi-abc123",
    "status": "completed",
    "result": [
        {
            "title": "Financial Stability",
            "page_index": 21,
            "text": "...",
            "nodes": []
        }
    ]
}

idx = DocumentIndexer().index_with_ocr("/path/to/scanned.pdf", ocr_payload=payload)
idx2 = document_index_from_ocr_payload(payload, source="/path/to/scanned.pdf")

OCR provider adapter (API key required)

from doctr import PageIndexOCRProvider, DocumentIndexer

provider = PageIndexOCRProvider(api_key="YOUR_KEY", base_url="https://api.pageindex.ai")
idx = DocumentIndexer(ocr_provider=provider).index_with_ocr("/path/to/scanned.pdf")

Retrieval helper

from doctr import index_document, retrieve_context

idx = index_document("/path/to/report.pdf")
payload = idx.to_pageindex_dict(include_empty_nodes=False)
ctx = retrieve_context(payload, "What changed in supervision?", top_k=8)
print(ctx)

Custom enricher

from doctr import index_document

def enrich(idx):
    for node in idx.nodes:
        if node.summary:
            node.summary = node.summary[:300]
    return idx

idx = index_document("/path/to/file.pdf", enricher=enrich)

CLI

doctr /path/to/file.pdf \
  --prefer-toc-hierarchy \
  --include-embedded \
  --max-embedded-depth 2 \
  --format pageindex \
  --output output_index.json

Embedded Files

Enable recursive embedded indexing:

from doctr import index_document

idx = index_document("/path/to/container.docx", include_embedded=True, max_embedded_depth=2)

Output includes:

Embedded Files branch in tree
metadata["embedded_files"] manifest with indexed/skipped status

Example Scripts

examples/chat_with_document.py
examples/docling_pipeline_demo.py
examples/local_ocr_tree_indexer.py

Notes

Normal indexing is local and does not require API keys.
OCR provider mode requires credentials only if using remote OCR.
nodes: [] means a valid leaf node.

Development

pytest -q

License

MIT

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.3

Mar 9, 2026

0.1.2

Mar 7, 2026

0.1.1

Mar 7, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

doctr_index-0.1.3.tar.gz (25.2 MB view details)

Uploaded Mar 9, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

doctr_index-0.1.3-py3-none-any.whl (24.9 MB view details)

Uploaded Mar 9, 2026 Python 3

File details

Details for the file doctr_index-0.1.3.tar.gz.

File metadata

Download URL: doctr_index-0.1.3.tar.gz
Upload date: Mar 9, 2026
Size: 25.2 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.8

File hashes

Hashes for doctr_index-0.1.3.tar.gz
Algorithm	Hash digest
SHA256	`21dc09b12c8eec4195352a0b49607e705294fc3390d47e7fe7064001e98e8dfa`
MD5	`6ca7e94e04ebafaab5d6a323ddef9d2e`
BLAKE2b-256	`83c6c97eefc60af32452439af0806e7c4cf675071efd61d0ad5565c00d10cb12`

See more details on using hashes here.

File details

Details for the file doctr_index-0.1.3-py3-none-any.whl.

File metadata

Download URL: doctr_index-0.1.3-py3-none-any.whl
Upload date: Mar 9, 2026
Size: 24.9 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.8

File hashes

Hashes for doctr_index-0.1.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`3bb54b8894687011d15b5be5078309f3525b05d9afadb0b2db211c148c925b48`
MD5	`8ebaf20bd67eefa8c3fcd25bad0004f7`
BLAKE2b-256	`98a8980a3daafb49be8089af2021df81d426508407f411f2353a137951cedd2e`

See more details on using hashes here.

doctr-index 0.1.3

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

doctr

Overview

Package Names

Installation

Quick Start

4-Stage Pipeline

Python Function Reference

Core indexing

Class Usage

OCR payload mapping (no API key required)

OCR provider adapter (API key required)

Retrieval helper

Custom enricher

CLI

Embedded Files

Example Scripts

Notes

Development

License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes