Skip to main content

Open-source document indexing library for building hierarchical trees from PDF and Markdown.

Project description

doctr logo

doctr

Deterministic multi-format document tree indexing for RAG and agent workflows.

PyPI Python License

Overview

doctr turns real-world documents into PageIndex-style tree nodes:

  • title
  • node_id
  • start_index
  • end_index
  • summary
  • nodes

Supported input formats:

  • pdf
  • docx
  • xlsx / xlsm
  • md / markdown
  • txt
  • msg (text fallback)
  • embedded files inside supported containers (best effort)

Package Names

  • PyPI distribution: doctr-index
  • Python import: doctr
  • Legacy compatibility import: pdfindexing (shim)

Installation

pip install doctr-index

Optional extras:

pip install 'doctr-index[office]'   # docx/xlsx
pip install 'doctr-index[docling]'  # docling adapter
pip install 'doctr-index[ocr]'      # OCR provider adapter

Quick Start

from doctr import index_document

idx = index_document(
    "/path/to/report.pdf",
    prefer_toc_hierarchy=True,
    include_embedded=True,
    max_embedded_depth=2,
)

tree = idx.to_pageindex_dict(include_empty_nodes=False)
print(tree["nodes"][0])

4-Stage Pipeline

from doctr import DocumentPipeline

pipeline = DocumentPipeline()

# 1) Document input
info = pipeline.document_input("/path/to/report.pdf")

# 2) Docling conversion
converted = pipeline.docling_conversion("/path/to/report.pdf")

# 3) Tree index builder
index = pipeline.build_tree_index(converted=converted)

# 4) Retrieval/chat context
context = pipeline.retrieve_for_chat(index, "What are the key risks?", top_k=6)
print(context)

Python Function Reference

Core indexing

from doctr import (
    index_document,
    index_pdf_file,
    index_docx_file,
    index_xlsx_file,
    index_markdown_file,
    index_markdown_text,
)

idx1 = index_document("/path/to/file.pdf")
idx2 = index_pdf_file("/path/to/file.pdf")
idx3 = index_docx_file("/path/to/file.docx")
idx4 = index_xlsx_file("/path/to/file.xlsx")
idx5 = index_markdown_file("/path/to/file.md")
idx6 = index_markdown_text("# Root\n## Child")

Class Usage

from doctr import DocumentIndexer

indexer = DocumentIndexer()
idx = indexer.index_document("/path/to/file.pdf", include_embedded=True)

OCR payload mapping (no API key required)

from doctr import DocumentIndexer, document_index_from_ocr_payload

payload = {
    "doc_id": "pi-abc123",
    "status": "completed",
    "result": [
        {
            "title": "Financial Stability",
            "page_index": 21,
            "text": "...",
            "nodes": []
        }
    ]
}

idx = DocumentIndexer().index_with_ocr("/path/to/scanned.pdf", ocr_payload=payload)
idx2 = document_index_from_ocr_payload(payload, source="/path/to/scanned.pdf")

OCR provider adapter (API key required)

from doctr import PageIndexOCRProvider, DocumentIndexer

provider = PageIndexOCRProvider(api_key="YOUR_KEY", base_url="https://api.pageindex.ai")
idx = DocumentIndexer(ocr_provider=provider).index_with_ocr("/path/to/scanned.pdf")

Retrieval helper

from doctr import index_document, retrieve_context

idx = index_document("/path/to/report.pdf")
payload = idx.to_pageindex_dict(include_empty_nodes=False)
ctx = retrieve_context(payload, "What changed in supervision?", top_k=8)
print(ctx)

Custom enricher

from doctr import index_document

def enrich(idx):
    for node in idx.nodes:
        if node.summary:
            node.summary = node.summary[:300]
    return idx

idx = index_document("/path/to/file.pdf", enricher=enrich)

CLI

doctr /path/to/file.pdf \
  --prefer-toc-hierarchy \
  --include-embedded \
  --max-embedded-depth 2 \
  --format pageindex \
  --output output_index.json

Embedded Files

Enable recursive embedded indexing:

from doctr import index_document

idx = index_document("/path/to/container.docx", include_embedded=True, max_embedded_depth=2)

Output includes:

  • Embedded Files branch in tree
  • metadata["embedded_files"] manifest with indexed/skipped status

Example Scripts

  • examples/chat_with_document.py
  • examples/docling_pipeline_demo.py
  • examples/local_ocr_tree_indexer.py

Notes

  • Normal indexing is local and does not require API keys.
  • OCR provider mode requires credentials only if using remote OCR.
  • nodes: [] means a valid leaf node.

Development

pytest -q

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

doctr_index-0.1.3.tar.gz (25.2 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

doctr_index-0.1.3-py3-none-any.whl (24.9 MB view details)

Uploaded Python 3

File details

Details for the file doctr_index-0.1.3.tar.gz.

File metadata

  • Download URL: doctr_index-0.1.3.tar.gz
  • Upload date:
  • Size: 25.2 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.8

File hashes

Hashes for doctr_index-0.1.3.tar.gz
Algorithm Hash digest
SHA256 21dc09b12c8eec4195352a0b49607e705294fc3390d47e7fe7064001e98e8dfa
MD5 6ca7e94e04ebafaab5d6a323ddef9d2e
BLAKE2b-256 83c6c97eefc60af32452439af0806e7c4cf675071efd61d0ad5565c00d10cb12

See more details on using hashes here.

File details

Details for the file doctr_index-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: doctr_index-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 24.9 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.8

File hashes

Hashes for doctr_index-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 3bb54b8894687011d15b5be5078309f3525b05d9afadb0b2db211c148c925b48
MD5 8ebaf20bd67eefa8c3fcd25bad0004f7
BLAKE2b-256 98a8980a3daafb49be8089af2021df81d426508407f411f2353a137951cedd2e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page