Skip to main content

Open-source document indexing library for building hierarchical trees from PDF and Markdown.

Project description

doctr logo

doctr

Deterministic multi-format document tree indexing for RAG and agent workflows.

PyPI Python License

Overview

doctr turns real-world documents into PageIndex-style tree nodes:

  • title
  • node_id
  • start_index
  • end_index
  • summary
  • nodes

Supported input formats:

  • pdf
  • docx
  • xlsx / xlsm
  • md / markdown
  • txt
  • msg (text fallback)
  • embedded files inside supported containers (best effort)

Package Names

  • PyPI distribution: doctr-index
  • Python import: doctr
  • Legacy compatibility import: pdfindexing (shim)

Installation

pip install doctr-index

Optional extras:

pip install 'doctr-index[office]'   # docx/xlsx
pip install 'doctr-index[docling]'  # docling adapter
pip install 'doctr-index[ocr]'      # OCR provider adapter

Quick Start

from doctr import index_document

idx = index_document(
    "/path/to/report.pdf",
    prefer_toc_hierarchy=True,
    include_embedded=True,
    max_embedded_depth=2,
)

tree = idx.to_pageindex_dict(include_empty_nodes=False)
print(tree["nodes"][0])

4-Stage Pipeline

from doctr import DocumentPipeline

pipeline = DocumentPipeline()

# 1) Document input
info = pipeline.document_input("/path/to/report.pdf")

# 2) Docling conversion
converted = pipeline.docling_conversion("/path/to/report.pdf")

# 3) Tree index builder
index = pipeline.build_tree_index(converted=converted)

# 4) Retrieval/chat context
context = pipeline.retrieve_for_chat(index, "What are the key risks?", top_k=6)
print(context)

API Usage Reference

Core indexing

from doctr import (
    index_document,
    index_pdf_file,
    index_docx_file,
    index_xlsx_file,
    index_markdown_file,
    index_markdown_text,
)

idx1 = index_document("/path/to/file.pdf")
idx2 = index_pdf_file("/path/to/file.pdf")
idx3 = index_docx_file("/path/to/file.docx")
idx4 = index_xlsx_file("/path/to/file.xlsx")
idx5 = index_markdown_file("/path/to/file.md")
idx6 = index_markdown_text("# Root\n## Child")

Class API

from doctr import DocumentIndexer

indexer = DocumentIndexer()
idx = indexer.index_document("/path/to/file.pdf", include_embedded=True)

OCR payload mapping (no API key required)

from doctr import DocumentIndexer, document_index_from_ocr_payload

payload = {
    "doc_id": "pi-abc123",
    "status": "completed",
    "result": [
        {
            "title": "Financial Stability",
            "page_index": 21,
            "text": "...",
            "nodes": []
        }
    ]
}

idx = DocumentIndexer().index_with_ocr("/path/to/scanned.pdf", ocr_payload=payload)
idx2 = document_index_from_ocr_payload(payload, source="/path/to/scanned.pdf")

OCR provider adapter (API key required)

from doctr import PageIndexOCRProvider, DocumentIndexer

provider = PageIndexOCRProvider(api_key="YOUR_KEY", base_url="https://api.pageindex.ai")
idx = DocumentIndexer(ocr_provider=provider).index_with_ocr("/path/to/scanned.pdf")

Retrieval helper

from doctr import index_document, retrieve_context

idx = index_document("/path/to/report.pdf")
payload = idx.to_pageindex_dict(include_empty_nodes=False)
ctx = retrieve_context(payload, "What changed in supervision?", top_k=8)
print(ctx)

Custom enricher

from doctr import index_document

def enrich(idx):
    for node in idx.nodes:
        if node.summary:
            node.summary = node.summary[:300]
    return idx

idx = index_document("/path/to/file.pdf", enricher=enrich)

CLI

doctr /path/to/file.pdf \
  --prefer-toc-hierarchy \
  --include-embedded \
  --max-embedded-depth 2 \
  --format pageindex \
  --output output_index.json

Embedded Files

Enable recursive embedded indexing:

from doctr import index_document

idx = index_document("/path/to/container.docx", include_embedded=True, max_embedded_depth=2)

Output includes:

  • Embedded Files branch in tree
  • metadata["embedded_files"] manifest with indexed/skipped status

Example Scripts

  • examples/chat_with_document.py
  • examples/docling_pipeline_demo.py
  • examples/local_ocr_tree_indexer.py

Notes

  • Normal indexing is local and does not require API keys.
  • OCR provider mode requires credentials only if using remote OCR.
  • nodes: [] means a valid leaf node.

Development

pytest -q

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

doctr_index-0.1.2.tar.gz (146.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

doctr_index-0.1.2-py3-none-any.whl (22.4 kB view details)

Uploaded Python 3

File details

Details for the file doctr_index-0.1.2.tar.gz.

File metadata

  • Download URL: doctr_index-0.1.2.tar.gz
  • Upload date:
  • Size: 146.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for doctr_index-0.1.2.tar.gz
Algorithm Hash digest
SHA256 898ac07181d0e3e3650f4b5864dce6cf9cf91add750b4bea84e7abc9e82af338
MD5 307af91a6bb371edae336d55e25c7d52
BLAKE2b-256 0df8c4e643509fa9b1442321028f6a3c31928992434cec430429a0dbf3ebd828

See more details on using hashes here.

Provenance

The following attestation bundles were made for doctr_index-0.1.2.tar.gz:

Publisher: publish.yml on Meet2147/doctr

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file doctr_index-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: doctr_index-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 22.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for doctr_index-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 4c057e6b08fd9b6ff2f4b24d5f1291bcde53b71cf93013f86e987258fa4d22db
MD5 40fd8a84125353f51b9bb785591da3ab
BLAKE2b-256 f8fd35e56c56c144a4860add0a31d2f3d7db214e78bc710484d2c084b671e524

See more details on using hashes here.

Provenance

The following attestation bundles were made for doctr_index-0.1.2-py3-none-any.whl:

Publisher: publish.yml on Meet2147/doctr

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page