Open-source document indexing library for building hierarchical trees from PDF and Markdown.
Project description
Overview
doctr turns real-world documents into PageIndex-style tree nodes:
titlenode_idstart_indexend_indexsummarynodes
Supported input formats:
pdfdocxxlsx/xlsmmd/markdowntxtmsg(text fallback)- embedded files inside supported containers (best effort)
Package Names
- PyPI distribution:
doctr-index - Python import:
doctr - Legacy compatibility import:
pdfindexing(shim)
Installation
pip install doctr-index
Optional extras:
pip install 'doctr-index[office]' # docx/xlsx
pip install 'doctr-index[docling]' # docling adapter
pip install 'doctr-index[ocr]' # OCR provider adapter
Quick Start
from doctr import index_document
idx = index_document(
"/path/to/report.pdf",
prefer_toc_hierarchy=True,
include_embedded=True,
max_embedded_depth=2,
)
tree = idx.to_pageindex_dict(include_empty_nodes=False)
print(tree["nodes"][0])
4-Stage Pipeline
from doctr import DocumentPipeline
pipeline = DocumentPipeline()
# 1) Document input
info = pipeline.document_input("/path/to/report.pdf")
# 2) Docling conversion
converted = pipeline.docling_conversion("/path/to/report.pdf")
# 3) Tree index builder
index = pipeline.build_tree_index(converted=converted)
# 4) Retrieval/chat context
context = pipeline.retrieve_for_chat(index, "What are the key risks?", top_k=6)
print(context)
Python Function Reference
Core indexing
from doctr import (
index_document,
index_pdf_file,
index_docx_file,
index_xlsx_file,
index_markdown_file,
index_markdown_text,
)
idx1 = index_document("/path/to/file.pdf")
idx2 = index_pdf_file("/path/to/file.pdf")
idx3 = index_docx_file("/path/to/file.docx")
idx4 = index_xlsx_file("/path/to/file.xlsx")
idx5 = index_markdown_file("/path/to/file.md")
idx6 = index_markdown_text("# Root\n## Child")
Class Usage
from doctr import DocumentIndexer
indexer = DocumentIndexer()
idx = indexer.index_document("/path/to/file.pdf", include_embedded=True)
OCR payload mapping (no API key required)
from doctr import DocumentIndexer, document_index_from_ocr_payload
payload = {
"doc_id": "pi-abc123",
"status": "completed",
"result": [
{
"title": "Financial Stability",
"page_index": 21,
"text": "...",
"nodes": []
}
]
}
idx = DocumentIndexer().index_with_ocr("/path/to/scanned.pdf", ocr_payload=payload)
idx2 = document_index_from_ocr_payload(payload, source="/path/to/scanned.pdf")
OCR provider adapter (API key required)
from doctr import PageIndexOCRProvider, DocumentIndexer
provider = PageIndexOCRProvider(api_key="YOUR_KEY", base_url="https://api.pageindex.ai")
idx = DocumentIndexer(ocr_provider=provider).index_with_ocr("/path/to/scanned.pdf")
Retrieval helper
from doctr import index_document, retrieve_context
idx = index_document("/path/to/report.pdf")
payload = idx.to_pageindex_dict(include_empty_nodes=False)
ctx = retrieve_context(payload, "What changed in supervision?", top_k=8)
print(ctx)
Custom enricher
from doctr import index_document
def enrich(idx):
for node in idx.nodes:
if node.summary:
node.summary = node.summary[:300]
return idx
idx = index_document("/path/to/file.pdf", enricher=enrich)
CLI
doctr /path/to/file.pdf \
--prefer-toc-hierarchy \
--include-embedded \
--max-embedded-depth 2 \
--format pageindex \
--output output_index.json
Embedded Files
Enable recursive embedded indexing:
from doctr import index_document
idx = index_document("/path/to/container.docx", include_embedded=True, max_embedded_depth=2)
Output includes:
Embedded Filesbranch in treemetadata["embedded_files"]manifest with indexed/skipped status
Example Scripts
examples/chat_with_document.pyexamples/docling_pipeline_demo.pyexamples/local_ocr_tree_indexer.py
Notes
- Normal indexing is local and does not require API keys.
- OCR provider mode requires credentials only if using remote OCR.
nodes: []means a valid leaf node.
Development
pytest -q
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
doctr_index-0.1.3.tar.gz
(25.2 MB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file doctr_index-0.1.3.tar.gz.
File metadata
- Download URL: doctr_index-0.1.3.tar.gz
- Upload date:
- Size: 25.2 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
21dc09b12c8eec4195352a0b49607e705294fc3390d47e7fe7064001e98e8dfa
|
|
| MD5 |
6ca7e94e04ebafaab5d6a323ddef9d2e
|
|
| BLAKE2b-256 |
83c6c97eefc60af32452439af0806e7c4cf675071efd61d0ad5565c00d10cb12
|
File details
Details for the file doctr_index-0.1.3-py3-none-any.whl.
File metadata
- Download URL: doctr_index-0.1.3-py3-none-any.whl
- Upload date:
- Size: 24.9 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3bb54b8894687011d15b5be5078309f3525b05d9afadb0b2db211c148c925b48
|
|
| MD5 |
8ebaf20bd67eefa8c3fcd25bad0004f7
|
|
| BLAKE2b-256 |
98a8980a3daafb49be8089af2021df81d426508407f411f2353a137951cedd2e
|