Open-source document indexing library for building hierarchical trees from PDF and Markdown.
Project description
doctr
Deterministic document indexing library with PageIndex-style tree output:
titlenode_idstart_indexend_indexsummarynodes
Works for pdf, docx, xlsx/xlsm, md/markdown, txt, msg, plus optional embedded-file recursion.
Package Name
PyPI distribution name: doctr-index
Python import name: doctr
Compatibility:
- Legacy imports from
pdfindexingare still supported via shim modules.
Install
Core:
pip install -e '.[dev]'
From PyPI:
pip install doctr-index
With Office (docx, xlsx, xlsm):
pip install -e '.[office]'
With Docling:
pip install -e '.[docling]'
With OCR provider adapter:
pip install -e '.[ocr]'
Architecture (Separate Methods)
- Document input (
pdf/docx/xlsx/...) - Docling conversion (layout + OCR + reading order + tables)
- Tree index builder
- Retrieval/chat layer
Use DocumentPipeline when you want these stages explicitly separated.
Quick Start
from doctr import index_document
idx = index_document(
"/path/to/report.pdf",
prefer_toc_hierarchy=True,
include_embedded=True,
max_embedded_depth=2,
)
tree = idx.to_pageindex_dict(include_empty_nodes=False)
print(tree["nodes"][0])
Full Public API (Every Export) with Usage
1) index_document(...)
Single entrypoint for supported local files.
from doctr import index_document
idx = index_document(
"/path/to/file.pdf",
max_toc_pages=20,
prefer_toc_hierarchy=False,
summary_max_chars=None, # full summaries
include_embedded=False,
max_embedded_depth=2,
)
2) index_pdf_file(path, ...)
PDF-specific indexing.
from doctr import index_pdf_file
idx = index_pdf_file("/path/to/file.pdf", prefer_toc_hierarchy=True)
3) index_docx_file(path)
DOCX indexing from heading styles and paragraph sections.
from doctr import index_docx_file
idx = index_docx_file("/path/to/file.docx")
4) index_xlsx_file(path)
XLSX/XLSM indexing by sheets and row chunks.
from doctr import index_xlsx_file
idx = index_xlsx_file("/path/to/file.xlsx")
5) index_markdown_file(path)
from doctr import index_markdown_file
idx = index_markdown_file("/path/to/file.md")
6) index_markdown_text(markdown)
from doctr import index_markdown_text
idx = index_markdown_text("# Root\n## Child\n")
7) DocumentIndexer
Class API with normal indexing + OCR indexing methods.
from doctr import DocumentIndexer
indexer = DocumentIndexer()
idx = indexer.index_document("/path/to/file.pdf", include_embedded=True)
8) DocumentIndexer.index_with_ocr(...)
Use OCR payload directly or provider.
from doctr import DocumentIndexer
ocr_payload = {
"doc_id": "pi-abc123",
"status": "completed",
"result": [
{
"title": "Financial Stability",
"page_index": 21,
"text": "Section text...",
"nodes": []
}
]
}
idx = DocumentIndexer().index_with_ocr("/path/to/scanned.pdf", ocr_payload=ocr_payload)
9) document_index_from_ocr_payload(payload, source=...)
Standalone converter from OCR node payload to DocumentIndex.
from doctr import document_index_from_ocr_payload
idx = document_index_from_ocr_payload(ocr_payload, source="/path/to/scanned.pdf")
10) PageIndexOCRProvider
Optional remote OCR adapter.
from doctr import PageIndexOCRProvider, DocumentIndexer
provider = PageIndexOCRProvider(api_key="YOUR_KEY", base_url="https://api.pageindex.ai")
indexer = DocumentIndexer(ocr_provider=provider)
idx = indexer.index_with_ocr("/path/to/scanned.pdf")
11) DocumentPipeline
Stage-based API.
from doctr import DocumentPipeline
pipeline = DocumentPipeline()
# Stage 1
inp = pipeline.document_input("/path/to/report.pdf")
# Stage 2
converted = pipeline.docling_conversion("/path/to/report.pdf")
# Stage 3
idx = pipeline.build_tree_index(converted=converted)
# Stage 4
context = pipeline.retrieve_for_chat(idx, "What are major risks?", top_k=6)
12) DoclingConverterAdapter
Direct Docling conversion wrapper.
from doctr import DoclingConverterAdapter
adapter = DoclingConverterAdapter()
converted = adapter.convert("/path/to/file.pdf")
print(converted.markdown[:200])
13) ConvertedDocument
Return type from DoclingConverterAdapter.convert.
from doctr import ConvertedDocument
obj = ConvertedDocument(
source_path="/tmp/a.pdf",
markdown="# Parsed document",
metadata={"converter": "docling"},
)
14) retrieve_context(index_payload, question, top_k=6)
Context retrieval helper for chat prompts.
from doctr import retrieve_context, index_document
idx = index_document("/path/to/file.pdf")
payload = idx.to_pageindex_dict(include_empty_nodes=False)
ctx = retrieve_context(payload, "What changed in monetary policy?", top_k=8)
print(ctx)
15) DocumentIndex model
Primary output object from indexers.
from doctr import index_document
idx = index_document("/path/to/file.pdf")
print(idx.to_dict())
print(idx.to_pageindex_dict(include_empty_nodes=False))
16) SectionNode model
Useful for custom node construction.
from doctr import SectionNode
n = SectionNode(title="Section A", start_page=10, end_page=12, summary="...")
17) IndexEnricher protocol
Type contract for custom post-processing.
from doctr import index_document
def enrich(idx):
for node in idx.nodes:
if node.summary:
node.summary = node.summary[:300]
return idx
idx = index_document("/path/to/file.pdf", enricher=enrich)
Embedded Files (Files within Files)
Supported best-effort extraction:
- Office embeddings (
*/embeddings/*) indocx/xlsx/xlsm/pptx - PDF file attachments
Enable it:
from doctr import index_document
idx = index_document(
"/path/to/container.docx",
include_embedded=True,
max_embedded_depth=2,
)
Results:
- Adds
Embedded Filesbranch in tree - Writes extraction/index status in
metadata["embedded_files"]
CLI
doctr /path/to/file.pdf \
--prefer-toc-hierarchy \
--include-embedded \
--max-embedded-depth 2 \
--format pageindex \
--output output_index.json
Examples
- Chat over index + Sonar: chat_with_document.py
- Local OCR from scratch (no API keys, no
doctrimport): local_ocr_tree_indexer.py - Docling 4-stage demo: docling_pipeline_demo.py
Notes
- Normal indexing is local and does not require API keys.
- OCR provider mode needs provider credentials.
nodes: []on a node means it is a valid leaf node.
Development
pytest -q
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file doctr_index-0.1.1.tar.gz.
File metadata
- Download URL: doctr_index-0.1.1.tar.gz
- Upload date:
- Size: 25.0 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7163f34cea88a793d4bbde662df13a7ad13da1f11ab22dc8500fd0a502d7176b
|
|
| MD5 |
51d9d515fa0850e81e900ae9798ce315
|
|
| BLAKE2b-256 |
3442f8725790fc35f679130475be35ff21b1c8251cd97c881439d297aa4e56d2
|
File details
Details for the file doctr_index-0.1.1-py3-none-any.whl.
File metadata
- Download URL: doctr_index-0.1.1-py3-none-any.whl
- Upload date:
- Size: 24.9 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0b3d2fa1b444ffe12ef5a84b551e39c841fa9fe7cbef281c49ab99f00bd3feb0
|
|
| MD5 |
a777af752f75be6ebacc65fb262545d0
|
|
| BLAKE2b-256 |
1c3800dddfedc7be2d4d0765e53dce84b089116b582ace566b3b8d197f934e7a
|