Open-source document indexing library for building hierarchical trees from PDF and Markdown.
Project description
Overview
doctr turns real-world documents into PageIndex-style tree nodes:
titlenode_idstart_indexend_indexsummarynodes
Supported input formats:
pdfdocxxlsx/xlsmmd/markdowntxtmsg(text fallback)- embedded files inside supported containers (best effort)
Package Names
- PyPI distribution:
doctr-index - Python import:
doctr - Legacy compatibility import:
pdfindexing(shim)
Installation
pip install doctr-index
Optional extras:
pip install 'doctr-index[office]' # docx/xlsx
pip install 'doctr-index[docling]' # docling adapter
pip install 'doctr-index[ocr]' # OCR provider adapter
Quick Start
from doctr import index_document
idx = index_document(
"/path/to/report.pdf",
prefer_toc_hierarchy=True,
include_embedded=True,
max_embedded_depth=2,
)
tree = idx.to_pageindex_dict(include_empty_nodes=False)
print(tree["nodes"][0])
4-Stage Pipeline
from doctr import DocumentPipeline
pipeline = DocumentPipeline()
# 1) Document input
info = pipeline.document_input("/path/to/report.pdf")
# 2) Docling conversion
converted = pipeline.docling_conversion("/path/to/report.pdf")
# 3) Tree index builder
index = pipeline.build_tree_index(converted=converted)
# 4) Retrieval/chat context
context = pipeline.retrieve_for_chat(index, "What are the key risks?", top_k=6)
print(context)
API Usage Reference
Core indexing
from doctr import (
index_document,
index_pdf_file,
index_docx_file,
index_xlsx_file,
index_markdown_file,
index_markdown_text,
)
idx1 = index_document("/path/to/file.pdf")
idx2 = index_pdf_file("/path/to/file.pdf")
idx3 = index_docx_file("/path/to/file.docx")
idx4 = index_xlsx_file("/path/to/file.xlsx")
idx5 = index_markdown_file("/path/to/file.md")
idx6 = index_markdown_text("# Root\n## Child")
Class API
from doctr import DocumentIndexer
indexer = DocumentIndexer()
idx = indexer.index_document("/path/to/file.pdf", include_embedded=True)
OCR payload mapping (no API key required)
from doctr import DocumentIndexer, document_index_from_ocr_payload
payload = {
"doc_id": "pi-abc123",
"status": "completed",
"result": [
{
"title": "Financial Stability",
"page_index": 21,
"text": "...",
"nodes": []
}
]
}
idx = DocumentIndexer().index_with_ocr("/path/to/scanned.pdf", ocr_payload=payload)
idx2 = document_index_from_ocr_payload(payload, source="/path/to/scanned.pdf")
OCR provider adapter (API key required)
from doctr import PageIndexOCRProvider, DocumentIndexer
provider = PageIndexOCRProvider(api_key="YOUR_KEY", base_url="https://api.pageindex.ai")
idx = DocumentIndexer(ocr_provider=provider).index_with_ocr("/path/to/scanned.pdf")
Retrieval helper
from doctr import index_document, retrieve_context
idx = index_document("/path/to/report.pdf")
payload = idx.to_pageindex_dict(include_empty_nodes=False)
ctx = retrieve_context(payload, "What changed in supervision?", top_k=8)
print(ctx)
Custom enricher
from doctr import index_document
def enrich(idx):
for node in idx.nodes:
if node.summary:
node.summary = node.summary[:300]
return idx
idx = index_document("/path/to/file.pdf", enricher=enrich)
CLI
doctr /path/to/file.pdf \
--prefer-toc-hierarchy \
--include-embedded \
--max-embedded-depth 2 \
--format pageindex \
--output output_index.json
Embedded Files
Enable recursive embedded indexing:
from doctr import index_document
idx = index_document("/path/to/container.docx", include_embedded=True, max_embedded_depth=2)
Output includes:
Embedded Filesbranch in treemetadata["embedded_files"]manifest with indexed/skipped status
Example Scripts
examples/chat_with_document.pyexamples/docling_pipeline_demo.pyexamples/local_ocr_tree_indexer.py
Notes
- Normal indexing is local and does not require API keys.
- OCR provider mode requires credentials only if using remote OCR.
nodes: []means a valid leaf node.
Development
pytest -q
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file doctr_index-0.1.2.tar.gz.
File metadata
- Download URL: doctr_index-0.1.2.tar.gz
- Upload date:
- Size: 146.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
898ac07181d0e3e3650f4b5864dce6cf9cf91add750b4bea84e7abc9e82af338
|
|
| MD5 |
307af91a6bb371edae336d55e25c7d52
|
|
| BLAKE2b-256 |
0df8c4e643509fa9b1442321028f6a3c31928992434cec430429a0dbf3ebd828
|
Provenance
The following attestation bundles were made for doctr_index-0.1.2.tar.gz:
Publisher:
publish.yml on Meet2147/doctr
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
doctr_index-0.1.2.tar.gz -
Subject digest:
898ac07181d0e3e3650f4b5864dce6cf9cf91add750b4bea84e7abc9e82af338 - Sigstore transparency entry: 1057012436
- Sigstore integration time:
-
Permalink:
Meet2147/doctr@56d2ed447fea18f8d3a0e0878afce9c648278241 -
Branch / Tag:
refs/tags/v0.1.2 - Owner: https://github.com/Meet2147
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@56d2ed447fea18f8d3a0e0878afce9c648278241 -
Trigger Event:
release
-
Statement type:
File details
Details for the file doctr_index-0.1.2-py3-none-any.whl.
File metadata
- Download URL: doctr_index-0.1.2-py3-none-any.whl
- Upload date:
- Size: 22.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4c057e6b08fd9b6ff2f4b24d5f1291bcde53b71cf93013f86e987258fa4d22db
|
|
| MD5 |
40fd8a84125353f51b9bb785591da3ab
|
|
| BLAKE2b-256 |
f8fd35e56c56c144a4860add0a31d2f3d7db214e78bc710484d2c084b671e524
|
Provenance
The following attestation bundles were made for doctr_index-0.1.2-py3-none-any.whl:
Publisher:
publish.yml on Meet2147/doctr
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
doctr_index-0.1.2-py3-none-any.whl -
Subject digest:
4c057e6b08fd9b6ff2f4b24d5f1291bcde53b71cf93013f86e987258fa4d22db - Sigstore transparency entry: 1057012441
- Sigstore integration time:
-
Permalink:
Meet2147/doctr@56d2ed447fea18f8d3a0e0878afce9c648278241 -
Branch / Tag:
refs/tags/v0.1.2 - Owner: https://github.com/Meet2147
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@56d2ed447fea18f8d3a0e0878afce9c648278241 -
Trigger Event:
release
-
Statement type: