Skip to main content

Open-source document indexing library for building hierarchical trees from PDF and Markdown.

Project description

doctr logo

doctr

Deterministic document indexing library with PageIndex-style tree output:

  • title
  • node_id
  • start_index
  • end_index
  • summary
  • nodes

Works for pdf, docx, xlsx/xlsm, md/markdown, txt, msg, plus optional embedded-file recursion.

Package Name

PyPI distribution name: doctr-index
Python import name: doctr

Compatibility:

  • Legacy imports from pdfindexing are still supported via shim modules.

Install

Core:

pip install -e '.[dev]'

From PyPI:

pip install doctr-index

With Office (docx, xlsx, xlsm):

pip install -e '.[office]'

With Docling:

pip install -e '.[docling]'

With OCR provider adapter:

pip install -e '.[ocr]'

Architecture (Separate Methods)

  1. Document input (pdf/docx/xlsx/...)
  2. Docling conversion (layout + OCR + reading order + tables)
  3. Tree index builder
  4. Retrieval/chat layer

Use DocumentPipeline when you want these stages explicitly separated.

Quick Start

from doctr import index_document

idx = index_document(
    "/path/to/report.pdf",
    prefer_toc_hierarchy=True,
    include_embedded=True,
    max_embedded_depth=2,
)

tree = idx.to_pageindex_dict(include_empty_nodes=False)
print(tree["nodes"][0])

Full Public API (Every Export) with Usage

1) index_document(...)

Single entrypoint for supported local files.

from doctr import index_document

idx = index_document(
    "/path/to/file.pdf",
    max_toc_pages=20,
    prefer_toc_hierarchy=False,
    summary_max_chars=None,   # full summaries
    include_embedded=False,
    max_embedded_depth=2,
)

2) index_pdf_file(path, ...)

PDF-specific indexing.

from doctr import index_pdf_file

idx = index_pdf_file("/path/to/file.pdf", prefer_toc_hierarchy=True)

3) index_docx_file(path)

DOCX indexing from heading styles and paragraph sections.

from doctr import index_docx_file

idx = index_docx_file("/path/to/file.docx")

4) index_xlsx_file(path)

XLSX/XLSM indexing by sheets and row chunks.

from doctr import index_xlsx_file

idx = index_xlsx_file("/path/to/file.xlsx")

5) index_markdown_file(path)

from doctr import index_markdown_file

idx = index_markdown_file("/path/to/file.md")

6) index_markdown_text(markdown)

from doctr import index_markdown_text

idx = index_markdown_text("# Root\n## Child\n")

7) DocumentIndexer

Class API with normal indexing + OCR indexing methods.

from doctr import DocumentIndexer

indexer = DocumentIndexer()
idx = indexer.index_document("/path/to/file.pdf", include_embedded=True)

8) DocumentIndexer.index_with_ocr(...)

Use OCR payload directly or provider.

from doctr import DocumentIndexer

ocr_payload = {
    "doc_id": "pi-abc123",
    "status": "completed",
    "result": [
        {
            "title": "Financial Stability",
            "page_index": 21,
            "text": "Section text...",
            "nodes": []
        }
    ]
}

idx = DocumentIndexer().index_with_ocr("/path/to/scanned.pdf", ocr_payload=ocr_payload)

9) document_index_from_ocr_payload(payload, source=...)

Standalone converter from OCR node payload to DocumentIndex.

from doctr import document_index_from_ocr_payload

idx = document_index_from_ocr_payload(ocr_payload, source="/path/to/scanned.pdf")

10) PageIndexOCRProvider

Optional remote OCR adapter.

from doctr import PageIndexOCRProvider, DocumentIndexer

provider = PageIndexOCRProvider(api_key="YOUR_KEY", base_url="https://api.pageindex.ai")
indexer = DocumentIndexer(ocr_provider=provider)
idx = indexer.index_with_ocr("/path/to/scanned.pdf")

11) DocumentPipeline

Stage-based API.

from doctr import DocumentPipeline

pipeline = DocumentPipeline()

# Stage 1
inp = pipeline.document_input("/path/to/report.pdf")

# Stage 2
converted = pipeline.docling_conversion("/path/to/report.pdf")

# Stage 3
idx = pipeline.build_tree_index(converted=converted)

# Stage 4
context = pipeline.retrieve_for_chat(idx, "What are major risks?", top_k=6)

12) DoclingConverterAdapter

Direct Docling conversion wrapper.

from doctr import DoclingConverterAdapter

adapter = DoclingConverterAdapter()
converted = adapter.convert("/path/to/file.pdf")
print(converted.markdown[:200])

13) ConvertedDocument

Return type from DoclingConverterAdapter.convert.

from doctr import ConvertedDocument

obj = ConvertedDocument(
    source_path="/tmp/a.pdf",
    markdown="# Parsed document",
    metadata={"converter": "docling"},
)

14) retrieve_context(index_payload, question, top_k=6)

Context retrieval helper for chat prompts.

from doctr import retrieve_context, index_document

idx = index_document("/path/to/file.pdf")
payload = idx.to_pageindex_dict(include_empty_nodes=False)
ctx = retrieve_context(payload, "What changed in monetary policy?", top_k=8)
print(ctx)

15) DocumentIndex model

Primary output object from indexers.

from doctr import index_document

idx = index_document("/path/to/file.pdf")
print(idx.to_dict())
print(idx.to_pageindex_dict(include_empty_nodes=False))

16) SectionNode model

Useful for custom node construction.

from doctr import SectionNode

n = SectionNode(title="Section A", start_page=10, end_page=12, summary="...")

17) IndexEnricher protocol

Type contract for custom post-processing.

from doctr import index_document

def enrich(idx):
    for node in idx.nodes:
        if node.summary:
            node.summary = node.summary[:300]
    return idx

idx = index_document("/path/to/file.pdf", enricher=enrich)

Embedded Files (Files within Files)

Supported best-effort extraction:

  • Office embeddings (*/embeddings/*) in docx/xlsx/xlsm/pptx
  • PDF file attachments

Enable it:

from doctr import index_document

idx = index_document(
    "/path/to/container.docx",
    include_embedded=True,
    max_embedded_depth=2,
)

Results:

  • Adds Embedded Files branch in tree
  • Writes extraction/index status in metadata["embedded_files"]

CLI

doctr /path/to/file.pdf \
  --prefer-toc-hierarchy \
  --include-embedded \
  --max-embedded-depth 2 \
  --format pageindex \
  --output output_index.json

Examples

Notes

  • Normal indexing is local and does not require API keys.
  • OCR provider mode needs provider credentials.
  • nodes: [] on a node means it is a valid leaf node.

Development

pytest -q

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

doctr_index-0.1.1.tar.gz (25.0 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

doctr_index-0.1.1-py3-none-any.whl (24.9 MB view details)

Uploaded Python 3

File details

Details for the file doctr_index-0.1.1.tar.gz.

File metadata

  • Download URL: doctr_index-0.1.1.tar.gz
  • Upload date:
  • Size: 25.0 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.8

File hashes

Hashes for doctr_index-0.1.1.tar.gz
Algorithm Hash digest
SHA256 7163f34cea88a793d4bbde662df13a7ad13da1f11ab22dc8500fd0a502d7176b
MD5 51d9d515fa0850e81e900ae9798ce315
BLAKE2b-256 3442f8725790fc35f679130475be35ff21b1c8251cd97c881439d297aa4e56d2

See more details on using hashes here.

File details

Details for the file doctr_index-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: doctr_index-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 24.9 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.8

File hashes

Hashes for doctr_index-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 0b3d2fa1b444ffe12ef5a84b551e39c841fa9fe7cbef281c49ab99f00bd3feb0
MD5 a777af752f75be6ebacc65fb262545d0
BLAKE2b-256 1c3800dddfedc7be2d4d0765e53dce84b089116b582ace566b3b8d197f934e7a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page