Hierarchical document indexing and tree-based retrieval for RAG pipelines, written in Rust

These details have not been verified by PyPI

Project links

Project description

pageindex-rs

A Rust-powered Python library for structured document retrieval in RAG pipelines.

I kept running into the same problem building LLM agents over financial documents and technical manuals: chunk-based RAG is terrible at it. You embed a 10-K, split it into 512-token chunks, and at query time you get three chunks from different sections that happen to share vocabulary, none of which actually answers the question. The retrieval is noisy, the context window fills up with irrelevant text, and you end up paying for tokens that actively hurt the answer.

The fix is obvious once you see it — structured documents already tell you how they're organized. Every heading is a natural retrieval boundary. pageindex-rs just respects that structure. It parses a markdown document into a tree of nodes, one per heading, and at retrieval time you hand the outline to your LLM and ask it which section to look in. One node, exactly the text you need, no embeddings required.

This is a Rust reimplementation of the original PageIndex library with Python bindings via PyO3. The Rust version is faster at scale and more consistent under load — details in the benchmarks below.

Installation

pip install pageindex-rs

Usage

import pageindex_rs

# Build an index from a markdown file
index = pageindex_rs.PageIndex.from_file("annual_report", "report.md")

# Or directly from a string
index = pageindex_rs.PageIndex.from_markdown("annual_report", markdown_string)

# Get the outline — this is what you send to your LLM
print(index.outline())
# [1] Executive Summary
# [2] Financial Results
#   [2.1] Revenue
#   [2.2] Expenses
#   [2.3] Net Income
# [3] Risk Factors
#   [3.1] Market Risk
#   [3.2] Regulatory Risk

# Fetch the node your LLM picked
node = index.get_node("3.2")
print(node.title)       # Regulatory Risk
print(node.text)        # New AI regulations in the EU...
print(node.breadcrumb)  # ['Risk Factors', 'Regulatory Risk']

# Need the full section including subsections?
section = index.get_node_with_children("2")
print(section.text)     # Revenue + Expenses + Net Income combined

# Peek at what's inside a section before going deeper
children = index.get_children("2")
# [('2.1', 'Revenue'), ('2.2', 'Expenses'), ('2.3', 'Net Income')]

# Full tree as JSON if you need it
print(index.to_json())

How retrieval works

outline = index.outline()

response = llm(f"""
Document outline:
{outline}

Question: {user_query}

Return only the node_id of the most relevant section. Nothing else.
""")

node_id = response.strip()
result = index.get_node(node_id)
# Pass result.text to your LLM to generate the final answer

The dot-notation node IDs (1.2.3) give the LLM a natural sense of document structure — it can see that 2.3 is a subsection of 2 without any extra explanation. This turns out to matter for accuracy.

API

PageIndex

Method	Description
`PageIndex.from_markdown(doc_id, markdown)`	Build from a markdown string
`PageIndex.from_file(doc_id, path)`	Build from a file path
`index.title()`	Document title (first H1)
`index.outline()`	Compact tree for LLM prompts
`index.node_ids()`	All node IDs in the tree
`index.get_node(node_id)`	Single node lookup
`index.get_node_with_children(node_id)`	Node with all descendant text merged
`index.get_children(node_id)`	Direct children as `(node_id, title)` pairs
`index.to_json()`	Full tree as JSON

NodeResult

Attribute	Type	Description
`node_id`	str	Dot-separated ID, e.g. `"2.1"`
`title`	str	Heading text
`text`	str	Body text of this node
`depth`	int	Heading level (1 = `#`, 2 = `##`, etc.)
`breadcrumb`	list[str]	Path from root to this node

Benchmarks

Benchmarked against the original Python PageIndex library. 500 iterations per build test, 1000 random lookups per retrieval test. Run the full benchmark yourself: tests/pageindex_rs_benchmark.ipynb.

Index build speed

Document size	Rust mean	Python mean	Speedup
42 KB	0.207 ms	0.153 ms	0.74x ❌
395 KB	0.873 ms	1.369 ms	1.57x
1055 KB	2.549 ms	4.278 ms	1.68x

Below ~200KB, PyO3 FFI overhead cancels out the parsing speedup. At realistic document sizes (several hundred KB and above) Rust pulls ahead. The more important number is consistency:

Document size	Rust stdev	Python stdev	Rust p99	Python p99
42 KB	0.835 ms	0.014 ms	1.335 ms	0.206 ms
395 KB	0.060 ms	0.053 ms	1.129 ms	1.511 ms
1055 KB	0.104 ms	2.782 ms	2.781 ms	20.993 ms

At 1055 KB, Python's p99 is 20ms and its max is 42ms. Rust's p99 is 2.8ms and max is 3.7ms. In a pipeline handling hundreds of documents, those Python spikes add up.

Node retrieval speed

Rust uses a HashMap so lookups are O(1). Python does a linear scan, so performance degrades as the tree grows.

Document size	Nodes	Rust mean	Python mean	Speedup
42 KB	28	0.0072 ms	0.0060 ms	0.83x
395 KB	261	0.0119 ms	0.0272 ms	2.29x
1055 KB	765	0.0216 ms	0.0686 ms	3.18x

The gap keeps widening. At 765 nodes Rust is 3.18x faster on average. For large technical manuals or combined document corpora this becomes meaningful.

Answer accuracy

Tested on 10 financial questions against a ~3MB document corpus:

	Correct
pageindex-rs	9 / 10
PageIndex (Python)	7 / 10

The accuracy difference comes down to node IDs. 1.2.3 is self-explanatory to an LLM — it signals hierarchy directly. 0012 is just a number with no structural meaning, so the LLM occasionally picks the wrong node.

Roadmap

PDF support (the big one)
Cross-document retrieval across a corpus
PageRank-style importance scoring on the tree

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.3

Feb 25, 2026

0.1.2

Feb 25, 2026

This version

0.1.0

Feb 25, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pageindex_rs-0.1.0.tar.gz (36.0 kB view details)

Uploaded Feb 25, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pageindex_rs-0.1.0-cp312-cp312-macosx_11_0_arm64.whl (216.4 kB view details)

Uploaded Feb 25, 2026 CPython 3.12macOS 11.0+ ARM64

File details

Details for the file pageindex_rs-0.1.0.tar.gz.

File metadata

Download URL: pageindex_rs-0.1.0.tar.gz
Upload date: Feb 25, 2026
Size: 36.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: maturin/1.12.4

File hashes

Hashes for pageindex_rs-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`552a6a779f85461434ddcefb568aa0b56afeb356367cd634db94bf382ce1f1bf`
MD5	`553eb6a630ca1087742f5cdcbaf67cee`
BLAKE2b-256	`1cebb69c64a4605a7a433188b310c59be1fe2b86b4c8718cc171d87264ae1f0e`

See more details on using hashes here.

File details

Details for the file pageindex_rs-0.1.0-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

Download URL: pageindex_rs-0.1.0-cp312-cp312-macosx_11_0_arm64.whl
Upload date: Feb 25, 2026
Size: 216.4 kB
Tags: CPython 3.12, macOS 11.0+ ARM64
Uploaded using Trusted Publishing? No
Uploaded via: maturin/1.12.4

File hashes

Hashes for pageindex_rs-0.1.0-cp312-cp312-macosx_11_0_arm64.whl
Algorithm	Hash digest
SHA256	`6acec84c8d58ea7b36a631e7b88983b1710acf0a7ed700f9e59effea04481ec2`
MD5	`8844eebd568acb8498ae29319b786756`
BLAKE2b-256	`1238f24df4dce154093e87019401df696812327dca0b25327293f57848de6b7e`

See more details on using hashes here.

pageindex-rs 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

pageindex-rs

Installation

Usage

How retrieval works

API

PageIndex

NodeResult

Benchmarks

Index build speed

Node retrieval speed

Answer accuracy

Roadmap

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes