Skip to main content

Hierarchical document indexing and tree-based retrieval for RAG pipelines, written in Rust

Project description

pageindex-rs

A Rust-powered Python library for structured document retrieval in RAG pipelines.


I kept running into the same problem building LLM agents over financial documents and technical manuals: chunk-based RAG is terrible at it. You embed a 10-K, split it into 512-token chunks, and at query time you get three chunks from different sections that happen to share vocabulary, none of which actually answers the question. The retrieval is noisy, the context window fills up with irrelevant text, and you end up paying for tokens that actively hurt the answer.

The fix is obvious once you see it — structured documents already tell you how they're organized. Every heading is a natural retrieval boundary. pageindex-rs just respects that structure. It parses a markdown document into a tree of nodes, one per heading, and at retrieval time you hand the outline to your LLM and ask it which section to look in. One node, exactly the text you need, no embeddings required.

This is a Rust reimplementation of the original PageIndex library with Python bindings via PyO3. The Rust version is faster at scale and more consistent under load — details in the benchmarks below.

Installation

pip install pageindex-rs

Usage

import pageindex_rs

# Build an index from a markdown file
index = pageindex_rs.PageIndex.from_file("annual_report", "report.md")

# Or directly from a string
index = pageindex_rs.PageIndex.from_markdown("annual_report", markdown_string)

# Get the outline — this is what you send to your LLM
print(index.outline())
# [1] Executive Summary
# [2] Financial Results
#   [2.1] Revenue
#   [2.2] Expenses
#   [2.3] Net Income
# [3] Risk Factors
#   [3.1] Market Risk
#   [3.2] Regulatory Risk

# Fetch the node your LLM picked
node = index.get_node("3.2")
print(node.title)       # Regulatory Risk
print(node.text)        # New AI regulations in the EU...
print(node.breadcrumb)  # ['Risk Factors', 'Regulatory Risk']

# Need the full section including subsections?
section = index.get_node_with_children("2")
print(section.text)     # Revenue + Expenses + Net Income combined

# Peek at what's inside a section before going deeper
children = index.get_children("2")
# [('2.1', 'Revenue'), ('2.2', 'Expenses'), ('2.3', 'Net Income')]

# Full tree as JSON if you need it
print(index.to_json())

How retrieval works

outline = index.outline()

response = llm(f"""
Document outline:
{outline}

Question: {user_query}

Return only the node_id of the most relevant section. Nothing else.
""")

node_id = response.strip()
result = index.get_node(node_id)
# Pass result.text to your LLM to generate the final answer

The dot-notation node IDs (1.2.3) give the LLM a natural sense of document structure — it can see that 2.3 is a subsection of 2 without any extra explanation. This turns out to matter for accuracy.

API

PageIndex

Method Description
PageIndex.from_markdown(doc_id, markdown) Build from a markdown string
PageIndex.from_file(doc_id, path) Build from a file path
index.title() Document title (first H1)
index.outline() Compact tree for LLM prompts
index.node_ids() All node IDs in the tree
index.get_node(node_id) Single node lookup
index.get_node_with_children(node_id) Node with all descendant text merged
index.get_children(node_id) Direct children as (node_id, title) pairs
index.to_json() Full tree as JSON

NodeResult

Attribute Type Description
node_id str Dot-separated ID, e.g. "2.1"
title str Heading text
text str Body text of this node
depth int Heading level (1 = #, 2 = ##, etc.)
breadcrumb list[str] Path from root to this node

Benchmarks

Benchmarked against the original Python PageIndex library. 500 iterations per build test, 1000 random lookups per retrieval test. Run the full benchmark yourself: tests/pageindex_rs_benchmark.ipynb.

Index build speed

Document size Rust mean Python mean Speedup
42 KB 0.207 ms 0.153 ms 0.74x ❌
395 KB 0.873 ms 1.369 ms 1.57x
1055 KB 2.549 ms 4.278 ms 1.68x

Below ~200KB, PyO3 FFI overhead cancels out the parsing speedup. At realistic document sizes (several hundred KB and above) Rust pulls ahead. The more important number is consistency:

Document size Rust stdev Python stdev Rust p99 Python p99
42 KB 0.835 ms 0.014 ms 1.335 ms 0.206 ms
395 KB 0.060 ms 0.053 ms 1.129 ms 1.511 ms
1055 KB 0.104 ms 2.782 ms 2.781 ms 20.993 ms

At 1055 KB, Python's p99 is 20ms and its max is 42ms. Rust's p99 is 2.8ms and max is 3.7ms. In a pipeline handling hundreds of documents, those Python spikes add up.

Node retrieval speed

Rust uses a HashMap so lookups are O(1). Python does a linear scan, so performance degrades as the tree grows.

Document size Nodes Rust mean Python mean Speedup
42 KB 28 0.0072 ms 0.0060 ms 0.83x
395 KB 261 0.0119 ms 0.0272 ms 2.29x
1055 KB 765 0.0216 ms 0.0686 ms 3.18x

The gap keeps widening. At 765 nodes Rust is 3.18x faster on average. For large technical manuals or combined document corpora this becomes meaningful.

Answer accuracy

Tested on 10 financial questions against a ~3MB document corpus:

Correct
pageindex-rs 9 / 10
PageIndex (Python) 7 / 10

The accuracy difference comes down to node IDs. 1.2.3 is self-explanatory to an LLM — it signals hierarchy directly. 0012 is just a number with no structural meaning, so the LLM occasionally picks the wrong node.

Roadmap

  • PDF support (the big one)
  • Cross-document retrieval across a corpus
  • PageRank-style importance scoring on the tree

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pageindex_rs-0.1.0.tar.gz (36.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pageindex_rs-0.1.0-cp312-cp312-macosx_11_0_arm64.whl (216.4 kB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

File details

Details for the file pageindex_rs-0.1.0.tar.gz.

File metadata

  • Download URL: pageindex_rs-0.1.0.tar.gz
  • Upload date:
  • Size: 36.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.12.4

File hashes

Hashes for pageindex_rs-0.1.0.tar.gz
Algorithm Hash digest
SHA256 552a6a779f85461434ddcefb568aa0b56afeb356367cd634db94bf382ce1f1bf
MD5 553eb6a630ca1087742f5cdcbaf67cee
BLAKE2b-256 1cebb69c64a4605a7a433188b310c59be1fe2b86b4c8718cc171d87264ae1f0e

See more details on using hashes here.

File details

Details for the file pageindex_rs-0.1.0-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for pageindex_rs-0.1.0-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 6acec84c8d58ea7b36a631e7b88983b1710acf0a7ed700f9e59effea04481ec2
MD5 8844eebd568acb8498ae29319b786756
BLAKE2b-256 1238f24df4dce154093e87019401df696812327dca0b25327293f57848de6b7e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page