Skip to main content

Hierarchical document indexing and tree-based retrieval for RAG pipelines, written in Rust

Project description

pageindex-rs

A Rust-powered Python library for structured document retrieval in RAG pipelines.


I kept running into the same problem building LLM agents over financial documents and technical manuals: chunk-based RAG is terrible at it. You embed a 10-K, split it into 512-token chunks, and at query time you get three chunks from different sections that happen to share vocabulary, none of which actually answers the question. The retrieval is noisy, the context window fills up with irrelevant text, and you end up paying for tokens that actively hurt the answer.

The fix is obvious once you see it — structured documents already tell you how they're organized. Every heading is a natural retrieval boundary. pageindex-rs just respects that structure. It parses a markdown document into a tree of nodes, one per heading, and at retrieval time you hand the outline to your LLM and ask it which section to look in. One node, exactly the text you need, no embeddings required.

This is a Rust reimplementation of the original PageIndex library with Python bindings via PyO3. The Rust version is faster at scale and more consistent under load — details in the benchmarks below.

Installation

pip install pageindex-rs

Usage

import pageindex_rs

# Build an index from a markdown file
index = pageindex_rs.PageIndex.from_file("annual_report", "report.md")

# Or directly from a string
index = pageindex_rs.PageIndex.from_markdown("annual_report", markdown_string)

# Get the outline — this is what you send to your LLM
print(index.outline())
# [1] Executive Summary
# [2] Financial Results
#   [2.1] Revenue
#   [2.2] Expenses
#   [2.3] Net Income
# [3] Risk Factors
#   [3.1] Market Risk
#   [3.2] Regulatory Risk

# Fetch the node your LLM picked
node = index.get_node("3.2")
print(node.title)       # Regulatory Risk
print(node.text)        # New AI regulations in the EU...
print(node.breadcrumb)  # ['Risk Factors', 'Regulatory Risk']

# Need the full section including subsections?
section = index.get_node_with_children("2")
print(section.text)     # Revenue + Expenses + Net Income combined

# Peek at what's inside a section before going deeper
children = index.get_children("2")
# [('2.1', 'Revenue'), ('2.2', 'Expenses'), ('2.3', 'Net Income')]

# Full tree as JSON if you need it
print(index.to_json())

How retrieval works

outline = index.outline()

response = llm(f"""
Document outline:
{outline}

Question: {user_query}

Return only the node_id of the most relevant section. Nothing else.
""")

node_id = response.strip()
result = index.get_node(node_id)
# Pass result.text to your LLM to generate the final answer

The dot-notation node IDs (1.2.3) give the LLM a natural sense of document structure — it can see that 2.3 is a subsection of 2 without any extra explanation. This turns out to matter for accuracy.

API

PageIndex

Method Description
PageIndex.from_markdown(doc_id, markdown) Build from a markdown string
PageIndex.from_file(doc_id, path) Build from a file path
index.title() Document title (first H1)
index.outline() Compact tree for LLM prompts
index.node_ids() All node IDs in the tree
index.get_node(node_id) Single node lookup
index.get_node_with_children(node_id) Node with all descendant text merged
index.get_children(node_id) Direct children as (node_id, title) pairs
index.to_json() Full tree as JSON

NodeResult

Attribute Type Description
node_id str Dot-separated ID, e.g. "2.1"
title str Heading text
text str Body text of this node
depth int Heading level (1 = #, 2 = ##, etc.)
breadcrumb list[str] Path from root to this node

Benchmarks

Benchmarked against the original Python PageIndex library. 500 iterations per build test, 1000 random lookups per retrieval test. Run the full benchmark yourself: tests/pageindex_rs_benchmark.ipynb.

Index build speed

Document size Rust mean Python mean Speedup
42 KB 0.207 ms 0.153 ms 0.74x ❌
395 KB 0.873 ms 1.369 ms 1.57x
1055 KB 2.549 ms 4.278 ms 1.68x

Below ~200KB, PyO3 FFI overhead cancels out the parsing speedup. At realistic document sizes (several hundred KB and above) Rust pulls ahead. The more important number is consistency:

Document size Rust stdev Python stdev Rust p99 Python p99
42 KB 0.835 ms 0.014 ms 1.335 ms 0.206 ms
395 KB 0.060 ms 0.053 ms 1.129 ms 1.511 ms
1055 KB 0.104 ms 2.782 ms 2.781 ms 20.993 ms

At 1055 KB, Python's p99 is 20ms and its max is 42ms. Rust's p99 is 2.8ms and max is 3.7ms. In a pipeline handling hundreds of documents, those Python spikes add up.

Node retrieval speed

Rust uses a HashMap so lookups are O(1). Python does a linear scan, so performance degrades as the tree grows.

Document size Nodes Rust mean Python mean Speedup
42 KB 28 0.0072 ms 0.0060 ms 0.83x
395 KB 261 0.0119 ms 0.0272 ms 2.29x
1055 KB 765 0.0216 ms 0.0686 ms 3.18x

The gap keeps widening. At 765 nodes Rust is 3.18x faster on average. For large technical manuals or combined document corpora this becomes meaningful.

Answer accuracy

Tested on 10 financial questions against a ~3MB document corpus:

Correct
pageindex-rs 9 / 10
PageIndex (Python) 7 / 10

The accuracy difference comes down to node IDs. 1.2.3 is self-explanatory to an LLM — it signals hierarchy directly. 0012 is just a number with no structural meaning, so the LLM occasionally picks the wrong node.

Roadmap

  • PDF support (the big one)
  • Cross-document retrieval across a corpus

Credits

https://github.com/VectifyAI/PageIndex?tab=readme-ov-file

Medium article which inspired this: https://agentnativedev.medium.com/vectorless-rag-for-agents-pageindex-is-why-their-demo-works-and-yours-needs-context-6a9219dcc20e

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pageindex_rs-0.1.3.tar.gz (36.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pageindex_rs-0.1.3-cp312-cp312-macosx_11_0_arm64.whl (216.4 kB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

File details

Details for the file pageindex_rs-0.1.3.tar.gz.

File metadata

  • Download URL: pageindex_rs-0.1.3.tar.gz
  • Upload date:
  • Size: 36.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.12.4

File hashes

Hashes for pageindex_rs-0.1.3.tar.gz
Algorithm Hash digest
SHA256 80baa9752bd08ae7cccfaf5961264a4cdbcce9a4d47498563483095afd45bed5
MD5 37bac65080bcf3af178fda0b2918fa5c
BLAKE2b-256 234dcda0e622416e21d82bd0f0994dc36c59ef38872eb7db687277521ecbb702

See more details on using hashes here.

File details

Details for the file pageindex_rs-0.1.3-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for pageindex_rs-0.1.3-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 d4e0ab44775ee22bdf6bc5199fef6924b042c7912a5e183d7b9645b40a4d0b2b
MD5 66f469807e373f452d1516be6d200611
BLAKE2b-256 d73d999a7fc9587ce4aa3517ac27b3fff5402e3248f48da5b205771d4f423b53

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page