Truly open PageIndex implementation
Project description
pageindex-open
Truly open pageindex RAG package
This package was inspired by PageIndex. I took inspiration from the concepts outlined and came up with my own implementation. I was not satisfied with the package as examples focus on the SaaS part of things.
This package works by simply converting your PDFs into a tree then the most relevent section is decided and used. This contrasts with chuncking where similarity is compared using embeddings.
Why?
- 🧠 Reasoning-backed: AI routes and answers using structured context, not just similarity.
- ⚡ Contrast to RAG: Traditional RAG retrieves random chunks by embedding similarity: here, relevance is hierarchical and precise.
- 🌳 Tree-structured: Sections, subsections, and headings preserved: your document is understood, not just searched.
- 🔢 Top-K retrieval: Combine multiple relevant sections for richer answers, avoiding “partial context” problems.
- ✂️ Text-on-demand: Only the node text is used, no bloated storage or duplication.
- 💾 Persistent cache: Markdown + tree saved separately: queries can be re-run without touching the PDF.
- 📄 Markdown source: Human-readable, diffable, and editable: not a black-box blob of vectors.
- 🔄 Reusable & update-friendly: Swap LLMs, add PDFs, or refresh sections without breaking the index.
- 📦 Clean Python API:
build_index(),query(),load_index(): intuitive for devs. - 💪 Production-ready design: Modular, maintainable, and scalable for large document QA workflows.
Quickstart
For one document, the example is as follows:
# export GEMINI_API_KEY=AI...
# uses litellm under the hood
from pageindex_open import *
PDF_FILE = "/path/to/file/2023-annual-report-truncated.pdf"
QUERY = "what about financial stability?"
pio = PIO(PDF_FILE)
pio.build_index()
answer = pio.query(QUERY, top_k=2)
print(answer)
Application
This works for structured documents and you applies to sectors like finance and legal
API
Specify more
pio = PIO(PDF_FILE, model_name="modelprovider/model-name", llm_client=litellm_client_if_any)
Load index
from pageindex_open import *
PDF_FILE = "/path/to/file/2023-annual-report-truncated.pdf"
QUERY = "what about financial stability?"
pio = PIO(PDF_FILE)
pio.load_index("/path/mdfile.md", "/path/file.tree.json") # files that were created using build_index
Roadmap
- Multi-document
- Document processing backend
- Save config options
- Add chat with docs feature
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pageindex_open-0.1.1.tar.gz.
File metadata
- Download URL: pageindex_open-0.1.1.tar.gz
- Upload date:
- Size: 9.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.10.0 {"installer":{"name":"uv","version":"0.10.0","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Linux Mint","version":"22.1","id":"xia","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ad9cb41690a5a59ce8970422b56994c47b36e839b40c434c25dd5a16c5a22f8c
|
|
| MD5 |
b0d322fca6c3a24752d20d08bbdd5b4b
|
|
| BLAKE2b-256 |
a75e6a1e45c7ef39630fb33432b55299db8a966211cc4d272cc5d91f006efd28
|
File details
Details for the file pageindex_open-0.1.1-py3-none-any.whl.
File metadata
- Download URL: pageindex_open-0.1.1-py3-none-any.whl
- Upload date:
- Size: 9.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.10.0 {"installer":{"name":"uv","version":"0.10.0","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Linux Mint","version":"22.1","id":"xia","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fa54e3807aa7395a2fd6c4b31f45bf7ab9af13a567abc543c45eec40ce128c57
|
|
| MD5 |
210e7ed1b26a1f20ef8651589acaa41b
|
|
| BLAKE2b-256 |
a4397c5cd3d103df747971c060f4d61572a7810d1143f1680f31152ce246a518
|