Skip to main content

Truly open PageIndex implementation

Project description

pageindex-open

Truly open pageindex RAG package

This package was inspired by PageIndex. I took inspiration from the concepts outlined and came up with my own implementation. I was not satisfied with the package as examples focus on the SaaS part of things.

This package works by simply converting your PDFs into a tree then the most relevent section is decided and used. This contrasts with chuncking where similarity is compared using embeddings.

Why?

  • 🧠 Reasoning-backed: AI routes and answers using structured context, not just similarity.
  • Contrast to RAG: Traditional RAG retrieves random chunks by embedding similarity: here, relevance is hierarchical and precise.
  • 🌳 Tree-structured: Sections, subsections, and headings preserved: your document is understood, not just searched.
  • 🔢 Top-K retrieval: Combine multiple relevant sections for richer answers, avoiding “partial context” problems.
  • ✂️ Text-on-demand: Only the node text is used, no bloated storage or duplication.
  • 💾 Persistent cache: Markdown + tree saved separately: queries can be re-run without touching the PDF.
  • 📄 Markdown source: Human-readable, diffable, and editable — not a black-box blob of vectors.
  • 🔄 Reusable & update-friendly: Swap LLMs, add PDFs, or refresh sections without breaking the index.
  • 🛠 Clean Python API: build_index(), query(), load_index() — intuitive for devs.
  • 😍 Production-ready design: Modular, maintainable, and scalable for large document QA workflows.

Quickstart

For one document, the example is as follows:

# export GEMINI_API_KEY=AI...
# uses litellm under the hood
from pageindex_open import *

PDF_FILE = "/path/to/file/2023-annual-report-truncated.pdf"
QUERY = "what about financial stability?"


pio = PIO(PDF_FILE)
pio.build_index() 

answer = pio.query(QUERY, top_k=2)
print(answer)

Application

This works for structured documents and you applies to sectors like finance and legal

API

Specify more

pio = PIO(PDF_FILE, model_name="modelprovider/model-name", llm_client=litellm_client_if_any)

Load index

from pageindex_open import *

PDF_FILE = "/path/to/file/2023-annual-report-truncated.pdf"
QUERY = "what about financial stability?"


pio = PIO(PDF_FILE)
pio.load_index("/path/mdfile.md", "/path/file.tree.json") # files that were created using build_index

Roadmap

  • Multi-document
  • Document processing backend
  • Save config options
  • Add chat with docs feature

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pageindex_open-0.1.0.tar.gz (9.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pageindex_open-0.1.0-py3-none-any.whl (9.5 kB view details)

Uploaded Python 3

File details

Details for the file pageindex_open-0.1.0.tar.gz.

File metadata

  • Download URL: pageindex_open-0.1.0.tar.gz
  • Upload date:
  • Size: 9.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.0 {"installer":{"name":"uv","version":"0.10.0","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Linux Mint","version":"22.1","id":"xia","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for pageindex_open-0.1.0.tar.gz
Algorithm Hash digest
SHA256 20eb808d79c5118f1abfaaa2cc82a1a186099ae2b3f50d82815c289dcc349e19
MD5 d515499a3f31ea74814feb38b17d3d78
BLAKE2b-256 26ce828a55aafef64ba4b84b1154d10f22e6e640a225d5ae0a3a4328ef911016

See more details on using hashes here.

File details

Details for the file pageindex_open-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: pageindex_open-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 9.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.0 {"installer":{"name":"uv","version":"0.10.0","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Linux Mint","version":"22.1","id":"xia","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for pageindex_open-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e2884801f8c6a9a790233faf15990b2595354f7ac5b5224bb13ff71c25c0b9f3
MD5 c6678ed1422dcc45d9caf90b378caf1f
BLAKE2b-256 8de800228892981affd1241179f1dd4cd1bbbace6effb96727717624d8619bd0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page