Skip to main content

Truly open PageIndex implementation

Project description

pageindex-open

Truly open pageindex RAG package

This package was inspired by PageIndex. I took inspiration from the concepts outlined and came up with my own implementation. I was not satisfied with the package as examples focus on the SaaS part of things.

This package works by simply converting your PDFs into a tree then the most relevent section is decided and used. This contrasts with chuncking where similarity is compared using embeddings.

Why?

  • 🧠 Reasoning-backed: AI routes and answers using structured context, not just similarity.
  • Contrast to RAG: Traditional RAG retrieves random chunks by embedding similarity: here, relevance is hierarchical and precise.
  • 🌳 Tree-structured: Sections, subsections, and headings preserved: your document is understood, not just searched.
  • 🔢 Top-K retrieval: Combine multiple relevant sections for richer answers, avoiding “partial context” problems.
  • ✂️ Text-on-demand: Only the node text is used, no bloated storage or duplication.
  • 💾 Persistent cache: Markdown + tree saved separately: queries can be re-run without touching the PDF.
  • 📄 Markdown source: Human-readable, diffable, and editable: not a black-box blob of vectors.
  • 🔄 Reusable & update-friendly: Swap LLMs, add PDFs, or refresh sections without breaking the index.
  • 📦 Clean Python API: build_index(), query(), load_index(): intuitive for devs.
  • 💪 Production-ready design: Modular, maintainable, and scalable for large document QA workflows.

Quickstart

For one document, the example is as follows:

# export GEMINI_API_KEY=AI...
# uses litellm under the hood
from pageindex_open import *

PDF_FILE = "/path/to/file/2023-annual-report-truncated.pdf"
QUERY = "what about financial stability?"


pio = PIO(PDF_FILE)
pio.build_index() 

answer = pio.query(QUERY, top_k=2)
print(answer)

Application

This works for structured documents and you applies to sectors like finance and legal

API

Specify more

pio = PIO(PDF_FILE, model_name="modelprovider/model-name", llm_client=litellm_client_if_any)

Load index

from pageindex_open import *

PDF_FILE = "/path/to/file/2023-annual-report-truncated.pdf"
QUERY = "what about financial stability?"


pio = PIO(PDF_FILE)
pio.load_index("/path/mdfile.md", "/path/file.tree.json") # files that were created using build_index

Roadmap

  • Multi-document
  • Document processing backend
  • Save config options
  • Add chat with docs feature

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pageindex_open-0.1.1.tar.gz (9.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pageindex_open-0.1.1-py3-none-any.whl (9.5 kB view details)

Uploaded Python 3

File details

Details for the file pageindex_open-0.1.1.tar.gz.

File metadata

  • Download URL: pageindex_open-0.1.1.tar.gz
  • Upload date:
  • Size: 9.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.0 {"installer":{"name":"uv","version":"0.10.0","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Linux Mint","version":"22.1","id":"xia","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for pageindex_open-0.1.1.tar.gz
Algorithm Hash digest
SHA256 ad9cb41690a5a59ce8970422b56994c47b36e839b40c434c25dd5a16c5a22f8c
MD5 b0d322fca6c3a24752d20d08bbdd5b4b
BLAKE2b-256 a75e6a1e45c7ef39630fb33432b55299db8a966211cc4d272cc5d91f006efd28

See more details on using hashes here.

File details

Details for the file pageindex_open-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: pageindex_open-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 9.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.0 {"installer":{"name":"uv","version":"0.10.0","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Linux Mint","version":"22.1","id":"xia","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for pageindex_open-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 fa54e3807aa7395a2fd6c4b31f45bf7ab9af13a567abc543c45eec40ce128c57
MD5 210e7ed1b26a1f20ef8651589acaa41b
BLAKE2b-256 a4397c5cd3d103df747971c060f4d61572a7810d1143f1680f31152ce246a518

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page