Skip to main content

Local-only Python SDK mirroring PageIndex Cloud API — vectorless RAG, no cloud required.

Project description

local-pageindex

PyPI version Python versions License: MIT Tests

A local-only Python SDK that mirrors the PageIndex Cloud API as importable Python methods — no cloud account, no API key, no external service.

All document processing, retrieval, and chat happens on your machine. The only external calls are to your configured OpenAI-compatible LLM endpoint (for indexing, retrieval reasoning, and chat).


Why this exists

PageIndex is an open-source vectorless RAG system that builds hierarchical tree indexes from documents and uses LLM reasoning to navigate them — achieving 98.7 % accuracy on FinanceBench without a vector database.

This SDK wraps that library with:

  • Local storage — all indexes, trees, and retrieval results saved under storage_path
  • Full API parity — local equivalents for every PageIndex Cloud REST endpoint
  • Streaming chat — Python generator interface
  • Tenant/workspace/workflow isolation — multi-tenant scoping and boundary enforcement
  • Batch ingestion — folder and file-list ingestion with success/failure summary
  • Citation support — every retrieval and chat response includes source references

Comparison with PageIndex Cloud

Feature PageIndex Cloud local-pageindex
Document tree building Cloud-hosted Local, via open-source pageindex library
Storage PageIndex servers Your storage_path directory
Authentication PAGEINDEX_API_KEY Your OpenAI-compatible api_key
PDF text extraction Cloud OCR PyMuPDF / PyPDF2
Retrieval Cloud-hosted vectorless RAG Local LLM tree navigation
Chat Cloud API Local LLM with retrieved context
Streaming SSE over HTTPS Python generator
Multi-tenant isolation Managed Metadata-based filtering
Cost PageIndex pricing LLM API calls only

Installation

pip install local-pageindex

For better PDF extraction accuracy, also install PyMuPDF:

pip install "local-pageindex[pdf]"

Important: PDF and Markdown tree building requires the open-source pageindex library, which is not on PyPI. Install it separately:

pip install git+https://github.com/VectifyAI/PageIndex.git

Without pageindex, text file ingestion and all retrieval/chat methods work normally. PDF and Markdown ingestion will raise LLMProviderError at runtime.


Quick start

import os
from local_pageindex import LocalPageIndexClient

client = LocalPageIndexClient(
    storage_path="./my_index",
    api_key=os.environ["OPENAI_API_KEY"],
    base_url=os.environ.get("OPENAI_BASE_URL"),  # optional
    model="gpt-4.1",
    reasoning_model="gpt-5.1",
)

# Ingest a document
result = client.ingest_document("report.pdf")
doc_id = result["doc_id"]

# Ask a question
answer = client.ask("What are the key findings?", doc_id=doc_id)
print(answer)

Document ingestion

PDF

result = client.ingest_document(
    "annual_report.pdf",
    document_id="report-2024",         # optional; UUID generated if omitted
    metadata={
        "tenant_id": "acme",
        "workspace_id": "ws-finance",
        "source_type": "workspace",
    },
)
# {"doc_id": "report-2024", "status": "completed", "retrieval_ready": True}

Plain text

result = client.ingest_document("notes.txt", document_id="notes-001")

Markdown

result = client.ingest_markdown(
    "README.md",
    options={
        "add_node_summary": True,
        "add_doc_description": True,
        # PageIndex-style strings also accepted: "if_add_node_summary": "yes"
    },
)

Folder ingestion

summary = client.ingest_folder(
    "./documents/",
    metadata_defaults={"tenant_id": "acme", "workspace_id": "ws-1"},
    recursive=True,
)
# {"succeeded": [...], "failed": [...], "skipped": [...], "success_count": N}

Supported types: .pdf, .md, .markdown, .txt, .text. Unsupported types appear in skipped — no exception is raised.


Get document data

# Hierarchical tree (sections / subsections)
tree = client.get_document_tree("report-2024")
tree_with_summaries = client.get_document_tree("report-2024", summary=True)

# Extracted text — page format  [{"page_index": 1, "text": "..."}]
client.get_document_ocr("report-2024", format="page")

# Extracted text — node format  [{"node_id": "0001", "title": "...", "text": "..."}]
client.get_document_ocr("report-2024", format="node")

# Extracted text — raw string
client.get_document_ocr("report-2024", format="raw")

# Metadata  {"id", "name", "description", "status", "createdAt", "pageNum"}
client.get_document_metadata("report-2024")

# List  {"documents": [...], "total": N, "limit": 50, "offset": 0}
client.list_documents(limit=20, offset=0)

# Delete
client.delete_document("report-2024")

Retrieval

result = client.retrieve(
    document_id="report-2024",
    query="What are the key risk factors?",
    thinking=False,           # True: use reasoning_model for deeper analysis
    max_results=5,
    max_context_tokens=4000,
)
# {
#   "retrieval_id": "...",
#   "doc_id": "report-2024",
#   "status": "completed",
#   "query": "...",
#   "retrieved_nodes": [
#     {
#       "title": "Risk Factors",
#       "node_id": "0005",
#       "relevant_contents": [
#         {"page_index": 12, "relevant_content": "The primary risk..."}
#       ]
#     }
#   ]
# }

Task-style retrieval (PageIndex Legacy API compatibility):

task = client.create_retrieval_task("report-2024", "What are the risks?")
result = client.get_retrieval_result(task["retrieval_id"])

Chat

result = client.chat_completion(
    messages=[{"role": "user", "content": "Summarise the key findings."}],
    doc_id="report-2024",
    enable_citations=True,
)
# {
#   "id": "chatcmpl-...",
#   "choices": [{"message": {"role": "assistant", "content": "..."}, "finish_reason": "end_turn"}],
#   "usage": {"prompt_tokens": N, "completion_tokens": N, "total_tokens": N},
#   "citations": [{"document_id": "...", "section_title": "...", "page_number": N, ...}]
# }

# Multi-document
result = client.chat_completion(
    messages=[{"role": "user", "content": "Compare the two reports."}],
    document_ids=["report-2024", "report-2023"],
)

# Convenience method — returns plain string
answer = client.ask("What is the revenue?", doc_id="report-2024")

Streaming chat

for chunk in client.stream_chat(
    messages=[{"role": "user", "content": "Summarise the findings."}],
    doc_id="report-2024",
):
    if chunk["type"] == "content":
        print(chunk["choices"][0]["delta"]["content"], end="", flush=True)
    elif chunk["type"] == "done":
        print()
        print("Citations:", chunk.get("citations", []))

Chunk types emitted:

Type Description
text_block_start Stream begins
content Text delta in choices[0].delta.content
text_stop Text stream complete
done Final chunk; includes citations list

Metadata filtering

Every document stores arbitrary metadata for later filtering:

client.ingest_document(
    "contract.pdf",
    metadata={
        "tenant_id": "acme",
        "workspace_id": "ws-legal",
        "workflow_id": "wf-onboarding",
        "user_id": "user-42",
        "source_type": "workflow_upload",   # "workspace" | "workflow_upload"
    },
)

docs = client.list_documents(filters={"workspace_id": "ws-legal"})

Workspace and workflow isolation

search() and scoped_chat() enforce strict isolation boundaries:

results = client.search(
    query="What are the obligations?",
    tenant_id="acme",
    workspace_id="ws-legal",
    include_workspace_context=True,
    include_uploaded_documents=True,
)

response = client.scoped_chat(
    messages=[{"role": "user", "content": "Analyse this contract."}],
    tenant_id="acme",
    workspace_id="ws-legal",
    workflow_id="wf-onboarding",
    include_workspace_context=True,
    include_uploaded_documents=True,
)

Boundary guarantees:

  • Never searches across tenants
  • Never searches across workspaces
  • Workflow-uploaded documents are only visible within that workflow
  • include_workspace_context=False → workspace docs excluded
  • include_uploaded_documents=False → workflow-uploaded docs excluded
  • Both False → empty context, no LLM call

Data locality

All data is stored under your storage_path:

storage_path/
  documents/
    {document_id}/
      metadata.json
      tree.json
      extracted_text.json
      pages.json
      nodes.json
      retrievals/
      chats/
  manifests/
    manifest.json
  tasks/

No data is sent to PageIndex Cloud. The only external calls are to your configured LLM endpoint for:

  • Tree building and node summarisation (during ingestion)
  • Tree navigation and answer generation (during retrieval)
  • Chat responses

API parity table

Cloud Endpoint Local SDK Method Supported
POST /doc/ ingest_document() / ingest_pdf() Yes
GET /doc/{id}/?type=tree get_document_tree() Yes
GET /doc/{id}/?type=tree&summary=true get_document_tree(summary=True) Yes
GET /doc/{id}/?type=ocr&format=page get_document_ocr(format="page") Yes
GET /doc/{id}/?type=ocr&format=node get_document_ocr(format="node") Yes
GET /doc/{id}/?type=ocr&format=raw get_document_ocr(format="raw") Yes
GET /doc/{id}/metadata get_document_metadata() Yes
GET /docs?limit=&offset= list_documents(limit, offset) Yes
DELETE /doc/{id}/ delete_document() Yes
POST /markdown/ ingest_markdown() / convert_markdown_to_tree() Yes
POST /chat/completions chat_completion() / chat() / ask() Yes
POST /chat/completions (stream) stream_chat() Yes (Python generator)
POST /retrieval/ retrieve() / create_retrieval_task() Yes
GET /retrieval/{id}/ get_retrieval_result() Yes
Folder ingestion ingest_folder() Local-only
Batch ingestion ingest_documents() / batch_ingest() Local-only
Workspace isolation search() / scoped_chat() Local-only

Known limitations

Cloud Feature Local Approximation
Enhanced cloud OCR PyMuPDF / PyPDF2 text extraction (may be less accurate for scanned PDFs)
Hosted MCP tooling Not implemented — local SDK uses direct LLM calls
MCP streaming events Approximated as text_block_start / content / done chunks
Async processing queue Synchronous — create_retrieval_task() runs inline and stores the result

Attribution

This package wraps the open-source PageIndex library by VectifyAI (MIT License). No PageIndex source code is incorporated — pageindex is used as a library dependency. See ATTRIBUTION.md for details.

local-pageindex is not affiliated with VectifyAI or PageIndex Cloud.


License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

local_pageindex-0.1.1.tar.gz (35.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

local_pageindex-0.1.1-py3-none-any.whl (31.1 kB view details)

Uploaded Python 3

File details

Details for the file local_pageindex-0.1.1.tar.gz.

File metadata

  • Download URL: local_pageindex-0.1.1.tar.gz
  • Upload date:
  • Size: 35.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for local_pageindex-0.1.1.tar.gz
Algorithm Hash digest
SHA256 7059eaa7a2015c52ebeb717c77b9a8b6b992d6a8317389221acc94b92f3821b0
MD5 b80b3a79ea92e5d4b6f6bd0f4f0ac44d
BLAKE2b-256 1a630d344ffee7bed9ed6efa24c43726efc50f197a8b1acdaf6215d09a43cb58

See more details on using hashes here.

File details

Details for the file local_pageindex-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for local_pageindex-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 cdcd52f55e2b1c059f4031a4f9850894d9b2897b14dc18872ada79a87779158b
MD5 b25ca6ef8feee85a059e9fb361dab119
BLAKE2b-256 5fb05127f8449be4bd790a17296eb2eb6c17c455edd28116e48fd0ee4b5fcc99

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page