Local-only Python SDK mirroring PageIndex Cloud API — vectorless RAG, no cloud required.

These details have not been verified by PyPI

Project links

Project description

local-pageindex

A local-only Python SDK that mirrors the PageIndex Cloud API as importable Python methods — no cloud account, no API key, no external service.

All document processing, retrieval, and chat happens on your machine. The only external calls are to your configured OpenAI-compatible LLM endpoint (for indexing, retrieval reasoning, and chat).

Why this exists

PageIndex is an open-source vectorless RAG system that builds hierarchical tree indexes from documents and uses LLM reasoning to navigate them — achieving 98.7 % accuracy on FinanceBench without a vector database.

This SDK wraps that library with:

Local storage — all indexes, trees, and retrieval results saved under storage_path
Full API parity — local equivalents for every PageIndex Cloud REST endpoint
Streaming chat — Python generator interface
Tenant/workspace/workflow isolation — multi-tenant scoping and boundary enforcement
Batch ingestion — folder and file-list ingestion with success/failure summary
Citation support — every retrieval and chat response includes source references

Comparison with PageIndex Cloud

Feature	PageIndex Cloud	local-pageindex
Document tree building	Cloud-hosted	Local, via open-source `pageindex` library
Storage	PageIndex servers	Your `storage_path` directory
Authentication	`PAGEINDEX_API_KEY`	Your OpenAI-compatible `api_key`
PDF text extraction	Cloud OCR	PyMuPDF / PyPDF2
Retrieval	Cloud-hosted vectorless RAG	Local LLM tree navigation
Chat	Cloud API	Local LLM with retrieved context
Streaming	SSE over HTTPS	Python generator
Multi-tenant isolation	Managed	Metadata-based filtering
Cost	PageIndex pricing	LLM API calls only

Installation

pip install local-pageindex

For better PDF extraction accuracy, also install PyMuPDF:

pip install "local-pageindex[pdf]"

Important: PDF and Markdown tree building requires the open-source pageindex library, which is not on PyPI. Install it separately:

pip install git+https://github.com/VectifyAI/PageIndex.git

Without pageindex, text file ingestion and all retrieval/chat methods work normally. PDF and Markdown ingestion will raise LLMProviderError at runtime.

Quick start

import os
from local_pageindex import LocalPageIndexClient

client = LocalPageIndexClient(
    storage_path="./my_index",
    api_key=os.environ["OPENAI_API_KEY"],
    base_url=os.environ.get("OPENAI_BASE_URL"),  # optional
    model="gpt-4.1",
    reasoning_model="gpt-5.1",
)

# Ingest a document
result = client.ingest_document("report.pdf")
doc_id = result["doc_id"]

# Ask a question
answer = client.ask("What are the key findings?", doc_id=doc_id)
print(answer)

Document ingestion

PDF

result = client.ingest_document(
    "annual_report.pdf",
    document_id="report-2024",         # optional; UUID generated if omitted
    metadata={
        "tenant_id": "acme",
        "workspace_id": "ws-finance",
        "source_type": "workspace",
    },
)
# {"doc_id": "report-2024", "status": "completed", "retrieval_ready": True}

Plain text

result = client.ingest_document("notes.txt", document_id="notes-001")

Markdown

result = client.ingest_markdown(
    "README.md",
    options={
        "add_node_summary": True,
        "add_doc_description": True,
        # PageIndex-style strings also accepted: "if_add_node_summary": "yes"
    },
)

Folder ingestion

summary = client.ingest_folder(
    "./documents/",
    metadata_defaults={"tenant_id": "acme", "workspace_id": "ws-1"},
    recursive=True,
)
# {"succeeded": [...], "failed": [...], "skipped": [...], "success_count": N}

Supported types: .pdf, .md, .markdown, .txt, .text. Unsupported types appear in skipped — no exception is raised.

Get document data

# Hierarchical tree (sections / subsections)
tree = client.get_document_tree("report-2024")
tree_with_summaries = client.get_document_tree("report-2024", summary=True)

# Extracted text — page format  [{"page_index": 1, "text": "..."}]
client.get_document_ocr("report-2024", format="page")

# Extracted text — node format  [{"node_id": "0001", "title": "...", "text": "..."}]
client.get_document_ocr("report-2024", format="node")

# Extracted text — raw string
client.get_document_ocr("report-2024", format="raw")

# Metadata  {"id", "name", "description", "status", "createdAt", "pageNum"}
client.get_document_metadata("report-2024")

# List  {"documents": [...], "total": N, "limit": 50, "offset": 0}
client.list_documents(limit=20, offset=0)

# Delete
client.delete_document("report-2024")

Retrieval

result = client.retrieve(
    document_id="report-2024",
    query="What are the key risk factors?",
    thinking=False,           # True: use reasoning_model for deeper analysis
    max_results=5,
    max_context_tokens=4000,
)
# {
#   "retrieval_id": "...",
#   "doc_id": "report-2024",
#   "status": "completed",
#   "query": "...",
#   "retrieved_nodes": [
#     {
#       "title": "Risk Factors",
#       "node_id": "0005",
#       "relevant_contents": [
#         {"page_index": 12, "relevant_content": "The primary risk..."}
#       ]
#     }
#   ]
# }

Task-style retrieval (PageIndex Legacy API compatibility):

task = client.create_retrieval_task("report-2024", "What are the risks?")
result = client.get_retrieval_result(task["retrieval_id"])

Chat

result = client.chat_completion(
    messages=[{"role": "user", "content": "Summarise the key findings."}],
    doc_id="report-2024",
    enable_citations=True,
)
# {
#   "id": "chatcmpl-...",
#   "choices": [{"message": {"role": "assistant", "content": "..."}, "finish_reason": "end_turn"}],
#   "usage": {"prompt_tokens": N, "completion_tokens": N, "total_tokens": N},
#   "citations": [{"document_id": "...", "section_title": "...", "page_number": N, ...}]
# }

# Multi-document
result = client.chat_completion(
    messages=[{"role": "user", "content": "Compare the two reports."}],
    document_ids=["report-2024", "report-2023"],
)

# Convenience method — returns plain string
answer = client.ask("What is the revenue?", doc_id="report-2024")

Streaming chat

for chunk in client.stream_chat(
    messages=[{"role": "user", "content": "Summarise the findings."}],
    doc_id="report-2024",
):
    if chunk["type"] == "content":
        print(chunk["choices"][0]["delta"]["content"], end="", flush=True)
    elif chunk["type"] == "done":
        print()
        print("Citations:", chunk.get("citations", []))

Chunk types emitted:

Type	Description
`text_block_start`	Stream begins
`content`	Text delta in `choices[0].delta.content`
`text_stop`	Text stream complete
`done`	Final chunk; includes `citations` list

Metadata filtering

Every document stores arbitrary metadata for later filtering:

client.ingest_document(
    "contract.pdf",
    metadata={
        "tenant_id": "acme",
        "workspace_id": "ws-legal",
        "workflow_id": "wf-onboarding",
        "user_id": "user-42",
        "source_type": "workflow_upload",   # "workspace" | "workflow_upload"
    },
)

docs = client.list_documents(filters={"workspace_id": "ws-legal"})

Workspace and workflow isolation

search() and scoped_chat() enforce strict isolation boundaries:

results = client.search(
    query="What are the obligations?",
    tenant_id="acme",
    workspace_id="ws-legal",
    include_workspace_context=True,
    include_uploaded_documents=True,
)

response = client.scoped_chat(
    messages=[{"role": "user", "content": "Analyse this contract."}],
    tenant_id="acme",
    workspace_id="ws-legal",
    workflow_id="wf-onboarding",
    include_workspace_context=True,
    include_uploaded_documents=True,
)

Boundary guarantees:

Never searches across tenants
Never searches across workspaces
Workflow-uploaded documents are only visible within that workflow
include_workspace_context=False → workspace docs excluded
include_uploaded_documents=False → workflow-uploaded docs excluded
Both False → empty context, no LLM call

Data locality

All data is stored under your storage_path:

storage_path/
  documents/
    {document_id}/
      metadata.json
      tree.json
      extracted_text.json
      pages.json
      nodes.json
      retrievals/
      chats/
  manifests/
    manifest.json
  tasks/

No data is sent to PageIndex Cloud. The only external calls are to your configured LLM endpoint for:

Tree building and node summarisation (during ingestion)
Tree navigation and answer generation (during retrieval)
Chat responses

API parity table

Cloud Endpoint	Local SDK Method	Supported
`POST /doc/`	`ingest_document()` / `ingest_pdf()`	Yes
`GET /doc/{id}/?type=tree`	`get_document_tree()`	Yes
`GET /doc/{id}/?type=tree&summary=true`	`get_document_tree(summary=True)`	Yes
`GET /doc/{id}/?type=ocr&format=page`	`get_document_ocr(format="page")`	Yes
`GET /doc/{id}/?type=ocr&format=node`	`get_document_ocr(format="node")`	Yes
`GET /doc/{id}/?type=ocr&format=raw`	`get_document_ocr(format="raw")`	Yes
`GET /doc/{id}/metadata`	`get_document_metadata()`	Yes
`GET /docs?limit=&offset=`	`list_documents(limit, offset)`	Yes
`DELETE /doc/{id}/`	`delete_document()`	Yes
`POST /markdown/`	`ingest_markdown()` / `convert_markdown_to_tree()`	Yes
`POST /chat/completions`	`chat_completion()` / `chat()` / `ask()`	Yes
`POST /chat/completions` (stream)	`stream_chat()`	Yes (Python generator)
`POST /retrieval/`	`retrieve()` / `create_retrieval_task()`	Yes
`GET /retrieval/{id}/`	`get_retrieval_result()`	Yes
Folder ingestion	`ingest_folder()`	Local-only
Batch ingestion	`ingest_documents()` / `batch_ingest()`	Local-only
Workspace isolation	`search()` / `scoped_chat()`	Local-only

Known limitations

Cloud Feature	Local Approximation
Enhanced cloud OCR	PyMuPDF / PyPDF2 text extraction (may be less accurate for scanned PDFs)
Hosted MCP tooling	Not implemented — local SDK uses direct LLM calls
MCP streaming events	Approximated as `text_block_start` / `content` / `done` chunks
Async processing queue	Synchronous — `create_retrieval_task()` runs inline and stores the result

Attribution

This package wraps the open-source PageIndex library by VectifyAI (MIT License). No PageIndex source code is incorporated — pageindex is used as a library dependency. See ATTRIBUTION.md for details.

local-pageindex is not affiliated with VectifyAI or PageIndex Cloud.

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.1

May 30, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

local_pageindex-0.1.1.tar.gz (35.7 kB view details)

Uploaded May 30, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

local_pageindex-0.1.1-py3-none-any.whl (31.1 kB view details)

Uploaded May 30, 2026 Python 3

File details

Details for the file local_pageindex-0.1.1.tar.gz.

File metadata

Download URL: local_pageindex-0.1.1.tar.gz
Upload date: May 30, 2026
Size: 35.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for local_pageindex-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`7059eaa7a2015c52ebeb717c77b9a8b6b992d6a8317389221acc94b92f3821b0`
MD5	`b80b3a79ea92e5d4b6f6bd0f4f0ac44d`
BLAKE2b-256	`1a630d344ffee7bed9ed6efa24c43726efc50f197a8b1acdaf6215d09a43cb58`

See more details on using hashes here.

File details

Details for the file local_pageindex-0.1.1-py3-none-any.whl.

File metadata

Download URL: local_pageindex-0.1.1-py3-none-any.whl
Upload date: May 30, 2026
Size: 31.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for local_pageindex-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`cdcd52f55e2b1c059f4031a4f9850894d9b2897b14dc18872ada79a87779158b`
MD5	`b25ca6ef8feee85a059e9fb361dab119`
BLAKE2b-256	`5fb05127f8449be4bd790a17296eb2eb6c17c455edd28116e48fd0ee4b5fcc99`

See more details on using hashes here.

local-pageindex 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

local-pageindex

Why this exists

Comparison with PageIndex Cloud

Installation

Quick start

Document ingestion

PDF

Plain text

Markdown

Folder ingestion

Get document data

Retrieval

Chat

Streaming chat

Metadata filtering

Workspace and workflow isolation

Data locality

API parity table

Known limitations

Attribution

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes