Local-only Python SDK mirroring PageIndex Cloud API — vectorless RAG, no cloud required.
Project description
local-pageindex
A local-only Python SDK that mirrors the PageIndex Cloud API as importable Python methods — no cloud account, no API key, no external service.
All document processing, retrieval, and chat happens on your machine. The only external calls are to your configured OpenAI-compatible LLM endpoint (for indexing, retrieval reasoning, and chat).
Why this exists
PageIndex is an open-source vectorless RAG system that builds hierarchical tree indexes from documents and uses LLM reasoning to navigate them — achieving 98.7 % accuracy on FinanceBench without a vector database.
This SDK wraps that library with:
- Local storage — all indexes, trees, and retrieval results saved under
storage_path - Full API parity — local equivalents for every PageIndex Cloud REST endpoint
- Streaming chat — Python generator interface
- Tenant/workspace/workflow isolation — multi-tenant scoping and boundary enforcement
- Batch ingestion — folder and file-list ingestion with success/failure summary
- Citation support — every retrieval and chat response includes source references
Comparison with PageIndex Cloud
| Feature | PageIndex Cloud | local-pageindex |
|---|---|---|
| Document tree building | Cloud-hosted | Local, via open-source pageindex library |
| Storage | PageIndex servers | Your storage_path directory |
| Authentication | PAGEINDEX_API_KEY |
Your OpenAI-compatible api_key |
| PDF text extraction | Cloud OCR | PyMuPDF / PyPDF2 |
| Retrieval | Cloud-hosted vectorless RAG | Local LLM tree navigation |
| Chat | Cloud API | Local LLM with retrieved context |
| Streaming | SSE over HTTPS | Python generator |
| Multi-tenant isolation | Managed | Metadata-based filtering |
| Cost | PageIndex pricing | LLM API calls only |
Installation
pip install local-pageindex
For better PDF extraction accuracy, also install PyMuPDF:
pip install "local-pageindex[pdf]"
Important: PDF and Markdown tree building requires the open-source pageindex library, which is not on PyPI. Install it separately:
pip install git+https://github.com/VectifyAI/PageIndex.git
Without pageindex, text file ingestion and all retrieval/chat methods work normally. PDF and Markdown ingestion will raise LLMProviderError at runtime.
Quick start
import os
from local_pageindex import LocalPageIndexClient
client = LocalPageIndexClient(
storage_path="./my_index",
api_key=os.environ["OPENAI_API_KEY"],
base_url=os.environ.get("OPENAI_BASE_URL"), # optional
model="gpt-4.1",
reasoning_model="gpt-5.1",
)
# Ingest a document
result = client.ingest_document("report.pdf")
doc_id = result["doc_id"]
# Ask a question
answer = client.ask("What are the key findings?", doc_id=doc_id)
print(answer)
Document ingestion
result = client.ingest_document(
"annual_report.pdf",
document_id="report-2024", # optional; UUID generated if omitted
metadata={
"tenant_id": "acme",
"workspace_id": "ws-finance",
"source_type": "workspace",
},
)
# {"doc_id": "report-2024", "status": "completed", "retrieval_ready": True}
Plain text
result = client.ingest_document("notes.txt", document_id="notes-001")
Markdown
result = client.ingest_markdown(
"README.md",
options={
"add_node_summary": True,
"add_doc_description": True,
# PageIndex-style strings also accepted: "if_add_node_summary": "yes"
},
)
Folder ingestion
summary = client.ingest_folder(
"./documents/",
metadata_defaults={"tenant_id": "acme", "workspace_id": "ws-1"},
recursive=True,
)
# {"succeeded": [...], "failed": [...], "skipped": [...], "success_count": N}
Supported types: .pdf, .md, .markdown, .txt, .text.
Unsupported types appear in skipped — no exception is raised.
Get document data
# Hierarchical tree (sections / subsections)
tree = client.get_document_tree("report-2024")
tree_with_summaries = client.get_document_tree("report-2024", summary=True)
# Extracted text — page format [{"page_index": 1, "text": "..."}]
client.get_document_ocr("report-2024", format="page")
# Extracted text — node format [{"node_id": "0001", "title": "...", "text": "..."}]
client.get_document_ocr("report-2024", format="node")
# Extracted text — raw string
client.get_document_ocr("report-2024", format="raw")
# Metadata {"id", "name", "description", "status", "createdAt", "pageNum"}
client.get_document_metadata("report-2024")
# List {"documents": [...], "total": N, "limit": 50, "offset": 0}
client.list_documents(limit=20, offset=0)
# Delete
client.delete_document("report-2024")
Retrieval
result = client.retrieve(
document_id="report-2024",
query="What are the key risk factors?",
thinking=False, # True: use reasoning_model for deeper analysis
max_results=5,
max_context_tokens=4000,
)
# {
# "retrieval_id": "...",
# "doc_id": "report-2024",
# "status": "completed",
# "query": "...",
# "retrieved_nodes": [
# {
# "title": "Risk Factors",
# "node_id": "0005",
# "relevant_contents": [
# {"page_index": 12, "relevant_content": "The primary risk..."}
# ]
# }
# ]
# }
Task-style retrieval (PageIndex Legacy API compatibility):
task = client.create_retrieval_task("report-2024", "What are the risks?")
result = client.get_retrieval_result(task["retrieval_id"])
Chat
result = client.chat_completion(
messages=[{"role": "user", "content": "Summarise the key findings."}],
doc_id="report-2024",
enable_citations=True,
)
# {
# "id": "chatcmpl-...",
# "choices": [{"message": {"role": "assistant", "content": "..."}, "finish_reason": "end_turn"}],
# "usage": {"prompt_tokens": N, "completion_tokens": N, "total_tokens": N},
# "citations": [{"document_id": "...", "section_title": "...", "page_number": N, ...}]
# }
# Multi-document
result = client.chat_completion(
messages=[{"role": "user", "content": "Compare the two reports."}],
document_ids=["report-2024", "report-2023"],
)
# Convenience method — returns plain string
answer = client.ask("What is the revenue?", doc_id="report-2024")
Streaming chat
for chunk in client.stream_chat(
messages=[{"role": "user", "content": "Summarise the findings."}],
doc_id="report-2024",
):
if chunk["type"] == "content":
print(chunk["choices"][0]["delta"]["content"], end="", flush=True)
elif chunk["type"] == "done":
print()
print("Citations:", chunk.get("citations", []))
Chunk types emitted:
| Type | Description |
|---|---|
text_block_start |
Stream begins |
content |
Text delta in choices[0].delta.content |
text_stop |
Text stream complete |
done |
Final chunk; includes citations list |
Metadata filtering
Every document stores arbitrary metadata for later filtering:
client.ingest_document(
"contract.pdf",
metadata={
"tenant_id": "acme",
"workspace_id": "ws-legal",
"workflow_id": "wf-onboarding",
"user_id": "user-42",
"source_type": "workflow_upload", # "workspace" | "workflow_upload"
},
)
docs = client.list_documents(filters={"workspace_id": "ws-legal"})
Workspace and workflow isolation
search() and scoped_chat() enforce strict isolation boundaries:
results = client.search(
query="What are the obligations?",
tenant_id="acme",
workspace_id="ws-legal",
include_workspace_context=True,
include_uploaded_documents=True,
)
response = client.scoped_chat(
messages=[{"role": "user", "content": "Analyse this contract."}],
tenant_id="acme",
workspace_id="ws-legal",
workflow_id="wf-onboarding",
include_workspace_context=True,
include_uploaded_documents=True,
)
Boundary guarantees:
- Never searches across tenants
- Never searches across workspaces
- Workflow-uploaded documents are only visible within that workflow
include_workspace_context=False→ workspace docs excludedinclude_uploaded_documents=False→ workflow-uploaded docs excluded- Both
False→ empty context, no LLM call
Data locality
All data is stored under your storage_path:
storage_path/
documents/
{document_id}/
metadata.json
tree.json
extracted_text.json
pages.json
nodes.json
retrievals/
chats/
manifests/
manifest.json
tasks/
No data is sent to PageIndex Cloud. The only external calls are to your configured LLM endpoint for:
- Tree building and node summarisation (during ingestion)
- Tree navigation and answer generation (during retrieval)
- Chat responses
API parity table
| Cloud Endpoint | Local SDK Method | Supported |
|---|---|---|
POST /doc/ |
ingest_document() / ingest_pdf() |
Yes |
GET /doc/{id}/?type=tree |
get_document_tree() |
Yes |
GET /doc/{id}/?type=tree&summary=true |
get_document_tree(summary=True) |
Yes |
GET /doc/{id}/?type=ocr&format=page |
get_document_ocr(format="page") |
Yes |
GET /doc/{id}/?type=ocr&format=node |
get_document_ocr(format="node") |
Yes |
GET /doc/{id}/?type=ocr&format=raw |
get_document_ocr(format="raw") |
Yes |
GET /doc/{id}/metadata |
get_document_metadata() |
Yes |
GET /docs?limit=&offset= |
list_documents(limit, offset) |
Yes |
DELETE /doc/{id}/ |
delete_document() |
Yes |
POST /markdown/ |
ingest_markdown() / convert_markdown_to_tree() |
Yes |
POST /chat/completions |
chat_completion() / chat() / ask() |
Yes |
POST /chat/completions (stream) |
stream_chat() |
Yes (Python generator) |
POST /retrieval/ |
retrieve() / create_retrieval_task() |
Yes |
GET /retrieval/{id}/ |
get_retrieval_result() |
Yes |
| Folder ingestion | ingest_folder() |
Local-only |
| Batch ingestion | ingest_documents() / batch_ingest() |
Local-only |
| Workspace isolation | search() / scoped_chat() |
Local-only |
Known limitations
| Cloud Feature | Local Approximation |
|---|---|
| Enhanced cloud OCR | PyMuPDF / PyPDF2 text extraction (may be less accurate for scanned PDFs) |
| Hosted MCP tooling | Not implemented — local SDK uses direct LLM calls |
| MCP streaming events | Approximated as text_block_start / content / done chunks |
| Async processing queue | Synchronous — create_retrieval_task() runs inline and stores the result |
Attribution
This package wraps the open-source PageIndex library
by VectifyAI (MIT License). No PageIndex source code is
incorporated — pageindex is used as a library dependency. See ATTRIBUTION.md
for details.
local-pageindex is not affiliated with VectifyAI or PageIndex Cloud.
License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file local_pageindex-0.1.1.tar.gz.
File metadata
- Download URL: local_pageindex-0.1.1.tar.gz
- Upload date:
- Size: 35.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7059eaa7a2015c52ebeb717c77b9a8b6b992d6a8317389221acc94b92f3821b0
|
|
| MD5 |
b80b3a79ea92e5d4b6f6bd0f4f0ac44d
|
|
| BLAKE2b-256 |
1a630d344ffee7bed9ed6efa24c43726efc50f197a8b1acdaf6215d09a43cb58
|
File details
Details for the file local_pageindex-0.1.1-py3-none-any.whl.
File metadata
- Download URL: local_pageindex-0.1.1-py3-none-any.whl
- Upload date:
- Size: 31.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cdcd52f55e2b1c059f4031a4f9850894d9b2897b14dc18872ada79a87779158b
|
|
| MD5 |
b25ca6ef8feee85a059e9fb361dab119
|
|
| BLAKE2b-256 |
5fb05127f8449be4bd790a17296eb2eb6c17c455edd28116e48fd0ee4b5fcc99
|