Compile PDFs into a queryable wiki.
Project description
OpenIndex
Overview
OpenIndex parses PDF documents into a hierarchical section tree and compiles them into a persistent, cross-linked wiki that agents can query.
It combines two projects:
- PageIndex — LLM-based hierarchical section extraction from PDFs
- OpenKB — compiles documents into a queryable wiki with cross-document concept pages
Unlike traditional RAG (which rediscovers knowledge on every query), OpenIndex compiles once: sections are indexed, summaries generated, concept pages created with bidirectional links, and a structured wiki is written to disk. An agent can then search the wiki to answer questions precisely.
Table of Contents
Installation
From PyPI:
pip install openindex
From source:
uv pip install git+https://github.com/hienhayho/openindex.git
Usage
Set environment variables (or use a .env file):
OPENAI_MODEL_NAME=...
OPENAI_BASE_URL=https://api.openai.com/v1
OPENAI_API_KEY=
OPENAI_EXTRA_BODY={}
Note: openindex works with any OpenAI-compatible API server (OpenAI, vLLM, Ollama, LM Studio, etc.). Set
OPENAI_BASE_URLto point to your server.
Index a PDF
Runs the full pipeline: section extraction → verification → tree building → summaries → wiki generation.
import os
import json
from dotenv import load_dotenv
from openindex import WikiIndex, TreeConfig
load_dotenv()
index = WikiIndex(
model_name=os.getenv("OPENAI_MODEL_NAME"),
base_url=os.getenv("OPENAI_BASE_URL"),
api_key=os.getenv("OPENAI_API_KEY"),
extra_body=json.loads(os.getenv("OPENAI_EXTRA_BODY", "{}")),
config=TreeConfig(max_parallel_llm_calls=8),
)
result = index.build_wiki_sync("paper.pdf", "./wiki")
WikiIndex.print_result(result)
See tools/index.py for a full example.
Output wiki structure:
wiki/
├── index.md # master catalog
├── summaries/<doc>.md # section tree with page ranges
├── concepts/<slug>.md # cross-document concept pages
└── sources/<doc>.json # full per-page text
Query the wiki
The query agent searches the compiled wiki to answer questions, fetching only the relevant pages.
import os
import json
from dotenv import load_dotenv
from openindex import WikiQueryAgent
load_dotenv()
agent = WikiQueryAgent(
wiki_dir="./wiki",
model_name=os.getenv("OPENAI_MODEL_NAME"),
base_url=os.getenv("OPENAI_BASE_URL"),
api_key=os.getenv("OPENAI_API_KEY"),
extra_body=json.loads(os.getenv("OPENAI_EXTRA_BODY", "{}")),
)
answer = agent.ask_sync("What is RAG?")
print(answer)
See tools/query.py for a full example.
License
Apache 2.0. See LICENSE for details.
This project incorporates code from:
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file openindex-0.1.4.tar.gz.
File metadata
- Download URL: openindex-0.1.4.tar.gz
- Upload date:
- Size: 33.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.6.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
caf8f1de4561e365eed8d8d67315d5669e50e8ea06f9c6ce4322cbce7c8838a5
|
|
| MD5 |
3cc2a169a2456321e225c24d20b37500
|
|
| BLAKE2b-256 |
56d5855c09413e9e732cf5abbf434599e5eedc21a5bb95f931fbb40b8aaea1e6
|
File details
Details for the file openindex-0.1.4-py3-none-any.whl.
File metadata
- Download URL: openindex-0.1.4-py3-none-any.whl
- Upload date:
- Size: 39.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.6.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
91e0c1c1113fe3fa4d726a2ee1c8f965414a3128cbf56e611bb51f9b5614abf6
|
|
| MD5 |
6d19a20108fd26713c9b85c39ca39c1a
|
|
| BLAKE2b-256 |
c0175154a742d556940cde4e3cbf59f86553b7f831e378a396f5f56ac5fd918d
|