Skip to main content

Compile PDFs into a queryable wiki.

Project description

OpenIndex

Overview

OpenIndex parses PDF documents into a hierarchical section tree and compiles them into a persistent, cross-linked wiki that agents can query.

It combines two projects:

  • PageIndex — LLM-based hierarchical section extraction from PDFs
  • OpenKB — compiles documents into a queryable wiki with cross-document concept pages

Unlike traditional RAG (which rediscovers knowledge on every query), OpenIndex compiles once: sections are indexed, summaries generated, concept pages created with bidirectional links, and a structured wiki is written to disk. An agent can then search the wiki to answer questions precisely.

Table of Contents

Installation

From PyPI:

pip install openindex

From source:

uv pip install git+https://github.com/hienhayho/openindex.git

Usage

Set environment variables (or use a .env file):

OPENAI_MODEL_NAME=...
OPENAI_BASE_URL=https://api.openai.com/v1
OPENAI_API_KEY=
OPENAI_EXTRA_BODY={}

Note: openindex works with any OpenAI-compatible API server (OpenAI, vLLM, Ollama, LM Studio, etc.). Set OPENAI_BASE_URL to point to your server.

Index a PDF

Runs the full pipeline: section extraction → verification → tree building → summaries → wiki generation.

import os
import json
from dotenv import load_dotenv
from openindex import WikiIndex, TreeConfig

load_dotenv()

index = WikiIndex(
    model_name=os.getenv("OPENAI_MODEL_NAME"),
    base_url=os.getenv("OPENAI_BASE_URL"),
    api_key=os.getenv("OPENAI_API_KEY"),
    extra_body=json.loads(os.getenv("OPENAI_EXTRA_BODY", "{}")),
    config=TreeConfig(max_parallel_llm_calls=8),
)

result = index.build_wiki_sync("paper.pdf", "./wiki")
WikiIndex.print_result(result)

See tools/index.py for a full example.

Output wiki structure:

wiki/
├── index.md              # master catalog
├── summaries/<doc>.md    # section tree with page ranges
├── concepts/<slug>.md    # cross-document concept pages
└── sources/<doc>.json    # full per-page text

Query the wiki

The query agent searches the compiled wiki to answer questions, fetching only the relevant pages.

import os
import json
from dotenv import load_dotenv
from openindex import WikiQueryAgent

load_dotenv()

agent = WikiQueryAgent(
    wiki_dir="./wiki",
    model_name=os.getenv("OPENAI_MODEL_NAME"),
    base_url=os.getenv("OPENAI_BASE_URL"),
    api_key=os.getenv("OPENAI_API_KEY"),
    extra_body=json.loads(os.getenv("OPENAI_EXTRA_BODY", "{}")),
)

answer = agent.ask_sync("What is RAG?")
print(answer)

See tools/query.py for a full example.

License

Apache 2.0. See LICENSE for details.

This project incorporates code from:

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

openindex-0.1.4.tar.gz (33.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

openindex-0.1.4-py3-none-any.whl (39.8 kB view details)

Uploaded Python 3

File details

Details for the file openindex-0.1.4.tar.gz.

File metadata

  • Download URL: openindex-0.1.4.tar.gz
  • Upload date:
  • Size: 33.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.6.10

File hashes

Hashes for openindex-0.1.4.tar.gz
Algorithm Hash digest
SHA256 caf8f1de4561e365eed8d8d67315d5669e50e8ea06f9c6ce4322cbce7c8838a5
MD5 3cc2a169a2456321e225c24d20b37500
BLAKE2b-256 56d5855c09413e9e732cf5abbf434599e5eedc21a5bb95f931fbb40b8aaea1e6

See more details on using hashes here.

File details

Details for the file openindex-0.1.4-py3-none-any.whl.

File metadata

  • Download URL: openindex-0.1.4-py3-none-any.whl
  • Upload date:
  • Size: 39.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.6.10

File hashes

Hashes for openindex-0.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 91e0c1c1113fe3fa4d726a2ee1c8f965414a3128cbf56e611bb51f9b5614abf6
MD5 6d19a20108fd26713c9b85c39ca39c1a
BLAKE2b-256 c0175154a742d556940cde4e3cbf59f86553b7f831e378a396f5f56ac5fd918d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page