Skip to main content

OpenKB: Open LLM Knowledge Base, powered by PageIndex

Project description

OpenKB (by PageIndex)

OpenKB — Open LLM Knowledge Base

Scale to long documents  •  Reasoning-based retrieval  •  Native multi-modality  •  No Vector DB


📑 What is OpenKB

OpenKB (Open Knowledge Base) is an open-source system (in CLI) that compiles raw documents into a structured, interlinked wiki-style knowledge base using LLMs, powered by PageIndex for vectorless long document retrieval.

The idea is based on a concept described by Andrej Karpathy: LLMs generate summaries, concept pages, and cross-references, all maintained automatically. Knowledge compounds over time instead of being re-derived on every query.

Why not traditional RAG?

Traditional RAG rediscovers knowledge from scratch on every query. Nothing accumulates. OpenKB compiles knowledge once into a persistent wiki, then keeps it current. Cross-references already exist. Contradictions are flagged. Synthesis reflects everything consumed.

OpenKB has two layers: a wiki foundation that compiles and maintains your knowledge, and generators (query / chat / Skill Factory) that turn it into useful output. See Usage for the full command list.

🚀 Getting Started

Install

pip install openkb
Other install options
  • Latest from GitHub:

    pip install git+https://github.com/VectifyAI/OpenKB.git
    
  • Install from source (editable, for development):

    git clone https://github.com/VectifyAI/OpenKB.git
    cd OpenKB
    pip install -e .
    

Quick Start

# 1. Create a directory for your knowledge base
mkdir my-kb && cd my-kb

# 2. Initialize the knowledge base
openkb init

# 3. Add documents
openkb add paper.pdf
openkb add ~/papers/                            # Add a whole directory
openkb add https://arxiv.org/pdf/2509.11420     # Or fetch from a URL

# 4. Ask a question
openkb query "What are the main findings?"

# 5. Or chat interactively
openkb chat

# 6. Or distill your wiki into a redistributable skill
openkb skill new my-expert "Reason like an expert on <topic-from-your-docs>"

Set up your LLM

OpenKB comes with multi-LLM support (e.g., OpenAI, Claude, Gemini) via LiteLLM (pinned to a safe version).

Set your model during openkb init, or in .openkb/config.yaml, using provider/model LiteLLM format (like anthropic/claude-sonnet-4-6). OpenAI models can omit the prefix (like gpt-5.4).

Create a .env file with your LLM API key:

LLM_API_KEY=your_llm_api_key

🧩 How OpenKB Works

Architecture

raw/                              You drop files here
 │
 ├─ Short docs ──→ markitdown ──→ LLM reads full text
 │                                     │
 ├─ Long PDFs ──→ PageIndex ────→ LLM reads document trees
 │                                     │
 │                                     ▼
 │                         Wiki Compilation (using LLM)
 │                                     │
 ▼                                     ▼
wiki/                                  │            ← the foundation
 ├── index.md            Knowledge base overview
 ├── log.md              Operations timeline
 ├── AGENTS.md           Wiki schema (LLM instructions)
 ├── sources/            Full-text conversions
 ├── summaries/          Per-document summaries
 ├── concepts/           Cross-document synthesis ← the good stuff
 ├── entities/           Specific named things (people, orgs, places, products)
 ├── explorations/       Saved query results
 └── reports/            Lint reports
                                       │
                ┌──────────────────────┼──────────────────────┐
                ▼                      ▼                      ▼
            query / chat         Skill Factory          (future)
          (LLM answers from     openkb skill new       ppt / podcast /
            the wiki)           → output/skills/        report / …
                                + marketplace.json

Short vs. Long Document Handling

Short documents Long documents (PDF ≥ 20 pages)
Convert markitdown → Markdown PageIndex → tree index + summaries
Images Extracted inline (pymupdf) Extracted by PageIndex
LLM reads Full text Document trees
Result summary + concepts summary + concepts

Short docs are read in full by the LLM. Long PDFs are indexed by PageIndex into a hierarchical tree with summaries. The LLM reads the tree instead of the full text, enabling better retrieval from long documents.

Knowledge Compilation

When you add a document, the LLM:

  1. Generates a summary page
  2. Reads existing concept and entity pages
  3. Creates or updates concepts with cross-document synthesis
  4. Creates or updates entity pages (people, orgs, places, products)
  5. Updates the index and log

A single source might touch 10-15 wiki pages. Knowledge accumulates: each document enriches the existing wiki rather than sitting in isolation.

⚙️ Usage

OpenKB commands fall into two layers: the wiki foundation (compile + manage your knowledge) and generators (turn that wiki into useful output).

🧱 Wiki Foundation — compile and maintain

Command Description
openkb init Initialize a new knowledge base (interactive)
openkb add <file_or_dir_or_URL> Add documents and compile to wiki. URL ingest auto-detects PDF (saved as .pdf → PageIndex / markitdown) vs HTML (trafilatura main-content extract → .md)
openkb remove <doc> Remove a document and clean up its wiki pages, images, registry, and PageIndex state (use --dry-run to preview, --keep-raw / --keep-empty to retain artifacts)
openkb recompile [<doc>] [--all] Re-run the current compile pipeline on already-indexed docs (e.g. to backfill the entities/ layer) without re-indexing. Regenerates summaries and rewrites concept pages — manual edits are overwritten. Use --dry-run to preview, --refresh-schema to also update wiki/AGENTS.md
openkb watch Watch raw/ and auto-compile new files
openkb lint Run structural + knowledge health checks
openkb list List indexed documents and concepts
openkb status Show knowledge base stats
openkb feedback ["msg"] File feedback by opening a prefilled GitHub issue (use --type bug/feature/question to tag the issue)

✨ Generators — turn the wiki into output

A "generator" reads from the compiled wiki and produces something usable: an answer, a conversation, a skill folder. The wiki is the substrate; generators are the surfaces.

Command Output
openkb query "question" A grounded answer with citations (use --save to persist to wiki/explorations/)
openkb chat Interactive multi-turn session over the wiki (use --resume, --list, --delete to manage sessions)
openkb skill new <name> "<intent>" A redistributable Anthropic Skill at <kb>/output/skills/<name>/ + auto-updated marketplace.json
openkb skill validate [name] Structural lint of compiled skills (frontmatter, file sizes, wikilinks, scripts/ stdlib check with --strict). Auto-runs at end of skill new
openkb skill eval <name> Trigger-accuracy evaluation — does the description: field actually fire? LLM generates eval prompts; grader LLM scores activation. --save persists the eval set
openkb skill history <name> / openkb skill rollback <name> Iteration workspace — every overwrite saves the previous version to output/skills/<name>-workspace/iteration-N/ with a structural diff. Rollback restores any iteration

Query & Chat — ask the wiki

openkb query "..." answers a single question. openkb chat is interactive — each turn carries history, so you can dig into a topic without re-typing context. Both use the same underlying wiki and the same retrieval primitives (PageIndex for long docs, direct concept reads for short).

openkb query "What does the literature say about attention scaling?"

openkb chat                       # start a new session
openkb chat --resume              # resume the most recent session
openkb chat --resume 20260411     # resume by id (unique prefix works)
openkb chat --list                # list all sessions
openkb chat --delete <id>         # delete a session

Inside a chat, type / to access slash commands (Tab to complete):

  • /help — list available commands
  • /status — show knowledge base status
  • /list — list all documents
  • /add <path> — add a document or directory without leaving the chat
  • /skill new <name> "<intent>" — compile a skill from this chat (see below)
  • /save [name] — export the transcript to wiki/explorations/
  • /clear — start a fresh session (the current one stays on disk)
  • /lint — run knowledge base lint
  • /exit — exit (Ctrl-D also works)

🛠 Skill Factory — Drop in a book. Out comes a digital expert.

The newest generator. openkb skill new distills any subset of your wiki into an Anthropic Skill — a portable folder that Claude Code, Codex CLI, Gemini CLI, and Cursor all install and load natively. Drop in a book's worth of papers; out comes a specialist that other agents can call on.

openkb skill new karpathy-thinking \
  "Reason about transformers and attention in Karpathy's style"

This produces:

<kb>/output/skills/karpathy-thinking/
├── SKILL.md                   # YAML frontmatter + when-to-use + approach
├── references/                # depth material the agent loads on demand
│   ├── methodology.md
│   └── key-quotes.md
└── (scripts/)                 # optional, only if intent implies computation

…plus an auto-updated <kb>/.claude-plugin/marketplace.json so the whole KB is one-line installable.

Install locally:

cp -r output/skills/karpathy-thinking ~/.claude/skills/

Share with others — push your KB to GitHub, then anyone runs:

npx skills@latest add <your-org>/<your-repo>

Iterate from chat — compilation is one-shot, but follow-up edits aren't. Inside openkb chat, you can refine without re-running the whole pipeline:

/skill new karpathy-thinking "Reason about transformers like Karpathy"
[generation streams]
> description is too generic, make it about transformer implementations specifically
[agent edits SKILL.md frontmatter in place]

Quality gates — structural validation, trigger-accuracy + body-coverage evaluation, and full history/rollback:

# Lint structure (auto-runs at end of `skill new`)
openkb skill validate karpathy-thinking
openkb skill validate --strict          # treat warnings as failures

# Does the description actually fire when it should?
openkb skill eval karpathy-thinking --save

# History + rollback if a new iteration regresses
openkb skill history karpathy-thinking
openkb skill rollback karpathy-thinking --to 2

Configuration

Settings are initialized by openkb init, and stored in .openkb/config.yaml:

model: gpt-5.4                   # LLM model (any LiteLLM-supported provider)
language: en                     # Wiki output language
pageindex_threshold: 20          # PDF pages threshold for PageIndex

entity_types (optional): a YAML list overriding the entity-type vocabulary used for entity pages; omit it to use the default person, organization, place, product, work, event, other.

extra_headers (optional): a YAML mapping of extra HTTP headers sent with every LLM request (forwarded to LiteLLM's extra_headers). Useful for providers that expect custom headers, e.g. GitHub Copilot IDE-auth headers:

extra_headers:
  Editor-Version: vscode/1.95.0
  Copilot-Integration-Id: vscode-chat

Subscription-based providers that authenticate via OAuth device flow (e.g. chatgpt/*, github_copilot/*) need no API key — OpenKB skips the missing-key warning for them.

Model names use provider/model LiteLLM format (OpenAI models can omit the prefix):

Provider Model example
OpenAI gpt-5.4
Anthropic anthropic/claude-sonnet-4-6
Gemini gemini/gemini-3.1-pro-preview

PageIndex Integration

Long documents are challenging for LLMs due to context limits, context rot, and summarization loss. PageIndex solves this with vectorless, reasoning-based retrieval — building a hierarchical tree index that lets LLMs reason over the index for context-aware retrieval.

PageIndex runs locally by default using the open-source version, with no external dependencies required.

Optional: Cloud Support

For large or complex PDFs, PageIndex Cloud can be used to access additional capabilities, including:

  • OCR support for scanned PDFs (via hosted VLM models)
  • Faster structure generation
  • Scalable indexing for large documents

Set PAGEINDEX_API_KEY in your .env to enable cloud features:

PAGEINDEX_API_KEY=your_pageindex_api_key

AGENTS.md

The wiki/AGENTS.md file defines wiki structure and conventions. It's the LLM's instruction manual for maintaining the wiki. Customize it to change how your wiki is organized.

At runtime, the LLM reads AGENTS.md from disk, so your edits take effect immediately.

Using with Obsidian

OpenKB's wiki is a directory of Markdown files with [[wikilinks]]. Obsidian renders it natively.

  1. Open wiki/ as an Obsidian vault
  2. Browse summaries, concepts, and explorations
  3. Use graph view to see knowledge connections
  4. Use Obsidian Web Clipper to add web articles to raw/

Using with Claude Code / Codex / Gemini CLI

OpenKB ships a SKILL.md so any agent CLI can read your compiled wiki — no extra runtime, no MCP setup, just install the skill once.

Claude Code:

/plugin marketplace add VectifyAI/OpenKB
/plugin install openkb@vectify

Gemini CLI:

gemini skills install https://github.com/VectifyAI/OpenKB.git --path skills/openkb --consent

OpenAI Codex CLI (no marketplace command yet — manual symlink):

git clone https://github.com/VectifyAI/OpenKB.git ~/openkb-src
mkdir -p ~/.agents/skills
ln -s ~/openkb-src/skills/openkb ~/.agents/skills/openkb

The skill is read-only — it won't run openkb add, remove, or lint --fix without you asking. See skills/openkb/SKILL.md for the full instruction set.

🧭 Learn More

Compared to Karpathy's Approach

Karpathy's workflow OpenKB
Short documents LLM reads directly markitdown → LLM reads
Long documents Context limits, context rot PageIndex tree index
Supported formats Web clipper → .md PDF, Word, PPT, Excel, HTML, text, CSV, .md
Wiki compilation LLM agent LLM agent (same)
Q&A Query over wiki Wiki + PageIndex retrieval

The Stack

  • PageIndex — Vectorless, reasoning-based document indexing and retrieval
  • markitdown — Universal file-to-markdown conversion
  • OpenAI Agents SDK — Agent framework (supports non-OpenAI models via LiteLLM)
  • LiteLLM — Multi-provider LLM gateway
  • Click — CLI framework
  • watchdog — Filesystem monitoring

Roadmap

  • Extend long document handling to non-PDF formats
  • Scale to large document collections with nested folder support
  • Hierarchical concept (topic) indexing for massive knowledge bases
  • Database-backed storage engine
  • Web UI for browsing and managing wikis

Contributing

Contributions are welcome! Please submit a pull request, or open an issue for bugs or feature requests. For larger changes, consider opening an issue first to discuss the approach.

License

Apache 2.0. See LICENSE.

Support Us

If you find OpenKB useful, please give us a star 🌟 — and check out PageIndex too!

TwitterLinkedInContact Us

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

openkb-0.4.0rc2.tar.gz (473.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

openkb-0.4.0rc2-py3-none-any.whl (153.2 kB view details)

Uploaded Python 3

File details

Details for the file openkb-0.4.0rc2.tar.gz.

File metadata

  • Download URL: openkb-0.4.0rc2.tar.gz
  • Upload date:
  • Size: 473.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.13

File hashes

Hashes for openkb-0.4.0rc2.tar.gz
Algorithm Hash digest
SHA256 18ac46e22280a27c0fe2611a257f5de5f6a901828057cc4ffa7d23db95a7ac75
MD5 77a4496740bd81507e0726a0a82df627
BLAKE2b-256 1340b18ed4a28da77ef2c4972bf52fc3a63929410fc6529cc3e24b8a73587832

See more details on using hashes here.

Provenance

The following attestation bundles were made for openkb-0.4.0rc2.tar.gz:

Publisher: publish.yml on VectifyAI/OpenKB

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file openkb-0.4.0rc2-py3-none-any.whl.

File metadata

  • Download URL: openkb-0.4.0rc2-py3-none-any.whl
  • Upload date:
  • Size: 153.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.13

File hashes

Hashes for openkb-0.4.0rc2-py3-none-any.whl
Algorithm Hash digest
SHA256 0a48e28c6cb036f4d37ff9c312837fb9e06c7e49417acf9a89c58d3b8d31225d
MD5 b725df11af5bf8898fcb4948eba9964c
BLAKE2b-256 92b485c9646376a316751a4681115b761710747b53aaeee46387ec20aa25e044

See more details on using hashes here.

Provenance

The following attestation bundles were made for openkb-0.4.0rc2-py3-none-any.whl:

Publisher: publish.yml on VectifyAI/OpenKB

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page