Progressive-disclosure code knowledge base generator for large C++ codebases
Project description
KBLens
English | 中文
A progressive-disclosure code knowledge base generator for large C++ codebases. KBLens uses tree-sitter to extract AST skeletons, packs them into LLM-friendly batches, and generates hierarchical Markdown summaries — giving AI coding assistants structured context about your codebase without reading every file.
Why KBLens
Large codebases (100K+ files) are too big for LLMs to consume directly. Without structured context, AI assistants either hallucinate or say "I don't know" when asked about internal systems.
KBLens solves this by generating a three-layer knowledge base:
L0 INDEX.md Project overview + package directory
L1 packages/engine.md Per-package component listing and architecture
L2 packages/engine/ Per-component: purpose, key types, public APIs, dependencies
This gives AI assistants a reliable, searchable reference — like an always-up-to-date architecture document generated from actual code.
Key Features
- AST-based extraction — Uses tree-sitter to extract class/struct/enum/function signatures from C++ headers and source files. No guessing, no hallucination.
- Hierarchical summaries — Three levels of detail (project → package → component) with progressive disclosure. Ask about a package, get the overview. Ask about a class, get the details.
- Incremental updates — Only regenerates components whose source files changed. Tracks changes via file hash. A full run on 200+ components takes ~5 minutes; incremental runs take seconds.
- Change detection — Five-way classification (unchanged / changed / new / deleted / failed) with automatic cleanup of orphaned files and cascade updates to affected packages.
- Multi-source projects — One config file can define multiple source directories. Each source gets its own independent knowledge base with separate INDEX, metadata, and change tracking.
- Concurrent generation — Processes 8 components in parallel with 8 concurrent LLM calls. Includes exponential backoff retry (3 attempts) for transient failures.
- Resume from interruption — Progress is persisted after each component. Ctrl+C and re-run to continue where you left off.
- Live dashboard — Rich terminal UI showing real-time progress, active components, token usage, and error count.
- Anti-hallucination prompts — LLM prompts explicitly forbid speculative language and invented content. Dependencies are only listed when
#includedirectives are visible in the AST.
Prerequisites
- Python 3.11+
- C compiler — Required by tree-sitter for grammar compilation (GCC, Clang, or MSVC)
- On Ubuntu/Debian:
sudo apt install build-essential - On macOS: Xcode Command Line Tools (
xcode-select --install) - On Windows: Visual Studio Build Tools or MinGW
- On Ubuntu/Debian:
Installation
# From PyPI (when published)
pip install kblens
# Or install from GitHub directly
pip install git+https://github.com/disrei/KBLens.git
# Or clone and install in development mode
git clone https://github.com/disrei/KBLens.git
cd kblens
pip install -e .
# Verify
kblens version
Quick Start
1. Create a configuration
kblens init
This walks you through creating ~/.config/kblens/config.yaml with your source paths and LLM settings.
Or create it manually:
# ~/.config/kblens/config.yaml
version: 1
project: "my_engine"
output_dir: "~/kblens_kb/my_engine"
sources:
- path: "/absolute/path/to/packages"
name: "core"
llm:
model: "gpt-4o-mini"
# api_key: "your-api-key" # see "API Key Security" below
temperature: 0.2
max_concurrent: 8
max_concurrent_components: 8
summary_language: "en"
2. Preview
kblens generate --dry-run
This scans your source, extracts AST, and reports statistics without calling the LLM.
3. Generate
kblens generate
For a project with ~200 components, expect ~5 minutes and ~400K input tokens.
4. Use
The generated knowledge base is a directory of Markdown files. You can:
- Browse directly — Open
INDEX.mdand navigate through the hierarchy - Search with grep — Find any class, function, or concept across all summaries
- Integrate with AI tools — Point your coding assistant's skill/tool at the knowledge base directory (see AI Assistant Integration below)
API Key Security
Never commit API keys to version control. Use one of these methods:
-
Environment variable (recommended):
export KBLENS_LLM_KEY=sk-your-key-here
-
Local config override — Create a
.local.yamlsibling next to your config file:# ~/.config/kblens/config.local.yaml (gitignored) llm: api_key: "sk-your-key-here"
-
Config key_env reference — Point to any environment variable:
llm: api_key_env: "MY_OPENAI_KEY"
Configuration
KBLens uses a two-layer config system:
| Layer | Location | Purpose |
|---|---|---|
| Global | ~/.config/kblens/config.yaml |
Shared LLM settings, packing parameters |
| Project | ./kblens.yaml in project root |
Project-specific sources and output |
Project config overrides global config. Each layer can have a .local.yaml sibling for sensitive values (API keys).
Config Reference
version: 1
project: "my_project" # Project name (displayed in CLI)
output_dir: "~/kblens_kb/my_project" # Knowledge base output root
sources: # Source directories to scan
- path: "/absolute/path/to/src" # Absolute path
name: "core" # Short name (used as subdirectory)
include_extensions: "auto" # "auto" or explicit list: [".h", ".cpp"]
exclude_patterns: # Glob patterns to skip
- "*/test/*"
- "*_test.*"
llm:
model: "gpt-4o-mini" # Any litellm-compatible model
api_base: "https://api.openai.com/v1"
api_key: "sk-..." # Or use api_key_env / KBLENS_LLM_KEY
temperature: 0.2
max_concurrent: 8 # Concurrent LLM calls
max_concurrent_components: 8 # Concurrent component pipelines
packing:
token_budget: 8000 # Target tokens per batch
token_min: 1000 # Minimum batch size
token_max: 24000 # Maximum batch size
component_split_threshold: 200 # File count threshold for splitting
summary_language: "en" # Language for generated summaries
Environment Variables
| Variable | Purpose |
|---|---|
KBLENS_LLM_KEY |
LLM API key (overrides config) |
CLI Reference
kblens generate # Generate all sources
kblens generate --source core # Generate only the "core" source
kblens generate --dry-run # Preview without LLM calls
kblens generate --config ./my.yaml # Use specific config file
kblens status # Show knowledge base status
kblens monitor # Monitor a running generation
kblens init # Interactive config setup
kblens version # Show version
Output Structure
For a project with two sources:
~/kblens_kb/my_project/
├── core/ # Source: core
│ ├── INDEX.md # L0: package directory with links
│ ├── _meta.json # Component status, hashes, token counts
│ ├── _progress.jsonl # Generation event log
│ └── core/ # packages (same name as source)
│ ├── engine.md # L1: engine package overview
│ ├── engine/
│ │ ├── SoundSystem.md # L2: component overview
│ │ ├── SoundSystem/ # Leaf batch files (large components)
│ │ │ ├── src_reverb.md
│ │ │ └── src_voice.md
│ │ └── Physics.md
│ ├── gameplay.md
│ └── gameplay/
│ └── ...
└── tools/ # Source: tools
├── INDEX.md
└── tools/
└── ...
Markdown Format
Each L2 component file follows a consistent structure:
# ComponentName
## Responsibility
One-to-two sentence description of what this component does.
## Key Types and Relationships
Classes, structs, enums and how they relate.
## Main Public Interfaces
Key methods with signatures.
## Dependencies
Explicit #include paths or "No explicit dependencies visible in AST excerpt."
How It Works
KBLens runs a six-phase pipeline for each source:
- Scan — Walk the directory tree, discover components (package/subdir pairs), count files and lines
- AST Extract — Parse C++ files with tree-sitter, extract class/struct/enum/function skeletons and
#includedirectives - Pack — Group AST entries into token-budgeted batches, create aggregation groups for large components
- Leaf Summarize — Send each batch to the LLM for a focused summary (Phase 4)
- Aggregate — Merge leaf summaries upward: fragments → component overview → package overview → INDEX (Phase 5a-5d)
- Write — Persist Markdown files and update
_meta.jsonincrementally
Incremental Behavior
On subsequent runs:
- Unchanged components are skipped entirely (hash match)
- Changed components are regenerated, and their package's L1 overview is updated
- New components are generated and added to the package overview
- Deleted components have their
.mdfiles and metadata cleaned up - Failed components (from previous timeout/errors) are automatically retried
- L0 INDEX is regenerated only if any package changed
Language Support
Currently supports C++ only (.h, .hpp, .cpp, .cc, .cxx). Other file types are detected during scanning but produce 0 AST tokens and are skipped. Components with fewer than 100 AST tokens are excluded from LLM summarization.
AI Assistant Integration
KBLens generates Markdown knowledge bases that can be queried by AI coding assistants. An OpenCode skill template is included in skills/kblens-kb/SKILL.md.
OpenCode Setup
-
Copy the skill to your OpenCode config directory:
# Linux / macOS mkdir -p ~/.config/opencode/skills/kblens-kb cp skills/kblens-kb/SKILL.md ~/.config/opencode/skills/kblens-kb/ # Windows mkdir "%USERPROFILE%\.config\opencode\skills\kblens-kb" copy skills\kblens-kb\SKILL.md "%USERPROFILE%\.config\opencode\skills\kblens-kb\"
-
The skill automatically reads your
~/.config/kblens/config.yamlto find the knowledge base location. -
Ask your AI assistant questions about your codebase — it will search the knowledge base for answers.
Other AI Tools
The knowledge base is plain Markdown files. You can integrate it with any AI tool that supports file-based context:
- Add the knowledge base directory as a reference path
- Use grep/search to find relevant
.mdfiles - The three-layer hierarchy (INDEX → package → component) provides natural progressive disclosure
Notes
- The knowledge base uses absolute paths in
_meta.jsonfor change tracking. If you move your source code directory, regenerate the knowledge base withkblens generate. - LLM model compatibility: KBLens uses litellm under the hood, so any model supported by litellm will work (OpenAI, Anthropic, local Ollama, etc.).
License
MIT — see LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file kblens-0.1.0.tar.gz.
File metadata
- Download URL: kblens-0.1.0.tar.gz
- Upload date:
- Size: 44.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
37ba09d0f7eab814ce12110304b2df7ee71b62fa9c347c7df174c80431ef4c43
|
|
| MD5 |
0dd54d0129898b8bceff61794ea5cdde
|
|
| BLAKE2b-256 |
e2e0e06afd9a3599a3e033764173588f67a33bcc793dc70e0f913f9d2551c925
|
File details
Details for the file kblens-0.1.0-py3-none-any.whl.
File metadata
- Download URL: kblens-0.1.0-py3-none-any.whl
- Upload date:
- Size: 43.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dce8c5d811393cc60b4aab0098e44f049f73232841e9aa8f4aad8303ba5576b2
|
|
| MD5 |
50d57300ba5ef8a06745eed8cb33f581
|
|
| BLAKE2b-256 |
439bfa2a3adc47ab4e590f98aca6a40854de1a28a2a652e60abc8846f0aa046f
|