LSP-powered code intelligence with AI semantic search, exposed as an MCP server
Project description
Codebase Insights
Persistent code intelligence for AI coding agents.
Codebase Insights combines Language Server Protocol (LSP) analysis, LLM-generated summaries, and semantic embeddings to build a structured, reusable understanding of a codebase.
It exposes this understanding through an MCP (Model Context Protocol) server, making it usable from MCP-compatible clients such as Claude Desktop, GitHub Copilot, and other agent workflows.
Why this exists
Most coding agents still rely heavily on keyword search over guessed context when trying to find relevant code. That causes three recurring problems:
-
Noisy retrieval
Lexical matches often find superficially similar code while missing the symbol or implementation that actually matters. -
Weak structural understanding
Definitions, references, implementations, and symbol hierarchies are crucial for navigating real codebases, but plain text search does not model them directly. -
No durable understanding across sessions
Agents often pay the repository exploration cost over and over again, even when nothing has changed.
Codebase Insights addresses these limitations by combining:
- LSP servers for precise symbol extraction and navigation
- LLM summarization for higher-level semantic understanding
- Vector embeddings for natural-language retrieval
- Persistent local indexes so unchanged code does not need to be rediscovered from scratch
The result is a reusable code-understanding layer for intelligent coding assistants.
What it provides
- Workspace-wide symbol indexing
- Natural-language semantic code search
- Definition / references / implementation navigation via LSP
- Incremental re-indexing driven by file watching and hashes
- Persistent codebase understanding across sessions
- MCP server integration for AI clients and agents
Benchmark Highlights
Benchmark results below are from codebase-insights v0.1.1 on a real Electron + Vue + React Native monorepo (G:\SyntaxSenpai) with 118 files, 5,587 symbols, and 32,570 cross-references.
| Metric | Value |
|---|---|
| Full pipeline wall time | 427.7s (~7.1 min) |
| Pre-server startup | 6.17s |
| Workspace indexing | 62.01s |
| Semantic indexing | 178.26s |
| File summaries | 40.51s |
| Project summary (full) | 138.06s |
| Storage footprint | 13.80 MB |
| Peak RSS | 3,201 MiB |
| Retrieval Hit@1 | 68.4% |
| Retrieval Hit@3 | 89.5% |
| Retrieval Hit@5 | 100% |
| No-change catch-up | 0.05s |
Incremental updates
| Scenario | Total time |
|---|---|
| No change | 0.05s |
| Leaf-file edit | ~18s |
| Core-file edit | ~19s |
| New file | ~21s |
Retrieval quality highlights
- Semantic search outperforms keyword-only symbol matching on concept-level queries
- A single natural-language query for streaming surfaced provider
stream()implementations across multiple backends - LSP navigation resolved:
- 23 references to
BaseAIProvider - 24 implementations / subclasses
- 23 references to
- Top-5 retrieval hit rate reached 100% on the benchmark query set
See the full benchmark report in
docs/benchmark.mdfor methodology, per-query results, incremental update scenarios, and failure analysis.
Features
- Multi-language support — Python, JavaScript/TypeScript, C++, Rust via standard LSP servers
- Symbol indexing — full workspace scan with file-watch-driven incremental re-indexing
- Semantic search — AI-generated summaries + embeddings for natural-language retrieval
- Hybrid ranking — blends lexical matching with vector similarity and reference-aware ranking
- Flexible model backends — Ollama (local) or OpenAI-compatible APIs
- MCP server — exposes all capabilities over HTTP for any MCP-compatible client
Architecture
src/codebase_insights/
├── main.py CLI entry point & startup orchestration
├── language_analysis.py Detects languages; parses .gitignore
├── LSP.py Async LSP client (hover, definition, references, symbols, …)
├── workspace_indexer.py Indexes symbols into SQLite; watches for file changes
├── semantic_indexer.py LLM summarization + ChromaDB vector indexing & search
├── semantic_config.py TOML config loader with interactive first-time setup wizard
└── mcp_server.py MCP server exposing all tools over HTTP
On-disk artifacts
These are created at the target project root and automatically added to .gitignore:
| File / Directory | Purpose |
|---|---|
.codebase-index.db |
SQLite symbol database |
.codebase-semantic/ |
ChromaDB vector store |
.codebase-insights.toml |
Configuration file |
How it works
-
Startup
Detects languages, validates required LSP servers, and initializes LSP clients. -
Workspace indexing
Scans the repository with LSPdocumentSymbol, stores symbols and references in SQLite, and monitors changes with filesystem watching. -
Semantic indexing
Extracts source context for qualifying symbols, generates short LLM summaries, and stores embeddings in ChromaDB. -
File and project summarization
Generates file-level summaries and maintains a project summary for higher-level semantic retrieval and context. -
Incremental updates
Uses file hashes and symbol-content hashes to skip unchanged work and only reprocess modified or new symbols. -
MCP query serving
Exposes symbol and semantic capabilities over MCP:query_symbols(...)reads from SQLitesemantic_search(...)performs hybrid lexical + vector rankinglsp_*tools expose structural navigation directly from language servers
Why not just use keyword search?
Keyword search is still useful, but it breaks down quickly when agents need to reason about structure and behavior.
| Task | Keyword search | Codebase Insights |
|---|---|---|
| Find exact symbol names | Good | Good |
| Find code by concept or behavior | Weak | Strong |
| Jump to definitions | Manual / indirect | Built-in |
| Find references | Approximate | Precise via LSP |
| Find implementations / subclasses | Hard | Built-in |
| Reuse code understanding across sessions | No | Yes |
| Reduce repeated exploration cost | No | Yes |
A concrete benchmark example:
- Query: “LLM streaming response handling”
- Semantic search returns provider
stream()implementations directly - Keyword symbol search mostly returns names that merely contain
"stream"such as helpers or chunking utilities
That difference matters a lot for coding agents.
Prerequisites
- Python 3.11+
- At least one LSP server matching the language(s) in the target repository
- Either:
- Ollama running locally, or
- an OpenAI-compatible API key
Supported LSP servers
| Language | Server | Install |
|---|---|---|
| Python | pylsp |
pip install python-lsp-server |
| JavaScript / TypeScript | typescript-language-server |
npm install -g typescript-language-server |
| C++ | clangd |
clangd.llvm.org |
| Rust | rust-analyzer |
rustup component add rust-analyzer |
Optional Python LSP plugins:
python-lsp-ruffpython-lsp-blackpylsp-mypy
Installation
From PyPI
pip install codebase-insights
From source
git clone https://github.com/your-org/codebase-insights
cd codebase-insights
pip install -e .
Quick start
Ollama
# Terminal 1
ollama serve
# Terminal 2
codebase-insights /path/to/your/project
OpenAI-compatible API
export OPENAI_API_KEY="sk-..."
codebase-insights /path/to/your/project --new-config
# choose "openai" when prompted for chat and embed providers
On first run, an interactive wizard creates .codebase-insights.toml and helps configure model providers and indexing settings.
The MCP server starts on:
http://127.0.0.1:6789/mcp
using streamable HTTP transport.
Usage
codebase-insights <project_root> [options]
CLI options
| Flag | Description |
|---|---|
--new-config |
Re-run the setup wizard, overwriting the existing config |
--rebuild-index |
Drop and rebuild the SQLite symbol index from scratch |
--rebuild-semantic |
Drop all LLM summaries and ChromaDB vectors, regenerate everything |
--rebuild-summaries |
Regenerate only file/project summaries (keeps symbol summaries) |
--rebuild-vectors |
Re-embed existing summaries with the current embedding model (no LLM calls) |
MCP tools
Once running, the following tools are available to connected MCP clients:
| Tool | Description |
|---|---|
languages_in_codebase() |
List detected languages in the project |
lsp_capabilities() |
Query active LSP server capabilities |
lsp_hover(file_uri, line, character) |
Type information and documentation at a position |
lsp_definition(file_uri, line, character) |
Jump to definition |
lsp_declaration(file_uri, line, character) |
Find declarations |
lsp_implementation(file_uri, line, character) |
Find implementations |
lsp_references(file_uri, line, character) |
Find all references to a symbol |
lsp_document_symbols(file_uri) |
List all symbols in a file |
query_symbols(path, kinds, name_query, limit) |
Query the SQLite index by path, kind, or name |
semantic_search(query, limit, kinds) |
Natural-language semantic search |
file_urishould use normal file URIs such asfile:///G:/repo/path/to/file.ts.
Configuration
The config file .codebase-insights.toml is created interactively on first run.
Example
[chat]
provider = "ollama" # "ollama" | "openai"
[chat.ollama]
base_url = "http://localhost:11434"
model = "qwen2.5"
[embed]
provider = "ollama"
[embed.ollama]
model = "bge-m3"
[semantic]
index_kinds = ["Class", "Method", "Function", "Interface", "Enum", "Constructor"]
concurrency = 16
batch_size = 16
min_ref_count = 1
[ranking]
# noise penalties and re-ranking weights (see semantic_config.py for defaults)
Environment variable overrides
These environment variables override the corresponding TOML values:
OPENAI_API_KEYOPENAI_BASE_URL
Example use cases
Codebase Insights is useful for questions like:
- “Find all implementations of this provider interface.”
- “What handles WebSocket messages in this repo?”
- “Where is configuration loaded and applied?”
- “Show me every reference to this base class.”
- “Find the real code responsible for streaming responses.”
- “Search the repository by behavior, not just by exact symbol names.”
It is particularly effective for agents that need to:
- move from natural-language intent to a likely symbol
- expand from that symbol to definitions, references, and implementations
- reduce file-opening and grep iteration overhead
- retain reusable codebase understanding across sessions
Known limitations
Current limitations and trade-offs include:
-
Only a subset of symbols are semantically indexed
Filtering by symbol kind and reference count improves quality and cost, but reduces coverage. -
Inline or anonymous logic is harder to retrieve semantically
For example, anonymous callbacks and ad-hoctry/catchbehavior do not form named symbols. -
Convention-based framework behavior may be less visible
File-system routing or other convention-heavy patterns may not map cleanly to LSP symbol graphs. -
Full project summaries remain relatively expensive
Incremental project-summary updates are much faster now, but full project summarization is still one of the largest rebuild costs.
Dependencies
| Package | Purpose |
|---|---|
mcp[cli] |
MCP server framework |
watchdog |
Filesystem monitoring |
langchain, langchain-ollama, langchain-openai |
LLM and embedding integration |
langchain-chroma, chromadb |
Vector store |
tqdm |
Progress bars |
Project status
Codebase Insights is an early-stage but functional code-understanding platform.
Validated so far:
- workspace-wide symbol indexing
- LSP-backed navigation
- semantic retrieval over indexed symbols
- persistent on-disk indexes
- near-zero no-change catch-up
- practical incremental update behavior
- incremental project summary updates
- automated benchmark coverage
Still improving:
- retrieval quality on diffuse / anonymous logic
- indexing coverage trade-offs
- performance on larger repositories
- ergonomics and documentation
Contributing
Contributions, benchmark results, bug reports, and design feedback are welcome.
Especially valuable areas include:
- performance optimization
- retrieval quality tuning
- incremental update behavior
- support for more repo styles and languages
- benchmark automation
- MCP client ergonomics
License
This project is licensed under the MIT License. See the LICENSE file for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file codebase_insights-0.1.1.tar.gz.
File metadata
- Download URL: codebase_insights-0.1.1.tar.gz
- Upload date:
- Size: 264.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
69cab304169119bc31550c46392f72deba1b43255e8f0652ac4f2bffb3be1100
|
|
| MD5 |
2abc552230902f169dab00d45f887999
|
|
| BLAKE2b-256 |
d63de8974a2040d9ef63a99c93ec6d7bcbb3771542aeb85afbb0ffc0ec91e67a
|
Provenance
The following attestation bundles were made for codebase_insights-0.1.1.tar.gz:
Publisher:
publish.yml on JimmyfaQwQ/Codebase-Insights
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
codebase_insights-0.1.1.tar.gz -
Subject digest:
69cab304169119bc31550c46392f72deba1b43255e8f0652ac4f2bffb3be1100 - Sigstore transparency entry: 1293654641
- Sigstore integration time:
-
Permalink:
JimmyfaQwQ/Codebase-Insights@e3e8b0a42440a102af118684fa5656b20dec84d7 -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/JimmyfaQwQ
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@e3e8b0a42440a102af118684fa5656b20dec84d7 -
Trigger Event:
push
-
Statement type:
File details
Details for the file codebase_insights-0.1.1-py3-none-any.whl.
File metadata
- Download URL: codebase_insights-0.1.1-py3-none-any.whl
- Upload date:
- Size: 50.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fc90357a08e65d570e27fa59237a7be1cb2d37fd10bd312745b5408ae76f5014
|
|
| MD5 |
b55fa63cfe507c89d92b63cda3770f4b
|
|
| BLAKE2b-256 |
9b952b057528f16d758d76bd0b581eccb7616038a01bd1410f9ae996387f5d20
|
Provenance
The following attestation bundles were made for codebase_insights-0.1.1-py3-none-any.whl:
Publisher:
publish.yml on JimmyfaQwQ/Codebase-Insights
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
codebase_insights-0.1.1-py3-none-any.whl -
Subject digest:
fc90357a08e65d570e27fa59237a7be1cb2d37fd10bd312745b5408ae76f5014 - Sigstore transparency entry: 1293654647
- Sigstore integration time:
-
Permalink:
JimmyfaQwQ/Codebase-Insights@e3e8b0a42440a102af118684fa5656b20dec84d7 -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/JimmyfaQwQ
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@e3e8b0a42440a102af118684fa5656b20dec84d7 -
Trigger Event:
push
-
Statement type: