LSP-powered code intelligence with AI semantic search, exposed as an MCP server
Project description
Codebase Insights
Codebase Insights is a local MCP server that gives coding agents a persistent model of your repository instead of making them rediscover it from scratch every session.
It combines four layers:
- LSP navigation for definitions, references, implementations, hover, and document symbols
- SQLite indexing for persistent workspace-wide symbol and reference lookup
- LLM-generated summaries for symbols, files, and the whole project
- Vector retrieval for natural-language search by behavior or intent
The point is not just “better search”. The point is giving an agent a cheaper, more reliable way to decide what code to read.
Why it exists
Text search works when you already know the name.
Agents usually do not. The real query looks more like this:
- “Where is the code that strips HTML?”
- “Which file owns the IPC bridge?”
- “What builds the system prompt?”
- “Show me the interface, then the implementations, then the references.”
Without a persistent index, the default workflow is wasteful:
- Guess keywords.
- Run repeated grep or symbol-name searches.
- Open a lot of files that turn out not to matter.
- Carry all of that source text forward in context.
Codebase Insights short-circuits that loop:
search_files()finds files by responsibilitysemantic_search()finds symbols by intentquery_symbols()confirms exact names from SQLite- LSP tools expand from there with precise structural navigation
That means fewer blind reads, less irrelevant context, and less token waste.
What it is good at
Codebase Insights is most useful when the caller knows what the code does, but not what it is called.
Examples:
- “find the function that strips HTML tags”
- “find the module responsible for encrypted key storage”
- “find the runtime object that tracks token usage”
- “find the implementation(s) of this interface and then all references”
It is less valuable when the target is already known and you just need to open one file.
Why it helps agents spend fewer tokens
Every exploratory file read adds raw source code to the conversation. In a ReAct-style loop, that source stays in context and makes later turns more expensive.
Codebase Insights reduces that cost by returning compact routing information first: ranked files, ranked symbols, and stored summaries. The agent can narrow the search space before it opens any source.
| Waste in a browse-first workflow | What Codebase Insights changes |
|---|---|
| Repeated keyword reformulation | Semantic retrieval accepts the original intent directly |
| Large irrelevant search dumps | Ranked symbols and files come back with short summaries |
| Re-exploring the same repo every session | SQLite and ChromaDB persist the repo model locally |
The difference matters most on concept-heavy tasks where the names are not obvious.
Benchmark Highlights
v1.2.0 note: this release is a pure TUI and infrastructure change (Textual CLI, section-routed logging, builtins.print patch, debounce removal). No indexing or retrieval logic changed, so the v1.1.1 numbers below still represent the current retrieval quality and token savings.
Agent token benchmark
The most recent five demo-agent benchmark runs all used the same model, the same task (syntaxsenpai-backup-import-restore), and the same baseline vs enhanced comparison in scripts/demo_agent_benchmark.py.
Across those five runs, the enhanced workflow consistently replaced exploratory browsing with summary-guided routing:
- total tokens fell from 7,036,458 to 3,189,240 (-54.7%)
- input tokens fell from 6,967,040 to 3,128,182 (-55.1%)
- agent turns fell from 187 to 116 (-38.0%)
- raw
viewcalls fell from 73 to 22 (-69.9%) - elapsed time fell from 2068.8s to 1329.1s (-35.8%)
The stable pattern in the logs is simple: baseline spends discovery turns on grep plus view, while enhanced starts with lsp_capabilities, get_indexer_criteria, get_project_summary, semantic_search, and get_file_summary, then opens only the confirmed owner files.
Read the consolidated report in benchmarks/agent-token-benchmark-v1.1.0.md.
Quick start
Requirements
- Python
3.11+ - At least one supported LSP server for the target repository
- Either Ollama or an OpenAI-compatible provider for chat and embeddings
Install
From PyPI:
pip install codebase-insights
From source:
pip install -e .
Supported LSP servers
| Language | Server | Install |
|---|---|---|
| Python | pylsp |
pip install python-lsp-server |
| JavaScript / TypeScript | typescript-language-server |
npm install -g typescript-language-server |
| C++ | clangd |
https://clangd.llvm.org/installation.html |
| Rust | rust-analyzer |
rustup component add rust-analyzer |
Optional Python plugins:
python-lsp-ruffpython-lsp-blackpylsp-mypy
For C and C++, indexing quality depends on clangd having enough project context, usually through compile_commands.json or compile_flags.txt.
Run it
codebase-insights <project_root> [options]
Quick start with Ollama:
# Terminal 1
ollama serve
# Terminal 2
codebase-insights /path/to/project
Quick start with an OpenAI-compatible provider:
# Export your API key first — it is never stored in the config file
export OPENAI_API_KEY=sk-...
codebase-insights /path/to/project --new-config
On first run, Codebase Insights creates .codebase-insights.toml using an interactive setup wizard.
How it works
- Detect supported languages in the target repo and start one LSP client per language.
- Walk the workspace, respect
.gitignore, flattendocumentSymboloutput, and store symbols plus references in SQLite. - Summarize eligible symbols with an LLM and store embeddings in ChromaDB.
- Generate per-file summaries and a project summary for higher-level routing.
- Keep the index warm incrementally by hashing files and only updating what changed.
- Expose the whole surface over MCP on
http://127.0.0.1:6789/mcp.
Recommended agent workflow
If you want token savings, the workflow matters.
Use Codebase Insights as a substitute for broad file browsing, not as a warm-up before browsing anyway.
Recommended sequence:
get_project_summary()to orient on architecture.search_files()orsemantic_search()to narrow candidates.query_symbols()when you need exact-name or kind filtering.get_file_summary()before opening source.lsp_definition(),lsp_implementation(), andlsp_references()once you have the right symbol.viewthe source only after the summaries point to the right place.
Reusable workflow instructions live in codebase-insights-instructions.md.
Best practices for Copilot, Claude Code, and Cursor
For agent-specific prompting guidance, see docs/agent-usage.md.
Use that guide if you want to configure Codebase Insights for GitHub Copilot, Claude Code, or Cursor without bloating the main README.
Configuration
Chat and embedding providers are configured independently, so combinations like OpenAI for chat plus Ollama for embeddings are supported.
Default configuration shape:
[chat]
provider = "ollama"
[chat.ollama]
base_url = "http://localhost:11434"
model = "qwen2.5"
[embed]
provider = "ollama"
[embed.ollama]
base_url = "http://localhost:11434"
model = "bge-m3"
[semantic]
index_kinds = ["Class", "Method", "Function", "Interface", "Enum", "Constructor"]
concurrency = 16
batch_size = 16
min_ref_count = 3
summary_update_threshold = 5
summary_file_idle_timeout = 30
summary_project_idle_timeout = 300
Summary refresh controls:
| Setting | Default | Meaning |
|---|---|---|
summary_update_threshold |
5 |
Refresh summaries after this many files become stale |
summary_file_idle_timeout |
30 |
Refresh one file summary after that file has been idle |
summary_project_idle_timeout |
300 |
Refresh all stale summaries plus the project summary after the repo has been idle |
Set any of these to 0 to disable that trigger. refresh_file_summary() and refresh_project_summary() always force regeneration immediately.
API keys for OpenAI-compatible providers are read exclusively from environment variables and are never stored in the config file:
| Variable | Used for |
|---|---|
CODEBASE_INSIGHTS_CHAT_API_KEY |
Chat / summarisation (highest priority) |
CODEBASE_INSIGHTS_EMBED_API_KEY |
Embeddings (highest priority) |
OPENAI_API_KEY |
Fallback for both chat and embeddings |
The setup wizard validates that the relevant variable is already exported and warns if it is not found.
CLI flags
| Flag | Purpose |
|---|---|
--new-config |
Re-run the interactive setup wizard |
--rebuild-index |
Clear and rebuild the SQLite symbol index |
--rebuild-semantic |
Clear all summaries and vector data, then regenerate everything |
--rebuild-summaries |
Rebuild file and project summaries only |
--rebuild-vectors |
Re-embed existing summaries without new LLM summarization |
Normal usage is incremental. Rebuild flags are mainly for maintenance, model changes, or benchmarking.
MCP tools
Workspace and LSP
| Tool | Purpose |
|---|---|
languages_in_codebase() |
Return detected languages |
lsp_capabilities() |
Return active LSP capability information |
lsp_hover(file_uri, line, character) |
Hover docs or type at a position |
lsp_definition(file_uri, line, character) |
Jump to definition |
lsp_declaration(file_uri, line, character) |
Find declaration |
lsp_implementation(file_uri, line, character) |
Find implementations |
lsp_references(file_uri, line, character) |
Find references |
lsp_document_symbols(file_uri) |
List symbols in a file |
Indexed and semantic retrieval
| Tool | Purpose |
|---|---|
query_symbols(path, kinds, name_query, limit) |
Search the SQLite symbol index by path, kind, or fuzzy name |
get_symbol_summary(name, file_uri, line, character) |
Fetch the stored summary for a symbol |
get_indexer_criteria() |
Return semantic indexing thresholds and eligible kinds |
semantic_search(query, limit, kinds) |
Find symbols by natural-language intent |
search_files(query, limit) |
Find files by natural-language responsibility |
get_file_summary(file_path) |
Fetch the stored summary for one file |
get_project_summary() |
Fetch the stored project-level overview |
refresh_file_summary(file_path) |
Force-regenerate one file summary |
refresh_project_summary() |
Force-regenerate all stale file summaries and the project summary |
LSP calls accept either file:///... URIs or absolute filesystem paths.
Generated files in the indexed repository
| Path | Purpose |
|---|---|
.codebase-index.db |
Persistent SQLite symbol and reference database |
.codebase-semantic/ |
ChromaDB data for symbol and file summaries |
.codebase-insights.toml |
Local provider and indexing configuration |
.codebase-insights.lock.toml |
Records the active embedding model to detect configuration drift |
If the target repo already has a .gitignore, Codebase Insights automatically adds .codebase-index.db to it. The other generated artifacts should also be ignored in normal use.
Repository layout
src/codebase_insights/
├── main.py CLI entry point and startup orchestration
├── cli_io.py Section-routed I/O facade and builtins.print patch for TUI
├── tui.py Textual TUI app (section switcher, overview, semantic watch)
├── LSP.py LSP client wrapper
├── language_analysis.py Language detection and .gitignore parsing
├── workspace_indexer.py SQLite symbol/reference index plus file watching
├── semantic_config.py Config loading, first-run setup wizard, embed lock file
├── semantic_indexer.py LLM summaries, embeddings, ranking, file/project summaries
├── mcp_server.py MCP tool surface
scripts/
├── benchmark_monitor.py Benchmark process monitor
├── demo_agent_benchmark.py Demo benchmark script
├── run_benchmark.py Benchmark orchestrator
docs/
├── agent-token-benchmark-v1.1.0.md Historical archived single-run benchmark
└── agent-usage.md Agent-specific usage guidance
benchmarks/
├── benchmark-v*.md Versioned indexing and retrieval reports
└── agent-token-benchmark-v*.md Agent token A/B benchmark reports
Limitations
- Not every symbol gets an AI summary. Semantic indexing is intentionally selective.
- Named declarations work better than anonymous inline logic.
- LSP quality sets the floor. If the language server cannot understand the workspace, indexing quality will degrade.
- Semantic search is a routing tool, not a proof of completeness.
- Timing numbers are environment-specific; retrieval quality is the more stable signal.
Benchmark material
- Latest benchmark report: benchmarks/benchmark-v1.1.0.md
- Latest agent token benchmark: benchmarks/agent-token-benchmark-v1.1.0.md
- Full agent context appendix: benchmarks/agent-token-benchmark-context-v1.1.0.md
- Historical archived agent token benchmark: docs/agent-token-benchmark-v1.1.0.md
- Benchmark runner: scripts/run_benchmark.py
- Benchmark skill: .github/skills/benchmark-eval/SKILL.md
- Agent token benchmark script: scripts/demo_agent_benchmark.py
- Agent token benchmark skill: .github/skills/agent-token-benchmark/SKILL.md
License
MIT. See LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file codebase_insights-1.3.4.tar.gz.
File metadata
- Download URL: codebase_insights-1.3.4.tar.gz
- Upload date:
- Size: 410.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9a7f106051a1397a0d7aee5547b275b7504357ab85ba9d93756aad32e96278af
|
|
| MD5 |
50f31129205ee0ca242e968bc991417d
|
|
| BLAKE2b-256 |
ee40ee3d4dfacd3ad290cf77136b5af5f184eb5437f6122ac376d9ab464f3006
|
Provenance
The following attestation bundles were made for codebase_insights-1.3.4.tar.gz:
Publisher:
publish.yml on JimmyfaQwQ/Codebase-Insights
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
codebase_insights-1.3.4.tar.gz -
Subject digest:
9a7f106051a1397a0d7aee5547b275b7504357ab85ba9d93756aad32e96278af - Sigstore transparency entry: 1392601153
- Sigstore integration time:
-
Permalink:
JimmyfaQwQ/Codebase-Insights@e02065e0b19a79b625d77425f448f7c17c71dfa5 -
Branch / Tag:
refs/tags/v1.3.4 - Owner: https://github.com/JimmyfaQwQ
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@e02065e0b19a79b625d77425f448f7c17c71dfa5 -
Trigger Event:
push
-
Statement type:
File details
Details for the file codebase_insights-1.3.4-py3-none-any.whl.
File metadata
- Download URL: codebase_insights-1.3.4-py3-none-any.whl
- Upload date:
- Size: 82.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1d03d30f21ae794edc34cd4f185e8686f885acd92b85340412c8d16c458dc6bc
|
|
| MD5 |
e796b6b42bf6ae82a0dc32f926d2e994
|
|
| BLAKE2b-256 |
93482ab4bb40c8a9a021899a5eb000fb4177092651a0dbd6c4183239fc8f7ea7
|
Provenance
The following attestation bundles were made for codebase_insights-1.3.4-py3-none-any.whl:
Publisher:
publish.yml on JimmyfaQwQ/Codebase-Insights
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
codebase_insights-1.3.4-py3-none-any.whl -
Subject digest:
1d03d30f21ae794edc34cd4f185e8686f885acd92b85340412c8d16c458dc6bc - Sigstore transparency entry: 1392601169
- Sigstore integration time:
-
Permalink:
JimmyfaQwQ/Codebase-Insights@e02065e0b19a79b625d77425f448f7c17c71dfa5 -
Branch / Tag:
refs/tags/v1.3.4 - Owner: https://github.com/JimmyfaQwQ
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@e02065e0b19a79b625d77425f448f7c17c71dfa5 -
Trigger Event:
push
-
Statement type: