LSP-powered code intelligence with AI semantic search, exposed as an MCP server

Project description

Codebase Insights

Codebase Insights is a local MCP server that gives coding agents a persistent model of your repository instead of making them rediscover it from scratch every session.

It combines four layers:

LSP navigation for definitions, references, implementations, hover, and document symbols
SQLite indexing for persistent workspace-wide symbol and reference lookup
LLM-generated summaries for symbols, files, and the whole project
Vector retrieval for natural-language search by behavior or intent

The point is not just “better search”. The point is giving an agent a cheaper, more reliable way to decide what code to read.

Why it exists

Text search works when you already know the name.

Agents usually do not. The real query looks more like this:

“Where is the code that strips HTML?”
“Which file owns the IPC bridge?”
“What builds the system prompt?”
“Show me the interface, then the implementations, then the references.”

Without a persistent index, the default workflow is wasteful:

Guess keywords.
Run repeated grep or symbol-name searches.
Open a lot of files that turn out not to matter.
Carry all of that source text forward in context.

Codebase Insights short-circuits that loop:

search_files() finds files by responsibility
semantic_search() finds symbols by intent
query_symbols() confirms exact names from SQLite
LSP tools expand from there with precise structural navigation

That means fewer blind reads, less irrelevant context, and less token waste.

What it is good at

Codebase Insights is most useful when the caller knows what the code does, but not what it is called.

Examples:

“find the function that strips HTML tags”
“find the module responsible for encrypted key storage”
“find the runtime object that tracks token usage”
“find the implementation(s) of this interface and then all references”

It is less valuable when the target is already known and you just need to open one file.

Why it helps agents spend fewer tokens

Every exploratory file read adds raw source code to the conversation. In a ReAct-style loop, that source stays in context and makes later turns more expensive.

Codebase Insights reduces that cost by returning compact routing information first: ranked files, ranked symbols, and stored summaries. The agent can narrow the search space before it opens any source.

Waste in a browse-first workflow	What Codebase Insights changes
Repeated keyword reformulation	Semantic retrieval accepts the original intent directly
Large irrelevant search dumps	Ranked symbols and files come back with short summaries
Re-exploring the same repo every session	SQLite and ChromaDB persist the repo model locally

The difference matters most on concept-heavy tasks where the names are not obvious.

Benchmark Highlights

v1.2.0 note: this release is a pure TUI and infrastructure change (Textual CLI, section-routed logging, builtins.print patch, debounce removal). No indexing or retrieval logic changed, so the v1.1.1 numbers below still represent the current retrieval quality and token savings.

Agent token benchmark

The most recent five demo-agent benchmark runs all used the same model, the same task (syntaxsenpai-backup-import-restore), and the same baseline vs enhanced comparison in scripts/demo_agent_benchmark.py.

Across those five runs, the enhanced workflow consistently replaced exploratory browsing with summary-guided routing:

total tokens fell from 7,036,458 to 3,189,240 (-54.7%)
input tokens fell from 6,967,040 to 3,128,182 (-55.1%)
agent turns fell from 187 to 116 (-38.0%)
raw view calls fell from 73 to 22 (-69.9%)
elapsed time fell from 2068.8s to 1329.1s (-35.8%)

The stable pattern in the logs is simple: baseline spends discovery turns on grep plus view, while enhanced starts with lsp_capabilities, get_indexer_criteria, get_project_summary, semantic_search, and get_file_summary, then opens only the confirmed owner files.

Read the consolidated report in benchmarks/agent-token-benchmark-v1.1.0.md.

Quick start

Requirements

Python 3.11+
At least one supported LSP server for the target repository
Either Ollama or an OpenAI-compatible provider for chat and embeddings

Install

From PyPI:

pip install codebase-insights

From source:

pip install -e .

Supported LSP servers

Language	Server	Install
Python	`pylsp`	`pip install python-lsp-server`
JavaScript / TypeScript	`typescript-language-server`	`npm install -g typescript-language-server`
C++	`clangd`	https://clangd.llvm.org/installation.html
Rust	`rust-analyzer`	`rustup component add rust-analyzer`

Optional Python plugins:

python-lsp-ruff
python-lsp-black
pylsp-mypy

For C and C++, indexing quality depends on clangd having enough project context, usually through compile_commands.json or compile_flags.txt.

Run it

codebase-insights <project_root> [options]

Quick start with Ollama:

# Terminal 1
ollama serve

# Terminal 2
codebase-insights /path/to/project

Quick start with an OpenAI-compatible provider:

# Export your API key first — it is never stored in the config file
export OPENAI_API_KEY=sk-...
codebase-insights /path/to/project --new-config

On first run, Codebase Insights creates .codebase-insights.toml using an interactive setup wizard.

How it works

Detect supported languages in the target repo and start one LSP client per language.
Walk the workspace, respect .gitignore, flatten documentSymbol output, and store symbols plus references in SQLite.
Summarize eligible symbols with an LLM and store embeddings in ChromaDB.
Generate per-file summaries and a project summary for higher-level routing.
Keep the index warm incrementally by hashing files and only updating what changed.
Expose the whole surface over MCP on http://127.0.0.1:6789/mcp.

Recommended agent workflow

If you want token savings, the workflow matters.

Use Codebase Insights as a substitute for broad file browsing, not as a warm-up before browsing anyway.

Recommended sequence:

get_project_summary() to orient on architecture.
search_files() or semantic_search() to narrow candidates.
query_symbols() when you need exact-name or kind filtering.
get_file_summary() before opening source.
lsp_definition(), lsp_implementation(), and lsp_references() once you have the right symbol.
view the source only after the summaries point to the right place.

Reusable workflow instructions live in codebase-insights-instructions.md.

Best practices for Copilot, Claude Code, and Cursor

For agent-specific prompting guidance, see docs/agent-usage.md.

Use that guide if you want to configure Codebase Insights for GitHub Copilot, Claude Code, or Cursor without bloating the main README.

Configuration

Chat and embedding providers are configured independently, so combinations like OpenAI for chat plus Ollama for embeddings are supported.

Default configuration shape:

[chat]
provider = "ollama"

[chat.ollama]
base_url = "http://localhost:11434"
model = "qwen2.5"

[embed]
provider = "ollama"

[embed.ollama]
base_url = "http://localhost:11434"
model = "bge-m3"

[semantic]
index_kinds = ["Class", "Method", "Function", "Interface", "Enum", "Constructor"]
concurrency = 16
batch_size = 16
min_ref_count = 3
summary_update_threshold = 5
summary_file_idle_timeout = 30
summary_project_idle_timeout = 300

Summary refresh controls:

Setting	Default	Meaning
`summary_update_threshold`	`5`	Refresh summaries after this many files become stale
`summary_file_idle_timeout`	`30`	Refresh one file summary after that file has been idle
`summary_project_idle_timeout`	`300`	Refresh all stale summaries plus the project summary after the repo has been idle

Set any of these to 0 to disable that trigger. refresh_file_summary() and refresh_project_summary() always force regeneration immediately.

API keys for OpenAI-compatible providers are read exclusively from environment variables and are never stored in the config file:

Variable	Used for
`CODEBASE_INSIGHTS_CHAT_API_KEY`	Chat / summarisation (highest priority)
`CODEBASE_INSIGHTS_EMBED_API_KEY`	Embeddings (highest priority)
`OPENAI_API_KEY`	Fallback for both chat and embeddings

The setup wizard validates that the relevant variable is already exported and warns if it is not found.

CLI flags

Flag	Purpose
`--new-config`	Re-run the interactive setup wizard
`--rebuild-index`	Clear and rebuild the SQLite symbol index
`--rebuild-semantic`	Clear all summaries and vector data, then regenerate everything
`--rebuild-summaries`	Rebuild file and project summaries only
`--rebuild-vectors`	Re-embed existing summaries without new LLM summarization

Normal usage is incremental. Rebuild flags are mainly for maintenance, model changes, or benchmarking.

MCP tools

Workspace and LSP

Tool	Purpose
`languages_in_codebase()`	Return detected languages
`lsp_capabilities()`	Return active LSP capability information
`lsp_hover(file_uri, line, character)`	Hover docs or type at a position
`lsp_definition(file_uri, line, character)`	Jump to definition
`lsp_declaration(file_uri, line, character)`	Find declaration
`lsp_implementation(file_uri, line, character)`	Find implementations
`lsp_references(file_uri, line, character)`	Find references
`lsp_document_symbols(file_uri)`	List symbols in a file

Indexed and semantic retrieval

Tool	Purpose
`query_symbols(path, kinds, name_query, limit)`	Search the SQLite symbol index by path, kind, or fuzzy name
`get_symbol_summary(name, file_uri, line, character)`	Fetch the stored summary for a symbol
`get_indexer_criteria()`	Return semantic indexing thresholds and eligible kinds
`semantic_search(query, limit, kinds)`	Find symbols by natural-language intent
`search_files(query, limit)`	Find files by natural-language responsibility
`get_file_summary(file_path)`	Fetch the stored summary for one file
`get_project_summary()`	Fetch the stored project-level overview
`refresh_file_summary(file_path)`	Force-regenerate one file summary
`refresh_project_summary()`	Force-regenerate all stale file summaries and the project summary

LSP calls accept either file:///... URIs or absolute filesystem paths.

Generated files in the indexed repository

Path	Purpose
`.codebase-index.db`	Persistent SQLite symbol and reference database
`.codebase-semantic/`	ChromaDB data for symbol and file summaries
`.codebase-insights.toml`	Local provider and indexing configuration
`.codebase-insights.lock.toml`	Records the active embedding model to detect configuration drift

If the target repo already has a .gitignore, Codebase Insights automatically adds .codebase-index.db to it. The other generated artifacts should also be ignored in normal use.

Repository layout

src/codebase_insights/
├── main.py              CLI entry point and startup orchestration
├── cli_io.py            Section-routed I/O facade and builtins.print patch for TUI
├── tui.py               Textual TUI app (section switcher, overview, semantic watch)
├── LSP.py               LSP client wrapper
├── language_analysis.py Language detection and .gitignore parsing
├── workspace_indexer.py SQLite symbol/reference index plus file watching
├── semantic_config.py   Config loading, first-run setup wizard, embed lock file
├── semantic_indexer.py  LLM summaries, embeddings, ranking, file/project summaries
├── mcp_server.py        MCP tool surface

scripts/
├── benchmark_monitor.py     Benchmark process monitor
├── demo_agent_benchmark.py  Demo benchmark script
├── run_benchmark.py         Benchmark orchestrator

docs/
├── agent-token-benchmark-v1.1.0.md Historical archived single-run benchmark
└── agent-usage.md                  Agent-specific usage guidance

benchmarks/
├── benchmark-v*.md                Versioned indexing and retrieval reports
└── agent-token-benchmark-v*.md    Agent token A/B benchmark reports

Limitations

Not every symbol gets an AI summary. Semantic indexing is intentionally selective.
Named declarations work better than anonymous inline logic.
LSP quality sets the floor. If the language server cannot understand the workspace, indexing quality will degrade.
Semantic search is a routing tool, not a proof of completeness.
Timing numbers are environment-specific; retrieval quality is the more stable signal.

Benchmark material

Latest benchmark report: benchmarks/benchmark-v1.1.0.md
Latest agent token benchmark: benchmarks/agent-token-benchmark-v1.1.0.md
Full agent context appendix: benchmarks/agent-token-benchmark-context-v1.1.0.md
Historical archived agent token benchmark: docs/agent-token-benchmark-v1.1.0.md
Benchmark runner: scripts/run_benchmark.py
Benchmark skill: .github/skills/benchmark-eval/SKILL.md
Agent token benchmark script: scripts/demo_agent_benchmark.py
Agent token benchmark skill: .github/skills/agent-token-benchmark/SKILL.md

License

MIT. See LICENSE.

Project details

Release history Release notifications | RSS feed

1.4.0

Apr 27, 2026

1.3.5

Apr 27, 2026

This version

1.3.4

Apr 27, 2026

1.3.3

Apr 27, 2026

1.3.2

Apr 27, 2026

1.3.1

Apr 27, 2026

1.3.0

Apr 27, 2026

1.2.3

Apr 27, 2026

1.2.2

Apr 27, 2026

1.2.1

Apr 21, 2026

1.2.0

Apr 21, 2026

1.1.0

Apr 21, 2026

1.0.1

Apr 16, 2026

0.2.4

Apr 16, 2026

0.1.1

Apr 14, 2026

0.1.0

Apr 14, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

codebase_insights-1.3.4.tar.gz (410.9 kB view details)

Uploaded Apr 27, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

codebase_insights-1.3.4-py3-none-any.whl (82.4 kB view details)

Uploaded Apr 27, 2026 Python 3

File details

Details for the file codebase_insights-1.3.4.tar.gz.

File metadata

Download URL: codebase_insights-1.3.4.tar.gz
Upload date: Apr 27, 2026
Size: 410.9 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for codebase_insights-1.3.4.tar.gz
Algorithm	Hash digest
SHA256	`9a7f106051a1397a0d7aee5547b275b7504357ab85ba9d93756aad32e96278af`
MD5	`50f31129205ee0ca242e968bc991417d`
BLAKE2b-256	`ee40ee3d4dfacd3ad290cf77136b5af5f184eb5437f6122ac376d9ab464f3006`

See more details on using hashes here.

Provenance

The following attestation bundles were made for codebase_insights-1.3.4.tar.gz:

Publisher: publish.yml on JimmyfaQwQ/Codebase-Insights

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: codebase_insights-1.3.4.tar.gz
- Subject digest: 9a7f106051a1397a0d7aee5547b275b7504357ab85ba9d93756aad32e96278af
- Sigstore transparency entry: 1392601153
- Sigstore integration time: Apr 27, 2026
Source repository:
- Permalink: JimmyfaQwQ/Codebase-Insights@e02065e0b19a79b625d77425f448f7c17c71dfa5
- Branch / Tag: refs/tags/v1.3.4
- Owner: https://github.com/JimmyfaQwQ
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@e02065e0b19a79b625d77425f448f7c17c71dfa5
- Trigger Event: push

File details

Details for the file codebase_insights-1.3.4-py3-none-any.whl.

File metadata

Download URL: codebase_insights-1.3.4-py3-none-any.whl
Upload date: Apr 27, 2026
Size: 82.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for codebase_insights-1.3.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`1d03d30f21ae794edc34cd4f185e8686f885acd92b85340412c8d16c458dc6bc`
MD5	`e796b6b42bf6ae82a0dc32f926d2e994`
BLAKE2b-256	`93482ab4bb40c8a9a021899a5eb000fb4177092651a0dbd6c4183239fc8f7ea7`

See more details on using hashes here.

Provenance

The following attestation bundles were made for codebase_insights-1.3.4-py3-none-any.whl:

Publisher: publish.yml on JimmyfaQwQ/Codebase-Insights

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: codebase_insights-1.3.4-py3-none-any.whl
- Subject digest: 1d03d30f21ae794edc34cd4f185e8686f885acd92b85340412c8d16c458dc6bc
- Sigstore transparency entry: 1392601169
- Sigstore integration time: Apr 27, 2026
Source repository:
- Permalink: JimmyfaQwQ/Codebase-Insights@e02065e0b19a79b625d77425f448f7c17c71dfa5
- Branch / Tag: refs/tags/v1.3.4
- Owner: https://github.com/JimmyfaQwQ
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@e02065e0b19a79b625d77425f448f7c17c71dfa5
- Trigger Event: push

codebase-insights 1.3.4

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

Codebase Insights

Why it exists

What it is good at

Why it helps agents spend fewer tokens

Benchmark Highlights

Agent token benchmark

Quick start

Requirements

Install

Supported LSP servers

Run it

How it works

Recommended agent workflow

Best practices for Copilot, Claude Code, and Cursor

Configuration

CLI flags

MCP tools

Workspace and LSP

Indexed and semantic retrieval

Generated files in the indexed repository

Repository layout

Limitations

Benchmark material

License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance