LSP-powered code intelligence with AI semantic search, exposed as an MCP server

Project description

Codebase Insights

Persistent code intelligence for MCP-compatible coding agents.

Codebase Insights builds a reusable understanding of a repository from four sources at once:

LSP structure for symbols, definitions, references, and implementations
SQLite for a persistent workspace symbol/reference index
LLM summaries for higher-level meaning
Vector search for natural-language retrieval

It runs as a local MCP server, so agents can search by intent, navigate by structure, and reuse the same codebase understanding across sessions instead of re-exploring the repo every time.

What it is for

Codebase Insights is designed for the gap between plain text search and full repository understanding.

Typical agent problems it helps with:

"Find the code that sanitizes HTML even if I do not know the symbol name."
"Show me the interface, then all implementations, then all references."
"Which file owns this feature?"
"Give me a project-level summary before I start editing."

The project combines:

precise structural navigation from language servers
concept-level retrieval from AI summaries + embeddings
persistent local storage so unchanged code does not need to be rediscovered
incremental updates driven by file hashes and filesystem watching

Core capabilities

Area	What it provides
Workspace indexing	Stores symbols, definition locations, and references in SQLite
Structural navigation	`hover`, `definition`, `declaration`, `implementation`, `references`, `document symbols`
Semantic symbol search	`semantic_search(query)` finds symbols by behavior or intent
Semantic file search	`search_files(query)` finds files by responsibility or architecture role
Stored summaries	Per-symbol, per-file, and project-level summaries
Incremental maintenance	Skips unchanged files, watches for edits, updates changed content only
MCP integration	Exposes the index and navigation tools over streamable HTTP
Flexible model setup	Chat and embedding providers can be configured independently

Supported languages today: Python, JavaScript/TypeScript, C++, and Rust via standard LSP servers.

Latest benchmark snapshot

The latest benchmark checked into this repository is v1.0.1, generated on 2026-04-16 against a real Electron + Vue + React Native monorepo (G:\SyntaxSenpai).

Benchmark target

Metric	Value
Files processed	89
Symbols	5,592
Cross-references	31,791

Retrieval quality

Benchmark	Result
Symbol search Hit@1	26/28 (92.9%)
Symbol search Hit@3	27/28 (96.4%)
Symbol search Hit@5	27/28 (96.4%)
File search Hit@1	14/15 (93.3%)
File search Hit@3	14/15 (93.3%)
File search Hit@5	15/15 (100.0%)
Keyword baseline found expected symbol	13/28 (46.4%)

Key takeaways from the latest run:

semantic_search missed rank 1 on two queries: AIChatRuntime was still returned by Hit@3, and StreamChunk was not found in the top 5.
search_files missed rank 1 only once (executor.ts), and still returned the correct file by Hit@5.
The keyword baseline failed badly on concept-heavy queries such as translation lookup, HTML sanitization, deferred mount logic, and runtime metrics.

This matters because Codebase Insights is trying to solve the exact case where the caller knows what the code does, but not what the symbol is named.

The latest checked-in benchmark report is docs/benchmark-v1.0.1.md. Historical benchmark reports are also kept under docs/.

Performance baseline

For full rebuild and incremental update behavior, the v1.0.1 baseline is:

Metric	Result
Full pipeline wall time	499.79s (~8.3 min)
Storage footprint	15.09 MB (6.37 MB SQLite + 8.72 MB ChromaDB)
No-change catch-up	30.2s
Leaf-file edit	27.2s
Core-file edit	63.2s (3.65s semantic + 7.0s file summary + 14.4s incremental project summary)
New file	67.0s (3.9s semantic + 3.9s file summary + 16.4s incremental project summary)

The v1.0.1 project summary fix reduces per-update LLM output from ~14 KB to ~2 KB by removing the redundant per-file bullet list from the project summary prompt. Incremental project summary now takes ~15–20s instead of ~136s. Core-file edit dropped from ~179s to 63s and new-file addition from ~157s to 67s, both now completing end-to-end in under 70 seconds.

How it works

Language detection and LSP startup
The CLI scans the target repository, detects supported languages, validates the required language servers, and starts an LSP client per language.
Workspace indexing
The workspace indexer walks the repo, respects .gitignore, flattens documentSymbol results, records symbol definitions and references, hashes files, and stores everything in SQLite.
Semantic indexing
The semantic indexer reads eligible symbols from SQLite, extracts local source context, generates short natural-language summaries, and stores embeddings in ChromaDB.
File and project summarization
File summaries support module-level retrieval. A project summary gives agents a higher-level map of the codebase before they start drilling down.
Incremental updates
Unchanged files are skipped by file hash. On re-indexing an edited file, existing symbol summaries are carried over by (name, kind, container) key so that only genuinely changed symbols are re-summarized. A watchdog observer keeps the index warm while the server is running.
MCP serving
All of that state is exposed through an MCP server on http://127.0.0.1:6789/mcp using streamable HTTP transport.

Repository layout

src/codebase_insights/
├── main.py              CLI entry point and startup orchestration
├── LSP.py               LSP client wrapper
├── language_analysis.py Language detection and .gitignore parsing
├── workspace_indexer.py SQLite symbol/reference index + file watching
├── semantic_config.py   Config loading and first-run setup wizard
├── semantic_indexer.py  LLM summaries, embeddings, ranking, file/project summaries
└── mcp_server.py        MCP tool surface

scripts/
└── run_benchmark.py     Benchmark orchestrator

docs/
└── benchmark-v*.md      Versioned benchmark reports

Files created in the indexed repository

Path	Purpose
`.codebase-index.db`	Persistent SQLite symbol/reference database
`.codebase-semantic/`	ChromaDB collections for symbol and file summaries
`.codebase-insights.toml`	Local configuration for model providers and semantic indexing

If the target repository already has a .gitignore, Codebase Insights automatically adds .codebase-index.db to it. The other generated artifacts should also be ignored in normal usage.

Requirements

Python 3.11+
At least one supported LSP server for the target repository
Either:
- Ollama, or
- an OpenAI-compatible chat provider and embedding provider

Supported LSP servers

Language	Server	Install
Python	`pylsp`	`pip install python-lsp-server`
JavaScript / TypeScript	`typescript-language-server`	`npm install -g typescript-language-server`
C++	`clangd`	https://clangd.llvm.org/installation.html
Rust	`rust-analyzer`	`rustup component add rust-analyzer`

Optional Python plugins:

python-lsp-ruff
python-lsp-black
pylsp-mypy

For C/C++, indexing quality depends on clangd having enough project context. In practice that usually means a valid compile_commands.json or compile_flags.txt.

Installation

From PyPI

pip install codebase-insights

From source

# from the repository root
pip install -e .

Configuration

On first run, Codebase Insights creates .codebase-insights.toml through an interactive setup wizard.

The wizard asks for:

chat provider (ollama or openai)
embedding provider (ollama or openai)
model names and base URLs
semantic indexing kinds
concurrency, batch size, and minimum reference count

Chat and embedding providers are configured independently, so mixed setups like OpenAI for chat + Ollama for embeddings are supported.

Default shape

[chat]
provider = "ollama"

[chat.ollama]
base_url = "http://localhost:11434"
model = "qwen2.5"

[embed]
provider = "ollama"

[embed.ollama]
base_url = "http://localhost:11434"
model = "bge-m3"

[semantic]
index_kinds = ["Class", "Method", "Function", "Interface", "Enum", "Constructor"]
concurrency = 16
batch_size = 16
min_ref_count = 3

API key precedence

When using an OpenAI-compatible provider, runtime environment variables take precedence over keys stored in the config file:

CODEBASE_INSIGHTS_CHAT_API_KEY
CODEBASE_INSIGHTS_EMBED_API_KEY
OPENAI_API_KEY

Running

codebase-insights <project_root> [options]

Quick start with Ollama

# Terminal 1
ollama serve

# Terminal 2
codebase-insights /path/to/project

Quick start with an OpenAI-compatible provider

# set OPENAI_API_KEY in your environment first
codebase-insights /path/to/project --new-config

Then choose openai for chat and/or embeddings in the setup wizard.

CLI flags

Flag	What it does
`--new-config`	Re-run the interactive setup wizard
`--rebuild-index`	Clear and rebuild the SQLite symbol index
`--rebuild-semantic`	Clear summaries and vector data, then regenerate everything
`--rebuild-summaries`	Rebuild file/project summaries only
`--rebuild-vectors`	Re-embed existing summaries without new LLM summarization

Normal usage is incremental. The rebuild flags are mainly for maintenance, model changes, or benchmarking.

MCP tools

Workspace and LSP

Tool	Purpose
`languages_in_codebase()`	Return detected languages
`lsp_capabilities()`	Return active LSP capability information
`lsp_hover(file_uri, line, character)`	Hover docs/type at a position
`lsp_definition(file_uri, line, character)`	Jump to definition
`lsp_declaration(file_uri, line, character)`	Find declaration
`lsp_implementation(file_uri, line, character)`	Find implementations
`lsp_references(file_uri, line, character)`	Find references
`lsp_document_symbols(file_uri)`	List symbols in a file

Indexed symbol queries

Tool	Purpose
`query_symbols(path, kinds, name_query, limit)`	Search the SQLite symbol index by path, kind, or fuzzy name
`get_symbol_summary(name, file_uri, line, character)`	Fetch the stored AI summary for a specific symbol
`get_indexer_criteria()`	Return the current semantic indexing thresholds and eligible kinds

Semantic retrieval

Tool	Purpose
`semantic_search(query, limit, kinds)`	Find symbols by natural-language intent
`search_files(query, limit)`	Find files by natural-language responsibility
`get_file_summary(file_path)`	Fetch the stored summary for one file
`get_project_summary()`	Fetch the stored project-level overview

For LSP calls, the server accepts normal file:///... URIs and also normal absolute filesystem paths.

A practical agent workflow

One effective flow is:

Call get_project_summary() to understand the repo shape.
Use search_files() to find the likely module.
Use semantic_search() to find the likely symbol.
Use query_symbols() when you know part of the name or want exact paths/kinds.
Use lsp_definition(), lsp_implementation(), and lsp_references() to expand outward structurally.
Use get_symbol_summary() when you want a compact natural-language explanation of a specific symbol.

That gives agents both semantic recall and structural precision.

Limitations and trade-offs

Not every symbol gets an AI summary.
By default, semantic indexing only covers Class, Method, Function, Interface, Enum, and Constructor, and only when a symbol meets the min_ref_count threshold.
Named declarations work better than anonymous logic.
The index intentionally filters or demotes anonymous LSP artifacts, callbacks, and low-signal pseudo-symbols.
LSP quality sets the floor.
If the language server cannot fully understand the workspace, indexing and navigation quality drop with it.
Semantic search is strongest for intent, not exhaustive coverage.
It is best used to find likely starting points, then combined with structural tools for complete navigation.
The benchmark measures retrieval quality and incremental update performance.
Retrieval quality (Hit@k) is the primary benchmark; timing numbers reflect one run against a specific LLM/embedding stack and will vary with provider and hardware.

Related benchmark material

Latest benchmark report: docs/benchmark-v1.0.1.md
Earlier reports: docs/
Benchmark runner: scripts/run_benchmark.py

License

MIT. See LICENSE.

Project details

Release history Release notifications | RSS feed

1.4.0

Apr 27, 2026

1.3.5

Apr 27, 2026

1.3.4

Apr 27, 2026

1.3.3

Apr 27, 2026

1.3.2

Apr 27, 2026

1.3.1

Apr 27, 2026

1.3.0

Apr 27, 2026

1.2.3

Apr 27, 2026

1.2.2

Apr 27, 2026

1.2.1

Apr 21, 2026

1.2.0

Apr 21, 2026

1.1.0

Apr 21, 2026

This version

1.0.1

Apr 16, 2026

0.2.4

Apr 16, 2026

0.1.1

Apr 14, 2026

0.1.0

Apr 14, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

codebase_insights-1.0.1.tar.gz (295.6 kB view details)

Uploaded Apr 16, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

codebase_insights-1.0.1-py3-none-any.whl (58.6 kB view details)

Uploaded Apr 16, 2026 Python 3

File details

Details for the file codebase_insights-1.0.1.tar.gz.

File metadata

Download URL: codebase_insights-1.0.1.tar.gz
Upload date: Apr 16, 2026
Size: 295.6 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for codebase_insights-1.0.1.tar.gz
Algorithm	Hash digest
SHA256	`9d03d04c85f4f69993a01133ff46897be6340f2c10547f057282b1bb7ffb7291`
MD5	`9146b3f59f61573bc8d19c38bf5a9eec`
BLAKE2b-256	`30c95abc2bfc4b5ea01fad41b2ba07f2bf2f69fe653a54d7b99086c52745f865`

See more details on using hashes here.

Provenance

The following attestation bundles were made for codebase_insights-1.0.1.tar.gz:

Publisher: publish.yml on JimmyfaQwQ/Codebase-Insights

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: codebase_insights-1.0.1.tar.gz
- Subject digest: 9d03d04c85f4f69993a01133ff46897be6340f2c10547f057282b1bb7ffb7291
- Sigstore transparency entry: 1317363544
- Sigstore integration time: Apr 16, 2026
Source repository:
- Permalink: JimmyfaQwQ/Codebase-Insights@657618369e56ed2e0b892ab355ca239905fd7050
- Branch / Tag: refs/tags/v1.0.1
- Owner: https://github.com/JimmyfaQwQ
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@657618369e56ed2e0b892ab355ca239905fd7050
- Trigger Event: push

File details

Details for the file codebase_insights-1.0.1-py3-none-any.whl.

File metadata

Download URL: codebase_insights-1.0.1-py3-none-any.whl
Upload date: Apr 16, 2026
Size: 58.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for codebase_insights-1.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`25937f20934f38714a2feaeb63606aeb314eadd50bed85e20f566275d923a708`
MD5	`cf16319789c9a72ee257ebedf49f7569`
BLAKE2b-256	`ae01d3638ef930e7f0a070455f4da9208a593acdb836b9b91de38788d0a73f6b`

See more details on using hashes here.

Provenance

The following attestation bundles were made for codebase_insights-1.0.1-py3-none-any.whl:

Publisher: publish.yml on JimmyfaQwQ/Codebase-Insights

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: codebase_insights-1.0.1-py3-none-any.whl
- Subject digest: 25937f20934f38714a2feaeb63606aeb314eadd50bed85e20f566275d923a708
- Sigstore transparency entry: 1317363555
- Sigstore integration time: Apr 16, 2026
Source repository:
- Permalink: JimmyfaQwQ/Codebase-Insights@657618369e56ed2e0b892ab355ca239905fd7050
- Branch / Tag: refs/tags/v1.0.1
- Owner: https://github.com/JimmyfaQwQ
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@657618369e56ed2e0b892ab355ca239905fd7050
- Trigger Event: push

codebase-insights 1.0.1

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

Codebase Insights

What it is for

Core capabilities

Latest benchmark snapshot

Benchmark target

Retrieval quality

Performance baseline

How it works

Repository layout

Files created in the indexed repository

Requirements

Supported LSP servers

Installation

From PyPI

From source

Configuration

Default shape

API key precedence

Running

Quick start with Ollama

Quick start with an OpenAI-compatible provider

CLI flags

MCP tools

Workspace and LSP

Indexed symbol queries

Semantic retrieval

A practical agent workflow

Limitations and trade-offs

Related benchmark material

License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance