LSP-powered code intelligence with AI semantic search, exposed as an MCP server

Project description

Codebase Insights

Persistent code intelligence for AI coding agents.

Codebase Insights combines Language Server Protocol (LSP) analysis, LLM-generated summaries, and semantic embeddings to build a structured, reusable understanding of a codebase.

It exposes this understanding through an MCP (Model Context Protocol) server, making it usable from MCP-compatible clients such as Claude Desktop, GitHub Copilot, and other agent workflows.

Why this exists

Most coding agents still rely heavily on keyword search over guessed context when trying to find relevant code. That causes three recurring problems:

Noisy retrieval
Lexical matches often find superficially similar code while missing the symbol or implementation that actually matters.
Weak structural understanding
Definitions, references, implementations, and symbol hierarchies are crucial for navigating real codebases, but plain text search does not model them directly.
No durable understanding across sessions
Agents often pay the repository exploration cost over and over again, even when nothing has changed.

Codebase Insights addresses these limitations by combining:

LSP servers for precise symbol extraction and navigation
LLM summarization for higher-level semantic understanding
Vector embeddings for natural-language retrieval
Persistent local indexes so unchanged code does not need to be rediscovered from scratch

The result is a reusable code-understanding layer for intelligent coding assistants.

What it provides

Workspace-wide symbol indexing
Natural-language semantic code search
Definition / references / implementation navigation via LSP
Incremental re-indexing driven by file watching and hashes
Persistent codebase understanding across sessions
MCP server integration for AI clients and agents

Benchmark Highlights

Benchmark results below are from codebase-insights v0.1.1 on a real Electron + Vue + React Native monorepo (G:\SyntaxSenpai) with 118 files, 5,587 symbols, and 32,570 cross-references.

Metric	Value
Full pipeline wall time	427.7s (~7.1 min)
Pre-server startup	6.17s
Workspace indexing	62.01s
Semantic indexing	178.26s
File summaries	40.51s
Project summary (full)	138.06s
Storage footprint	13.80 MB
Peak RSS	3,201 MiB
Retrieval Hit@1	68.4%
Retrieval Hit@3	89.5%
Retrieval Hit@5	100%
No-change catch-up	0.05s

Incremental updates

Scenario	Total time
No change	0.05s
Leaf-file edit	~18s
Core-file edit	~19s
New file	~21s

Retrieval quality highlights

Semantic search outperforms keyword-only symbol matching on concept-level queries
A single natural-language query for streaming surfaced provider stream() implementations across multiple backends
LSP navigation resolved:
- 23 references to BaseAIProvider
- 24 implementations / subclasses
Top-5 retrieval hit rate reached 100% on the benchmark query set

See the full benchmark report in docs/benchmark.md for methodology, per-query results, incremental update scenarios, and failure analysis.

Features

Multi-language support — Python, JavaScript/TypeScript, C++, Rust via standard LSP servers
Symbol indexing — full workspace scan with file-watch-driven incremental re-indexing
Semantic search — AI-generated summaries + embeddings for natural-language retrieval
Hybrid ranking — blends lexical matching with vector similarity and reference-aware ranking
Flexible model backends — Ollama (local) or OpenAI-compatible APIs
MCP server — exposes all capabilities over HTTP for any MCP-compatible client

Architecture

src/codebase_insights/
├── main.py              CLI entry point & startup orchestration
├── language_analysis.py Detects languages; parses .gitignore
├── LSP.py               Async LSP client (hover, definition, references, symbols, …)
├── workspace_indexer.py Indexes symbols into SQLite; watches for file changes
├── semantic_indexer.py  LLM summarization + ChromaDB vector indexing & search
├── semantic_config.py   TOML config loader with interactive first-time setup wizard
└── mcp_server.py        MCP server exposing all tools over HTTP

On-disk artifacts

These are created at the target project root and automatically added to .gitignore:

File / Directory	Purpose
`.codebase-index.db`	SQLite symbol database
`.codebase-semantic/`	ChromaDB vector store
`.codebase-insights.toml`	Configuration file

How it works

Startup
Detects languages, validates required LSP servers, and initializes LSP clients.
Workspace indexing
Scans the repository with LSP documentSymbol, stores symbols and references in SQLite, and monitors changes with filesystem watching.
Semantic indexing
Extracts source context for qualifying symbols, generates short LLM summaries, and stores embeddings in ChromaDB.
File and project summarization
Generates file-level summaries and maintains a project summary for higher-level semantic retrieval and context.
Incremental updates
Uses file hashes and symbol-content hashes to skip unchanged work and only reprocess modified or new symbols.
MCP query serving
Exposes symbol and semantic capabilities over MCP:
- query_symbols(...) reads from SQLite
- semantic_search(...) performs hybrid lexical + vector ranking
- lsp_* tools expose structural navigation directly from language servers

Why not just use keyword search?

Keyword search is still useful, but it breaks down quickly when agents need to reason about structure and behavior.

Task	Keyword search	Codebase Insights
Find exact symbol names	Good	Good
Find code by concept or behavior	Weak	Strong
Jump to definitions	Manual / indirect	Built-in
Find references	Approximate	Precise via LSP
Find implementations / subclasses	Hard	Built-in
Reuse code understanding across sessions	No	Yes
Reduce repeated exploration cost	No	Yes

A concrete benchmark example:

Query: “LLM streaming response handling”
Semantic search returns provider stream() implementations directly
Keyword symbol search mostly returns names that merely contain "stream" such as helpers or chunking utilities

That difference matters a lot for coding agents.

Prerequisites

Python 3.11+
At least one LSP server matching the language(s) in the target repository
Either:
- Ollama running locally, or
- an OpenAI-compatible API key

Supported LSP servers

Language	Server	Install
Python	`pylsp`	`pip install python-lsp-server`
JavaScript / TypeScript	`typescript-language-server`	`npm install -g typescript-language-server`
C++	`clangd`	clangd.llvm.org
Rust	`rust-analyzer`	`rustup component add rust-analyzer`

Optional Python LSP plugins:

python-lsp-ruff
python-lsp-black
pylsp-mypy

Installation

From PyPI

pip install codebase-insights

From source

git clone https://github.com/your-org/codebase-insights
cd codebase-insights
pip install -e .

Quick start

Ollama

# Terminal 1
ollama serve

# Terminal 2
codebase-insights /path/to/your/project

OpenAI-compatible API

export OPENAI_API_KEY="sk-..."
codebase-insights /path/to/your/project --new-config
# choose "openai" when prompted for chat and embed providers

On first run, an interactive wizard creates .codebase-insights.toml and helps configure model providers and indexing settings.

The MCP server starts on:

http://127.0.0.1:6789/mcp

using streamable HTTP transport.

Usage

codebase-insights <project_root> [options]

CLI options

Flag	Description
`--new-config`	Re-run the setup wizard, overwriting the existing config
`--rebuild-index`	Drop and rebuild the SQLite symbol index from scratch
`--rebuild-semantic`	Drop all LLM summaries and ChromaDB vectors, regenerate everything
`--rebuild-summaries`	Regenerate only file/project summaries (keeps symbol summaries)
`--rebuild-vectors`	Re-embed existing summaries with the current embedding model (no LLM calls)

MCP tools

Once running, the following tools are available to connected MCP clients:

Tool	Description
`languages_in_codebase()`	List detected languages in the project
`lsp_capabilities()`	Query active LSP server capabilities
`lsp_hover(file_uri, line, character)`	Type information and documentation at a position
`lsp_definition(file_uri, line, character)`	Jump to definition
`lsp_declaration(file_uri, line, character)`	Find declarations
`lsp_implementation(file_uri, line, character)`	Find implementations
`lsp_references(file_uri, line, character)`	Find all references to a symbol
`lsp_document_symbols(file_uri)`	List all symbols in a file
`query_symbols(path, kinds, name_query, limit)`	Query the SQLite index by path, kind, or name
`semantic_search(query, limit, kinds)`	Natural-language semantic search

file_uri should use normal file URIs such as file:///G:/repo/path/to/file.ts.

Configuration

The config file .codebase-insights.toml is created interactively on first run.

Example

[chat]
provider = "ollama"          # "ollama" | "openai"

[chat.ollama]
base_url = "http://localhost:11434"
model = "qwen2.5"

[embed]
provider = "ollama"

[embed.ollama]
model = "bge-m3"

[semantic]
index_kinds = ["Class", "Method", "Function", "Interface", "Enum", "Constructor"]
concurrency = 16
batch_size = 16
min_ref_count = 1

[ranking]
# noise penalties and re-ranking weights (see semantic_config.py for defaults)

Environment variable overrides

These environment variables override the corresponding TOML values:

OPENAI_API_KEY
OPENAI_BASE_URL

Example use cases

Codebase Insights is useful for questions like:

“Find all implementations of this provider interface.”
“What handles WebSocket messages in this repo?”
“Where is configuration loaded and applied?”
“Show me every reference to this base class.”
“Find the real code responsible for streaming responses.”
“Search the repository by behavior, not just by exact symbol names.”

It is particularly effective for agents that need to:

move from natural-language intent to a likely symbol
expand from that symbol to definitions, references, and implementations
reduce file-opening and grep iteration overhead
retain reusable codebase understanding across sessions

Known limitations

Current limitations and trade-offs include:

Only a subset of symbols are semantically indexed
Filtering by symbol kind and reference count improves quality and cost, but reduces coverage.
Inline or anonymous logic is harder to retrieve semantically
For example, anonymous callbacks and ad-hoc try/catch behavior do not form named symbols.
Convention-based framework behavior may be less visible
File-system routing or other convention-heavy patterns may not map cleanly to LSP symbol graphs.
Full project summaries remain relatively expensive
Incremental project-summary updates are much faster now, but full project summarization is still one of the largest rebuild costs.

Dependencies

Package	Purpose
`mcp[cli]`	MCP server framework
`watchdog`	Filesystem monitoring
`langchain`, `langchain-ollama`, `langchain-openai`	LLM and embedding integration
`langchain-chroma`, `chromadb`	Vector store
`tqdm`	Progress bars

Project status

Codebase Insights is an early-stage but functional code-understanding platform.

Validated so far:

workspace-wide symbol indexing
LSP-backed navigation
semantic retrieval over indexed symbols
persistent on-disk indexes
near-zero no-change catch-up
practical incremental update behavior
incremental project summary updates
automated benchmark coverage

Still improving:

retrieval quality on diffuse / anonymous logic
indexing coverage trade-offs
performance on larger repositories
ergonomics and documentation

Contributing

Contributions, benchmark results, bug reports, and design feedback are welcome.

Especially valuable areas include:

performance optimization
retrieval quality tuning
incremental update behavior
support for more repo styles and languages
benchmark automation
MCP client ergonomics

License

This project is licensed under the MIT License. See the LICENSE file for details.

Project details

Release history Release notifications | RSS feed

1.4.0

Apr 27, 2026

1.3.5

Apr 27, 2026

1.3.4

Apr 27, 2026

1.3.3

Apr 27, 2026

1.3.2

Apr 27, 2026

1.3.1

Apr 27, 2026

1.3.0

Apr 27, 2026

1.2.3

Apr 27, 2026

1.2.2

Apr 27, 2026

1.2.1

Apr 21, 2026

1.2.0

Apr 21, 2026

1.1.0

Apr 21, 2026

1.0.1

Apr 16, 2026

0.2.4

Apr 16, 2026

This version

0.1.1

Apr 14, 2026

0.1.0

Apr 14, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

codebase_insights-0.1.1.tar.gz (264.2 kB view details)

Uploaded Apr 14, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

codebase_insights-0.1.1-py3-none-any.whl (50.9 kB view details)

Uploaded Apr 14, 2026 Python 3

File details

Details for the file codebase_insights-0.1.1.tar.gz.

File metadata

Download URL: codebase_insights-0.1.1.tar.gz
Upload date: Apr 14, 2026
Size: 264.2 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for codebase_insights-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`69cab304169119bc31550c46392f72deba1b43255e8f0652ac4f2bffb3be1100`
MD5	`2abc552230902f169dab00d45f887999`
BLAKE2b-256	`d63de8974a2040d9ef63a99c93ec6d7bcbb3771542aeb85afbb0ffc0ec91e67a`

See more details on using hashes here.

Provenance

The following attestation bundles were made for codebase_insights-0.1.1.tar.gz:

Publisher: publish.yml on JimmyfaQwQ/Codebase-Insights

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: codebase_insights-0.1.1.tar.gz
- Subject digest: 69cab304169119bc31550c46392f72deba1b43255e8f0652ac4f2bffb3be1100
- Sigstore transparency entry: 1293654641
- Sigstore integration time: Apr 14, 2026
Source repository:
- Permalink: JimmyfaQwQ/Codebase-Insights@e3e8b0a42440a102af118684fa5656b20dec84d7
- Branch / Tag: refs/tags/v0.1.1
- Owner: https://github.com/JimmyfaQwQ
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@e3e8b0a42440a102af118684fa5656b20dec84d7
- Trigger Event: push

File details

Details for the file codebase_insights-0.1.1-py3-none-any.whl.

File metadata

Download URL: codebase_insights-0.1.1-py3-none-any.whl
Upload date: Apr 14, 2026
Size: 50.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for codebase_insights-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`fc90357a08e65d570e27fa59237a7be1cb2d37fd10bd312745b5408ae76f5014`
MD5	`b55fa63cfe507c89d92b63cda3770f4b`
BLAKE2b-256	`9b952b057528f16d758d76bd0b581eccb7616038a01bd1410f9ae996387f5d20`

See more details on using hashes here.

Provenance

The following attestation bundles were made for codebase_insights-0.1.1-py3-none-any.whl:

Publisher: publish.yml on JimmyfaQwQ/Codebase-Insights

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: codebase_insights-0.1.1-py3-none-any.whl
- Subject digest: fc90357a08e65d570e27fa59237a7be1cb2d37fd10bd312745b5408ae76f5014
- Sigstore transparency entry: 1293654647
- Sigstore integration time: Apr 14, 2026
Source repository:
- Permalink: JimmyfaQwQ/Codebase-Insights@e3e8b0a42440a102af118684fa5656b20dec84d7
- Branch / Tag: refs/tags/v0.1.1
- Owner: https://github.com/JimmyfaQwQ
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@e3e8b0a42440a102af118684fa5656b20dec84d7
- Trigger Event: push

codebase-insights 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

Codebase Insights

Why this exists

What it provides

Benchmark Highlights

Incremental updates

Retrieval quality highlights

Features

Architecture

On-disk artifacts

How it works

Why not just use keyword search?

Prerequisites

Supported LSP servers

Installation

From PyPI

From source

Quick start

Ollama

OpenAI-compatible API

Usage

CLI options

MCP tools

Configuration

Example

Environment variable overrides

Example use cases

Known limitations

Dependencies

Project status

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance