Semantic code search + LLM answers for your TypeScript/React codebase — runs entirely on your machine.
Project description
codescout
Semantic code search + LLM answers for your TypeScript/React codebase — runs entirely on your machine.
Quickstart • MCP Server • CLI • Embedding models • How it works • Configuration
CodeScout indexes TypeScript and React codebases with tree-sitter AST parsing, generates local embeddings via sentence-transformers, and answers natural-language questions with an LLM — all with zero external infrastructure. No Docker, no database server, no GPU, no API key for embeddings.
Run it as an MCP server and any agent (GitHub Copilot, Claude Code, Cursor) gets instant, cited answers from your codebase before writing a single line of code.
Quickstart
pip install codescout
export OPENROUTER_API_KEY=sk-or-... # free tier at openrouter.ai — only needed for ask
codescout ask "how does authentication work?"
On first run, codescout automatically initializes .codescout/ in your project, downloads the embedding model (~90 MB, cached), indexes every file, then answers your question. No other setup needed.
Main Features
- AST-aware chunking — parses TypeScript and TSX with tree-sitter; every function, component, hook, type, and interface becomes its own chunk with the right semantic label
- Enriched embeddings — each chunk is embedded with its natural-language description + imports + source, so queries match on meaning, not just keywords
- Incremental indexing — files are SHA256-hashed; only changed files are re-processed on subsequent runs
- Fully local — embeddings are generated by sentence-transformers on CPU, stored in FAISS + SQLite inside
.codescout/; nothing leaves your machine - MCP server — expose
search_codebaseandindex_statusas MCP tools; Copilot, Claude Code, and Cursor call them automatically - Direct LLM answers —
codescout askretrieves relevant chunks and returns a cited, plain-English answer via OpenRouter (not just a list of snippets)
MCP Server
Run codescout as an MCP server so your agent searches your codebase directly — before writing or modifying code.
Setup
pip install 'codescout[mcp]'
codescout index # index first
codescout mcp-init # generates config + agent instructions (run once per project)
mcp-init creates:
| File | Purpose |
|---|---|
.vscode/mcp.json |
Tells VS Code how to launch the MCP server |
.github/copilot-instructions.md |
Instructs GitHub Copilot to call search_codebase before tasks |
CLAUDE.md |
Same instructions for Claude Code |
Reload VS Code after running mcp-init. The agent will then call search_codebase automatically on every task.
Manual config
{
"servers": {
"codescout": {
"type": "stdio",
"command": "codescout",
"args": ["mcp-serve"]
}
}
}
Available tools
| Tool | Description |
|---|---|
search_codebase(query, top_k?) |
Return the most relevant code chunks for a natural-language query |
index_status() |
Report how many files and chunks are indexed and when the last run was |
Commands
codescout init # Create .codescout/ config in current project
codescout index # Scan and embed (incremental by default)
codescout index --full # Force complete re-index from scratch
codescout index --verbose # Show per-file chunk counts while indexing
codescout ask "your question" # Semantic search + LLM answer
codescout ask "..." --show-context # Also print retrieved code chunks
codescout ask "..." --top-k 10 # Retrieve more chunks (default: 5)
codescout ask "..." --model openai/gpt-4o # Override LLM model
codescout status # Show index stats (files, chunks, size)
codescout config # View all config values
codescout config top_k 10 # Set a config value
codescout mcp-init # Generate MCP config + agent instruction files
codescout mcp-serve # Start MCP server (stdio)
Embedding Models
Embeddings are generated entirely locally with sentence-transformers — no API key, no internet after the first download. Models are cached at ~/.cache/huggingface/hub/.
| Model | Config value | Dims | Size | Code quality |
|---|---|---|---|---|
| all-MiniLM-L6-v2 (default) | all-MiniLM-L6-v2 |
384 | ~90 MB | ⭐⭐⭐ Good starting point |
| nomic CodeRankEmbed (recommended) | nomic-ai/CodeRankEmbed |
768 | ~520 MB | ⭐⭐⭐⭐⭐ Best open-source for code |
| Jina v2 Code | jinaai/jina-embeddings-v2-base-code |
768 | ~610 MB | ⭐⭐⭐⭐ Strong, 8k context |
| Salesforce CodeT5+ | Salesforce/codet5p-110m-embedding |
256 | ~420 MB | ⭐⭐⭐⭐ Compact index size |
All models run on CPU. No GPU required.
Switching models
// .codescout/config.json
{
"model": "nomic-ai/CodeRankEmbed",
"embedding_dim": 768
}
codescout index --full # always re-index after changing models
embedding_dimmust match the model's output dimension. Mixing embeddings from two different models in the same index produces wrong results.
How It Works
- Scan — walks the project tree, respects
.gitignore, filters by configured extensions - Parse — tree-sitter extracts top-level declarations (functions, components, hooks, types, interfaces, classes)
- Chunk — each declaration becomes one chunk; a natural-language description is generated and prepended before embedding
- Embed — all chunks are encoded in a single batched call (memory-safe mini-batches of 32)
- Store — vectors go into FAISS (
index.faiss), metadata + source into SQLite (metadata.db), both inside.codescout/ - Query — question is embedded, FAISS finds nearest vectors, source is fetched from SQLite, sent to LLM
Chunk classification
| Type | Detection |
|---|---|
component |
JSX in body + uppercase first letter |
hook |
name starts with use + uppercase |
function |
any other named function or arrow |
type / interface |
TypeScript type alias or interface |
class |
class or abstract class declaration |
constant |
exported non-function declaration |
Why descriptions improve recall
A chunk for useAuthToken() gets the description "hook useAuthToken — Uses: token, setToken, userId". A query for "authentication token handling" matches this description even if the words "authentication" or "token" never appear in the function body. Interfaces get their field names extracted; hooks and components get their destructured state variables listed. Tools that embed raw source alone miss this signal.
Configuration
.codescout/config.json (created by codescout init):
{
"model": "all-MiniLM-L6-v2",
"embedding_dim": 384,
"top_k": 5,
"extensions": [".ts", ".tsx", ".js", ".jsx"],
"exclude": ["node_modules", "dist", ".next", "build", "*.test.ts", "*.spec.ts"],
"llm_model": "anthropic/claude-sonnet-4",
"max_chunk_lines": 80,
"min_chunk_lines": 3,
"max_context_chars": 12000
}
Read or write any value:
codescout config # show all
codescout config top_k # read one
codescout config top_k 10 # write one
Storage
Everything lives in .codescout/ inside the project root:
.codescout/
├── config.json # project config
├── index.faiss # FAISS vector index (~1.5 MB per 1,000 chunks at 384 dims)
├── metadata.db # SQLite: chunk source, file hashes, line numbers
└── .gitignore # auto-generated; prevents committing the index to git
Inspect directly:
sqlite3 .codescout/metadata.db \
"SELECT name, chunk_type, file_path, start_line FROM chunks LIMIT 20;"
Installation
pip install codescout # core
pip install 'codescout[mcp]' # + MCP server
Requires Python 3.10+. FAISS, sentence-transformers, and tree-sitter are bundled as dependencies.
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file emb_codescout-0.1.0.tar.gz.
File metadata
- Download URL: emb_codescout-0.1.0.tar.gz
- Upload date:
- Size: 32.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2796237688ce3dedc150d65e4591cff2a1f37be132830ac3ff4980ba1c8a28b3
|
|
| MD5 |
0004dcc3fb38786fa84600f077ee803b
|
|
| BLAKE2b-256 |
4f71de905f2a525336c87a796acd1c59bc611f14d0ee256ee7b7fc2185bc6746
|
File details
Details for the file emb_codescout-0.1.0-py3-none-any.whl.
File metadata
- Download URL: emb_codescout-0.1.0-py3-none-any.whl
- Upload date:
- Size: 6.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
afa5bceef0e0b8acb8af1795549d6f05c16d3f73d6a11190b4a560022cbfc089
|
|
| MD5 |
6e63a0120c8f4cf0d4182273c1d52a16
|
|
| BLAKE2b-256 |
da6e11fa9110f11cc6f48808775f9bf82a700d71e093e495eed5f1fa283f62a1
|