Skip to main content

Semantic code search + LLM answers for your TypeScript/React codebase — runs entirely on your machine.

Project description

codescout

Semantic code search + LLM answers for your TypeScript/React codebase — runs entirely on your machine.

PyPI version Python 3.10+ License: MIT

QuickstartMCP ServerCLIEmbedding modelsHow it worksConfiguration


CodeScout indexes TypeScript and React codebases with tree-sitter AST parsing, generates local embeddings via sentence-transformers, and answers natural-language questions with an LLM — all with zero external infrastructure. No Docker, no database server, no GPU, no API key for embeddings.

Run it as an MCP server and any agent (GitHub Copilot, Claude Code, Cursor) gets instant, cited answers from your codebase before writing a single line of code.


Quickstart

pip install codescout
export OPENROUTER_API_KEY=sk-or-...   # free tier at openrouter.ai — only needed for ask
codescout ask "how does authentication work?"

On first run, codescout automatically initializes .codescout/ in your project, downloads the embedding model (~90 MB, cached), indexes every file, then answers your question. No other setup needed.


Main Features

  • AST-aware chunking — parses TypeScript and TSX with tree-sitter; every function, component, hook, type, and interface becomes its own chunk with the right semantic label
  • Enriched embeddings — each chunk is embedded with its natural-language description + imports + source, so queries match on meaning, not just keywords
  • Incremental indexing — files are SHA256-hashed; only changed files are re-processed on subsequent runs
  • Fully local — embeddings are generated by sentence-transformers on CPU, stored in FAISS + SQLite inside .codescout/; nothing leaves your machine
  • MCP server — expose search_codebase and index_status as MCP tools; Copilot, Claude Code, and Cursor call them automatically
  • Direct LLM answerscodescout ask retrieves relevant chunks and returns a cited, plain-English answer via OpenRouter (not just a list of snippets)

MCP Server

Run codescout as an MCP server so your agent searches your codebase directly — before writing or modifying code.

Setup

pip install 'codescout[mcp]'
codescout index          # index first
codescout mcp-init       # generates config + agent instructions (run once per project)

mcp-init creates:

File Purpose
.vscode/mcp.json Tells VS Code how to launch the MCP server
.github/copilot-instructions.md Instructs GitHub Copilot to call search_codebase before tasks
CLAUDE.md Same instructions for Claude Code

Reload VS Code after running mcp-init. The agent will then call search_codebase automatically on every task.

Manual config

{
  "servers": {
    "codescout": {
      "type": "stdio",
      "command": "codescout",
      "args": ["mcp-serve"]
    }
  }
}

Available tools

Tool Description
search_codebase(query, top_k?) Return the most relevant code chunks for a natural-language query
index_status() Report how many files and chunks are indexed and when the last run was

Commands

codescout init                          # Create .codescout/ config in current project
codescout index                         # Scan and embed (incremental by default)
codescout index --full                  # Force complete re-index from scratch
codescout index --verbose               # Show per-file chunk counts while indexing
codescout ask "your question"           # Semantic search + LLM answer
codescout ask "..." --show-context      # Also print retrieved code chunks
codescout ask "..." --top-k 10          # Retrieve more chunks (default: 5)
codescout ask "..." --model openai/gpt-4o   # Override LLM model
codescout status                        # Show index stats (files, chunks, size)
codescout config                        # View all config values
codescout config top_k 10              # Set a config value
codescout mcp-init                      # Generate MCP config + agent instruction files
codescout mcp-serve                     # Start MCP server (stdio)

Embedding Models

Embeddings are generated entirely locally with sentence-transformers — no API key, no internet after the first download. Models are cached at ~/.cache/huggingface/hub/.

Model Config value Dims Size Code quality
all-MiniLM-L6-v2 (default) all-MiniLM-L6-v2 384 ~90 MB ⭐⭐⭐ Good starting point
nomic CodeRankEmbed (recommended) nomic-ai/CodeRankEmbed 768 ~520 MB ⭐⭐⭐⭐⭐ Best open-source for code
Jina v2 Code jinaai/jina-embeddings-v2-base-code 768 ~610 MB ⭐⭐⭐⭐ Strong, 8k context
Salesforce CodeT5+ Salesforce/codet5p-110m-embedding 256 ~420 MB ⭐⭐⭐⭐ Compact index size

All models run on CPU. No GPU required.

Switching models

// .codescout/config.json
{
  "model": "nomic-ai/CodeRankEmbed",
  "embedding_dim": 768
}
codescout index --full   # always re-index after changing models

embedding_dim must match the model's output dimension. Mixing embeddings from two different models in the same index produces wrong results.


How It Works

  1. Scan — walks the project tree, respects .gitignore, filters by configured extensions
  2. Parse — tree-sitter extracts top-level declarations (functions, components, hooks, types, interfaces, classes)
  3. Chunk — each declaration becomes one chunk; a natural-language description is generated and prepended before embedding
  4. Embed — all chunks are encoded in a single batched call (memory-safe mini-batches of 32)
  5. Store — vectors go into FAISS (index.faiss), metadata + source into SQLite (metadata.db), both inside .codescout/
  6. Query — question is embedded, FAISS finds nearest vectors, source is fetched from SQLite, sent to LLM

Chunk classification

Type Detection
component JSX in body + uppercase first letter
hook name starts with use + uppercase
function any other named function or arrow
type / interface TypeScript type alias or interface
class class or abstract class declaration
constant exported non-function declaration

Why descriptions improve recall

A chunk for useAuthToken() gets the description "hook useAuthToken — Uses: token, setToken, userId". A query for "authentication token handling" matches this description even if the words "authentication" or "token" never appear in the function body. Interfaces get their field names extracted; hooks and components get their destructured state variables listed. Tools that embed raw source alone miss this signal.


Configuration

.codescout/config.json (created by codescout init):

{
  "model": "all-MiniLM-L6-v2",
  "embedding_dim": 384,
  "top_k": 5,
  "extensions": [".ts", ".tsx", ".js", ".jsx"],
  "exclude": ["node_modules", "dist", ".next", "build", "*.test.ts", "*.spec.ts"],
  "llm_model": "anthropic/claude-sonnet-4",
  "max_chunk_lines": 80,
  "min_chunk_lines": 3,
  "max_context_chars": 12000
}

Read or write any value:

codescout config                   # show all
codescout config top_k             # read one
codescout config top_k 10          # write one

Storage

Everything lives in .codescout/ inside the project root:

.codescout/
├── config.json       # project config
├── index.faiss       # FAISS vector index (~1.5 MB per 1,000 chunks at 384 dims)
├── metadata.db       # SQLite: chunk source, file hashes, line numbers
└── .gitignore        # auto-generated; prevents committing the index to git

Inspect directly:

sqlite3 .codescout/metadata.db \
  "SELECT name, chunk_type, file_path, start_line FROM chunks LIMIT 20;"

Installation

pip install codescout            # core
pip install 'codescout[mcp]'     # + MCP server

Requires Python 3.10+. FAISS, sentence-transformers, and tree-sitter are bundled as dependencies.


License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

emb_codescout-0.1.1.tar.gz (32.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

emb_codescout-0.1.1-py3-none-any.whl (32.9 kB view details)

Uploaded Python 3

File details

Details for the file emb_codescout-0.1.1.tar.gz.

File metadata

  • Download URL: emb_codescout-0.1.1.tar.gz
  • Upload date:
  • Size: 32.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for emb_codescout-0.1.1.tar.gz
Algorithm Hash digest
SHA256 b832b3390c25fe20b9f97c0e26c5ac3f96a4024ca8f56e647f8f536ffc8b25b7
MD5 6f1ae0e2ef8fbdb966fe60388dd3e58b
BLAKE2b-256 5ba11ec95357cc3611d67625289de7a312a4cd7567bad0ec44900eb1fccc2461

See more details on using hashes here.

File details

Details for the file emb_codescout-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: emb_codescout-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 32.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for emb_codescout-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 7430f95733a302d89c749755098ed1dbe0b90ec693b62a790aea1ac4ac2cefd6
MD5 25704f8e441082cdad19ae92da1b09c0
BLAKE2b-256 0fc8989c3527410efca0aa2a196cfc427c8c5e19dc78bcc9afda078204a4df06

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page