CLI-first semantic code search with MCP integration and interactive D3.js visualization for exploring code relationships
Project description
MCP Vector Search
๐ CLI-first semantic code search with MCP integration
โ ๏ธ Production Release (v2.5.56): Stable and actively maintained. LanceDB is now the default backend for better performance and stability.
A modern, fast, and intelligent code search tool that understands your codebase through semantic analysis and AST parsing. Built with Python, powered by LanceDB, and designed for developer productivity.
โจ Features
๐ Core Capabilities
- Semantic Search: Find code by meaning, not just keywords
- AST-Aware Parsing: Understands code structure (functions, classes, methods)
- Multi-Language Support: 13 languages - Python, JavaScript, TypeScript, C#, Dart/Flutter, PHP, Ruby, Java, Go, Rust, HTML, and Markdown/Text (with extensible architecture)
- Knowledge Graph: Temporal knowledge graph with KuzuDB for entity extraction and relationship mapping (
kg build,kg status,kg query) - Interactive Visualization: D3.js-powered visualization with 5+ views (Treemap, Sunburst, Force Graph, Knowledge Graph, Heatmap)
- Development Narratives: Generate git history narratives with
storycommand (markdown, JSON, HTML output) - Real-time Indexing: File watching with automatic index updates
- Automatic Version Tracking: Smart reindexing on tool upgrades
- Local-First: Complete privacy with on-device processing
- Zero Configuration: Auto-detects project structure and languages
๐ ๏ธ Developer Experience
- CLI-First Design: Simple commands for immediate productivity
- Rich Output: Syntax highlighting, similarity scores, context
- Fast Performance: Sub-second search responses, efficient indexing with pipeline parallelism (37% faster); IVF-PQ vector index delivers 4.9x faster queries (3.4ms vs 16.7ms)
- Modern Architecture: Async-first, type-safe, modular design
- Semi-Automatic Reindexing: Multiple strategies without daemon processes
- 17 MCP Tools: Comprehensive MCP integration for AI assistants (search, analysis, documentation, KG, story generation)
- Chat Mode: LLM-powered code Q&A with iterative refinement (up to 30 queries), deep search, and KG query tools
- CodeT5+ Embeddings: Code-specific embeddings via
index-codecommand (Salesforce/codet5p-110m-embedding)
๐ง Technical Features
- Vector Database: LanceDB (serverless, file-based)
- Embedding Models: Configurable sentence transformers with GPU acceleration
- Smart Reindexing: Search-triggered, Git hooks, scheduled tasks, and manual options
- Extensible Parsers: Plugin architecture for new languages
- Configuration Management: Project-specific settings
- Production Ready: Write buffering, auto-indexing, comprehensive error handling
- Performance: Apple Silicon M4 Max optimizations (2-4x speedup with MPS)
๐ Quick Start
Installation
# Install from PyPI (recommended)
pip install mcp-vector-search
# Or with UV (faster)
uv pip install mcp-vector-search
# Or install from source
git clone https://github.com/bobmatnyc/mcp-vector-search.git
cd mcp-vector-search
uv sync && uv pip install -e .
Verify Installation:
# Check that all dependencies are installed correctly
mcp-vector-search doctor
# Should show all โ marks
# If you see missing dependencies, try:
pip install --upgrade mcp-vector-search
Zero-Config Setup (Recommended)
The fastest way to get started - completely hands-off, just one command:
# Smart zero-config setup (recommended)
mcp-vector-search setup
What setup does automatically:
- โ Detects your project's languages and file types
- โ Initializes semantic search with optimal settings
- โ Indexes your entire codebase
- โ Configures ALL installed MCP platforms (Claude Code, Cursor, etc.)
- โ
Uses native Claude CLI integration (
claude mcp add) when available - โ
Falls back to
.mcp.jsonif Claude CLI not available - โ Sets up file watching for auto-reindex
- โ Zero user input required!
Behind the scenes:
- Server name:
mcp(for consistency with other MCP projects) - Command:
uv run python -m mcp_vector_search.mcp.server {PROJECT_ROOT} - File watching: Enabled via
MCP_ENABLE_FILE_WATCHING=true - Integration method: Native
claude mcp add(or.mcp.jsonfallback)
Example output:
๐ Smart Setup for mcp-vector-search
๐ Detecting project...
โ
Found 3 language(s): Python, JavaScript, TypeScript
โ
Detected 8 file type(s)
โ
Found 2 platform(s): claude-code, cursor
โ๏ธ Configuring...
โ
Embedding model: sentence-transformers/all-MiniLM-L6-v2
๐ Initializing...
โ
Vector database created
โ
Configuration saved
๐ Indexing codebase...
โ
Indexing completed in 12.3s
๐ Configuring MCP integrations...
โ
Using Claude CLI for automatic setup
โ
Registered with Claude CLI
โ
Configured 2 platform(s)
๐ Setup Complete!
Options:
# Force re-setup
mcp-vector-search setup --force
# Verbose output for debugging (shows Claude CLI commands)
mcp-vector-search setup --verbose
Advanced Setup Options
For more control over the installation process:
# Manual setup with MCP integration
mcp-vector-search install --with-mcp
# Custom file extensions
mcp-vector-search install --extensions .py,.js,.ts,.dart
# Skip automatic indexing
mcp-vector-search install --no-auto-index
# Just initialize (no indexing or MCP)
mcp-vector-search init
Add MCP Integration for AI Tools
Automatic (Recommended):
# One command sets up all detected platforms
mcp-vector-search setup
Manual Platform Installation:
# Add Claude Code integration (project-scoped)
mcp-vector-search install claude-code
# Add Cursor IDE integration (global)
mcp-vector-search install cursor
# See all available platforms
mcp-vector-search install list
Note: The setup command uses native claude mcp add when Claude CLI is available, providing better integration than manual .mcp.json creation.
Remove MCP Integrations
# Remove specific platform
mcp-vector-search uninstall claude-code
# Remove all integrations
mcp-vector-search uninstall --all
# List configured integrations
mcp-vector-search uninstall list
Basic Usage
# Search your code
mcp-vector-search search "authentication logic"
mcp-vector-search search "database connection setup"
mcp-vector-search search "error handling patterns"
# Index your codebase (if not done during setup)
mcp-vector-search index
# Index with code-specific embeddings (CodeT5+)
mcp-vector-search index-code
# Check project status
mcp-vector-search status
# Start file watching (auto-update index)
mcp-vector-search watch
# Interactive visualization (5+ views)
mcp-vector-search visualize
# Generate development narrative from git history
mcp-vector-search story
# Knowledge graph operations
mcp-vector-search kg build
mcp-vector-search kg status
mcp-vector-search kg query "find all Python functions"
# Chat mode with LLM
mcp-vector-search chat "explain the authentication flow"
# Code analysis
mcp-vector-search analyze complexity
mcp-vector-search analyze dead-code
Smart CLI with "Did You Mean" Suggestions
The CLI includes intelligent command suggestions for typos:
# Typos are automatically detected and corrected
$ mcp-vector-search serach "auth"
No such command 'serach'. Did you mean 'search'?
$ mcp-vector-search indx
No such command 'indx'. Did you mean 'index'?
See docs/guides/cli-usage.md for more details.
Versioning & Releasing
This project uses semantic versioning with an automated release workflow.
Quick Commands
make version-show- Display current versionmake release-patch- Create patch releasemake publish- Publish to PyPI
See docs/development/versioning.md for complete documentation.
๐ AI Code Review
Context-aware code review using your entire codebase as context โ Not just diff analysis!
What Makes It Different
Traditional code review tools only see individual files or diffs. MCP Vector Search analyzes code with full codebase context by:
- ๐ Semantic Search: Finding related patterns and similar implementations
- ๐ธ๏ธ Knowledge Graph: Understanding dependencies and callers
- ๐ค LLM Analysis: Deep analysis with language-specific standards
- โก Smart Caching: 5x speedup with intelligent result caching
Quick Examples
# Security review of your codebase
mvs analyze review security
# Review a pull request with full context
mvs analyze review-pr --baseline main --head feature-branch
# Review only changed files (fast!)
mvs analyze review security --changed-only --baseline main
# Run multiple review types at once
mvs analyze review --types security,quality,architecture
Review Types
| Type | Focus | Key Checks |
|---|---|---|
| security | OWASP Top 10, CWE | SQL injection, XSS, auth flaws, hardcoded secrets |
| architecture | SOLID principles | Coupling, circular deps, god classes, SRP violations |
| performance | Efficiency | N+1 queries, O(nยฒ) algorithms, blocking I/O |
| quality | Maintainability | Code smells, duplication, magic numbers, dead code |
| testing | Test coverage | Missing tests, edge cases, test quality |
| documentation | Code docs | Missing docstrings, TODOs, outdated comments |
PR Review with Context
The killer feature โ review PRs using the entire codebase as context:
# Review PR with context-aware analysis
mvs analyze review-pr --baseline main --format github-json
# For each changed file, finds:
# โ Similar patterns in codebase (consistency checking)
# โ Callers and dependencies (impact analysis)
# โ Existing tests (coverage gaps)
# โ Language-specific idioms (12 languages supported)
Context Strategy:
Changed File โ Vector Search (similar patterns)
โ Knowledge Graph (callers, deps)
โ Test Discovery (coverage)
โ LLM Analysis (with full context)
โ Actionable Comments
Multi-Language Support
12 languages with language-specific idioms, anti-patterns, and security checks:
Python โข TypeScript โข JavaScript โข Java โข C# โข Ruby โข Go โข Rust โข PHP โข Swift โข Kotlin โข Scala
Each language has tailored standards:
- Python: PEP 8, type hints, context managers, SQL injection patterns
- TypeScript: Strict mode, no
any, XSS patterns - Java: SOLID principles, Optional over null, XXE patterns
- Ruby: Guard clauses, blocks, RuboCop standards
- Go: Error handling, goroutines, interfaces
Custom Instructions
Create .mcp-vector-search/review-instructions.yaml:
language_standards:
python:
- "Enforce type hints on all public functions"
- "Use Pydantic for data validation"
scope_standards:
src/auth:
- "All auth functions must have audit logging"
custom_review_focus:
security:
- "Flag any hardcoded credentials"
Auto-Discovery
Automatically reads and applies standards from your existing config files:
- Python:
pyproject.toml,.flake8,mypy.ini,ruff.toml - TypeScript:
tsconfig.json,.eslintrc.json - Ruby:
.rubocop.yml - Java:
checkstyle.xml,pom.xml - +8 more languages
CI/CD Integration
# .github/workflows/code-review.yml
- name: Review PR
run: |
mvs analyze review-pr \
--baseline ${{ github.base_ref }} \
--format sarif \
--output review.sarif
- name: Upload to Security tab
uses: github/codeql-action/upload-sarif@v3
with:
sarif_file: review.sarif
Output Formats
- console: Rich, colored output for humans
- json: Machine-readable structured data
- sarif: GitHub Security tab integration
- markdown: Reports for documentation
- github-json: PR comments (summary + inline)
Performance
- Vector Search: <0.5s (find relevant code)
- KG Queries: <0.2s (relationships)
- LLM Analysis: 10-15s (deep analysis)
- Cache Hit: 5x speedup on repeat reviews
Smart Caching: Unchanged code chunks return cached findings instantly.
Learn More
๐ Complete Documentation โ Architecture, examples, best practices
๐ CI/CD Integration Guide โ GitHub Actions, GitLab CI, pre-commit hooks
๐ Multi-Language Support โ 12 languages with standards
๐ Documentation
Commands
setup - Zero-Config Smart Setup (Recommended)
# One command to do everything (recommended)
mcp-vector-search setup
# What it does automatically:
# - Detects project languages and file types
# - Initializes semantic search
# - Indexes entire codebase
# - Configures all detected MCP platforms
# - Sets up file watching
# - Zero configuration needed!
# Force re-setup
mcp-vector-search setup --force
# Verbose output for debugging
mcp-vector-search setup --verbose
Key Features:
- Zero Configuration: No user input required
- Smart Detection: Automatically discovers languages and platforms
- Comprehensive: Handles init + index + MCP setup in one command
- Idempotent: Safe to run multiple times
- Fast: Timeout-protected scanning (won't hang on large projects)
- Team-Friendly: Commit
.mcp.jsonto share configuration
When to use:
- โ First-time project setup
- โ Team onboarding
- โ Quick testing in new codebases
- โ Setting up multiple MCP platforms at once
install - Install Project and MCP Integrations (Advanced)
# Manual setup with more control
mcp-vector-search install
# Install with all MCP integrations
mcp-vector-search install --with-mcp
# Custom file extensions
mcp-vector-search install --extensions .py,.js,.ts
# Skip automatic indexing
mcp-vector-search install --no-auto-index
# Platform-specific MCP integration
mcp-vector-search install claude-code # Project-scoped
mcp-vector-search install cursor # Global
mcp-vector-search install windsurf # Global
mcp-vector-search install vscode # Global
# List available platforms
mcp-vector-search install list
When to use:
- Use
installwhen you need fine-grained control over extensions, models, or MCP platforms - Use
setupfor quick, zero-config onboarding (recommended)
uninstall - Remove MCP Integrations
# Remove specific platform
mcp-vector-search uninstall claude-code
# Remove all integrations
mcp-vector-search uninstall --all
# List configured integrations
mcp-vector-search uninstall list
# Skip backup creation
mcp-vector-search uninstall claude-code --no-backup
# Alias (same as uninstall)
mcp-vector-search remove claude-code
init - Initialize Project (Simple)
# Basic initialization (no indexing or MCP)
mcp-vector-search init
# Custom configuration
mcp-vector-search init --extensions .py,.js,.ts --embedding-model sentence-transformers/all-MiniLM-L6-v2
# Force re-initialization
mcp-vector-search init --force
Note: For most users, use setup instead of init. The init command is for advanced users who want manual control.
index - Index Codebase
# Index all files
mcp-vector-search index
# Index specific directory
mcp-vector-search index /path/to/code
# Force re-indexing
mcp-vector-search index --force
# Reindex entire project
mcp-vector-search index reindex
# Reindex entire project (explicit)
mcp-vector-search index reindex --all
# Reindex entire project without confirmation
mcp-vector-search index reindex --force
# Reindex specific file
mcp-vector-search index reindex path/to/file.py
search - Semantic Search
# Basic search
mcp-vector-search search "function that handles user authentication"
# Adjust similarity threshold
mcp-vector-search search "database queries" --threshold 0.7
# Limit results
mcp-vector-search search "error handling" --limit 10
# Search in specific context
mcp-vector-search search similar "path/to/function.py:25"
auto-index - Automatic Reindexing
# Setup all auto-indexing strategies
mcp-vector-search auto-index setup --method all
# Setup specific strategies
mcp-vector-search auto-index setup --method git-hooks
mcp-vector-search auto-index setup --method scheduled --interval 60
# Check for stale files and auto-reindex
mcp-vector-search auto-index check --auto-reindex --max-files 10
# View auto-indexing status
mcp-vector-search auto-index status
# Remove auto-indexing setup
mcp-vector-search auto-index teardown --method all
watch - File Watching
# Start watching for changes
mcp-vector-search watch
# Check watch status
mcp-vector-search watch status
# Enable/disable watching
mcp-vector-search watch enable
mcp-vector-search watch disable
status - Project Information
# Basic status
mcp-vector-search status
# Detailed information
mcp-vector-search status --verbose
config - Configuration Management
# View configuration
mcp-vector-search config show
# Update settings
mcp-vector-search config set similarity_threshold 0.8
mcp-vector-search config set embedding_model microsoft/codebert-base
# Configure indexing behavior
mcp-vector-search config set skip_dotfiles true # Skip dotfiles (default)
mcp-vector-search config set respect_gitignore true # Respect .gitignore (default)
# Get specific setting
mcp-vector-search config get skip_dotfiles
mcp-vector-search config get respect_gitignore
# List available models
mcp-vector-search config models
# List all configuration keys
mcp-vector-search config list-keys
index-code - Code-Specific Embeddings
# Index with CodeT5+ embeddings (code-optimized)
mcp-vector-search index-code
# Feature-flagged via environment variable
export MCP_CODE_ENRICHMENT=true
mcp-vector-search index-code
visualize - Interactive D3.js Visualization
# Launch visualization server
mcp-vector-search visualize
# Start on custom port
mcp-vector-search visualize --port 8080
# Available views:
# - Treemap: Hierarchical view with size/complexity encoding
# - Sunburst: Radial hierarchical view
# - Force Graph: Network visualization of code relationships
# - Knowledge Graph: Entity and relationship visualization
# - Heatmap: Complexity and quality heatmap
story - Development Narrative Generation
# Generate development narrative from git history
mcp-vector-search story
# Output formats
mcp-vector-search story --format markdown
mcp-vector-search story --format json
mcp-vector-search story --format html
# Serve as HTTP endpoint
mcp-vector-search story --serve
# Extract-only mode (no LLM)
mcp-vector-search story --no-llm
# Custom LLM model
mcp-vector-search story --model gpt-4o
kg - Knowledge Graph Operations
# Build knowledge graph
mcp-vector-search kg build
# Check knowledge graph status
mcp-vector-search kg status
# Query knowledge graph
mcp-vector-search kg query "find all Python functions"
mcp-vector-search kg query "show classes in module auth"
# Browse document ontology (file-level document classification)
mcp-vector-search kg ontology
mcp-vector-search kg ontology --category guide # filter by category
mcp-vector-search kg ontology --verbose # include file paths
# Knowledge graph entities:
# - CodeFile, Function, Class, Person
# - ProgrammingLanguage, ProgrammingFramework
# - Document (file-level, with doc_category classification)
# - Topic (hierarchical taxonomy)
chat - LLM-Powered Code Q&A
# Ask questions about your codebase
mcp-vector-search chat "explain the authentication flow"
mcp-vector-search chat "how does error handling work?"
# Iterative refinement (up to 30 queries)
# Automatically uses deep search and KG query tools
# Advanced reasoning mode
mcp-vector-search chat "architectural patterns" --think
# Filter by files
mcp-vector-search chat "validation logic" --files "src/*.py"
analyze - Code Analysis
# Complexity analysis
mcp-vector-search analyze complexity
# Dead code detection
mcp-vector-search analyze dead-code
# Output formats
mcp-vector-search analyze complexity --json
mcp-vector-search analyze complexity --sarif
mcp-vector-search analyze complexity --output-format markdown
# CI/CD integration
mcp-vector-search analyze complexity --fail-on-smell
๐ Performance Features
Search Optimizations
MCP Vector Search includes several query-time optimizations that are automatically enabled as your index grows.
IVF-PQ Index is built automatically after indexing more than 256 rows. It uses Inverted File with Product Quantization to partition vectors into clusters, so queries scan only a relevant subset rather than the full index. The index parameters adapt to your data: num_partitions = clamp(sqrt(N), 16, 512) and num_sub_vectors = dim // 4.
Two-stage retrieval improves precision on top of the IVF-PQ scan: the engine probes 20 IVF partitions (nprobes=20) and fetches 5x the requested candidates, then reranks them with exact cosine similarity (refine_factor=5). Applied to both the LanceDB and legacy vector backends.
Contextual chunking prepends a compact metadata header to each chunk before embedding, so the vector captures file, language, class, and function context rather than code text alone. Format: File: core/search.py | Lang: python | Class: Engine | Fn: search | Uses: lancedb. Based on Anthropic research showing 35-49% fewer retrieval failures.
| Optimization | Impact |
|---|---|
| IVF-PQ index + two-stage retrieval | 4.9x faster queries (3.4ms vs 16.7ms median) |
| Contextual chunking | 35-49% fewer retrieval failures |
| Pipeline parallelism | 37% faster indexing |
| Apple Silicon MPS | 2-4x faster embedding generation |
See docs/performance/search-optimizations.md for technical details and benchmark methodology.
LanceDB Backend (Default in v2.1+)
LanceDB is now the default vector database for better performance and stability:
- Serverless Architecture: No separate server process needed
- Better Scaling: Superior performance for large codebases (>100k chunks)
- File-Based Storage: Simple directory-based persistence
- Fewer Corruption Issues: More stable than ChromaDB's HNSW indices
- Write Buffering: 2-4x faster indexing with accumulated batch writes
To use ChromaDB (legacy), set environment variable:
export MCP_VECTOR_SEARCH_BACKEND=chromadb
Migrate existing ChromaDB database:
mcp-vector-search migrate db chromadb-to-lancedb
See docs/LANCEDB_BACKEND.md for detailed documentation.
Apple Silicon M4 Max Optimizations
2-4x speedup on Apple Silicon with automatic hardware detection:
- MPS Backend: Metal Performance Shaders GPU acceleration for embeddings
- Intelligent Batch Sizing: Auto-detects GPU memory (384-512 for M4 Max with 128GB RAM)
- Multi-Core Optimization: Utilizes all 12 performance cores efficiently
- Zero Configuration: Automatically enabled on Apple Silicon Macs
Environment variables for tuning:
export MCP_VECTOR_SEARCH_MPS_BATCH_SIZE=512 # Override MPS batch size
export MCP_VECTOR_SEARCH_BATCH_SIZE=128 # Override all backends
Semi-Automatic Reindexing
Multiple strategies to keep your index up-to-date without daemon processes:
- Search-Triggered: Automatically checks for stale files during searches
- Git Hooks: Triggers reindexing after commits, merges, checkouts
- Scheduled Tasks: System-level cron jobs or Windows tasks
- Manual Checks: On-demand via CLI commands
- Periodic Checker: In-process periodic checks for long-running apps
# Setup all strategies
mcp-vector-search auto-index setup --method all
# Check status
mcp-vector-search auto-index status
Configuration
Projects are configured via .mcp-vector-search/config.json:
{
"project_root": "/path/to/project",
"file_extensions": [".py", ".js", ".ts"],
"embedding_model": "sentence-transformers/all-MiniLM-L6-v2",
"similarity_threshold": 0.75,
"languages": ["python", "javascript", "typescript"],
"watch_files": true,
"cache_embeddings": true,
"skip_dotfiles": true,
"respect_gitignore": true
}
Indexing Configuration Options
skip_dotfiles (default: true)
- Controls whether files and directories starting with "." are skipped during indexing
- Whitelisted directories are always indexed regardless of this setting:
.github/- GitHub workflows and actions.gitlab-ci/- GitLab CI configuration.circleci/- CircleCI configuration
- When
false: All dotfiles are indexed (subject to gitignore rules ifrespect_gitignoreistrue)
respect_gitignore (default: true)
- Controls whether
.gitignorepatterns are respected during indexing - When
false: Files in.gitignoreare indexed (subject toskip_dotfilesif enabled)
force_include_patterns (default: [])
- Glob patterns to force-include files/directories even if they are gitignored
- Patterns support
**for recursive matching (e.g.,repos/**/*.javamatches all Java files inrepos/and subdirectories) - Force-include patterns override
.gitignorerules, allowing selective indexing of gitignored directories - Example use case: Index specific file types in a gitignored
repos/directory
Example: Force-include Java files from gitignored directory
# Set force_include_patterns via JSON list
mcp-vector-search config set force_include_patterns '["repos/**/*.java", "repos/**/*.kt"]'
# Or add patterns one at a time (requires custom CLI command)
# This allows .gitignore to exclude repos/ from git, but mcp-vector-search still indexes Java/Kotlin files
Example config.json with force_include_patterns:
{
"respect_gitignore": true,
"force_include_patterns": [
"repos/**/*.java",
"repos/**/*.kt",
"vendor/internal/**/*.go"
]
}
Configuration Use Cases
Default Behavior (Recommended for most projects):
# Skip dotfiles AND respect .gitignore
mcp-vector-search config set skip_dotfiles true
mcp-vector-search config set respect_gitignore true
Index Everything (Useful for deep code analysis):
# Index all files including dotfiles and gitignored files
mcp-vector-search config set skip_dotfiles false
mcp-vector-search config set respect_gitignore false
Index Dotfiles but Respect .gitignore:
# Index configuration files but skip build artifacts
mcp-vector-search config set skip_dotfiles false
mcp-vector-search config set respect_gitignore true
Skip Dotfiles but Ignore .gitignore:
# Useful when you want to index files in .gitignore but skip hidden config files
mcp-vector-search config set skip_dotfiles true
mcp-vector-search config set respect_gitignore false
Selective Gitignore Override with Force-Include Patterns:
# Index specific file types from gitignored directories
# Example: .gitignore excludes repos/, but you want to index Java/Kotlin files
mcp-vector-search config set respect_gitignore true
mcp-vector-search config set force_include_patterns '["repos/**/*.java", "repos/**/*.kt"]'
# This allows:
# - .gitignore to exclude repos/ from git (keeps your repo clean)
# - mcp-vector-search to index Java/Kotlin files in repos/ (semantic search)
# - Other files in repos/ (e.g., .class, .jar) remain excluded
๐๏ธ Architecture
Core Components
- Parser Registry: Extensible system for language-specific parsing
- Semantic Indexer: Efficient code chunking and embedding generation
- Vector Database: LanceDB for similarity search
- File Watcher: Real-time monitoring and incremental updates
- CLI Interface: Rich, user-friendly command-line experience
Supported Languages
MCP Vector Search supports 13 programming languages with full semantic search capabilities:
| Language | Extensions | Status | Features |
|---|---|---|---|
| Python | .py, .pyw |
โ Full | Functions, classes, methods, docstrings |
| JavaScript | .js, .jsx, .mjs |
โ Full | Functions, classes, JSDoc, ES6+ syntax |
| TypeScript | .ts, .tsx |
โ Full | Interfaces, types, generics, decorators |
| C# | .cs |
โ Full | Classes, interfaces, structs, enums, methods, XML docs, attributes |
| Dart | .dart |
โ Full | Functions, classes, widgets, async, dartdoc |
| PHP | .php, .phtml |
โ Full | Classes, methods, traits, PHPDoc, Laravel patterns |
| Ruby | .rb, .rake, .gemspec |
โ Full | Modules, classes, methods, RDoc, Rails patterns |
| Java | .java |
โ Full | Classes, methods, annotations, interfaces |
| Go | .go |
โ Full | Functions, structs, interfaces, packages |
| Rust | .rs |
โ Full | Functions, structs, traits, implementations |
| HTML | .html, .htm |
โ Full | Semantic content extraction, heading hierarchy, text chunking |
| Text/Markdown | .txt, .md, .markdown |
โ Basic | Semantic chunking for documentation |
New Language Support
HTML Support (Unreleased):
- Semantic Extraction: Content from h1-h6, p, section, article, main, aside, nav, header, footer
- Intelligent Chunking: Based on heading hierarchy (h1-h6)
- Context Preservation: Maintains class and id attributes for searchability
- Script/Style Filtering: Ignores non-content elements
- Use Cases: Static sites, documentation, web templates, HTML fragments
Dart/Flutter Support (v0.4.15):
- Widget Detection: StatelessWidget, StatefulWidget recognition
- State Classes: Automatic parsing of
_WidgetNameStatepatterns - Async Support: Future and async function handling
- Dartdoc: Triple-slash comment extraction
- Tree-sitter AST: Fast, accurate parsing with regex fallback
PHP Support (v0.5.0):
- Class Detection: Classes, interfaces, traits
- Method Extraction: Public, private, protected, static methods
- Magic Methods: __construct, __get, __set, __call, etc.
- PHPDoc: Full comment extraction
- Laravel Patterns: Controllers, Models, Eloquent support
- Tree-sitter AST: Fast parsing with regex fallback
Ruby Support (v0.5.0):
- Module/Class Detection: Full namespace support (::)
- Method Extraction: Instance and class methods
- Special Syntax: Method names with ?, ! support
- Attribute Macros: attr_accessor, attr_reader, attr_writer
- RDoc: Comment extraction (# and =begin...=end)
- Rails Patterns: ActiveRecord, Controllers support
- Tree-sitter AST: Fast parsing with regex fallback
๐ค Contributing
We welcome contributions! Please see our Contributing Guide for details.
Development Setup
# Clone the repository
git clone https://github.com/bobmatnyc/mcp-vector-search.git
cd mcp-vector-search
# Install development environment (includes dependencies + editable install)
make dev
# Test CLI from source (recommended during development)
./dev-mcp version # Shows [DEV] indicator
./dev-mcp search "test" # No reinstall needed after code changes
# Run tests and quality checks
make test-unit # Run unit tests
make quality # Run linting and type checking
make fix # Auto-fix formatting issues
# View all available targets
make help
For detailed development workflow and dev-mcp usage, see the Development section below.
Adding Language Support
- Create a new parser in
src/mcp_vector_search/parsers/ - Extend the
BaseParserclass - Register the parser in
parsers/registry.py - Add tests and documentation
๐ Performance
- Indexing Speed: ~1000 files/minute (typical Python project)
- Search Latency: 3.4ms median with IVF-PQ index (4.9x faster than without)
- Memory Usage: ~50MB baseline + ~1MB per 1000 code chunks
- Storage: ~1KB per code chunk (compressed embeddings)
โ ๏ธ Known Limitations (Alpha)
- Tree-sitter Integration: Currently using regex fallback parsing (Tree-sitter setup needs improvement)
- Search Relevance: Embedding model may need tuning for code-specific queries
- Error Handling: Some edge cases may not be gracefully handled
- Documentation: API documentation is minimal
- Testing: Limited test coverage, needs real-world validation
๐ Feedback Needed
We're actively seeking feedback on:
- Search Quality: How relevant are the search results for your codebase?
- Performance: How does indexing and search speed feel in practice?
- Usability: Is the CLI interface intuitive and helpful?
- Language Support: Which languages would you like to see added next?
- Features: What functionality is missing for your workflow?
Please open an issue or start a discussion to share your experience!
๐ฎ Roadmap
v2.5: Production (Current) โ
- Core CLI interface
- Multi-language parsing (13 languages: Python, JavaScript, TypeScript, C#, Dart, PHP, Ruby, Java, Go, Rust, HTML, Markdown, Text)
- LanceDB default backend (ChromaDB legacy support)
- Apple Silicon optimizations (2-4x speedup with MPS)
- File watching and auto-reindexing
- MCP server implementation with 17 tools
- Advanced search modes (semantic, contextual, similar code)
- Code analysis tools (complexity, dead code detection, code smells)
- Interactive D3.js visualization (5+ views: Treemap, Sunburst, Force Graph, KG, Heatmap)
- Knowledge Graph with KuzuDB (entity extraction, relationship mapping)
- Development narrative generation (
storycommand) - Chat mode with LLM integration (iterative refinement, up to 30 queries)
- CodeT5+ code-specific embeddings
- Pipeline parallelism (37% faster indexing)
- Production-ready performance (write buffering, GPU acceleration, async pipeline)
- IVF-PQ vector index with two-stage retrieval (4.9x faster queries)
- Contextual chunking (metadata-enriched embeddings, 35-49% fewer retrieval failures)
- CodeRankEmbed model support (
nomic-ai/CodeRankEmbed, 768d, 8K context) - Document ontology with 23 categories (
kg ontologycommand)
v2.6+: Enhancements ๐ฎ
- Hybrid search (vector + keyword + BM25)
- Additional language support (more languages beyond 13)
- IDE extensions (VS Code, JetBrains)
- Team collaboration features
- Advanced code refactoring suggestions
- Real-time collaboration on knowledge graph
- Multi-project knowledge graph federation
๐ ๏ธ Development
Three-Stage Development Workflow
Stage A: Local Development & Testing
# Setup development environment
make dev
# Run development tests
make test-unit
# Run CLI from source (recommended during development)
./dev-mcp version # Visual [DEV] indicator
./dev-mcp status # Any command works
./dev-mcp search "auth" # Immediate feedback on changes
# Run quality checks
make quality
# Alternative: use uv run directly
uv run mcp-vector-search version
Using the dev-mcp Development Helper
The ./dev-mcp script provides a streamlined way to run the CLI from source code during development, eliminating the need for repeated installations.
Key Features:
- Visual [DEV] Indicator: Shows
[DEV]prefix to distinguish from installed version - No Reinstall Required: Reflects code changes immediately
- Complete Argument Forwarding: Works with all CLI commands and options
- Verbose Mode: Debug output with
--verboseflag - Built-in Help: Script usage with
--help
Usage Examples:
# Basic commands (note the [DEV] prefix in output)
./dev-mcp version
./dev-mcp status
./dev-mcp index
./dev-mcp search "authentication logic"
# With CLI options
./dev-mcp search "error handling" --limit 10
./dev-mcp index --force
# Script verbose mode (shows Python interpreter, paths)
./dev-mcp --verbose search "database"
# Script help (shows dev-mcp usage, not CLI help)
./dev-mcp --help
# CLI command help (forwards --help to the CLI)
./dev-mcp search --help
./dev-mcp index --help
When to Use:
./dev-mcpโ Development workflow (runs from source code)mcp-vector-searchโ Production usage (runs installed version via pipx/pip)
Benefits:
- Instant Feedback: Changes to source code are reflected immediately
- No Build Step: Skip the reinstall cycle during active development
- Clear Context: Visual
[DEV]indicator prevents confusion about which version is running - Error Handling: Built-in checks for uv installation and project structure
Requirements:
- Must have
uvinstalled (pip install uv) - Must run from project root directory
- Requires
pyproject.tomlin current directory
Stage B: Local Deployment Testing
# Build and test clean deployment
./scripts/deploy-test.sh
# Test on other projects
cd ~/other-project
mcp-vector-search init && mcp-vector-search index
Stage C: PyPI Publication
# Publish to PyPI
./scripts/publish.sh
# Verify published version
pip install mcp-vector-search --upgrade
Quick Reference
./scripts/workflow.sh # Show workflow overview
See DEVELOPMENT.md for detailed development instructions.
๐ Documentation
For comprehensive documentation, see docs/index.md - the complete documentation hub.
Getting Started
- Installation Guide - Complete installation instructions
- First Steps - Quick start tutorial
- Configuration - Basic configuration
User Guides
- Searching Guide - Master semantic code search
- Indexing Guide - Indexing strategies and optimization
- CLI Usage - Advanced CLI features
- MCP Integration - AI tool integration
- File Watching - Real-time index updates
Reference
- CLI Commands - Complete command reference
- Configuration Options - All configuration settings
- Features - Feature overview
- Architecture - System architecture
Development
- Contributing - How to contribute
- Testing - Testing guide
- Code Quality - Linting and formatting
- API Reference - Internal API docs
- Deployment - Release and deployment guide
Advanced
- Troubleshooting - Common issues and solutions
- Performance - Performance optimization
- Extending - Adding new features
๐ค Contributing
Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.
๐ License
Elastic License 2.0 - see LICENSE file for details.
Note: This software may not be provided to third parties as a hosted or managed service.
๐ Acknowledgments
- LanceDB for vector database
- Tree-sitter for parsing infrastructure
- Sentence Transformers for embeddings
- Typer for CLI framework
- Rich for beautiful terminal output
Built with โค๏ธ for developers who love efficient code search
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file mcp_vector_search-3.0.28.tar.gz.
File metadata
- Download URL: mcp_vector_search-3.0.28.tar.gz
- Upload date:
- Size: 2.9 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5bd84a3f050ee2165ae848d5f68b1796a5df3af6c61b82028291f3d37967f949
|
|
| MD5 |
0cdf9a2c8b4c86a02a45af9dfb1e231e
|
|
| BLAKE2b-256 |
444b77562681daf23f0be51c4c73a73c7036f5fa08b7f476c6df14ce6a47daec
|
File details
Details for the file mcp_vector_search-3.0.28-py3-none-any.whl.
File metadata
- Download URL: mcp_vector_search-3.0.28-py3-none-any.whl
- Upload date:
- Size: 1.0 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7b9e210aa3ce0ead53766728a34d6dd4ae99a142a6ad589ac1690673d1e8ef6e
|
|
| MD5 |
b08244ba2c01a93f5b5f6f41591430b0
|
|
| BLAKE2b-256 |
70dae51de91ddd6087db6cf70732a9893ecd711c3ad82ec80f9d4dcf5985bfb0
|