A semantic code search MCP server for AI agents to index and retrieve code from codebases
This project has been archived.
The maintainers of this project have marked this project as archived. No new releases are expected.
Project description
SemanticScout ๐
A hybrid code intelligence system for AI agents - combining semantic search with structural understanding
SemanticScout is a Model Context Protocol (MCP) server that provides hybrid code intelligence by combining semantic search with structural code understanding. It goes beyond simple text matching to understand code relationships, dependencies, and architecture.
๐ What's New in v2.4.0
๐ง LSP Integration - 7% More Accurate Symbol Extraction!
- โ Language Server Protocol (LSP) - Uses real language servers for symbol extraction (default)
- โ Multi-Language Support - Python (jedi), C# (omnisharp), TypeScript/JavaScript (tsserver)
- โ Intelligent Fallback - Automatically falls back to tree-sitter if LSP unavailable
- โ Session-Based Lifecycle - Servers stay alive for entire MCP session (no startup overhead)
- โ 7% More Symbols - LSP extracts 2,722 symbols vs 2,542 for tree-sitter (7.1% improvement)
- โ Better Accuracy - Language servers provide more accurate symbol information than AST parsing
Performance Comparison: LSP vs Tree-sitter
| Metric | Tree-sitter | LSP (jedi) | Improvement |
|---|---|---|---|
| Symbols Extracted | 2,542 | 2,722 | +7.1% |
| Dependencies Tracked | 63 | 63 | Same |
| Indexing Time | 1.85s | 3.88s | 2.1x slower |
| Accuracy | Good | Excellent | Better |
When to use LSP:
- โ Default: LSP is now the default for supported languages (Python, C#, TypeScript, JavaScript)
- โ Accuracy matters: When you need the most accurate symbol extraction
- โ Production use: For production codebases where quality > speed
When to disable LSP:
- โ ๏ธ Speed critical: If indexing speed is more important than accuracy
- โ ๏ธ Unsupported languages: For languages without LSP support (falls back automatically)
๐ What's New in v2.2.0
๐ Incremental Indexing - 5-10x Faster Updates!
- โ Incremental Indexing - Only indexes changed files (5-10x speedup for small changes)
- โ Chunk-Level Granularity - Only re-embeds changed code chunks (50%+ reuse rate)
- โ Parallel Processing - Async parallel updates with 4x+ speedup
- โ Hybrid Change Detection - Automatic Git-based or hash-based detection
- โ Model Switching - Reuse indexes when switching embedding models (if dimensions match)
- โ Real-Time Updates - Process file change events from editors via MCP
Performance Benchmarks
- Incremental indexing: 10x faster for 5% file changes
- Chunk-level reuse: 50% fewer embeddings generated
- Parallel updates: 4.15x speedup with 3 workers
โจ Features
Core Capabilities
- ๐ Semantic Code Search - Find code using natural language queries
- ๐ฏ Symbol Resolution - Precise function/class/variable lookup (95%+ accuracy)
- ๐ Dependency Tracking - Understand code relationships and call graphs (90%+ completeness)
- ๐ง Hybrid Retrieval - Combines semantic, symbol, and dependency-based search
- ๐ Context Expansion - Intelligent code context with dependency awareness
Technical Features
- ๐ง LSP Integration (Default) - Language Server Protocol for 7% more accurate symbol extraction (Python, C#, TypeScript, JavaScript)
- ๏ฟฝ Local Embeddings (Default) - sentence-transformers included (fast, no setup) or Ollama (optional, GPU support)
- ๐ณ AST-Based Fallback - tree-sitter for unsupported languages or when LSP unavailable (11 languages)
- ๐๏ธ Symbol Tables - SQLite-based symbol storage with FTS5 full-text search
- ๐ Dependency Graphs - NetworkX-based graph analysis and traversal
- ๐ Multi-Language Support - TypeScript, JavaScript, Python, Java, C#, Go, Rust, Ruby, PHP, C, C++
- โก High Performance - <100ms queries, <4s per file indexing (LSP), <1GB memory
- ๐ Security Built-in - Path validation, rate limiting, and resource limits
- ๐ค MCP Integration - Works with Claude Desktop and other MCP clients
๐ Quick Start
Get started in under 2 minutes with uvx - zero installation, zero configuration required!
Prerequisites
- uv - Install uv
- Claude Desktop (or other MCP client) - Install Claude Desktop
That's it! No Ollama, no language servers, no additional setup needed. Everything is included.
1. Configure Claude Desktop
Add to your Claude Desktop MCP configuration (%APPDATA%\Claude\claude_desktop_config.json on Windows or ~/Library/Application Support/Claude/claude_desktop_config.json on Mac):
{
"mcpServers": {
"semanticscout": {
"command": "uvx",
"args": ["--python", "3.12", "semanticscout@latest"]
}
}
}
That's it! This uses the default configuration:
- โ LSP integration - Accurate symbol extraction (Python, C#, TypeScript, JavaScript)
- โ sentence-transformers - Fast local embeddings (no Ollama needed)
- โ All enhancement features - Symbol tables, dependency graphs, hybrid search
Note: We specify --python 3.12 because some dependencies don't yet support Python 3.13. If you only have Python 3.13, install Python 3.12 with brew install python@3.12 (Mac) or download from python.org (Windows).
2. Restart Claude Desktop
That's it! SemanticScout will be automatically downloaded and run when Claude needs it.
โจ Benefits:
- โ No manual installation
- โ No Ollama or language server setup required
- โ Always uses latest version
- โ Automatic dependency management
- โ Isolated environment per run
- โ Works on Windows, Mac, and Linux
- โ
Data stored in
~/.semanticscout/
Optional: Custom Data Directory
By default, data is stored in ~/.semanticscout/. To use a custom location:
{
"mcpServers": {
"semanticscout": {
"command": "uvx",
"args": [
"--python", "3.12",
"semanticscout@latest",
"--data-dir", "/path/to/your/data"
]
}
}
}
๐ Incremental Indexing
SemanticScout v2.2.0 introduces incremental indexing for 5-10x faster updates when code changes.
How It Works
Automatic Change Detection:
- Git repositories: Uses
git diffto detect changed files since last index - Non-Git directories: Uses MD5 file hashing to detect changes
- Chunk-level granularity: Only re-embeds changed code chunks (not entire files)
Usage:
# Full indexing (indexes all files)
index_codebase(path="/path/to/project")
# Incremental indexing (only indexes changed files - 5-10x faster!)
index_codebase(path="/path/to/project", incremental=True)
Performance:
- Small changes (1-10% of files): 5-10x faster
- Chunk-level reuse: 50%+ fewer embeddings generated
- Parallel processing: 4x+ speedup with multiple files
When to use:
- โ Incremental: After initial indexing, for regular code updates
- โ Full: First-time indexing, major refactors, model changes
Real-Time File Change Events
Process file changes from editors in real-time:
# Process file change events
process_file_changes(
collection_name="my_project",
changes=json.dumps({
"events": [
{"type": "modified", "path": "src/main.py", "timestamp": 1234567890}
],
"workspace_root": "/path/to/project",
"debounce_ms": 500
}),
auto_update=True # Apply changes immediately
)
Security: All file paths are validated to prevent path traversal attacks.
๐ง LSP Integration Configuration
SemanticScout v2.4.0 uses Language Server Protocol (LSP) by default for more accurate symbol extraction.
How It Works
LSP vs Tree-sitter:
- LSP (default): Uses real language servers (jedi, omnisharp, tsserver) for symbol extraction
- โ 7% more symbols extracted (2,722 vs 2,542)
- โ More accurate type information and signatures
- โ Better handling of complex language features
- โ ๏ธ 2x slower indexing (3.88s vs 1.85s per file)
- Tree-sitter (fallback): Fast AST-based parsing
- โ Very fast indexing
- โ Works for all languages
- โ ๏ธ Less accurate symbol extraction
Automatic Fallback:
- LSP is used for supported languages (Python, C#, TypeScript, JavaScript)
- Tree-sitter is used for unsupported languages or if LSP fails
- No configuration needed - it just works!
Supported Languages
| Language | LSP Server | Status |
|---|---|---|
| Python | jedi | โ Enabled by default |
| C# | omnisharp | โ Enabled by default |
| TypeScript | tsserver | โ Enabled by default |
| JavaScript | tsserver | โ Enabled by default |
| Go, Rust, Java, etc. | tree-sitter | โ Fallback |
Disabling LSP (Use Tree-sitter Only)
If you prefer faster indexing over accuracy, you can disable LSP:
{
"mcpServers": {
"semanticscout": {
"command": "uvx",
"args": ["--python", "3.12", "semanticscout@latest"],
"env": {
"SEMANTICSCOUT_CONFIG_JSON": "{\"enhancement_config\":{\"lsp_integration\":{\"enabled\":false}}}"
}
}
}
}
Per-Language Configuration
Disable LSP for specific languages:
{
"mcpServers": {
"semanticscout": {
"command": "uvx",
"args": ["--python", "3.12", "semanticscout@latest"],
"env": {
"SEMANTICSCOUT_CONFIG_JSON": "{\"enhancement_config\":{\"lsp_integration\":{\"languages\":{\"python\":{\"enabled\":false}}}}}"
}
}
}
}
Note: LSP servers are automatically installed via the multilspy package (included in dependencies).
โก Advanced Configuration
Default Configuration (Recommended)
No configuration needed! The default setup uses:
- LSP integration - Accurate symbol extraction (Python, C#, TypeScript, JavaScript)
- sentence-transformers - Fast local embeddings (30-60 sec for 500 chunks)
- All enhancement features - Symbol tables, dependency graphs, hybrid search
{
"mcpServers": {
"semanticscout": {
"command": "uvx",
"args": ["--python", "3.12", "semanticscout@latest"]
}
}
}
Embedding Provider Options
SemanticScout supports multiple embedding providers:
| Provider | Speed | Setup Required | Use Case |
|---|---|---|---|
| sentence-transformers (default) | ~30-60 sec for 500 chunks | โ None | Best for most users |
| Ollama (async) | ~2.6-4.4 min for 500 chunks | Ollama server | GPU acceleration, larger models |
| Ollama (sequential) | ~26-44 min for 500 chunks | Ollama server | Legacy/testing |
Option 1: sentence-transformers (Default - Recommended)
Already configured! This is the default. Available models:
all-MiniLM-L6-v2- 384 dims, very fast, good quality (default)all-mpnet-base-v2- 768 dims, higher quality, slowerparaphrase-MiniLM-L6-v2- 384 dims, optimized for paraphrase
To use a different model:
{
"mcpServers": {
"semanticscout": {
"command": "uvx",
"args": ["--python", "3.12", "semanticscout@latest"],
"env": {
"SEMANTICSCOUT_CONFIG_JSON": "{\"embedding\":{\"provider\":\"sentence-transformers\",\"model\":\"all-mpnet-base-v2\"}}"
}
}
}
}
Option 2: Ollama (Optional - For GPU Acceleration)
Requires Ollama server running locally:
# Start Ollama and pull model
ollama serve
ollama pull nomic-embed-text
{
"mcpServers": {
"semanticscout": {
"command": "uvx",
"args": ["--python", "3.12", "semanticscout@latest"],
"env": {
"OLLAMA_BASE_URL": "http://localhost:11434",
"OLLAMA_MODEL": "nomic-embed-text",
"OLLAMA_MAX_CONCURRENT": "10",
"SEMANTICSCOUT_CONFIG_JSON": "{\"embedding\":{\"provider\":\"ollama\"}}"
}
}
}
}
๐ Usage
Once configured in Claude Desktop, you can use natural language to interact with the MCP server:
Example Conversations
Index a codebase:
You: "Index my codebase at /workspace"
Claude: [Calls index_codebase tool and shows indexing progress]
Search for code:
You: "Find the authentication logic"
Claude: [Calls search_code tool and shows relevant code snippets]
List indexed projects:
You: "What codebases have been indexed?"
Claude: [Calls list_collections tool and shows all indexed projects]
Clear an index:
You: "Delete the index for my old project"
Claude: [Calls clear_index tool after confirmation]
Available MCP Tools
The server exposes these tools to Claude (you don't call them directly):
Core Tools
| Tool | Description | Parameters |
|---|---|---|
index_codebase |
Index a codebase for semantic and structural search | path (required) |
search_code |
Search with natural language + context expansion | query, collection_name, top_k, expansion_level |
list_collections |
List all indexed codebases | None |
get_indexing_status |
Get statistics for a collection | collection_name |
clear_index |
Delete a collection (permanent) | collection_name |
Enhanced Tools (Symbol & Dependency Analysis)
| Tool | Description | Parameters |
|---|---|---|
find_symbol |
Find symbols by name (functions, classes, etc.) | symbol_name, collection_name, symbol_type, limit |
find_callers |
Find all functions that call a given symbol | symbol_name, collection_name, max_depth |
trace_dependencies |
Trace dependency chains between files | file_path, collection_name, direction, max_depth |
โ๏ธ Environment Variables
Most users don't need to configure anything! The defaults work great.
Optional Environment Variables
| Variable | Default | Description |
|---|---|---|
MAX_FILE_SIZE_MB |
10.0 |
Skip files larger than this |
MAX_CODEBASE_SIZE_GB |
10.0 |
Maximum total codebase size |
MAX_FILES |
100000 |
Maximum number of files |
CHUNK_SIZE_MIN |
500 |
Minimum chunk size (chars) |
CHUNK_SIZE_MAX |
1500 |
Maximum chunk size (chars) |
LOG_LEVEL |
INFO |
Logging level |
Ollama-Specific Variables (Only if using Ollama)
| Variable | Default | Description |
|---|---|---|
OLLAMA_BASE_URL |
http://localhost:11434 |
Ollama server URL |
OLLAMA_MODEL |
nomic-embed-text |
Embedding model to use |
OLLAMA_MAX_CONCURRENT |
10 |
Max concurrent requests |
Example with Custom Settings
{
"mcpServers": {
"semanticscout": {
"command": "uvx",
"args": ["--python", "3.12", "semanticscout@latest"],
"env": {
"MAX_FILE_SIZE_MB": "20.0",
"LOG_LEVEL": "DEBUG"
}
}
}
}
๐๏ธ Architecture
โโโโโโโโโโโโโโโโโโโ
โ MCP Client โ (Claude Desktop, etc.)
โ (AI Agent) โ
โโโโโโโโโโฌโโโโโโโโโ
โ JSON-RPC over STDIO
โ
โโโโโโโโโโผโโโโโโโโโ
โ MCP Server โ
โ (FastMCP) โ
โโโโโโโโโโฌโโโโโโโโโ
โ
โโโโโโดโโโโโฌโโโโโโโโโฌโโโโโโโโโโโฌโโโโโโโโโโโ
โ โ โ โ โ
โโโโโผโโโโ โโโโผโโโ โโโโโผโโโโโ โโโโผโโโโโ โโโโโผโโโโโ
โIndexerโ โQueryโ โHybrid โ โVector โ โSymbol/ โ
โ โ โAnal โ โRetriev โ โ Store โ โDepGraphโ
โโโโโฌโโโโ โโโโฌโโโ โโโโโฌโโโโโ โโโโโฌโโโโ โโโโโฌโโโโโ
โ โ โ โ โ
โโโโโผโโโโโโโโโผโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโผโโโโ
โ ChromaDB + SQLite + NetworkX + Caches โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Core Components
- File Discovery: Finds code files, respects
.gitignore - LSP Processor: Uses Language Server Protocol for accurate symbol extraction (Python, C#, TypeScript, JavaScript)
- AST Processor: Parses code with tree-sitter, extracts symbols and dependencies (fallback or unsupported languages)
- Code Chunker: AST-based semantic chunking
- Embedding Provider: Generates vector embeddings (Ollama or sentence-transformers)
- Vector Store: Stores and searches embeddings (ChromaDB)
- Symbol Table: SQLite-based symbol storage with FTS5 search
- Dependency Graph: NetworkX-based graph analysis
- Query Analyzer: Classifies queries and routes to optimal strategy
- Hybrid Retriever: Coordinates semantic, symbol, and dependency search
- Context Expander: Intelligent context expansion with dependency awareness
- Security Validators: Path validation, rate limiting, input sanitization
๐งช Development
Running Tests
# Run all tests
pytest
# Run with coverage
pytest --cov
# Run specific test file
pytest tests/unit/test_semantic_search.py -v
Test Coverage
Current coverage: 73% (351 tests passing)
Core Components:
- File Discovery: 80%
- Code Chunker: 89%
- Ollama Provider: 92%
- Vector Store: 89%
- Query Processor: 100%
- Semantic Search: 99%
- Security Validators: 95%
Enhanced Components:
- AST Processor: 82%
- Symbol Table: 79%
- Dependency Graph: 84%
- Query Analyzer: 100%
- Hybrid Retriever: 97%
- Context Expander: 82%
- Performance Monitoring: 93%
Project Structure
semanticscout/
โโโ src/semanticscout/
โ โโโ mcp_server.py # MCP server entry point
โ โโโ config/ # Configuration management
โ โ โโโ __init__.py
โ โ โโโ enhancement_config.py
โ โโโ logging_config.py # Logging setup
โ โโโ indexer/ # Indexing components
โ โ โโโ file_discovery.py
โ โ โโโ code_chunker.py
โ โ โโโ pipeline.py
โ โโโ lsp/ # LSP integration (NEW in v2.4.0)
โ โ โโโ __init__.py
โ โ โโโ language_server_manager.py
โ โ โโโ lsp_processor.py
โ โ โโโ lsp_symbol_mapper.py
โ โโโ ast_processing/ # AST parsing & symbol extraction (fallback)
โ โ โโโ ast_processor.py
โ โ โโโ ast_cache.py
โ โโโ symbol_table/ # Symbol storage & lookup
โ โ โโโ symbol_table.py
โ โโโ dependency_graph/ # Dependency tracking
โ โ โโโ dependency_graph.py
โ โโโ query_analysis/ # Query classification
โ โ โโโ query_analyzer.py
โ โโโ embeddings/ # Embedding providers
โ โ โโโ base.py
โ โ โโโ ollama_provider.py
โ โโโ vector_store/ # Vector database
โ โ โโโ chroma_store.py
โ โโโ retriever/ # Search components
โ โ โโโ query_processor.py
โ โ โโโ semantic_search.py
โ โ โโโ hybrid_retriever.py
โ โ โโโ context_expander.py
โ โโโ performance/ # Performance monitoring
โ โ โโโ metrics.py
โ โ โโโ memory.py
โ โ โโโ parallel.py
โ โโโ security/ # Security & validation
โ โโโ validators.py
โโโ tests/ # Unit & integration tests
โ โโโ unit/ # Unit tests (200+ tests)
โ โโโ integration/ # Integration tests
โ โโโ validation/ # Validation tests
โโโ examples/ # Example scripts
โโโ docs/ # Documentation
โ โโโ API_REFERENCE.md
โ โโโ USER_GUIDE.md
โ โโโ CONFIGURATION.md
โ โโโ PERFORMANCE_TUNING.md
โโโ config/ # Configuration files
โโโ enhancement_config.template.json
## ๐ Runtime Data Structure
SemanticScout stores all runtime data in `~/semanticscout/`:
~/semanticscout/ # User's home directory โโโ config/ # Configuration files โ โโโ enhancement_config.json โโโ data/ # Runtime data โ โโโ chroma_db/ # Vector store database โ โโโ symbol_tables/ # Symbol databases โ โโโ dependency_graphs/ # Dependency graph files โ โโโ ast_cache/ # AST parsing cache โโโ logs/ # Log files โโโ mcp_server.log
## ๐ Documentation
Comprehensive documentation is available in the `docs/` directory:
- **[API_REFERENCE.md](docs/API_REFERENCE.md)** - Complete API documentation for all MCP tools
- **[USER_GUIDE.md](docs/USER_GUIDE.md)** - User guide with examples and best practices
- **[CONFIGURATION.md](docs/CONFIGURATION.md)** - Configuration options and feature flags
- **[PERFORMANCE_TUNING.md](docs/PERFORMANCE_TUNING.md)** - Performance optimization guide
### Examples
See the [examples/](examples/) directory for working examples:
- `test_full_pipeline.py` - Complete indexing and search workflow
- `test_retrieval_system.py` - Advanced search with filtering
- `index_weather_unified.py` - Real-world codebase indexing
## ๐ Troubleshooting
### Python Version Issues
**Error:** `No module named 'onnxruntime'` or tree-sitter compatibility issues
**Solution:** Use Python 3.12 (not 3.14). See [PYTHON_VERSION_ISSUE.md](PYTHON_VERSION_ISSUE.md).
### Ollama Not Running (Only if using Ollama)
**Error:** `Ollama server not available`
**Solution:** The default configuration uses sentence-transformers (no Ollama needed). If you explicitly configured Ollama, start it:
```bash
ollama serve
ollama pull nomic-embed-text
Or switch back to the default (sentence-transformers) by removing Ollama configuration.
Rate Limit Exceeded
Error: Rate limit exceeded: Maximum X requests per hour
Solution: Adjust rate limits in .env:
MAX_INDEXING_REQUESTS_PER_HOUR=20
MAX_SEARCH_REQUESTS_PER_MINUTE=200
Path Not Allowed
Error: Path is not within allowed directories
Solution: The server only allows indexing within the current working directory by default.
๐ค Contributing
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch
- Add tests for new functionality
- Ensure all tests pass
- Submit a pull request
๐ License
MIT License - see LICENSE for details.
๐ Acknowledgments
- Anthropic for the MCP protocol
- Ollama for local embeddings
- ChromaDB for vector storage
- Tree-sitter for code parsing
- multilspy for LSP integration
- Jedi, OmniSharp, and TypeScript Language Server for language servers
Built with โค๏ธ for the AI agent ecosystem
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file semanticscout-2.6.0.tar.gz.
File metadata
- Download URL: semanticscout-2.6.0.tar.gz
- Upload date:
- Size: 155.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ca290bbf86bc92638beeed14be1cce288c415a2fe0c448d412b60ca823b80f32
|
|
| MD5 |
8744325745377b8e9406d793713a3e38
|
|
| BLAKE2b-256 |
925143fdb519fbd6393977a7019cae9f7e8a09bb03d8f4cfd14c268a7ca10e77
|
File details
Details for the file semanticscout-2.6.0-py3-none-any.whl.
File metadata
- Download URL: semanticscout-2.6.0-py3-none-any.whl
- Upload date:
- Size: 152.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
eb9c65c3951f2ce0d516547ced410036034566e489c967919f300f22ee7f70b4
|
|
| MD5 |
f2ecf37ca1407c07bc20119eb93fc972
|
|
| BLAKE2b-256 |
5ce25ddec92c54cce7701f78c9edd93c9b501453ff5f3420b06f6cbc0f77fbc6
|