A language-aware semantic code search MCP server with intelligent filtering and 9.3x better dependency analysis
This project has been archived.
The maintainers of this project have marked this project as archived. No new releases are expected.
Project description
SemanticScout ๐
A hybrid code intelligence system for AI agents - combining semantic search with structural understanding
SemanticScout is a Model Context Protocol (MCP) server that provides hybrid code intelligence by combining semantic search with structural code understanding. It goes beyond simple text matching to understand code relationships, dependencies, and architecture with language-aware analysis and intelligent filtering.
๐ What's New in v2.9.0
๐ฏ Language-Aware Dependency Analysis - 9.3x Better Accuracy!
- โ Project Language Detection - Automatically detects primary languages (Rust, C#, Python, etc.)
- โ Language-Specific Routing - Routes dependency analysis to specialized strategies
- โ Rust Support - Advanced Cargo.toml parsing, mod declarations, crate resolution
- โ C# Support - Namespace resolution, using statements, project references
- โ Python Support - Import analysis, package detection, module resolution
- โ 9.3x Improvement - 100% accuracy vs 10.7% with generic analysis
๐ซ Intelligent Test Code Filtering - 0% Test Pollution!
- โ Multi-Strategy Detection - Path patterns, file names, AST analysis
- โ Production Code Focus - Automatically excludes test files from search results
- โ
Configurable Filtering - Enable/disable via
exclude_test_filesparameter - โ Zero False Positives - Comprehensive test detection patterns
- โ 24% โ 0% Test Pollution - Eliminates irrelevant test code from results
๐๏ธ Enhanced Git Filtering - Massive Project Support!
- โ Untracked File Detection - Automatically excludes untracked files from indexing
- โ Performance Optimization - 30-second caching of git status results
- โ Configurable Filtering - Enable/disable untracked file filtering
- โ Massive Project Support - Handles large repositories efficiently
- โ Graceful Fallbacks - Works with non-Git repositories
๐๏ธ Architectural Query Detection - Smart Pattern Recognition!
- โ DI Pattern Detection - Recognizes dependency injection queries
- โ Result Boosting - Prioritizes architectural files (Startup.cs, Program.cs)
- โ Context Expansion - Intelligent expansion for architectural queries
- โ Coverage Modes - Focused (5), Balanced (10), Comprehensive (20), Exhaustive (50)
- โ File-Level Deduplication - Eliminates duplicate results from same files
Performance Comparison: Language-Aware vs Generic Analysis
| Metric | Generic Analysis | Language-Aware | Improvement |
|---|---|---|---|
| Accuracy | 10.7% | 100% | 9.3x better |
| Test Pollution | 24% | 0% | Eliminated |
| Duplicate Results | 15% | 0% | Eliminated |
| Coverage | 3-5 files | 10-20 files | 2-4x more |
๐ง Recent Improvements (v2.8.0)
- โ UUID-Based Collection Naming - Collections now use unique identifiers for better multi-project support
- โ Embedding Dimensions Metadata - Embedding model and dimensions stored as separate metadata fields
- โ
Enhanced Collection Management - Improved collection naming scheme (e.g.,
morfeus_qt_<uuid>instead ofmorfeus_qt_nomic_embed_text)
๐ Previous Major Features
๐ Incremental Indexing (v2.2.0) - 5-10x Faster Updates!
- โ Incremental Indexing - Only indexes changed files (5-10x speedup for small changes)
- โ Chunk-Level Granularity - Only re-embeds changed code chunks (50%+ reuse rate)
- โ Parallel Processing - Async parallel updates with 4x+ speedup
- โ Hybrid Change Detection - Automatic Git-based or hash-based detection
- โ Model Switching - Reuse indexes when switching embedding models (if dimensions match)
- โ Real-Time Updates - Process file change events from editors via MCP
โจ Features
Core Capabilities
- ๐ Semantic Code Search - Find code using natural language queries with 100% accuracy
- ๐ฏ Symbol Resolution - Precise function/class/variable lookup (95%+ accuracy)
- ๐ Language-Aware Dependencies - Understand code relationships with specialized analysis (9.3x better)
- ๐ง Hybrid Retrieval - Combines semantic, symbol, and dependency-based search
- ๐ Context Expansion - Intelligent code context with dependency awareness
- ๐ซ Test Code Filtering - Automatically excludes test files (0% test pollution)
- ๐๏ธ Git Integration - Smart filtering of untracked files and git-aware indexing
Technical Features
- ๐ฏ Language Detection - Automatic project language detection and specialized routing
- ๐ฅ Local Embeddings (Default) - sentence-transformers included (fast, no setup) or Ollama (optional, GPU support)
- ๐ณ AST-Based Parsing - tree-sitter for accurate symbol extraction and dependency tracking (11 languages)
- ๐๏ธ Symbol Tables - SQLite-based symbol storage with FTS5 full-text search
- ๐ Dependency Graphs - NetworkX-based graph analysis and traversal
- ๐ Multi-Language Support - TypeScript, JavaScript, Python, Java, C#, Go, Rust, Ruby, PHP, C, C++
- โก High Performance - <100ms queries, <2s per file indexing, <1GB memory
- ๐ Security Built-in - Path validation, rate limiting, and resource limits
- ๐ค MCP Integration - Works with Claude Desktop and other MCP clients
- ๐ Coverage Modes - Focused (5), Balanced (10), Comprehensive (20), Exhaustive (50) results
๐ Quick Start
Get started in under 2 minutes with uvx - zero installation, zero configuration required!
Prerequisites
- uv - Install uv
- Claude Desktop (or other MCP client) - Install Claude Desktop
That's it! No Ollama, no language servers, no additional setup needed. Everything is included.
1. Configure Claude Desktop
Add to your Claude Desktop MCP configuration (%APPDATA%\Claude\claude_desktop_config.json on Windows or ~/Library/Application Support/Claude/claude_desktop_config.json on Mac):
{
"mcpServers": {
"semanticscout": {
"command": "uvx",
"args": ["--python", "3.12", "semanticscout@latest"]
}
}
}
That's it! This uses the default configuration:
- โ Language-aware analysis - Automatic language detection and specialized routing
- โ Tree-sitter parsing - Accurate symbol extraction and dependency tracking
- โ sentence-transformers - Fast local embeddings (no Ollama needed)
- โ Test code filtering - Excludes test files from search results
- โ Git filtering - Smart handling of untracked files
- โ All enhancement features - Symbol tables, dependency graphs, hybrid search
Note: We specify --python 3.12 because some dependencies don't yet support Python 3.13. If you only have Python 3.13, install Python 3.12 with brew install python@3.12 (Mac) or download from python.org (Windows).
2. Restart Claude Desktop
That's it! SemanticScout will be automatically downloaded and run when Claude needs it.
โจ Benefits:
- โ No manual installation
- โ No Ollama or language server setup required
- โ Always uses latest version
- โ Automatic dependency management
- โ Isolated environment per run
- โ Works on Windows, Mac, and Linux
- โ
Data stored in
~/semanticscout/
Optional: Custom Data Directory
By default, data is stored in ~/semanticscout/. To use a custom location:
{
"mcpServers": {
"semanticscout": {
"command": "uvx",
"args": [
"--python", "3.12",
"semanticscout@latest",
"--data-dir", "/path/to/your/data"
]
}
}
}
๐ Incremental Indexing & Git Integration
SemanticScout v2.9.0 provides advanced Git integration with enhanced filtering and 5-10x faster updates.
Enhanced Git Features
Smart File Filtering:
- Untracked file detection: Automatically excludes untracked files from indexing
- Git status caching: 30-second cache for performance optimization
- Configurable filtering: Enable/disable untracked file filtering
- Massive project support: Handles large repositories efficiently
Automatic Change Detection:
- Git repositories: Uses
git diffto detect changed files since last index - Non-Git directories: Uses MD5 file hashing to detect changes
- Chunk-level granularity: Only re-embeds changed code chunks (not entire files)
Usage:
# Full indexing (indexes all files)
index_codebase(path="/path/to/project")
# Incremental indexing (only indexes changed files - 5-10x faster!)
index_codebase(path="/path/to/project", incremental=True)
Performance:
- Small changes (1-10% of files): 5-10x faster
- Chunk-level reuse: 50%+ fewer embeddings generated
- Parallel processing: 4x+ speedup with multiple files
When to use:
- โ Incremental: After initial indexing, for regular code updates
- โ Full: First-time indexing, major refactors, model changes
Real-Time File Change Events
Process file changes from editors in real-time:
# Process file change events
process_file_changes(
collection_name="my_project",
changes=json.dumps({
"events": [
{"type": "modified", "path": "src/main.py", "timestamp": 1234567890}
],
"workspace_root": "/path/to/project",
"debounce_ms": 500
}),
auto_update=True # Apply changes immediately
)
Security: All file paths are validated to prevent path traversal attacks.
๐ฏ Language-Aware Analysis Configuration
SemanticScout v2.9.0 provides language-aware dependency analysis with 9.3x better accuracy than generic analysis.
How It Works
Language Detection & Routing:
- Automatic Detection: Analyzes project structure, config files, and file extensions
- Specialized Strategies: Routes to language-specific dependency analysis
- Rust Support: Cargo.toml parsing, mod declarations, crate resolution
- C# Support: Namespace resolution, using statements, project references
- Python Support: Import analysis, package detection, module resolution
Performance Comparison:
| Language | Generic Analysis | Language-Aware | Improvement |
|---|---|---|---|
| Rust | 8% accuracy | 100% accuracy | 12.5x better |
| C# | 12% accuracy | 100% accuracy | 8.3x better |
| Python | 15% accuracy | 100% accuracy | 6.7x better |
โก Advanced Configuration
Default Configuration (Recommended)
No configuration needed! The default setup uses:
- Language-aware analysis - Automatic language detection and specialized routing
- Tree-sitter parsing - Accurate symbol extraction and dependency tracking
- sentence-transformers - Fast local embeddings (30-60 sec for 500 chunks)
- Test code filtering - Excludes test files from search results
- Git filtering - Smart handling of untracked files
- All enhancement features - Symbol tables, dependency graphs, hybrid search
{
"mcpServers": {
"semanticscout": {
"command": "uvx",
"args": ["--python", "3.12", "semanticscout@latest"]
}
}
}
Embedding Provider Options
SemanticScout supports multiple embedding providers:
| Provider | Speed | Setup Required | Use Case |
|---|---|---|---|
| sentence-transformers (default) | ~30-60 sec for 500 chunks | โ None | Best for most users |
| Ollama (async) | ~2.6-4.4 min for 500 chunks | Ollama server | GPU acceleration, larger models |
| Ollama (sequential) | ~26-44 min for 500 chunks | Ollama server | Legacy/testing |
Option 1: sentence-transformers (Default - Recommended)
Already configured! This is the default. Available models:
all-MiniLM-L6-v2- 384 dims, very fast, good quality (default)all-mpnet-base-v2- 768 dims, higher quality, slowerparaphrase-MiniLM-L6-v2- 384 dims, optimized for paraphrase
To use a different model:
{
"mcpServers": {
"semanticscout": {
"command": "uvx",
"args": ["--python", "3.12", "semanticscout@latest"],
"env": {
"SEMANTICSCOUT_CONFIG_JSON": "{\"embedding\":{\"provider\":\"sentence-transformers\",\"model\":\"all-mpnet-base-v2\"}}"
}
}
}
}
Option 2: Ollama (Optional - For GPU Acceleration)
Requires Ollama server running locally:
# Start Ollama and pull model
ollama serve
ollama pull nomic-embed-text
{
"mcpServers": {
"semanticscout": {
"command": "uvx",
"args": ["--python", "3.12", "semanticscout@latest"],
"env": {
"OLLAMA_BASE_URL": "http://localhost:11434",
"OLLAMA_MODEL": "nomic-embed-text",
"OLLAMA_MAX_CONCURRENT": "10",
"SEMANTICSCOUT_CONFIG_JSON": "{\"embedding\":{\"provider\":\"ollama\"}}"
}
}
}
}
๐ Usage
Once configured in Claude Desktop, you can use natural language to interact with the MCP server:
Example Conversations
Index a codebase:
You: "Index my codebase at /workspace"
Claude: [Calls index_codebase tool and shows indexing progress]
Search for code:
You: "Find the authentication logic"
Claude: [Calls search_code tool and shows relevant code snippets]
List indexed projects:
You: "What codebases have been indexed?"
Claude: [Calls list_collections tool and shows all indexed projects]
Clear an index:
You: "Delete the index for my old project"
Claude: [Calls clear_index tool after confirmation]
Available MCP Tools
The server exposes these tools to Claude (you don't call them directly):
Core Tools
| Tool | Description | Parameters |
|---|---|---|
index_codebase |
Index a codebase with language-aware analysis | path (required), incremental (optional) |
search_code |
Search with natural language + context expansion | query, collection_name, coverage_mode, exclude_test_files |
list_collections |
List all indexed codebases | None |
get_indexing_status |
Get statistics for a collection | collection_name |
clear_index |
Delete a collection (permanent) | collection_name |
Enhanced Tools (Symbol & Dependency Analysis)
| Tool | Description | Parameters |
|---|---|---|
find_symbol |
Find symbols with language-aware lookup | symbol_name, collection_name, symbol_type |
find_callers |
Find all functions that call a given symbol | symbol_name, collection_name, max_results |
trace_dependencies |
Trace dependency chains with language-specific analysis | file_path, collection_name, depth |
process_file_changes |
Process real-time file change events | collection_name, changes, auto_update |
โ๏ธ Environment Variables
Most users don't need to configure anything! The defaults work great.
Optional Environment Variables
| Variable | Default | Description |
|---|---|---|
MAX_FILE_SIZE_MB |
10.0 |
Skip files larger than this |
MAX_CODEBASE_SIZE_GB |
10.0 |
Maximum total codebase size |
MAX_FILES |
100000 |
Maximum number of files |
CHUNK_SIZE_MIN |
500 |
Minimum chunk size (chars) |
CHUNK_SIZE_MAX |
1500 |
Maximum chunk size (chars) |
LOG_LEVEL |
INFO |
Logging level |
Ollama-Specific Variables (Only if using Ollama)
| Variable | Default | Description |
|---|---|---|
OLLAMA_BASE_URL |
http://localhost:11434 |
Ollama server URL |
OLLAMA_MODEL |
nomic-embed-text |
Embedding model to use |
OLLAMA_MAX_CONCURRENT |
10 |
Max concurrent requests |
Example with Custom Settings
{
"mcpServers": {
"semanticscout": {
"command": "uvx",
"args": ["--python", "3.12", "semanticscout@latest"],
"env": {
"MAX_FILE_SIZE_MB": "20.0",
"LOG_LEVEL": "DEBUG"
}
}
}
}
๐๏ธ Architecture
โโโโโโโโโโโโโโโโโโโ
โ MCP Client โ (Claude Desktop, etc.)
โ (AI Agent) โ
โโโโโโโโโโฌโโโโโโโโโ
โ JSON-RPC over STDIO
โ
โโโโโโโโโโผโโโโโโโโโ
โ MCP Server โ
โ (FastMCP) โ
โโโโโโโโโโฌโโโโโโโโโ
โ
โโโโโโดโโโโโฌโโโโโโโโโฌโโโโโโโโโโโฌโโโโโโโโโโโ
โ โ โ โ โ
โโโโโผโโโโ โโโโผโโโ โโโโโผโโโโโ โโโโผโโโโโ โโโโโผโโโโโ
โIndexerโ โQueryโ โHybrid โ โVector โ โSymbol/ โ
โ โ โAnal โ โRetriev โ โ Store โ โDepGraphโ
โโโโโฌโโโโ โโโโฌโโโ โโโโโฌโโโโโ โโโโโฌโโโโ โโโโโฌโโโโโ
โ โ โ โ โ
โโโโโผโโโโโโโโโผโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโผโโโโ
โ ChromaDB + SQLite + NetworkX + Caches โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Core Components
- File Discovery: Finds code files, respects
.gitignore - AST Processor: Parses code with tree-sitter, extracts symbols and dependencies
- Code Chunker: AST-based semantic chunking
- Embedding Provider: Generates vector embeddings (Ollama or sentence-transformers)
- Vector Store: Stores and searches embeddings (ChromaDB)
- Symbol Table: SQLite-based symbol storage with FTS5 search
- Dependency Graph: NetworkX-based graph analysis
- Query Analyzer: Classifies queries and routes to optimal strategy
- Hybrid Retriever: Coordinates semantic, symbol, and dependency search
- Context Expander: Intelligent context expansion with dependency awareness
- Security Validators: Path validation, rate limiting, input sanitization
๐งช Development
Running Tests
# Run all tests
pytest
# Run with coverage
pytest --cov
# Run specific test file
pytest tests/unit/test_semantic_search.py -v
Test Coverage
Current coverage: 85% (400+ tests passing)
Core Components:
- File Discovery: 85%
- Code Chunker: 89%
- Ollama Provider: 92%
- Vector Store: 89%
- Query Processor: 100%
- Semantic Search: 99%
- Security Validators: 95%
Enhanced Components:
- Language Detection: 90%
- Dependency Router: 88%
- AST Processor: 82%
- Symbol Table: 79%
- Dependency Graph: 84%
- Query Analyzer: 100%
- Hybrid Retriever: 97%
- Context Expander: 82%
- Git Integration: 85%
- Test Filtering: 92%
Project Structure
semanticscout/
โโโ src/semanticscout/
โ โโโ mcp_server.py # MCP server entry point
โ โโโ config/ # Configuration management
โ โ โโโ __init__.py
โ โ โโโ enhancement_config.py
โ โโโ logging_config.py # Logging setup
โ โโโ indexer/ # Indexing components
โ โ โโโ file_discovery.py
โ โ โโโ file_classifier.py # NEW: Test file detection
โ โ โโโ code_chunker.py
โ โ โโโ git_change_detector.py # NEW: Enhanced git filtering
โ โ โโโ pipeline.py
โ โโโ language_detection/ # Language detection
โ โ โโโ project_language_detector.py
โ โโโ dependency_analysis/ # Language-aware analysis
โ โ โโโ dependency_router.py
โ โ โโโ strategies.py
โ โโโ ast_processing/ # AST parsing & symbol extraction
โ โ โโโ ast_processor.py
โ โ โโโ ast_cache.py
โ โโโ symbol_table/ # Symbol storage & lookup
โ โ โโโ symbol_table.py
โ โโโ dependency_graph/ # Dependency tracking
โ โ โโโ dependency_graph.py
โ โโโ query_analysis/ # Query classification
โ โ โโโ query_analyzer.py
โ โโโ embeddings/ # Embedding providers
โ โ โโโ base.py
โ โ โโโ ollama_provider.py
โ โโโ vector_store/ # Vector database
โ โ โโโ chroma_store.py
โ โโโ retriever/ # Search components
โ โ โโโ query_processor.py
โ โ โโโ semantic_search.py # Enhanced with test filtering
โ โ โโโ hybrid_retriever.py # Enhanced with deduplication
โ โ โโโ context_expander.py # Enhanced with smart expansion
โ โโโ performance/ # Performance monitoring
โ โ โโโ metrics.py
โ โ โโโ memory.py
โ โ โโโ parallel.py
โ โโโ security/ # Security & validation
โ โโโ validators.py
โโโ tests/ # Unit & integration tests
โ โโโ unit/ # Unit tests (200+ tests)
โ โโโ integration/ # Integration tests
โ โโโ validation/ # Validation tests
โโโ examples/ # Example scripts
โโโ docs/ # Documentation
โ โโโ API_REFERENCE.md
โ โโโ USER_GUIDE.md
โ โโโ CONFIGURATION.md
โ โโโ PERFORMANCE_TUNING.md
โโโ config/ # Configuration files
โโโ enhancement_config.template.json
## ๐ Runtime Data Structure
SemanticScout stores all runtime data in `~/semanticscout/`:
~/semanticscout/ # User's home directory โโโ config/ # Configuration files โ โโโ enhancement_config.json โโโ data/ # Runtime data โ โโโ chroma_db/ # Vector store database โ โโโ symbol_tables/ # Symbol databases โ โโโ dependency_graphs/ # Dependency graph files โ โโโ ast_cache/ # AST parsing cache โโโ logs/ # Log files โโโ mcp_server.log
## ๐ Documentation
Comprehensive documentation is available in the `docs/` directory:
- **[API_REFERENCE.md](docs/API_REFERENCE.md)** - Complete API documentation for all MCP tools
- **[USER_GUIDE.md](docs/USER_GUIDE.md)** - User guide with examples and best practices
- **[CONFIGURATION.md](docs/CONFIGURATION.md)** - Configuration options and feature flags
- **[PERFORMANCE_TUNING.md](docs/PERFORMANCE_TUNING.md)** - Performance optimization guide
### Examples
See the [examples/](examples/) directory for working examples:
- `test_full_pipeline.py` - Complete indexing and search workflow
- `test_retrieval_system.py` - Advanced search with filtering
- `index_weather_unified.py` - Real-world codebase indexing
## ๐ Troubleshooting
### Python Version Issues
**Error:** `No module named 'onnxruntime'` or tree-sitter compatibility issues
**Solution:** Use Python 3.12 (not 3.14). See [PYTHON_VERSION_ISSUE.md](PYTHON_VERSION_ISSUE.md).
### Ollama Not Running (Only if using Ollama)
**Error:** `Ollama server not available`
**Solution:** The default configuration uses sentence-transformers (no Ollama needed). If you explicitly configured Ollama, start it:
```bash
ollama serve
ollama pull nomic-embed-text
Or switch back to the default (sentence-transformers) by removing Ollama configuration.
Rate Limit Exceeded
Error: Rate limit exceeded: Maximum X requests per hour
Solution: Adjust rate limits in .env:
MAX_INDEXING_REQUESTS_PER_HOUR=20
MAX_SEARCH_REQUESTS_PER_MINUTE=200
Path Not Allowed
Error: Path is not within allowed directories
Solution: The server only allows indexing within the current working directory by default.
๐ค Contributing
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch
- Add tests for new functionality
- Ensure all tests pass
- Submit a pull request
๐ License
MIT License - see LICENSE for details.
๐ Acknowledgments
- Anthropic for the MCP protocol
- Ollama for local embeddings
- ChromaDB for vector storage
- Tree-sitter for code parsing
Built with โค๏ธ for the AI agent ecosystem
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file semanticscout-3.0.1.tar.gz.
File metadata
- Download URL: semanticscout-3.0.1.tar.gz
- Upload date:
- Size: 175.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
42045783aef776bdb4ade362aa79d85c48844662c0680fbb459b9e3ba37d3a0d
|
|
| MD5 |
e31c441651f1b24762436a4d5aad7d6e
|
|
| BLAKE2b-256 |
97abc34bbd859b68ffe10bd53bf07946bf084888456add7bddf515299ba83688
|
File details
Details for the file semanticscout-3.0.1-py3-none-any.whl.
File metadata
- Download URL: semanticscout-3.0.1-py3-none-any.whl
- Upload date:
- Size: 162.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9a2eb41cbe45aa611c23e7121ea80fdfa8ef4acce63b2bbf80e616eeaea2034a
|
|
| MD5 |
272aca3db0c1e641708e734513f30117
|
|
| BLAKE2b-256 |
6c1025f474102abd3af0893c5ca80299701c3b2165cff38d39aa2e442ac0c55d
|