Semantic code search with hybrid retrieval and MCP integration
Project description
clew
Semantic code search with hybrid retrieval and MCP integration for Claude Code.
Ask natural language questions about your codebase — clew indexes your code with AST-aware chunking, embeds it with Voyage AI, stores it in Qdrant, and serves results through both a CLI and an MCP server that Claude Code can call directly.
Features
- Hybrid search — Dense embeddings (Voyage voyage-code-3) + BM25 keyword matching fused with Reciprocal Rank Fusion, optionally re-ranked with Voyage rerank-2.5
- AST-aware chunking — tree-sitter parses Python, TypeScript, and JavaScript into semantic units (functions, classes, components) with token-aware fallback splitting
- Code relationship tracing — Extracts imports, calls, inheritance, decorators, JSX renders, test mappings, and API boundaries; traversable via BFS graph queries
- Incremental indexing — Git-aware change detection (with file-hash fallback) so re-indexing only touches what changed
- NL descriptions — LLM-generated descriptions for undocumented code, prepended before embedding to improve search quality
- Compact MCP responses — ~20x token reduction by default; returns signatures + docstring previews instead of full source
- Multi-collection — Separate
codeanddocscollections with intent-driven routing
Prerequisites
Quick start
1. Install clew
git clone https://github.com/ruminaider/clew.git
cd clew
pip install -e .
2. Start Qdrant
docker compose up -d qdrant
3. Set your API key
cp .env.example .env
# Edit .env and add your VOYAGE_API_KEY
Or export directly:
export VOYAGE_API_KEY=pa-xxxxxxxxxxxxxxxxxxxx
4. Index your project
clew index /path/to/your/project --full
5. Search
clew search "how do we handle authentication"
Configuration
Environment variables
| Variable | Required | Default | Description |
|---|---|---|---|
VOYAGE_API_KEY |
Yes | — | Voyage AI API key for embeddings and re-ranking |
QDRANT_URL |
No | http://localhost:6333 |
Qdrant server endpoint |
QDRANT_API_KEY |
No | — | Qdrant API key (if auth is enabled) |
CLEW_CACHE_DIR |
No | Auto-detected from git root | SQLite cache directory (.clew/) |
CLEW_LOG_LEVEL |
No | INFO |
Logging verbosity |
ANTHROPIC_API_KEY |
No | — | Required for NL description generation |
The cache directory resolves in order: CLEW_CACHE_DIR env var, then {git_root}/.clew/, then .clew/ relative to the working directory. This ensures the MCP server and CLI share the same cache.
Project configuration (optional)
Create a config.yaml in your project root for fine-grained control:
project:
name: "my-project"
root: "."
collections:
code:
include:
- "src/**/*.py"
- "frontend/**/*.tsx"
exclude:
- "**/migrations/*.py"
- "**/__pycache__/**"
docs:
include:
- "**/*.md"
chunking:
default_max_tokens: 3000
overlap_tokens: 200
terminology_file: indexer/terminology.yaml
CLI usage
clew index
Index a codebase for search.
# Incremental — only re-index changed files
clew index /path/to/project
# Full reindex
clew index /path/to/project --full
# Generate NL descriptions for undocumented code (requires ANTHROPIC_API_KEY)
clew index /path/to/project --nl-descriptions
# Index specific files
clew index --files src/auth.py --files src/models.py
clew search
Search the indexed codebase.
# Natural language query
clew search "where is the rate limiter configured"
# Filter by language
clew search "database models" --language python
# Filter by chunk type
clew search "API endpoints" --chunk-type function
# Set intent explicitly (code, docs, debug, location)
clew search "why does login fail" --intent debug
# JSON output
clew search "user authentication" --raw
clew trace
Trace code relationships via BFS graph traversal.
# Show all relationships for an entity
clew trace "src/auth/models.py::User"
# Only inbound (what depends on this)
clew trace "src/auth/models.py::User" --direction inbound
# Limit depth and filter types
clew trace "src/api/views.py::handle_request" --depth 3 --type calls --type imports
# JSON output
clew trace "src/auth/models.py::User" --raw
Relationship types: imports, calls, inherits, decorates, renders, tests, calls_api
clew status
Show system health and index statistics.
clew status
clew serve
Start the MCP server (stdio transport) for Claude Code integration.
clew serve
MCP integration
Add clew to Claude Code's .mcp.json:
{
"mcpServers": {
"clew": {
"command": "clew",
"args": ["serve"],
"env": {
"VOYAGE_API_KEY": "pa-xxxxxxxxxxxxxxxxxxxx",
"QDRANT_URL": "http://localhost:6333"
}
}
}
}
MCP tools
search
Semantic search over the indexed codebase.
search(query, limit=5, collection="code", active_file=None,
intent=None, filters=None, detail="compact")
detail="compact"(default) — returns signature + docstring snippetdetail="full"— returns complete source contentfilters— metadata filters:language,chunk_type,app_name,layer,is_test
get_context
Read file content with optional related code chunks.
get_context(file_path, line_start=None, line_end=None, include_related=False)
explain
Search for context about a symbol or question in a file.
explain(file_path, symbol=None, question=None, detail="compact")
trace
Traverse code relationships (imports, calls, inheritance, etc.).
trace(entity, direction="both", max_depth=2, relationship_types=None)
index_status
Check health or trigger re-indexing.
index_status(action="status", project_root=None)
Architecture
┌──────────────┐
│ Claude Code │
│ (MCP client) │
└──────┬───────┘
│ stdio
┌──────▼───────┐
│ MCP Server │ search, get_context, explain, trace, index_status
└──────┬───────┘
│
┌────────────▼────────────┐
│ Search Pipeline │
│ enhance → classify → │
│ hybrid search → rerank │
└────────────┬────────────┘
│
┌────────────▼────────────┐
│ Qdrant Collections │
│ code: py/ts/tsx/js/jsx │
│ docs: markdown │
└────────────┬────────────┘
│
┌────────────▼────────────┐
│ Indexing Pipeline │
│ discover → chunk → │
│ enrich → embed → │
│ upsert + relationships │
└────────────┬────────────┘
│
┌──────────┬───────┼───────┬──────────┐
▼ ▼ ▼ ▼ ▼
tree-sitter Voyage SQLite git Anthropic
(AST parse) (embed) (cache) (diff) (NL desc)
Search pipeline
- Query enhancement — Terminology expansion via YAML (abbreviations, synonyms)
- Intent classification — Heuristic routing:
CODE,DOCS,DEBUG,LOCATION - Hybrid search — Dense + BM25 multi-prefetch with structural boosting (same-module, test files for debug intent)
- Re-ranking — Voyage rerank-2.5 for final ordering
Chunking strategy
| File pattern | Strategy | Token range |
|---|---|---|
models.py |
Class + fields as unit | 1,500 - 3,000 |
views.py |
Class as unit; split large actions | 2,000 - 4,000 |
tasks.py |
Function with decorators | 1,000 - 2,000 |
*.tsx, *.jsx |
Component boundaries | 1,500 - 3,000 |
*.md |
Section-level by headers | 1,000 - 2,000 |
| Migrations | Skipped | — |
Fallback chain: tree-sitter AST → token-recursive splitting → line-based splitting.
Development
Setup
pip install -e ".[dev]"
Tests
# All tests with coverage
pytest --cov=clew -v
# Integration tests (requires running Qdrant)
pytest -m integration
# Single test file
pytest tests/search/test_hybrid.py -v
Linting and type checking
ruff format . # Format
ruff check . # Lint
mypy clew/ # Type check (strict mode)
Project structure
clew/
├── chunker/ # AST parsing, language strategies, token counting
├── clients/ # External service wrappers (Voyage, Qdrant, Anthropic)
├── indexer/ # Pipeline, caching, change detection, relationship extraction
│ └── extractors/ # Pluggable per-language relationship extractors
├── search/ # Engine, hybrid retrieval, intent classification, re-ranking
├── cli.py # Typer CLI
├── mcp_server.py # FastMCP server (5 tools)
├── config.py # Environment variable loading
├── factory.py # Component wiring (no global state)
├── models.py # Pydantic v2 config models
├── exceptions.py # Error hierarchy with fix hints
├── discovery.py # File discovery with ignore patterns and safety checks
└── safety.py # File size, chunk count, collection limits
tests/ # 491 tests, 92% coverage
docs/
├── DESIGN.md # Architecture and design decisions
├── IMPLEMENTATION.md # Concrete specs and schemas
├── adr/ # Architecture Decision Records
└── plans/ # Phase and version plans
Tech stack
| Component | Technology |
|---|---|
| Embeddings | Voyage AI voyage-code-3 (1024 dims) |
| Re-ranking | Voyage rerank-2.5 |
| Vector DB | Qdrant (self-hosted, Docker) |
| AST parsing | tree-sitter (Python, TypeScript, JavaScript) |
| CLI | typer + rich |
| Config | Pydantic v2 + YAML |
| Cache | SQLite (contextmanager, no ORM) |
| Change detection | git diff primary, SHA-256 file hash fallback |
| MCP | FastMCP (stdio transport) |
Troubleshooting
| Problem | Fix |
|---|---|
| Qdrant not running | docker compose up -d qdrant |
VOYAGE_API_KEY not set |
export VOYAGE_API_KEY=pa-... |
| No search results | Run clew index --full to reindex |
| MCP server can't find cache | Set CLEW_CACHE_DIR to an absolute path, or run from within the git repo |
| Stale results after code changes | Run clew index (incremental) to pick up changes |
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file clewdex-0.1.0.tar.gz.
File metadata
- Download URL: clewdex-0.1.0.tar.gz
- Upload date:
- Size: 7.3 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e607cd991ed9ef6ebca74848bcdc4240a9ac2b027f9f3559f1b4032b83c36342
|
|
| MD5 |
0b0470790d0523a1ddf63ecd56020992
|
|
| BLAKE2b-256 |
72124cf45c6ae30f08a53fab64aa4d775f70b5e3849ba3e35ae0a8d6c86ade3f
|
File details
Details for the file clewdex-0.1.0-py3-none-any.whl.
File metadata
- Download URL: clewdex-0.1.0-py3-none-any.whl
- Upload date:
- Size: 79.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
adf0d926ca9195e631297fe4e405f6a89cb95f460b8a3f20f9fc5e27f5a9e812
|
|
| MD5 |
a1b037863aa432a8c781757a8fa0943a
|
|
| BLAKE2b-256 |
8591373469d024dd7cca07d8b11c7092ea7dc86a82cb87e22b109373762f30d5
|