Skip to main content

Semantic code search with hybrid retrieval and MCP integration

Project description

clew

Semantic code search with hybrid retrieval and MCP integration for Claude Code.

Ask natural language questions about your codebase — clew indexes your code with AST-aware chunking, embeds it with Voyage AI, stores it in Qdrant, and serves results through both a CLI and an MCP server that Claude Code can call directly.

Features

  • Hybrid search — Dense embeddings (Voyage voyage-code-3) + BM25 keyword matching fused with Reciprocal Rank Fusion, optionally re-ranked with Voyage rerank-2.5
  • AST-aware chunking — tree-sitter parses Python, TypeScript, and JavaScript into semantic units (functions, classes, components) with token-aware fallback splitting
  • Code relationship tracing — Extracts imports, calls, inheritance, decorators, JSX renders, test mappings, and API boundaries; traversable via BFS graph queries
  • Incremental indexing — Git-aware change detection (with file-hash fallback) so re-indexing only touches what changed
  • NL descriptions — LLM-generated descriptions for undocumented code, prepended before embedding to improve search quality
  • Compact MCP responses — ~20x token reduction by default; returns signatures + docstring previews instead of full source
  • Multi-collection — Separate code and docs collections with intent-driven routing

Prerequisites

Quick start

1. Install clew

git clone https://github.com/ruminaider/clew.git
cd clew
pip install -e .

2. Start Qdrant

docker compose up -d qdrant

3. Set your API key

cp .env.example .env
# Edit .env and add your VOYAGE_API_KEY

Or export directly:

export VOYAGE_API_KEY=pa-xxxxxxxxxxxxxxxxxxxx

4. Index your project

clew index /path/to/your/project --full

5. Search

clew search "how do we handle authentication"

Configuration

Environment variables

Variable Required Default Description
VOYAGE_API_KEY Yes Voyage AI API key for embeddings and re-ranking
QDRANT_URL No http://localhost:6333 Qdrant server endpoint
QDRANT_API_KEY No Qdrant API key (if auth is enabled)
CLEW_CACHE_DIR No Auto-detected from git root SQLite cache directory (.clew/)
CLEW_LOG_LEVEL No INFO Logging verbosity
ANTHROPIC_API_KEY No Required for NL description generation

The cache directory resolves in order: CLEW_CACHE_DIR env var, then {git_root}/.clew/, then .clew/ relative to the working directory. This ensures the MCP server and CLI share the same cache.

Project configuration (optional)

Create a config.yaml in your project root for fine-grained control:

project:
  name: "my-project"
  root: "."

collections:
  code:
    include:
      - "src/**/*.py"
      - "frontend/**/*.tsx"
    exclude:
      - "**/migrations/*.py"
      - "**/__pycache__/**"
  docs:
    include:
      - "**/*.md"

chunking:
  default_max_tokens: 3000
  overlap_tokens: 200

terminology_file: indexer/terminology.yaml

CLI usage

clew index

Index a codebase for search.

# Incremental — only re-index changed files
clew index /path/to/project

# Full reindex
clew index /path/to/project --full

# Generate NL descriptions for undocumented code (requires ANTHROPIC_API_KEY)
clew index /path/to/project --nl-descriptions

# Index specific files
clew index --files src/auth.py --files src/models.py

clew search

Search the indexed codebase.

# Natural language query
clew search "where is the rate limiter configured"

# Filter by language
clew search "database models" --language python

# Filter by chunk type
clew search "API endpoints" --chunk-type function

# Set intent explicitly (code, docs, debug, location)
clew search "why does login fail" --intent debug

# JSON output
clew search "user authentication" --raw

clew trace

Trace code relationships via BFS graph traversal.

# Show all relationships for an entity
clew trace "src/auth/models.py::User"

# Only inbound (what depends on this)
clew trace "src/auth/models.py::User" --direction inbound

# Limit depth and filter types
clew trace "src/api/views.py::handle_request" --depth 3 --type calls --type imports

# JSON output
clew trace "src/auth/models.py::User" --raw

Relationship types: imports, calls, inherits, decorates, renders, tests, calls_api

clew status

Show system health and index statistics.

clew status

clew serve

Start the MCP server (stdio transport) for Claude Code integration.

clew serve

MCP integration

Add clew to Claude Code's .mcp.json:

{
  "mcpServers": {
    "clew": {
      "command": "clew",
      "args": ["serve"],
      "env": {
        "VOYAGE_API_KEY": "pa-xxxxxxxxxxxxxxxxxxxx",
        "QDRANT_URL": "http://localhost:6333"
      }
    }
  }
}

MCP tools

search

Semantic search over the indexed codebase.

search(query, limit=5, collection="code", active_file=None,
       intent=None, filters=None, detail="compact")
  • detail="compact" (default) — returns signature + docstring snippet
  • detail="full" — returns complete source content
  • filters — metadata filters: language, chunk_type, app_name, layer, is_test

get_context

Read file content with optional related code chunks.

get_context(file_path, line_start=None, line_end=None, include_related=False)

explain

Search for context about a symbol or question in a file.

explain(file_path, symbol=None, question=None, detail="compact")

trace

Traverse code relationships (imports, calls, inheritance, etc.).

trace(entity, direction="both", max_depth=2, relationship_types=None)

index_status

Check health or trigger re-indexing.

index_status(action="status", project_root=None)

Architecture

                    ┌──────────────┐
                    │ Claude Code  │
                    │ (MCP client) │
                    └──────┬───────┘
                           │ stdio
                    ┌──────▼───────┐
                    │  MCP Server  │  search, get_context, explain, trace, index_status
                    └──────┬───────┘
                           │
              ┌────────────▼────────────┐
              │     Search Pipeline     │
              │  enhance → classify →   │
              │  hybrid search → rerank │
              └────────────┬────────────┘
                           │
              ┌────────────▼────────────┐
              │    Qdrant Collections   │
              │  code: py/ts/tsx/js/jsx │
              │  docs: markdown         │
              └────────────┬────────────┘
                           │
              ┌────────────▼────────────┐
              │   Indexing Pipeline     │
              │  discover → chunk →     │
              │  enrich → embed →       │
              │  upsert + relationships │
              └────────────┬────────────┘
                           │
        ┌──────────┬───────┼───────┬──────────┐
        ▼          ▼       ▼       ▼          ▼
   tree-sitter  Voyage   SQLite   git     Anthropic
   (AST parse)  (embed)  (cache)  (diff)  (NL desc)

Search pipeline

  1. Query enhancement — Terminology expansion via YAML (abbreviations, synonyms)
  2. Intent classification — Heuristic routing: CODE, DOCS, DEBUG, LOCATION
  3. Hybrid search — Dense + BM25 multi-prefetch with structural boosting (same-module, test files for debug intent)
  4. Re-ranking — Voyage rerank-2.5 for final ordering

Chunking strategy

File pattern Strategy Token range
models.py Class + fields as unit 1,500 - 3,000
views.py Class as unit; split large actions 2,000 - 4,000
tasks.py Function with decorators 1,000 - 2,000
*.tsx, *.jsx Component boundaries 1,500 - 3,000
*.md Section-level by headers 1,000 - 2,000
Migrations Skipped

Fallback chain: tree-sitter AST → token-recursive splitting → line-based splitting.

Development

Setup

pip install -e ".[dev]"

Tests

# All tests with coverage
pytest --cov=clew -v

# Integration tests (requires running Qdrant)
pytest -m integration

# Single test file
pytest tests/search/test_hybrid.py -v

Linting and type checking

ruff format .           # Format
ruff check .            # Lint
mypy clew/              # Type check (strict mode)

Project structure

clew/
├── chunker/             # AST parsing, language strategies, token counting
├── clients/             # External service wrappers (Voyage, Qdrant, Anthropic)
├── indexer/             # Pipeline, caching, change detection, relationship extraction
│   └── extractors/      # Pluggable per-language relationship extractors
├── search/              # Engine, hybrid retrieval, intent classification, re-ranking
├── cli.py               # Typer CLI
├── mcp_server.py        # FastMCP server (5 tools)
├── config.py            # Environment variable loading
├── factory.py           # Component wiring (no global state)
├── models.py            # Pydantic v2 config models
├── exceptions.py        # Error hierarchy with fix hints
├── discovery.py         # File discovery with ignore patterns and safety checks
└── safety.py            # File size, chunk count, collection limits

tests/                   # 491 tests, 92% coverage
docs/
├── DESIGN.md            # Architecture and design decisions
├── IMPLEMENTATION.md    # Concrete specs and schemas
├── adr/                 # Architecture Decision Records
└── plans/               # Phase and version plans

Tech stack

Component Technology
Embeddings Voyage AI voyage-code-3 (1024 dims)
Re-ranking Voyage rerank-2.5
Vector DB Qdrant (self-hosted, Docker)
AST parsing tree-sitter (Python, TypeScript, JavaScript)
CLI typer + rich
Config Pydantic v2 + YAML
Cache SQLite (contextmanager, no ORM)
Change detection git diff primary, SHA-256 file hash fallback
MCP FastMCP (stdio transport)

Troubleshooting

Problem Fix
Qdrant not running docker compose up -d qdrant
VOYAGE_API_KEY not set export VOYAGE_API_KEY=pa-...
No search results Run clew index --full to reindex
MCP server can't find cache Set CLEW_CACHE_DIR to an absolute path, or run from within the git repo
Stale results after code changes Run clew index (incremental) to pick up changes

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

clewdex-0.1.0.tar.gz (7.3 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

clewdex-0.1.0-py3-none-any.whl (79.1 kB view details)

Uploaded Python 3

File details

Details for the file clewdex-0.1.0.tar.gz.

File metadata

  • Download URL: clewdex-0.1.0.tar.gz
  • Upload date:
  • Size: 7.3 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for clewdex-0.1.0.tar.gz
Algorithm Hash digest
SHA256 e607cd991ed9ef6ebca74848bcdc4240a9ac2b027f9f3559f1b4032b83c36342
MD5 0b0470790d0523a1ddf63ecd56020992
BLAKE2b-256 72124cf45c6ae30f08a53fab64aa4d775f70b5e3849ba3e35ae0a8d6c86ade3f

See more details on using hashes here.

File details

Details for the file clewdex-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: clewdex-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 79.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for clewdex-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 adf0d926ca9195e631297fe4e405f6a89cb95f460b8a3f20f9fc5e27f5a9e812
MD5 a1b037863aa432a8c781757a8fa0943a
BLAKE2b-256 8591373469d024dd7cca07d8b11c7092ea7dc86a82cb87e22b109373762f30d5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page