Semantic code search CLI — find code by meaning, not just text. Powered by bge-small-code-v1.
Project description
semantic-codesearch
Semantic code search CLI — find code by meaning, not just text.
Powered by bge-small-code-v1, a 33M parameter code embedding model trained on 200K CoRNStack triplets across Python, JavaScript, Java, and Go.
Install
pip install semantic-codesearch
Usage
Index a codebase
codesearch index .
Walks the directory, chunks code files (30 lines with 5-line overlap), embeds each chunk with the ONNX model, and stores everything in a local SQLite database (.codesearch.db).
Search by meaning
codesearch search "function that sorts users by date"
codesearch search "authentication middleware" -n 10
codesearch search "database connection pool" -d /path/to/repo
Results show file path, line range, similarity score, and a code preview:
────────────────────────────────────────────────────────────
#1 src/auth.py:26-32 (71.8% match)
────────────────────────────────────────────────────────────
26 if AUTH_TOKEN:
27 auth = request.headers.get("authorization", "")
28 if auth != f"Bearer {AUTH_TOKEN}":
29 return JSONResponse({"error": "unauthorized"}, ...)
View index stats
codesearch stats
Features
- Semantic search — finds code by meaning, not keywords. "sort by date" finds
sorted(users, key=lambda u: u.created_at). - Fast — ONNX model runs on CPU. Indexing ~50 files takes ~15 seconds. Searches are instant (cosine similarity on cached embeddings).
- Local & private — everything runs locally. No API calls, no data leaves your machine.
- Auto-downloads model — fetches bge-small-code-v1 ONNX from HuggingFace on first run (~130MB).
- 50+ file types — Python, JS, TS, Java, Go, Rust, C/C++, SQL, YAML, and more.
- Smart directory skipping — ignores
.git,node_modules,__pycache__,.venv,dist, etc.
How it works
- Chunking — splits each file into overlapping 30-line chunks
- Embedding — runs each chunk through bge-small-code-v1 (ONNX, 384-dim output)
- Storage — stores embeddings + metadata in SQLite (
.codesearch.db) - Search — embeds your query, computes cosine similarity against all chunks, returns top-k
Model
Built on BAAI/bge-small-en-v1.5 (33M params), fine-tuned on CoRNStack code search triplets with Matryoshka loss for flexible embedding dimensions (384/256/128/64).
- Accuracy@1: 72.6% | Accuracy@10: 91.8% | NDCG@10: 82.5%
- ONNX INT8: 33.8MB — small enough to run in a browser
Requirements
- Python 3.10-3.13 (onnxruntime doesn't support 3.14 yet)
- No GPU needed — runs on CPU
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file semcode_search-0.1.0.tar.gz.
File metadata
- Download URL: semcode_search-0.1.0.tar.gz
- Upload date:
- Size: 9.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dedcc92c2fb0124d328facccae217d8bbb0e917648d1edc30f6ccaa333eb4254
|
|
| MD5 |
48f49f307a9315524ee1d5f3cac4c052
|
|
| BLAKE2b-256 |
58c0e6d20318829a7b3443529f2a5663a95c0b64ca753e997061e3aeb13d51f6
|
File details
Details for the file semcode_search-0.1.0-py3-none-any.whl.
File metadata
- Download URL: semcode_search-0.1.0-py3-none-any.whl
- Upload date:
- Size: 9.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
114d932d60c5c965901c8109804387c5916f4a02dd72745cb34324337ca8b6bf
|
|
| MD5 |
c20fd7455d5d4cb248a058005190bf7e
|
|
| BLAKE2b-256 |
35d36a75bbc8224f48e3672ab5ddca6b96a676c137953fdf384ffe68bebb7713
|