Skip to main content

Semantic code search CLI — find code by meaning, not just text. Powered by bge-small-code-v1.

Project description

semantic-codesearch

Semantic code search CLI — find code by meaning, not just text.

Powered by bge-small-code-v1, a 33M parameter code embedding model trained on 200K CoRNStack triplets across Python, JavaScript, Java, and Go.

Install

pip install semantic-codesearch

Usage

Index a codebase

codesearch index .

Walks the directory, chunks code files (30 lines with 5-line overlap), embeds each chunk with the ONNX model, and stores everything in a local SQLite database (.codesearch.db).

Search by meaning

codesearch search "function that sorts users by date"
codesearch search "authentication middleware" -n 10
codesearch search "database connection pool" -d /path/to/repo

Results show file path, line range, similarity score, and a code preview:

────────────────────────────────────────────────────────────
  #1 src/auth.py:26-32  (71.8% match)
────────────────────────────────────────────────────────────
    26 if AUTH_TOKEN:
    27     auth = request.headers.get("authorization", "")
    28     if auth != f"Bearer {AUTH_TOKEN}":
    29         return JSONResponse({"error": "unauthorized"}, ...)

View index stats

codesearch stats

Features

  • Semantic search — finds code by meaning, not keywords. "sort by date" finds sorted(users, key=lambda u: u.created_at).
  • Fast — ONNX model runs on CPU. Indexing ~50 files takes ~15 seconds. Searches are instant (cosine similarity on cached embeddings).
  • Local & private — everything runs locally. No API calls, no data leaves your machine.
  • Auto-downloads model — fetches bge-small-code-v1 ONNX from HuggingFace on first run (~130MB).
  • 50+ file types — Python, JS, TS, Java, Go, Rust, C/C++, SQL, YAML, and more.
  • Smart directory skipping — ignores .git, node_modules, __pycache__, .venv, dist, etc.

How it works

  1. Chunking — splits each file into overlapping 30-line chunks
  2. Embedding — runs each chunk through bge-small-code-v1 (ONNX, 384-dim output)
  3. Storage — stores embeddings + metadata in SQLite (.codesearch.db)
  4. Search — embeds your query, computes cosine similarity against all chunks, returns top-k

Model

Built on BAAI/bge-small-en-v1.5 (33M params), fine-tuned on CoRNStack code search triplets with Matryoshka loss for flexible embedding dimensions (384/256/128/64).

  • Accuracy@1: 72.6% | Accuracy@10: 91.8% | NDCG@10: 82.5%
  • ONNX INT8: 33.8MB — small enough to run in a browser

Requirements

  • Python 3.10-3.13 (onnxruntime doesn't support 3.14 yet)
  • No GPU needed — runs on CPU

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

semcode_search-0.1.0.tar.gz (9.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

semcode_search-0.1.0-py3-none-any.whl (9.0 kB view details)

Uploaded Python 3

File details

Details for the file semcode_search-0.1.0.tar.gz.

File metadata

  • Download URL: semcode_search-0.1.0.tar.gz
  • Upload date:
  • Size: 9.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for semcode_search-0.1.0.tar.gz
Algorithm Hash digest
SHA256 dedcc92c2fb0124d328facccae217d8bbb0e917648d1edc30f6ccaa333eb4254
MD5 48f49f307a9315524ee1d5f3cac4c052
BLAKE2b-256 58c0e6d20318829a7b3443529f2a5663a95c0b64ca753e997061e3aeb13d51f6

See more details on using hashes here.

File details

Details for the file semcode_search-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: semcode_search-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 9.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for semcode_search-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 114d932d60c5c965901c8109804387c5916f4a02dd72745cb34324337ca8b6bf
MD5 c20fd7455d5d4cb248a058005190bf7e
BLAKE2b-256 35d36a75bbc8224f48e3672ab5ddca6b96a676c137953fdf384ffe68bebb7713

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page