Search your codebase with natural language.
Project description
Semantic Code Search
Search your codebase with natural language. No data leaves your computer.
๐ Overview โข ๐ง Installation โข ๐ป Usage โข ๐ Docs โข ๐ง How it works
Overview
sem
is a command line application which allows you to search your git repository using natural language. For example you can query for:
- 'Where are API requests authenticated?'
- 'Saving user objects to the database'
- 'Handling of webhook events'
- 'Where are jobs read from the queue?'
You will get a (visualized) list of code snippets and their file:line
locations. You can use sem
for exploring large codebases or, if you are as forgetfull as I am, even small ones.
Basic usage:
sem 'my query'
This will present you with a list of code snippets that most closely match your search. You can select one and press Return
to open it in your editor of choice.
How does this work? In a nutshell, it uses a neural network to generate code embeddings. More info below.
NB: All processing is done on your hardware and no data is transmitted to the Internet.
Installation
You can install semantic-code-search
via pip
.
Pip (MacOS, Linux, Windows)
pip3 install semantic-code-search
Usage
TL;DR:
cd /my/repo
sem 'my query'
Run sem --help
to see all available options.
Searching for code
Inside your repo simply run
sem 'my query'
(quotes can be omitted)
Note that you need to be inside a git repository or provide a path to a repo with the
-p
argument.
Before you get your first search results, two things need to happen:
- The app downloads its model (~500 MB). This is done only once for the installation.
- The app generates 'embeddings' of your code. This will be cached in an
.embeddings
file at the root of the repo and is reused in subsequent searches.
Depending on the project size, the above can take from a couple of seconds to minutes. Once this is complete, querying is very fast.
Example output:
foo@bar:~$ cd /my/repo
foo@bar:~$ sem 'parsing command line args'
Embeddings not found in /Users/kiril/src/semantic-code-search. Generating embeddings now.
Embedding 15 functions in 1 batches. This is done once and cached in .embeddings
Batches: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 1/1 [00:07<00:00, 7.05s/it]
Navigating search results
By default, a list of the top 5 matches are shown, containing :
- Similarity score
- File path
- Line number
- Code snippet
You can navigate the list using the โ
โ
arrow keys or vim
bindings. Pressing return
will open the relevant file at the line of the code snippet in your editor.
NB: The editor used for opening can be set with the
--editor
argument.
Example results:
Command line flags
usage: sem [-h] [-p PATH] [-m MODEL] [-d] [-b BS] [-x EXT] [-n N]
[-e {vscode,vim}]
...
Search your codebase using natural language
positional arguments:
query_text
optional arguments:
-h, --help show this help message and exit
-p PATH, --path-to-repo PATH
Path to the root of the git repo to search or embed
-m MODEL, --model-name-or-path MODEL
Name or path of the model to use
-d, --embed (Re)create the embeddings index for codebase
-b BS, --batch-size BS
Batch size for embeddings generation
-x EXT, --file-extension EXT
File extension filter (e.g. "py" will only return
results from Python files)
-n N, --n-results N Number of results to return
-e {vscode,vim}, --editor {vscode,vim}
Editor to open selected result in
How it works
In a nutshell, this application uses a transformer machine learning model to generate embeddings of methods and functions in your codebase. Embeddings are information dense numerical representations of the semantics of the text/code they represent.
Here is a great blog post by Jay Alammar which explains the concept really nicely:
When the app is ran with the --embed
argument, function and method definitions are first extracted from the source files and then used for sentence embedding. To avoid doing this for every query, the results are compressed and saved in an .embeddings
file.
When a query is being processed, embeddings are generated from the query text. This is then used in a 'nearest neighbor' search to discover function or methods with similar embeddings. We are basically comparing the cosine similarity between vectors.
Model
The application uses sentence transformer model architecture to produce 'sentence' embeddings for functions and queries. The particular model is krlvi/sentence-t5-base-nlpl-code_search_net which is based of a SentenceT5-Base checkpoint with 110M parameters and a pooling layer.
It has been further trained on the code_search_net dataset of 'natural language' โ 'programming language' pairs with a MultipleNegativesRanking loss function.
You can experiment with your own sentence transformer models with the --model
parameter.
Bugs and limitations
- Currently, the
.embeddings
index is not updated when repository files change. As a temporary workaround,sem embed
can be re-ran occasionally. - Supported languages:
{ 'python', 'javascript', 'typescript', 'ruby', 'go', 'rust', 'java' }
- Supported text editors for opening results in:
{ 'vscode', 'vim' }
License
Semantic Code Search is distributed under AGPL-3.0-only. For Apache-2.0 exceptions โ kiril@codeball.ai
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for semantic-code-search-0.1.2.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | f30c12972eee13e416971ab7e50ae6902a94356583b5df20a3777ae7f03f5677 |
|
MD5 | 4a260b20a8a0b643592d86e32a10e729 |
|
BLAKE2b-256 | c350e7b9588b33b720e3086178ed4ea923cebe8128955780e5f7197f8e2fd3b5 |
Hashes for semantic_code_search-0.1.2-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | fe6061bb5c37578cf22f353be3405a714f1bc8965cd892a0a9158ad6217c79c4 |
|
MD5 | ca0512c83e5c1a21e140da9900d222bb |
|
BLAKE2b-256 | 5b60cd1259e14f41b293db0a9d4d29b68149cf572913a29b119a47effb8891df |