Skip to main content

file: README.md

Project description

git-semantic-similarity

Search git commit messages by semantic similarity with sentence-transformers.

Embeddings are stored on disk for faster retrieval, and can easily be checked into git.

$ gitsem "project scaffolding"

Commit 403836d2ee4900579b0d1e8169dd4bfebddab0ba
Author: Adrian Meidell Fiorito <adrianmefi@gmail.com>
Date:   2024-09-23 19:08:05
Similarity: 0.2299

    Change model, add src folder

Commit d2909a8ec352a881ab05cab8b8a67038b063f37a
Author: Adrian Meidell Fiorito <adrianmefi@gmail.com>
Date:   2024-09-23 19:08:05
Similarity: 0.2086

    Initial commit

...

Commit a09923166072aca4910e92272ef161e3398b1d89
Author: Adrian Meidell Fiorito <adrianmefi@gmail.com>
Date:   2024-09-23 19:08:05
Similarity: -0.0716

    Remove buggy rounding

Installation

Clone and run locally

git clone https://github.com/adrianmfi/git-semantic-similarity.git
cd git-semantic-similarity
pip install .

Usage

In a git repository, run: gitsem "query string"

To only show the 10 most relevant commits:

gitsem "changes to project documentation" -n 10

To use another pretrained model, for example a smaller and faster model:

gitsem "user service refactoring" --model sentence-transformers/all-MiniLM-L6-v2

A list of supported models can be found here

The tool supports forwarding arguments to git rev-list For example, to only search in the 10 most recent commits:

gitsem "query string" -- -n 10

Or to filter by a specific author:

gitsem "query string" -- --author bob

Or you can format the output in a single line for further shell processing:

gitsem "query string" --sort False --oneline -- n 100 | sort -n -r | head -n 10

Arguments

  • -m, --model [STRING]:
    A sentence-transformers model to use for embeddings. Default is all-mpnet-base-v2.

  • -c, --cache [BOOLEAN]:
    Whether to cache commit embeddings on disk for faster retrieval. Default is True.

  • --cache-dir [PATH]:
    Directory to store cached embeddings. If not specified, defaults to git_root/.git_semsim/model_name.

  • --oneline:
    Use a concise output format.

  • --sort [BOOLEAN]:
    Sort results by similarity score. Default is True.

  • -n, --max-count [INTEGER]:
    Limit the number of results displayed. If not provided, no limit is applied.

  • -b, --batch-size [INTEGER]:
    Batch size for embedding commits. Default is 1000.

  • query [STRING]:
    The query string to compare against commit messages.

  • git_args [STRING...]:
    Arguments after -- will be forwarded to git rev-list.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

git_semantic_similarity-1.0.0.tar.gz (6.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

git_semantic_similarity-1.0.0-py3-none-any.whl (7.6 kB view details)

Uploaded Python 3

File details

Details for the file git_semantic_similarity-1.0.0.tar.gz.

File metadata

  • Download URL: git_semantic_similarity-1.0.0.tar.gz
  • Upload date:
  • Size: 6.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.1 CPython/3.12.6

File hashes

Hashes for git_semantic_similarity-1.0.0.tar.gz
Algorithm Hash digest
SHA256 3c57dde10716e2696ec63a9d5f765090a005fb53d6d600edaef5201b35894c3e
MD5 64ad99da2dd14f068a886dca5690a022
BLAKE2b-256 12f4179270c2b5d12d2f2adfaaa92d923095861e1a040e43b2f8c290ae137bb9

See more details on using hashes here.

File details

Details for the file git_semantic_similarity-1.0.0-py3-none-any.whl.

File metadata

File hashes

Hashes for git_semantic_similarity-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 efdcc77a06c86ddf7059847a09e3e2e690dea8e9125058cfcdb1fd22b29767fc
MD5 15d1acfda2a6b4d695500fc5d5833308
BLAKE2b-256 44739f71ba7a4bba63ad9ae09cb44da84329d31ab6ace30f68c7d6327889d338

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page