Skip to main content

Semantic search tool for git repositories using LanceDB

Project description

Semantic Git Blame Tool

A powerful semantic search tool for git repositories using LanceDB. Search your git history, find code authors, and understand code evolution using natural language queries.

Features

  • 🔍 Semantic Search: Search commits, code changes, and blame data using natural language
  • 👤 Author Attribution: Find who implemented specific features or functionality
  • 📈 Code Evolution: Track how code has changed over time
  • Fast Indexing: Efficient vector indexing with LanceDB
  • 🔌 IDE Integration: Works with Continue for in-editor code intelligence
  • 🌐 HTTP API: REST endpoint for integration with other tools

Quick Start

Prerequisites

  • Python 3.9+
  • Git
  • uv (recommended) or pip

Installation

Option 1: Global Installation (Recommended)

# Install uv
curl -LsSf https://astral.sh/uv/install.sh | sh

# Clone the repository
git clone https://github.com/yourusername/git_blame_search.git
cd git_blame_search

# Install globally from local directory
uv tool install .

Option 2: Local Development

# Clone the repository
git clone https://github.com/yourusername/git_blame_search.git
cd git_blame_search

# Install dependencies
uv sync

Usage

1. Index Your Repository

# Index current repository (recent 100 commits)
git_blame_search index --max-commits 100

# Index a specific repository
git_blame_search index --repo /path/to/repo --max-commits 500

# For testing with minimal commits (e.g., new repos)
git_blame_search index --max-commits 1

2. Index Specific Files for Blame Analysis

# Index blame data for important files
git_blame_search index-file main.py
git_blame_search index-file src/core/auth.py

3. Search Commits

# For testing in a new (this) repository
git_blame_search search "init"

# Search by natural language
git_blame_search search "authentication implementation"
git_blame_search search "bug fix for memory leak"
git_blame_search search "refactoring database layer" --limit 20

4. Semantic Blame Search

# Find who wrote specific functionality
git_blame_search blame "error handling logic"
git_blame_search blame "database connection" --file src/db.py

5. Author Attribution

# Find who implemented features
git_blame_search who "authentication system"
git_blame_search who "caching implementation"

Continue Integration

1. Index Your Repository First

# Index your repository before using MCP. It will take a minute on mature projects.
git_blame_search index --max-commits 100

# Optionally index specific files for blame analysis
git_blame_search index-file src/main.py

2. Configure Continue

Add to your ~/.continue/config.yml:

mcpServers:
  - name: git_blame_search
    command: git_blame_search
    args:
      - mcp
    cwd: "."

3. Use in VSCode

  • Use git_blame_search in Continue chat
  • Ask questions like "who implemented the auth system?"

Example Queries

Finding Code Authors

uv run python git_blame_tool.py who "security vulnerability fix"
uv run python git_blame_tool.py who "REST API implementation"

Understanding Code Evolution

uv run python git_blame_tool.py search "added try-catch blocks"
uv run python git_blame_tool.py search "refactored to use async/await"

Blame Analysis

uv run python git_blame_tool.py blame "validate.*email"
uv run python git_blame_tool.py blame "test_.*authentication" --file tests/

Architecture

The tool uses:

  • LanceDB: Vector database for storing and searching embeddings
  • Sentence Transformers: Microsoft's CodeBERT for code-aware embeddings
  • GitPython: For parsing git repository data
  • Rich: For beautiful CLI output

Data Schema

Commits Table

  • Commit metadata (hash, author, timestamp, message)
  • Files changed, additions, deletions
  • Message vector embeddings

Blame Table

  • File path, line number, commit hash
  • Author information and timestamps
  • Code content and vector embeddings

Diffs Table

  • Change type, file paths
  • Old and new content
  • Diff chunk vector embeddings

Performance Tips

Large Repositories

For repositories with 10k+ commits:

# Process in batches
uv run python git_blame_tool.py index --max-commits 1000 --skip 0
uv run python git_blame_tool.py index --max-commits 1000 --skip 1000

Memory Optimization

# Use smaller batches
uv run python git_blame_tool.py index --max-commits 50

# Increase PyTorch memory
export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512

Faster Embeddings

Modify the embedder for speed:

# In git_blame_tool.py, change:
self.embedder = SentenceTransformer('all-MiniLM-L6-v2')  # Faster

API Reference

CLI Commands

  • index - Index a git repository

    • --repo, -r - Repository path (default: current directory)
    • --max-commits, -m - Maximum commits to index
  • index-file - Index blame data for a specific file

    • FILE_PATH - Path to file to index
  • search - Search commits

    • QUERY - Natural language search query
    • --limit, -l - Number of results (default: 10)
  • blame - Search blame data

    • QUERY - Natural language search query
    • --file, -f - Filter by file path
    • --limit, -l - Number of results (default: 5)
  • who - Find code authors

    • QUERY - Natural language search query
    • --limit, -l - Number of results (default: 5)
  • serve - Start HTTP server for IDE integration

HTTP API

POST /retrieve

{
  "query": "authentication implementation"
}

Response:

{
  "results": [
    {
      "title": "Commit: Add JWT authentication",
      "content": "Author: Jane Smith\nDate: 2024-01-15\n\nAdd JWT authentication middleware",
      "metadata": {
        "type": "commit",
        "hash": "a1b2c3d4e5f6"
      }
    }
  ]
}

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

MIT License - see LICENSE file for details

Acknowledgments

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

git_blame_search-0.1.0.tar.gz (144.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

git_blame_search-0.1.0-py3-none-any.whl (164.1 kB view details)

Uploaded Python 3

File details

Details for the file git_blame_search-0.1.0.tar.gz.

File metadata

  • Download URL: git_blame_search-0.1.0.tar.gz
  • Upload date:
  • Size: 144.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for git_blame_search-0.1.0.tar.gz
Algorithm Hash digest
SHA256 6ddc933d8bc79dab78f7b3a752b762e5fd86f8596b918881e47587b4974864e5
MD5 fd07258dc46d0f2c407d4ea734a3d161
BLAKE2b-256 b066b9bdfef8a17de4efeb01271b64adc700b237fcc5ab4394d568914bda7d8d

See more details on using hashes here.

File details

Details for the file git_blame_search-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for git_blame_search-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b742c3c3a226d0df01c067dab4990a0748ab0fcccc5b0fd6a1b614fe2d37f5d0
MD5 b9999b2c5d140792ecd4f7a752076be3
BLAKE2b-256 702aa83b96e68741503f44c59b8215e6c397b3cbf393718b3a28a0827b247472

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page