Semantic search tool for git repositories using LanceDB
Project description
Semantic Git Blame Tool
A powerful semantic search tool for git repositories using LanceDB. Search your git history, find code authors, and understand code evolution using natural language queries.
Features
- 🔍 Semantic Search: Search commits, code changes, and blame data using natural language
- 👤 Author Attribution: Find who implemented specific features or functionality
- 📈 Code Evolution: Track how code has changed over time
- ⚡ Fast Indexing: Efficient vector indexing with LanceDB
- 🔌 IDE Integration: Works with Continue for in-editor code intelligence
- 🌐 HTTP API: REST endpoint for integration with other tools
Quick Start
Prerequisites
- Python 3.9+
- Git
- uv (recommended) or pip
Installation
Option 1: Global Installation (Recommended)
# Install uv
curl -LsSf https://astral.sh/uv/install.sh | sh
# Clone the repository
git clone https://github.com/yourusername/git_blame_search.git
cd git_blame_search
# Install globally from local directory
uv tool install .
Option 2: Local Development
# Clone the repository
git clone https://github.com/yourusername/git_blame_search.git
cd git_blame_search
# Install dependencies
uv sync
Usage
1. Index Your Repository
# Index current repository (recent 100 commits)
git_blame_search index --max-commits 100
# Index a specific repository
git_blame_search index --repo /path/to/repo --max-commits 500
# For testing with minimal commits (e.g., new repos)
git_blame_search index --max-commits 1
2. Index Specific Files for Blame Analysis
# Index blame data for important files
git_blame_search index-file main.py
git_blame_search index-file src/core/auth.py
3. Search Commits
# For testing in a new (this) repository
git_blame_search search "init"
# Search by natural language
git_blame_search search "authentication implementation"
git_blame_search search "bug fix for memory leak"
git_blame_search search "refactoring database layer" --limit 20
4. Semantic Blame Search
# Find who wrote specific functionality
git_blame_search blame "error handling logic"
git_blame_search blame "database connection" --file src/db.py
5. Author Attribution
# Find who implemented features
git_blame_search who "authentication system"
git_blame_search who "caching implementation"
Continue Integration
1. Index Your Repository First
# Index your repository before using MCP. It will take a minute on mature projects.
git_blame_search index --max-commits 100
# Optionally index specific files for blame analysis
git_blame_search index-file src/main.py
2. Configure Continue
Add to your ~/.continue/config.yml:
mcpServers:
- name: git_blame_search
command: git_blame_search
args:
- mcp
cwd: "."
3. Use in VSCode
- Use
git_blame_searchin Continue chat - Ask questions like "who implemented the auth system?"
Example Queries
Finding Code Authors
uv run python git_blame_tool.py who "security vulnerability fix"
uv run python git_blame_tool.py who "REST API implementation"
Understanding Code Evolution
uv run python git_blame_tool.py search "added try-catch blocks"
uv run python git_blame_tool.py search "refactored to use async/await"
Blame Analysis
uv run python git_blame_tool.py blame "validate.*email"
uv run python git_blame_tool.py blame "test_.*authentication" --file tests/
Architecture
The tool uses:
- LanceDB: Vector database for storing and searching embeddings
- Sentence Transformers: Microsoft's CodeBERT for code-aware embeddings
- GitPython: For parsing git repository data
- Rich: For beautiful CLI output
Data Schema
Commits Table
- Commit metadata (hash, author, timestamp, message)
- Files changed, additions, deletions
- Message vector embeddings
Blame Table
- File path, line number, commit hash
- Author information and timestamps
- Code content and vector embeddings
Diffs Table
- Change type, file paths
- Old and new content
- Diff chunk vector embeddings
Performance Tips
Large Repositories
For repositories with 10k+ commits:
# Process in batches
uv run python git_blame_tool.py index --max-commits 1000 --skip 0
uv run python git_blame_tool.py index --max-commits 1000 --skip 1000
Memory Optimization
# Use smaller batches
uv run python git_blame_tool.py index --max-commits 50
# Increase PyTorch memory
export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512
Faster Embeddings
Modify the embedder for speed:
# In git_blame_tool.py, change:
self.embedder = SentenceTransformer('all-MiniLM-L6-v2') # Faster
API Reference
CLI Commands
-
index- Index a git repository--repo, -r- Repository path (default: current directory)--max-commits, -m- Maximum commits to index
-
index-file- Index blame data for a specific fileFILE_PATH- Path to file to index
-
search- Search commitsQUERY- Natural language search query--limit, -l- Number of results (default: 10)
-
blame- Search blame dataQUERY- Natural language search query--file, -f- Filter by file path--limit, -l- Number of results (default: 5)
-
who- Find code authorsQUERY- Natural language search query--limit, -l- Number of results (default: 5)
-
serve- Start HTTP server for IDE integration
HTTP API
POST /retrieve
{
"query": "authentication implementation"
}
Response:
{
"results": [
{
"title": "Commit: Add JWT authentication",
"content": "Author: Jane Smith\nDate: 2024-01-15\n\nAdd JWT authentication middleware",
"metadata": {
"type": "commit",
"hash": "a1b2c3d4e5f6"
}
}
]
}
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
License
MIT License - see LICENSE file for details
Acknowledgments
- Built with LanceDB - The multimodal vector database
- Uses Microsoft CodeBERT for code-aware embeddings
- Integrated with Continue for IDE support
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file git_blame_search-0.1.0.tar.gz.
File metadata
- Download URL: git_blame_search-0.1.0.tar.gz
- Upload date:
- Size: 144.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6ddc933d8bc79dab78f7b3a752b762e5fd86f8596b918881e47587b4974864e5
|
|
| MD5 |
fd07258dc46d0f2c407d4ea734a3d161
|
|
| BLAKE2b-256 |
b066b9bdfef8a17de4efeb01271b64adc700b237fcc5ab4394d568914bda7d8d
|
File details
Details for the file git_blame_search-0.1.0-py3-none-any.whl.
File metadata
- Download URL: git_blame_search-0.1.0-py3-none-any.whl
- Upload date:
- Size: 164.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b742c3c3a226d0df01c067dab4990a0748ab0fcccc5b0fd6a1b614fe2d37f5d0
|
|
| MD5 |
b9999b2c5d140792ecd4f7a752076be3
|
|
| BLAKE2b-256 |
702aa83b96e68741503f44c59b8215e6c397b3cbf393718b3a28a0827b247472
|