A plug and play python package for code repository indexing and semantic search
Project description
CodeRAG
A plug-and-play Python package for code repository indexing and semantic search, enabling RAG (Retrieval-Augmented Generation) applications for codebases.
Features
- Smart Code Parsing: Analyzes code using tree-sitter, preserving semantic structure
- Language Support: Handles Python, JavaScript, TypeScript, and Java
- Flexible Storage: Choose between various vector databases (currently supports ChromaDB)
- Hierarchical Chunking: Preserves class-method relationships for improved context and search results
- Intelligent Chunking: Creates meaningful code chunks based on classes, functions, and logical blocks
- Summarization Option: Generate AI summaries of code chunks for more effective embedding
- Easy Integration: Simple API to add CodeRAG to any application
Installation
pip install coderag
Quick Start
from coderag import Repository, ChromaDBStore
# Initialize vector store
vector_store = ChromaDBStore(
collection_name="my_repo",
persist_directory="./vector_db"
)
# Initialize repository handler
repo = Repository(
repo_path="path/to/your/repo",
vector_store=vector_store,
use_code_summaries=True # Optional: use AI summaries for better embeddings
)
# Index the repository
repo.index()
# Search for code
results = repo.search("function to handle HTTP requests", top_k=5)
# Display results
for result in results:
print(f"Score: {result['score']}")
print(f"File: {result['metadata']['file_path']}")
# Display hierarchical information if available
if result['metadata']['type'] == 'method' and 'parent' in result['metadata']:
print(f"Method in class: {result['metadata']['parent'].split(':')[-2]}")
elif result['metadata']['type'] == 'class' and 'children' in result['metadata']:
print(f"Class with methods: {len(result['metadata']['children'])}")
if 'summary' in result['metadata']:
print(f"Summary: {result['metadata']['summary']}")
print(f"Code:\n{result['metadata']['content']}")
print("-" * 50)
How It Works
CodeRAG breaks down the code repository into semantically meaningful chunks using tree-sitter parsing. It understands code structure (functions, classes, imports) and organizes them accordingly.
- Parsing: Repository files are parsed using language-specific parsers
- Hierarchical Chunking: Classes and methods are preserved in a hierarchical structure
- Chunking: Code is divided into logical chunks (functions, classes, imports, etc.)
- Embedding: Chunks are embedded using SentenceTransformers
- Storage: Embeddings are stored in a vector database
- Retrieval: Similar code is retrieved based on semantic similarity, preserving hierarchical context
Advanced Usage
Using Code Summaries
For better semantic matching, you can enable code summarization:
repo = Repository(
repo_path="path/to/your/repo",
vector_store=vector_store,
use_code_summaries=True # Enable AI summarization
)
Custom Embeddings
You can provide your own embedding model:
from coderag import CodeEmbedder, Repository
custom_embedder = CodeEmbedder(model_name="your-preferred-model")
repo = Repository(
repo_path="path/to/your/repo",
vector_store=vector_store,
embedder=custom_embedder
)
Filtering Results
You can filter search results based on metadata:
# Search only for Python functions
results = repo.search(
query="handle authentication",
filter={"language": "python", "type": "function"}
)
Understanding Hierarchical Results
Search results include hierarchical metadata:
# Search for methods within a specific class
results = repo.search("database connection method")
for result in results:
metadata = result['metadata']
# For methods, get the parent class
if metadata['type'] == 'method' and 'parent' in metadata:
parent_id = metadata['parent']
print(f"Method '{metadata['name']}' belongs to class: {parent_id.split(':')[-2]}")
# For classes, see what methods are included
if metadata['type'] == 'class' and 'children' in metadata:
method_names = [m.split(':')[-1] for m in metadata['children']]
print(f"Class '{metadata['name']}' contains methods: {', '.join(method_names)}")
Contributing
Contributions are welcome! To contribute:
- Fork the repository
- Create a feature branch
- Add your changes
- Submit a pull request
For major changes, please open an issue first to discuss what you would like to change.
License
This project is licensed under the Apache 2.0 License - see the LICENSE file for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file coderag-0.1.0.tar.gz.
File metadata
- Download URL: coderag-0.1.0.tar.gz
- Upload date:
- Size: 22.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.1 CPython/3.13.2 Darwin/23.5.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f4d26775c6458561b4cc92e9d214cdd0d1f18e44d3609be8e03e22b1f91d2870
|
|
| MD5 |
77d79d711451c6b4182abeeb2a3831b8
|
|
| BLAKE2b-256 |
76d8cc0569cc5e4a2aff03973826b9c6462d77298a5ac43efc98eb3e54d1dcee
|
File details
Details for the file coderag-0.1.0-py3-none-any.whl.
File metadata
- Download URL: coderag-0.1.0-py3-none-any.whl
- Upload date:
- Size: 26.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.1 CPython/3.13.2 Darwin/23.5.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a1d602c75bcee396805b85becd3101a0091edf07e7f74a93b0383ed310d4c871
|
|
| MD5 |
208f05436a3760559b692e074e6a9e1a
|
|
| BLAKE2b-256 |
17bf8c9fded733307f7e9f11243d83c5a430dc1dc1a6f4123dadab70a9924d71
|