A plug and play python package for code repository indexing and semantic search

These details have not been verified by PyPI

Project links

Development Status
- 4 - Beta
Intended Audience
- Developers
License
- OSI Approved :: Apache Software License
Programming Language
Topic
- Software Development :: Libraries

Project description

CodeRAG

A plug-and-play Python package for code repository indexing and semantic search, enabling RAG (Retrieval-Augmented Generation) applications for codebases.

Features

Smart Code Parsing: Analyzes code using tree-sitter, preserving semantic structure
Language Support: Handles Python, JavaScript, TypeScript, and Java
Flexible Storage: Choose between various vector databases (currently supports ChromaDB)
Hierarchical Chunking: Preserves class-method relationships for improved context and search results
Intelligent Chunking: Creates meaningful code chunks based on classes, functions, and logical blocks
Summarization Option: Generate AI summaries of code chunks for more effective embedding
Easy Integration: Simple API to add CodeRAG to any application

Installation

pip install coderag

Quick Start

from coderag import Repository, ChromaDBStore

# Initialize vector store
vector_store = ChromaDBStore(
    collection_name="my_repo",
    persist_directory="./vector_db"
)

# Initialize repository handler
repo = Repository(
    repo_path="path/to/your/repo",
    vector_store=vector_store,
    use_code_summaries=True  # Optional: use AI summaries for better embeddings
)

# Index the repository
repo.index()

# Search for code
results = repo.search("function to handle HTTP requests", top_k=5)

# Display results
for result in results:
    print(f"Score: {result['score']}")
    print(f"File: {result['metadata']['file_path']}")
    # Display hierarchical information if available
    if result['metadata']['type'] == 'method' and 'parent' in result['metadata']:
        print(f"Method in class: {result['metadata']['parent'].split(':')[-2]}")
    elif result['metadata']['type'] == 'class' and 'children' in result['metadata']:
        print(f"Class with methods: {len(result['metadata']['children'])}")
    if 'summary' in result['metadata']:
        print(f"Summary: {result['metadata']['summary']}")
    print(f"Code:\n{result['metadata']['content']}")
    print("-" * 50)

How It Works

CodeRAG breaks down the code repository into semantically meaningful chunks using tree-sitter parsing. It understands code structure (functions, classes, imports) and organizes them accordingly.

Parsing: Repository files are parsed using language-specific parsers
Hierarchical Chunking: Classes and methods are preserved in a hierarchical structure
Chunking: Code is divided into logical chunks (functions, classes, imports, etc.)
Embedding: Chunks are embedded using SentenceTransformers
Storage: Embeddings are stored in a vector database
Retrieval: Similar code is retrieved based on semantic similarity, preserving hierarchical context

Advanced Usage

Using Code Summaries

For better semantic matching, you can enable code summarization:

repo = Repository(
    repo_path="path/to/your/repo",
    vector_store=vector_store,
    use_code_summaries=True  # Enable AI summarization
)

Custom Embeddings

You can provide your own embedding model:

from coderag import CodeEmbedder, Repository

custom_embedder = CodeEmbedder(model_name="your-preferred-model")

repo = Repository(
    repo_path="path/to/your/repo",
    vector_store=vector_store,
    embedder=custom_embedder
)

Filtering Results

You can filter search results based on metadata:

# Search only for Python functions
results = repo.search(
    query="handle authentication",
    filter={"language": "python", "type": "function"}
)

Understanding Hierarchical Results

Search results include hierarchical metadata:

# Search for methods within a specific class
results = repo.search("database connection method")

for result in results:
    metadata = result['metadata']

    # For methods, get the parent class
    if metadata['type'] == 'method' and 'parent' in metadata:
        parent_id = metadata['parent']
        print(f"Method '{metadata['name']}' belongs to class: {parent_id.split(':')[-2]}")

    # For classes, see what methods are included
    if metadata['type'] == 'class' and 'children' in metadata:
        method_names = [m.split(':')[-1] for m in metadata['children']]
        print(f"Class '{metadata['name']}' contains methods: {', '.join(method_names)}")

Contributing

Contributions are welcome! To contribute:

Fork the repository
Create a feature branch
Add your changes
Submit a pull request

For major changes, please open an issue first to discuss what you would like to change.

License

This project is licensed under the Apache 2.0 License - see the LICENSE file for details.

Project details

These details have not been verified by PyPI

Project links

Development Status
- 4 - Beta
Intended Audience
- Developers
License
- OSI Approved :: Apache Software License
Programming Language
Topic
- Software Development :: Libraries

Release history Release notifications | RSS feed

This version

0.1.0

Apr 2, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

coderag-0.1.0.tar.gz (22.5 kB view details)

Uploaded Apr 2, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

coderag-0.1.0-py3-none-any.whl (26.3 kB view details)

Uploaded Apr 2, 2025 Python 3

File details

Details for the file coderag-0.1.0.tar.gz.

File metadata

Download URL: coderag-0.1.0.tar.gz
Upload date: Apr 2, 2025
Size: 22.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.1.1 CPython/3.13.2 Darwin/23.5.0

File hashes

Hashes for coderag-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`f4d26775c6458561b4cc92e9d214cdd0d1f18e44d3609be8e03e22b1f91d2870`
MD5	`77d79d711451c6b4182abeeb2a3831b8`
BLAKE2b-256	`76d8cc0569cc5e4a2aff03973826b9c6462d77298a5ac43efc98eb3e54d1dcee`

See more details on using hashes here.

File details

Details for the file coderag-0.1.0-py3-none-any.whl.

File metadata

Download URL: coderag-0.1.0-py3-none-any.whl
Upload date: Apr 2, 2025
Size: 26.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.1.1 CPython/3.13.2 Darwin/23.5.0

File hashes

Hashes for coderag-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a1d602c75bcee396805b85becd3101a0091edf07e7f74a93b0383ed310d4c871`
MD5	`208f05436a3760559b692e074e6a9e1a`
BLAKE2b-256	`17bf8c9fded733307f7e9f11243d83c5a430dc1dc1a6f4123dadab70a9924d71`

See more details on using hashes here.

coderag 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

CodeRAG

Features

Installation

Quick Start

How It Works

Advanced Usage

Using Code Summaries

Custom Embeddings

Filtering Results

Understanding Hierarchical Results

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes