Skip to main content

Vectorless RAG for Code Repositories - Navigate your codebase with LLM reasoning

Project description

๐ŸŒฒ CodeTree

Vectorless RAG for Code Repositories

Navigate your codebase like a human expert โ€” using LLM reasoning, not vector similarity.

Python 3.10+ License: MIT GitHub stars


๐Ÿค” The Problem

Traditional RAG (Retrieval-Augmented Generation) for code has fundamental limitations:

Problem Description
โŒ Vector similarity โ‰  Code relevance "login" and "logout" have similar embeddings, but they're completely different!
โŒ Chunking destroys structure Splitting a class across chunks loses critical context
โŒ Can't follow call chains "Who calls this function?" is nearly impossible with vectors
โŒ No architecture understanding Vectors don't know that auth/ is for authentication

๐Ÿ’ก The Solution

CodeTree takes a different approach โ€” it builds a hierarchical tree index of your codebase and uses LLM reasoning to navigate it, just like a human developer would:

  • โœ… AST-based parsing preserves code structure
  • โœ… LLM reasons about which files are relevant
  • โœ… Understands module relationships and dependencies
  • โœ… Can trace function calls across files

โœจ Features

Feature Description
๐Ÿšซ No Vector Database Uses code structure + LLM reasoning instead of embedding similarity
๐ŸŒณ AST-Based Indexing Parses actual code structure โ€” functions, classes, imports, dependencies
๐Ÿ”— Cross-File Intelligence Tracks imports, function calls, and dependencies across your entire codebase
๐Ÿง  Reasoning-Based Retrieval LLM navigates the code tree like a human expert
๐Ÿ’ฌ Natural Language Queries Ask questions in plain English
๐Ÿ”’ Privacy-First Works with local models (Ollama). Your code never leaves your machine

๐Ÿ“Š Comparison: Vector RAG vs CodeTree

Feature Vector RAG CodeTree
Understands code structure โŒ โœ…
Cross-file references โŒ โœ…
"Who calls this function?" โŒ โœ…
No chunking headaches โŒ โœ…
Explainable retrieval โŒ โœ…
Works offline โš ๏ธ โœ…
No vector DB needed โŒ โœ…

๐Ÿš€ Quick Start

Installation

git clone https://github.com/toller892/Oh-Code-Rag.git
cd Oh-Code-Rag
pip install -e .

Configuration

Set your LLM API key:

export OPENAI_API_KEY="sk-..."
# or
export ANTHROPIC_API_KEY="sk-ant-..."

Basic Usage

from codetree import CodeTree

# Index your repository
tree = CodeTree("/path/to/your/repo")
tree.build_index()

# Ask questions about the code
answer = tree.query("How does the authentication system work?")
print(answer)

CLI Usage

# Index a repository
codetree index /path/to/repo

# Query the codebase  
codetree query "Where is database connection handled?"

# Interactive chat mode
codetree chat

# Show code structure
codetree tree

# Find symbol references
codetree find "UserService"

๐ŸŽฏ Use Cases

๐Ÿ‘จโ€๐Ÿ’ป For Developers

Onboarding to New Codebases:

  • "What's the overall architecture of this project?"
  • "How do requests flow from API to database?"
  • "Where should I add a new payment method?"

Code Review & Understanding:

  • "What does the processOrder function do?"
  • "Who calls the validateUser method?"
  • "What happens if authentication fails?"

๐Ÿข Industry Applications

Industry Use Case Example Query
FinTech Audit & Compliance "How is user data encrypted?"
Healthcare Security Review "Where is patient data accessed?"
E-commerce Feature Development "How does the cart system work?"
DevOps Incident Response "What services depend on Redis?"
Education Code Learning "Explain the MVC pattern in this app"

๐Ÿ”ฌ Research & Analysis

  • Legacy Code Migration: Understand old systems before rewriting
  • Security Auditing: Find all database queries, API endpoints
  • Documentation Generation: Auto-generate architecture docs
  • Dependency Analysis: Map out service dependencies

๐Ÿ”ฌ Real-World Examples

Example 1: Understanding Project Architecture

Query:

from codetree import CodeTree

tree = CodeTree("./my-project")
tree.build_index()

answer = tree.query("What's the overall architecture? What are the core modules?")
print(answer)

Output:

## Project Architecture

This project follows a modular architecture with these core components:

1. **CodeTree (core.py)** - Main entry point
   - `build_index()`: Builds the code tree
   - `query()`: Natural language queries
   - `find()`: Symbol search

2. **CodeIndexer (indexer.py)** - Index construction
   - Recursively parses directories
   - Builds TreeNode hierarchy
   
3. **CodeParser (parser.py)** - AST parsing
   - Supports Python, JS, Go, Rust, Java
   - Extracts functions, classes, imports

4. **CodeRetriever (retriever.py)** - LLM-based retrieval
   - Two-stage: retrieve โ†’ answer
   - Uses reasoning prompts

## Data Flow
User Query โ†’ CodeTree โ†’ Retriever โ†’ LLM Reasoning โ†’ File Selection โ†’ Answer

Example 2: Finding Function Usage

Query:

refs = tree.find("authenticate")
print(refs)

Output:

๐Ÿ“ Found 5 references to 'authenticate':

  [function]  src/auth/login.py:45 โ†’ authenticate
  [function]  src/auth/oauth.py:78 โ†’ authenticate_oauth  
  [import]    src/api/middleware.py โ†’ from auth import authenticate
  [import]    src/api/routes.py โ†’ from auth.login import authenticate
  [class]     src/auth/base.py:12 โ†’ Authenticator

Example 3: Tracing Code Flow

Query:

answer = tree.query("How does a user login request flow through the system?")
print(answer)

Output:

## Login Request Flow

1. **Entry Point**: `src/api/routes.py`
   - @app.post("/login") routes to auth_service.authenticate()

2. **Authentication**: `src/auth/service.py`
   - Validates credentials against database
   - Generates JWT token on success
   
3. **Database**: `src/db/users.py`
   - get_user_by_email() fetches user record
   - verify_password() checks hash

4. **Response**: Returns JWT token or 401 error

๐Ÿ—๏ธ How It Works

Architecture Overview

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                        CodeTree                              โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚                                                              โ”‚
โ”‚   CodeParser โ”€โ”€โ”€โ”€โ”€โ”€โ–ถ CodeIndexer โ”€โ”€โ”€โ”€โ”€โ”€โ–ถ CodeIndex (JSON)   โ”‚
โ”‚   (AST Parse)        (Build Tree)        (Store)            โ”‚
โ”‚                                              โ”‚               โ”‚
โ”‚                                              โ–ผ               โ”‚
โ”‚   Answer โ—€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ Retrieve โ—€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ CodeRetriever    โ”‚
โ”‚   (Markdown)         (Read Files)         (LLM Reasoning)   โ”‚
โ”‚                                                              โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Two-Stage Retrieval Process

Stage 1: Reasoning-Based Navigation

User: "How does authentication work?"
                    โ”‚
                    โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ LLM analyzes code tree structure:                           โ”‚
โ”‚                                                             โ”‚
โ”‚ "Authentication relates to auth module...                   โ”‚
โ”‚  Let me check src/auth/ directory...                        โ”‚
โ”‚  login.py and oauth.py look relevant...                     โ”‚
โ”‚  Also need to check who imports these..."                   โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                    โ”‚
                    โ–ผ
Selected Files: [src/auth/login.py, src/auth/oauth.py, ...]

Stage 2: Answer Generation

Read selected files โ†’ Generate comprehensive answer with code snippets

๐Ÿ—ฃ๏ธ Supported Languages

Language Extensions Status
Python .py, .pyi โœ… Full
JavaScript .js, .jsx, .mjs โœ… Full
TypeScript .ts, .tsx โœ… Full
Go .go โœ… Full
Rust .rs โœ… Full
Java .java โœ… Full
C/C++ .c, .cpp, .h ๐Ÿšง Coming Soon

โš™๏ธ Configuration

Create .codetree.yaml in your project:

# LLM Configuration
llm:
  provider: openai          # openai, anthropic, ollama
  model: gpt-4o
  temperature: 0.0
  max_tokens: 4096

# For local/private deployment
# llm:
#   provider: ollama
#   model: llama3
#   base_url: http://localhost:11434

# Index Settings  
index:
  languages:
    - python
    - javascript
    - typescript
    - go
  exclude:
    - node_modules
    - __pycache__
    - .git
    - venv
    - dist
  max_file_size: 100000    # Skip files larger than 100KB

๐Ÿ“ˆ Performance

Metric Small Repo (<100 files) Medium Repo (<1000 files) Large Repo (<10000 files)
Index Time < 5s < 30s < 5min
Index Size < 100KB < 1MB < 10MB
Query Time 2-5s 3-8s 5-15s

Times depend on LLM provider latency


๐Ÿค Contributing

We welcome contributions! See CONTRIBUTING.md for guidelines.

Areas to contribute:

  • ๐ŸŒ Add language parsers (C++, Ruby, PHP, etc.)
  • ๐Ÿงช Improve test coverage
  • ๐Ÿ“– Documentation and examples
  • ๐Ÿš€ Performance optimizations
  • ๐ŸŽจ CLI improvements

๐Ÿ”Œ MCP Server (Claude Desktop & More)

CodeTree works as an MCP (Model Context Protocol) server, compatible with Claude Desktop, Cline, Continue, and other MCP clients.

Setup for Claude Desktop

Add to your Claude Desktop config:

{
  "mcpServers": {
    "codetree": {
      "command": "python",
      "args": ["/path/to/Oh-Code-Rag/mcp/server.py"],
      "env": {
        "OPENAI_API_KEY": "sk-your-key-here"
      }
    }
  }
}

MCP Tools

Tool Description
codetree_index Index a repository
codetree_query Ask questions about code
codetree_tree Show code structure
codetree_find Find symbol references
codetree_stats Get repo statistics

See mcp/README.md for full documentation.


๐Ÿค– Clawdbot Skill

CodeTree also comes as a Clawdbot skill for AI assistant integration.

Install Skill

Copy the skill/ folder to your Clawdbot skills directory:

cp -r skill/ ~/.clawdbot/skills/codetree/

Skill Commands

# Index a repo
./scripts/codetree.sh index /path/to/repo

# Query code
./scripts/codetree.sh query /path/to/repo "How does auth work?"

# Show structure
./scripts/codetree.sh tree /path/to/repo

# Find symbol
./scripts/codetree.sh find /path/to/repo "UserService"

See skill/SKILL.md for full documentation.


๐Ÿ“„ License

MIT License - see LICENSE for details.


๐Ÿ™ Acknowledgments

Inspired by PageIndex โ€” vectorless RAG for documents.


โญ Star History

Star History Chart


If you find CodeTree useful, please give us a โญ!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

codetree_rag-0.1.0.tar.gz (25.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

codetree_rag-0.1.0-py3-none-any.whl (22.2 kB view details)

Uploaded Python 3

File details

Details for the file codetree_rag-0.1.0.tar.gz.

File metadata

  • Download URL: codetree_rag-0.1.0.tar.gz
  • Upload date:
  • Size: 25.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for codetree_rag-0.1.0.tar.gz
Algorithm Hash digest
SHA256 d5330508b9ae74473c19b74163bb876ba6429af52fba4facf70a37a35db05979
MD5 53ec40493b6b5c3d0bbda9aa34be24e3
BLAKE2b-256 e648cb72f4c31a4a76fcf4eb501e8900556cb2d04f9143d52ec3741efd6642d3

See more details on using hashes here.

File details

Details for the file codetree_rag-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: codetree_rag-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 22.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for codetree_rag-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 7ad47ebbee32dde268a26551cf197f0101203cbf9cf914121bddf24511115155
MD5 fee01082cfa649ef395715be7c691587
BLAKE2b-256 99210bca13b2b27a1bf3a27561613b2dc79b8453ddc331f1a325c044773a53e8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page