Skip to main content

Codebase Intelligence AI — chat with any codebase using semantic search and LLMs

Project description

AskRepo

A local-first code intelligence system that indexes source code and documentation into a vector database and answers natural language questions about it using semantic search and an LLM.

Python ChromaDB License LLM


Overview

AskRepo parses your codebase at the AST level, assigns each function, class, and file a natural language description using an LLM, stores everything in a local ChromaDB vector store, and lets you query it in plain English.

It supports any public GitHub repository via shallow Git clone, and handles both structured code (Python, JavaScript, TypeScript) and unstructured documents (Markdown, JSON, TOML, YAML, plain text, config files).

Both the indexing descriptions and the query answering use pluggable LLM backends — Groq (cloud) or Ollama (local) — configurable independently of each other.


Retrieval-Augmented Generation (RAG)

AskRepo is built on the RAG pattern — a technique that improves LLM answers by grounding them in retrieved facts rather than relying on the model's training memory alone.

The problem RAG solves: a codebase is too large to send to an LLM in a single prompt, and even if it fit, the model has no specific knowledge of your code. RAG solves this by:

  1. Indexing — splitting the codebase into small, searchable chunks and storing them in a vector database.
  2. Retrieval — when a question arrives, finding the chunks most semantically relevant to it.
  3. Augmented generation — constructing a prompt that contains only those relevant chunks and sending it to the LLM, so the model answers from actual context rather than guessing.

Phase 1: Indexing

Source file
    └── Parsed into chunks (one per function / class / file / document)
            └── LLM writes a verbal description of each chunk
                    └── Description is converted to a vector (embedding)
                            └── Vector + metadata stored in ChromaDB

Each chunk is described in plain English by an LLM before being embedded. This is important: raw code contains a lot of syntactic noise (brackets, keywords, indentation) that degrades embedding quality. A natural language description captures intent, which embeds and retrieves far more accurately.

Phase 2: Retrieval and Generation

User question
    └── Question converted to a vector using the same embedding model
            └── Cosine similarity search against all stored vectors
                    └── Top-k most similar chunks retrieved
                            └── Chunks assembled into a context prompt
                                    └── LLM generates the answer

What is an Embedding?

An embedding is a fixed-size array of floating-point numbers (a vector) that represents the semantic meaning of a piece of text. Text with similar meaning produces vectors that are geometrically close to each other in high-dimensional space.

For example, the phrase "function that hashes a password" and the phrase "bcrypt-based password encryption routine" will produce vectors with a high cosine similarity score — even though they share no words. This is what enables natural language search over code.

AskRepo uses all-MiniLM-L6-v2, a 22M parameter model that produces 384-dimensional vectors. It runs entirely on CPU and takes under a second per query on modern hardware.

Why not just send all the code to the LLM?

Approach Problem
Send full codebase in prompt Exceeds context window; expensive; model loses focus in large contexts
Keyword search (grep) Finds exact text matches, misses semantic relationships
RAG with embeddings Retrieves semantically relevant chunks; fits in context; accurate

How It Works

Source Code / GitHub Repo
         │
         ▼
  ┌─────────────┐
  │   parser.py  │  AST extraction (tree-sitter) for .py / .js / .ts / .tsx
  │              │  Raw content read for .md / .json / .toml / .yaml / etc.
  └──────┬───────┘
         │  Structured data: functions, classes, imports, globals
         ▼
  ┌─────────────┐
  │  chunker.py  │  One chunk per function, class, file overview, or document
  └──────┬───────┘
         │  List of typed chunks with metadata
         ▼
  ┌──────────────┐
  │ describer.py  │  LLM generates a 2-4 sentence verbal description per chunk
  │               │  Backend: Ollama (local) or Groq (cloud) — set in config.py
  └──────┬────────┘
         │  Chunks with `verbal` field populated
         ▼
  ┌──────────────┐
  │   store.py    │  Embeds the verbal description via sentence-transformers
  │               │  Stores vectors + metadata in local ChromaDB
  └──────┬────────┘
         │
         ◆  Index complete
         │
  ┌──────┴────────────────────────────────────────────────────┐
  │                        query.py                            │
  │  1. Embed the user's question                              │
  │  2. Retrieve top-k chunks via cosine similarity            │
  │  3. Build a context prompt from the retrieved chunks       │
  │  4. Call LLM (Groq or Ollama) → synthesise answer         │
  └───────────────────────────────────────────────────────────┘

Key design decisions

  • AST over full-file embedding — Each function and class is indexed independently. This gives precise semantic hits instead of retrieving large, diluted file blobs.
  • Verbal descriptions as the embedding target — Rather than embedding raw code (which encodes syntax, not intent), an LLM first writes a plain English description of each chunk. That description is what gets embedded. This dramatically improves retrieval relevance.
  • Fully local storage — ChromaDB persists all vectors to ./chroma_db/ on disk. No cloud vector database, no data leaves the machine (unless you use the Groq backend).
  • Shallow Git clones — GitHub repositories are fetched with git clone --depth=1, avoiding API rate limits and keeping clone sizes small.
  • Lazy model loading — The embedding model is only loaded into memory when a command actually needs it (query, index). Commands like list and count run instantly without touching the model.

Project Structure

askrepo/
├── main.py            CLI entry point — all commands route through here
├── askrepo.bat        Windows launcher (run `askrepo` from the project directory)
├── config.py          Single source of truth for all settings
├── parser.py          AST extraction (Python, JS, TS) + simple file reader
├── chunker.py         Splits parsed output into indexable chunks
├── describer.py       LLM description generation (Groq / Ollama)
├── store.py           ChromaDB wrapper — add, search, count, metadata
├── query.py           Query pipeline — retrieve → prompt → LLM → answer
├── github_fetcher.py  Git clone / pull for public GitHub repositories
├── requirements.txt
├── .env               GROQ_API_KEY goes here
├── chroma_db/         Local vector store (auto-created on first index)
└── repos/             Cached GitHub repository clones

Requirements

  • Python 3.10+
  • Git (must be in PATH — used for index-repo)
  • Ollama with gemma:2b pulled (if using the Ollama backend)
  • A Groq API key (if using the Groq backend — free tier available)

Installation

git clone <this-repo>
cd askrepo

python -m venv .venv
.venv\Scripts\activate          # Windows
# source .venv/bin/activate     # macOS / Linux

pip install -r requirements.txt

Create a .env file in the project root:

GROQ_API_KEY=your_key_here

If you intend to use only Ollama, the .env file and Groq key are not required.

CLI setup (Windows)

The project includes askrepo.bat. To use askrepo as a command from anywhere, add the project directory to your system PATH, or simply run it from within the project directory:

askrepo query "how does authentication work?"

Alternatively, you can always invoke it directly:

python main.py query "how does authentication work?"

Configuration

All settings are in config.py. Edit this file directly — no CLI flags, no environment variable hunting.

# config.py

# Which LLM generates verbal descriptions during indexing
# "ollama"  — local, unlimited, no API key needed  (default)
# "groq"    — cloud, faster, 100k token/day free tier
DESCRIBER_BACKEND = "ollama"

# Which LLM synthesises answers during queries
# "groq"    — cloud, better reasoning quality      (default)
# "ollama"  — local, unlimited
QUERY_BACKEND = "groq"

# Ollama
OLLAMA_BASE_URL = "http://localhost:11434"
OLLAMA_MODEL    = "gemma:2b"

# Groq
GROQ_MODEL      = "llama-3.3-70b-versatile"
GROQ_CALL_DELAY = 0.5           # seconds between calls, respects free-tier limits

# Retrieval
TOP_K = 5                       # chunks returned per query

# Directories never descended into during indexing
SKIP_DIRS = {"venv", ".venv", "__pycache__", "node_modules", "docs", ...}

Backend matrix

Use case DESCRIBER_BACKEND QUERY_BACKEND
Default (local index, cloud query) "ollama" "groq"
Fully offline "ollama" "ollama"
Groq daily limit hit "ollama" "ollama"
Fastest indexing (burns tokens) "groq" "groq"

Usage

Index a local path

askrepo index ./myproject
askrepo index ./src/auth.py

Walks the directory recursively. Skips test files, dependency directories (node_modules, .venv, etc.), and documentation folders (docs/). Accepts a single file or any directory.

Index a GitHub repository

askrepo index-repo fastapi/fastapi
askrepo index-repo https://github.com/psf/requests
askrepo index-repo django/django --branch stable/4.2.x

Performs a shallow clone (--depth=1) into ./repos/<owner>_<repo>/. If the repository is already cached, runs git pull to update it instead of re-cloning.

Note on large repositories — Repos with large docs/ folders (translations, tutorials) will generate hundreds of chunks and exhaust the Groq free-tier token budget quickly. The docs/ directory is in SKIP_DIRS by default. Adjust SKIP_DIRS in config.py if needed.

Query

askrepo query "how does authentication work?"
askrepo query "what does the Timers class track?"
askrepo query "what python version does this require?"

Runs the full pipeline: embed the question → retrieve top-k chunks → build a context prompt → call the LLM → print the answer.

List the index

askrepo list

Prints a structured breakdown of everything currently indexed, grouped by source:

==============================================================
  INDEX BREAKDOWN
==============================================================
  Sources : 1
  Files   : 7
  Chunks  : 25

--------------------------------------------------------------
  Source : aswin-2005/MONOL-Server   (7 files | 25 chunks)
--------------------------------------------------------------
  auth.py          python    12 chunks  [file, 11x function]  ->  generate_challenge, ...
  crypt.py         python     4 chunks  [file, 3x function]   ->  encrypt_with_aesgcm, ...
  entries.py       python     5 chunks  [file, 4x function]   ->  add_entry, get_entries, ...
  requirements.txt text       1 chunk   [document]
  ...
==============================================================

Count chunks

askrepo count

Prints the total number of indexed chunks. Does not load the embedding model.

Clear the index

askrepo clear

Wipes the ChromaDB collection. Does not delete cached repository clones in ./repos/.


Supported File Types

Structured (AST-parsed)

These files are parsed with tree-sitter. Each function, class, and method becomes its own chunk with extracted metadata (parameters, return type, calls, docstring).

Extension Language
.py Python
.js, .mjs, .cjs JavaScript
.ts TypeScript
.tsx TypeScript + JSX

Simple (raw content)

These files are read as plain text and stored as a single document chunk each.

Extension / Filename Label
.md, .markdown markdown
.txt text
.rst restructuredtext
.json json
.toml toml
.yaml, .yml yaml
.env, .ini, .cfg, .conf env / config
Dockerfile, Makefile dockerfile / makefile
.gitignore, .dockerignore gitignore

Embedding Model

AskRepo uses all-MiniLM-L6-v2 from sentence-transformers.

Property Value
Parameters 22.7 million
Vector dimensions 384
Max input tokens 256
Runs on CPU (no GPU required)
Similarity metric Cosine similarity
Local cache ~/.cache/huggingface/

The model is downloaded once on first use and cached locally. All subsequent runs load from disk — no internet connection required after the initial download.

Why verbal descriptions are embedded, not raw code

Embedding raw source code produces vectors that are heavily influenced by syntax — language keywords, punctuation, indentation — rather than the semantic purpose of the code. An LLM-generated description strips the syntactic noise and represents what the code does in plain language, which maps much more faithfully to the kind of natural language questions users ask.

For example, a question like "how is the user session cleaned up?" will match the description "removes expired sessions from the in-memory store on a timed schedule" far more reliably than it would match the raw Python source of that function.

The model is lazy-loaded: it is only initialised when a command actually needs embeddings (index, query). Commands like list and count are instant.

To change the embedding model, update EMBEDDING_MODEL in config.py. Any model from the sentence-transformers model hub can be used.


Ollama Setup

Install Ollama and pull the model:

# Install from https://ollama.com
ollama pull gemma:2b
ollama serve        # Ollama usually auto-starts; only needed if not running

Verify it is reachable:

curl http://localhost:11434/api/tags

The base URL and model name are configurable in config.py under OLLAMA_BASE_URL and OLLAMA_MODEL. Any Ollama-compatible model can be used.


Skipped Files and Directories

The following are automatically excluded during indexing to avoid token waste and retrieval noise:

Directories: venv, .venv, __pycache__, node_modules, dist, build, .git, vendor, third_party, site-packages, docs, doc, documentation, examples, example, benchmarks, bench

File name patterns:

  • Prefix: test_, spec_
  • Suffix: _test.py, _test.js, _test.ts, .test.js, .test.ts, .spec.js, .spec.ts, _spec.rb

All of these are configurable via SKIP_DIRS, SKIP_FILE_PREFIXES, and SKIP_FILE_SUFFIXES in config.py.


Limitations

  • Groq free tier — 100,000 tokens per day. Indexing a large repository with many files can exhaust this quickly. Use Ollama for indexing (DESCRIBER_BACKEND = "ollama") and reserve Groq tokens for queries.
  • Query quality with small modelsgemma:2b is capable but noticeably weaker than llama-3.3-70b-versatile on complex reasoning. For best answer quality, use Groq for queries.
  • No incremental re-indexing — Re-running index or index-repo on an already-indexed path will upsert (overwrite) existing chunks. This is safe but re-runs all LLM description calls.
  • No cross-collection search — All indexed sources share a single ChromaDB collection. Run clear if you want to start fresh.

Dependencies

Package Purpose
chromadb Local vector database
sentence-transformers Embedding model (all-MiniLM-L6-v2)
tree-sitter AST parsing core
tree-sitter-python Python grammar
tree-sitter-javascript JavaScript grammar
tree-sitter-typescript TypeScript / TSX grammar
groq Groq SDK for cloud LLM calls
python-dotenv .env file loading

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

askrepo-1.3.2.tar.gz (40.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

askrepo-1.3.2-py3-none-any.whl (36.2 kB view details)

Uploaded Python 3

File details

Details for the file askrepo-1.3.2.tar.gz.

File metadata

  • Download URL: askrepo-1.3.2.tar.gz
  • Upload date:
  • Size: 40.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.11

File hashes

Hashes for askrepo-1.3.2.tar.gz
Algorithm Hash digest
SHA256 6e5f13a73bd244002bf6b2bf1ba7756c9a2f79645c7b414ac912f89399f3fcc8
MD5 ee9c5e6722b399dbbbbb1e405260bbad
BLAKE2b-256 4e8875c06b1ff66fa81652353082933c09ebe0bfd39f3a98988b4aca0a7864f2

See more details on using hashes here.

File details

Details for the file askrepo-1.3.2-py3-none-any.whl.

File metadata

  • Download URL: askrepo-1.3.2-py3-none-any.whl
  • Upload date:
  • Size: 36.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.11

File hashes

Hashes for askrepo-1.3.2-py3-none-any.whl
Algorithm Hash digest
SHA256 9bb8c148d2ffcc727035532c29635f795cf838c858df5516d4a3dc91f614b6b6
MD5 04811a751ba9bcac243f6b047df83de2
BLAKE2b-256 3a3d47523b4a0a2bc0561a7f00cde286b69d3a54333778f0f5bbb8cdf867a757

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page