Privacy-preserving CLI AI agent for querying local code repositories using RAG
Project description
Autonomous RAG Agent
A privacy-preserving command-line AI assistant for your local code repositories. Index any codebase, search it with natural language, ask questions in an interactive chat session, and generate code patches — all without uploading your source files to the cloud.
Only semantically selected code snippets are sent to the LLM. Raw repository files never leave your machine.
How it works
Your repo → Chunk → Embed (local CPU) → ChromaDB (local disk)
↓
Your query → Embed → Retrieve top-K chunks → Gemini / GPT / Claude
↓
Answer + sources
- Index — scans your repo, splits files into overlapping chunks, generates embeddings locally using
all-MiniLM-L6-v2(runs on CPU, no GPU needed), stores everything in ChromaDB on your disk. - Chat / Ask / Search / Patch — your query is embedded the same way, the most relevant chunks are retrieved, and only those chunks are sent to the LLM along with your question.
Installation
Prerequisites
- Python 3.10 or newer
- pip
Install (recommended — pipx keeps it isolated)
pipx install context_rag_cli
Install (standard pip)
pip install context_rag_cli
Install from source
git clone <repo-url>
cd project_cli
pip install -e .
Verify the install:
agent --help
CPU-only PyTorch (saves ~1 GB): The default
torchwheel from PyPI includes CUDA support. To install the lighter CPU-only build, see INSTALL.md.
The agent stores all its data in
~/.agent/— it never writes files to your project directory.
Quick start — using it on your own project
Step 1 — Configure your LLM provider
Pick one of the three supported providers and set your API key:
# Gemini (Google)
agent config set provider gemini
agent config set model gemini-1.5-flash
agent config set api_key YOUR_GEMINI_API_KEY
# OpenAI
agent config set provider openai
agent config set model gpt-4o-mini
agent config set api_key YOUR_OPENAI_API_KEY
# Anthropic (Claude)
agent config set provider anthropic
agent config set model claude-3-5-sonnet-20241022
agent config set api_key YOUR_ANTHROPIC_API_KEY
Confirm everything looks right:
agent config show
Step 2 — Index your project
Navigate to any project directory and index it:
cd /path/to/your/project
agent index .
Or index a specific path from anywhere:
agent index "/path/to/your/project"
This step runs entirely locally. It may take a minute on the first run while the embedding model is downloaded (~90 MB, cached after that). All subsequent runs are fast.
Step 3 — Start a chat session (recommended)
The best way to explore a codebase is the interactive chat mode. The embedding model loads once and you type questions directly. The agent remembers the last 6 turns by default, so follow-up questions work naturally:
cd /path/to/your/project
agent chat
RAG Agent — Interactive Chat
Project : /path/to/your/project
Provider: gemini / gemini-1.5-flash
Top-K : 5 | Memory: 6 turns
You: How does authentication work?
Agent: The authentication flow starts in auth.py...
You: What errors can it throw? ← follow-up works
Agent: Based on our discussion of the auth flow,
it raises AuthError on 401/403...
You: /search JWT token
# Shows raw search results without LLM
You: /top-k 10
# Now retrieves 10 chunks per question
You: /clear-history
# Forget conversation history without restarting
You: /exit
Step 4 — Or use single commands
agent ask "How does authentication work in this project?"
agent ask "Where is the database connection configured?"
agent search "error handling"
agent search "database query" --top-k 10
Step 5 — Generate a patch
agent patch "Add input validation to the login endpoint"
agent patch "Add docstrings to all public functions in utils.py" --dry-run
All commands
agent chat ⭐ recommended
Start an interactive chat session. The embedding model and vector store load once — no startup delay on every question. Conversation memory is built in — the agent remembers your last N turns so follow-up questions work naturally.
agent chat # default top-k 5, 6 turns memory, buffered responses
agent chat --stream # stream tokens as they arrive
agent chat --top-k 8 # retrieve 8 chunks per question
agent chat --history 10 # keep last 10 turns in memory
agent chat --history 0 # stateless mode (no memory)
agent chat --verbose # show source files after each answer
In-session commands:
| Command | Description |
|---|---|
/exit or /quit |
End the session |
/search <query> |
Search without calling the LLM |
/snippet <file> [question] |
Load an entire file as pinned context |
/top-k <N> |
Change chunk count for the rest of the session |
/clear |
Clear the screen |
/clear-history |
Forget conversation history without restarting |
/help |
Show available commands |
Ctrl+C |
Interrupt a slow response without killing the session |
Asking about a specific file or code snippet — /snippet:
The /snippet command loads any file directly into the prompt as the highest-priority context. No indexing required. This is the reliable way to ask about a specific file on Windows — avoids the terminal paste problem where multiline text fires as separate questions:
You: /snippet agent/llm_client.py What does TokenUsage do?
Agent: TokenUsage is a dataclass that tracks token usage...
You: /snippet mycode.py
Question about this file: Explain the retry logic
Agent: ...
You: /snippet config.yaml Is this configuration valid?
Agent: ...
The file content is injected first (highest priority), followed by RAG-retrieved chunks for supporting codebase context.
Paste mode — short snippets typed manually:
Start your message with ``` to enter paste mode. Type or paste line by line, then close with ``` on its own line:
You: ```
> def hello():
> return "world"
> ```
Snippet captured (2 lines). Now type your question.
You: What does this function return?
Agent: It returns the string "world".
Windows tip: Pasting multiline code directly into the terminal sends each line as a separate message, triggering a Gemini call for each line (rate limit errors). Always save the code to a file and use
/snippet <file>for anything larger than 2-3 lines.
Conversation memory example:
You: How does authentication work?
Agent: The auth module uses JWT tokens validated in auth.py...
You: What about error handling in that? ← follow-up works
Agent: Based on our discussion of the JWT flow, errors are
raised as AuthError (401/403) which...
You: /clear-history ← start fresh without restarting
Conversation history cleared.
You: /exit
agent index <path>
Indexes a repository. Creates or replaces the local ChromaDB collection.
agent index .
agent index "/path/to/project"
agent index . --verbose # show detailed timing logs
agent index . --changed # only re-index files that changed since last run
--changed compares each file's SHA-256 hash against the manifest saved during the previous index. Only new or modified files are re-embedded; deleted files have their chunks removed. Falls back to a full index if no previous index exists.
Always prints file count, chunk count, and elapsed time on completion.
agent ask "<question>"
Single-shot question. Retrieves relevant chunks and sends them to the LLM.
agent ask "How is the config loaded at startup?"
agent ask "What errors can the API return?" --top-k 10
agent ask "Summarise the architecture" --stream
Every answer includes a Sources table showing which files and lines were used as context.
agent search "<query>"
Searches the indexed repository and returns the most relevant code chunks — no LLM call.
Uses hybrid search (vector similarity + BM25 keyword) by default. This catches exact function/variable name matches that pure semantic search misses.
agent search "authentication middleware"
agent search "getUserById" # exact symbol name — BM25 helps here
agent search "database schema" --top-k 10
agent search "error handling" --semantic-only # pure vector only
Each result shows: file path, line numbers, relevance score (0.000–1.000), and a code excerpt.
agent patch "<instruction>"
Retrieves relevant chunks, asks the LLM to generate a unified diff, and optionally applies it.
agent patch "Add type hints to all functions in auth.py"
agent patch "Refactor the login flow to use async/await" --dry-run # preview only
agent patch --discard-backup src/auth.py # delete the .bak file after applying
You will be asked to confirm before any file is modified. Backups are created at <file>.bak before every modification.
agent config set <key> <value>
Writes a setting to ~/.agent/config.toml.
agent config set provider gemini
agent config set api_key YOUR_KEY
agent config set top_k 8
agent config set stream true
agent config show
Displays the current effective configuration. The API key is masked (last 4 chars visible).
agent config show
agent list
Lists all indexed collections with their repository path, chunk count, and index timestamp.
agent list
agent purge <path>
Permanently deletes the indexed collection for a repository.
agent purge .
agent purge "/path/to/old-project"
agent watch [path]
Watches a directory for file changes and automatically re-indexes modified files — no manual agent index . --changed needed.
Uses incremental indexing under the hood: only changed, new, or deleted files are re-embedded. A short debounce window (default 2 s) batches rapid saves (e.g. auto-save storms) into a single re-index pass.
agent watch . # watch current directory
agent watch /path/to/project # watch a specific path
agent watch . --verbose # log every file event
agent watch . --debounce 5 # wait 5 s of quiet before re-indexing
Press Ctrl+C to stop watching.
Practical workflow for a developer
# 1. Install once (ever)
pipx install context_rag_cli
agent config set provider gemini
agent config set model gemini-1.5-flash
agent config set api_key YOUR_KEY
# 2. Go to any project
cd ~/projects/my-django-app
# 3. Index it (once per project, re-run after big changes)
agent index .
# 4. Start a chat session and explore
agent chat
You: What does this project do?
You: How is the database connection managed?
You: Where is input validation handled?
You: /search rate limiting
You: /top-k 10
You: How does the caching layer work?
You: /exit
# 5. Re-index after making changes (fast incremental update)
agent index . --changed # only re-embeds modified files
# 6. Or keep the index always fresh automatically
agent watch . # stays running; re-indexes on every save
# 7. Generate code from the terminal
agent patch "Add request ID logging to all API endpoints" --dry-run
Token savings
This agent saves 97–99%+ of LLM tokens compared to sending your full codebase. See TOKEN_SAVINGS.md for the full math and a script to audit your own codebase.
Quick numbers for a typical medium project (120,000 tokens):
| Approach | Tokens/query | GPT-4o cost |
|---|---|---|
| Full repo | 120,000 | $0.60 |
| This agent (top-k=5) | ~2,300 | $0.012 |
| Saving | 97.9% | 50x cheaper |
Architecture
The project is a single Python package (agent/) with one module per component:
agent/
├── cli.py # Typer CLI — all commands including chat REPL
├── config.py # ConfigManager — reads ~/.agent/config.toml + AGENT_* env vars
├── chunking.py # ChunkingEngine — token-aware sliding window, Chunk dataclass
├── embedding.py # EmbeddingModel — wraps all-MiniLM-L6-v2 via sentence-transformers
├── vector_store.py # VectorStore — ChromaDB adapter, atomic replace_collection()
├── search.py # SemanticSearch — orchestrates embed → query → return ScoredChunks
├── prompt_compiler.py # PromptCompiler — assembles system + context + query into a Prompt
├── llm_client.py # LLMClient + GeminiProvider / OpenAIProvider / AnthropicProvider
├── tool_executor.py # ToolExecutor — validate/apply unified diffs, .bak backups
└── errors.py # All typed error classes (AgentError hierarchy)
All data is stored in ~/.agent/ — the agent never writes to your project directory:
~/.agent/
├── config.toml # your settings
├── chroma/ # ChromaDB vector store (all indexed collections)
└── logs/
└── agent-YYYY-MM-DD.log
Key design decisions
- Privacy by architecture — the LLM only sees
PromptCompiler-assembled text. No raw files are ever sent. - Zero GPU —
EmbeddingModelsetsdevice="cpu"explicitly. No CUDA required. - Atomic re-indexing —
VectorStore.replace_collection()writes to a temp collection first, then renames. No partial state is observable during re-index. - Provider agnosticism —
LLMProvideris an abstract base class.GeminiProvider,OpenAIProvider, andAnthropicProviderare concrete implementations selected bybuild_provider(config). - Hybrid search —
HybridSearchcombines ChromaDB vector similarity with BM25 keyword scoring via Reciprocal Rank Fusion (RRF). Vector search catches semantic similarity; BM25 catches exact symbol names. RRF combines both ranked lists without needing to tune score weights. - Cross-encoder reranking — after retrieving candidates via hybrid search, a
CrossEncoderReranker(default:ms-marco-MiniLM-L-6-v2, ~22 MB, CPU-only) jointly encodes the query and each candidate passage to produce precise relevance scores. Candidates are re-ordered by cross-encoder score before the top-k are sent to the LLM. Pass--no-rerankto skip this step for faster responses. - Single retry on timeout —
LLMClientretries exactly once (30s × 2) before raisingTimeoutError. All other errors propagate immediately. - Explicit confirmation for patches —
agent patchalways prompts yes/no before modifying files.--dry-runskips both the prompt and any file writes. - Chat REPL loads components once —
agent chatloads the embedding model once at startup and reuses it for every question in the session, eliminating per-command startup latency. - Conversation memory —
agent chataccumulates up to N prior turns (default 6) and injects them as realuser/assistantmessage pairs in the prompt, enabling natural follow-up questions. Reset anytime with/clear-historyor set--history 0for stateless mode.
Development setup
# Clone and install in editable mode with dev dependencies
git clone <repo-url>
cd project_cli
pip install -e ".[dev]"
Dev dependencies include pytest, hypothesis, and pytest-mock.
Running tests
# All unit tests
python -m pytest tests/unit/
# A specific test file
python -m pytest tests/unit/test_chunking.py -v
# Integration tests (requires a configured API key and network)
python -m pytest tests/integration/ -m integration
Test structure
tests/
├── unit/
│ ├── test_chunking.py # ChunkingEngine + Chunk dataclass
│ ├── test_config.py # ConfigManager (load, set, show, validate)
│ ├── test_embedding.py # EmbeddingModel shape/dtype/determinism
│ ├── test_prompt_compiler.py # PromptCompiler compile/render
│ ├── test_search.py # SemanticSearch delegation and guards
│ ├── test_tool_executor.py # ToolExecutor validate/apply/rollback
│ └── test_vector_store.py # VectorStore CRUD and atomicity
├── property/ # Hypothesis property-based tests (optional)
└── integration/ # End-to-end tests (marked pytest.mark.integration)
229 unit tests, all passing. No internet connection required for unit tests.
Adding a new LLM provider
- Open
agent/llm_client.py - Add a new class inheriting from
LLMProviderand implementcomplete(prompt, stream, timeout) - Map HTTP errors using the shared
_raise_for_status()helper - Register the new class in the
build_provider()factory dict
Supported file types
By default the agent indexes: .py .js .ts .jsx .tsx .java .c .cpp .h .go .rs .rb .php .swift .kt .scala .sh .bash .md .txt .yaml .yml .json .toml .html .css
To add or restrict extensions, see CONFIGURATION.md.
What gets excluded from indexing
The agent automatically skips directories that should never be indexed:
Always excluded (hardcoded, no configuration needed):
| Category | Directories skipped |
|---|---|
| Version control | .git, .hg, .svn |
| Python | __pycache__, .pytest_cache, .mypy_cache, .tox, .venv, venv, *.egg-info |
| JavaScript / Node | node_modules, .next, .nuxt, .turbo |
| Build artefacts | dist, build, out, target, bin, obj |
| IDE | .idea, .vscode, .vs |
.gitignore respected — if a .gitignore file exists at the root of the indexed directory, its patterns are applied automatically. Files and directories matched by .gitignore are skipped. This uses the pathspec library with gitwildmatch semantics (the same engine git uses).
To disable .gitignore parsing, pass respect_gitignore=False when constructing ChunkingEngine directly (no CLI flag needed for most use cases).
Privacy guarantee
- Embeddings are generated locally on CPU — no data sent to HuggingFace at runtime (model is cached after first download)
- ChromaDB stores all collections on your local disk at
~/.agent/chroma/ - Only the retrieved code chunks (not full files) are included in the prompt sent to the LLM
- A log of all outbound network requests is written to
~/.agent/logs/agent-YYYY-MM-DD.log - The agent never writes any files to your project directory
Further reading
| Document | What it covers |
|---|---|
| CONFIGURATION.md | All settings, env vars, system prompt customisation |
| TOKEN_SAVINGS.md | Token audit math + script to audit your own codebase |
| ARCHITECTURE_DX.md | Deployment options, DX analysis, production considerations |
Release process
- Bump
versioninpyproject.toml git commit -m "chore: release vX.Y.Z"git tag vX.Y.Zgit push && git push --tags- Monitor the
publish.ymlworkflow on GitHub Actions — it builds and publishes to PyPI automatically
Troubleshooting
No indexed collection found — run agent index . from inside your project directory first.
Got unexpected extra argument — your path has spaces. Wrap it in quotes: agent index "C:\My Projects\app"
Path separator on Windows — use forward slashes or quotes: agent index "D:\My Projects\app"
Gemini / LLM timeout — the model took too long. Try fewer chunks: agent ask "..." --top-k 2. The agent retries once automatically (30s per attempt, 60s total max) before giving up.
Noisy httpx logs on startup (first run only) — on the very first agent index run the embedding model is downloaded from HuggingFace (~90 MB). After that, the agent sets TRANSFORMERS_OFFLINE=1 automatically so subsequent commands skip the network check entirely and start instantly. If you need to force a model update, set TRANSFORMERS_OFFLINE=0 in your shell before running.
Re-index after large changes — if answers seem stale or wrong, re-run agent index . to refresh the collection.
Multiline paste breaks into separate questions (Windows) — pasting multiline code directly into the Windows terminal sends each line as a separate Enter, triggering a Gemini API call per line (causes 429 rate limit errors). Solution: save the code to a file first, then use /snippet in agent chat:
# Save your code to a file (e.g. snippet.py), then in agent chat:
You: /snippet snippet.py What does this class do?
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file context_rag_cli-0.1.0.tar.gz.
File metadata
- Download URL: context_rag_cli-0.1.0.tar.gz
- Upload date:
- Size: 85.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
79867e5c467c6a64a49b2e243f23d6304327f97f680b32b451608b7974d43692
|
|
| MD5 |
c1b0bba4f5ed48f141c3ff4463bd0c97
|
|
| BLAKE2b-256 |
6b68c402c414e22ef102483dc02e076e044b3f36f53881b9c31d75cdfce8aa02
|
Provenance
The following attestation bundles were made for context_rag_cli-0.1.0.tar.gz:
Publisher:
publish.yml on pa1nagar/Project-cli
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
context_rag_cli-0.1.0.tar.gz -
Subject digest:
79867e5c467c6a64a49b2e243f23d6304327f97f680b32b451608b7974d43692 - Sigstore transparency entry: 1818549374
- Sigstore integration time:
-
Permalink:
pa1nagar/Project-cli@f6f0822ca53f3f1567ea21dc6025b84a2695c214 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/pa1nagar
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@f6f0822ca53f3f1567ea21dc6025b84a2695c214 -
Trigger Event:
push
-
Statement type:
File details
Details for the file context_rag_cli-0.1.0-py3-none-any.whl.
File metadata
- Download URL: context_rag_cli-0.1.0-py3-none-any.whl
- Upload date:
- Size: 68.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3870eae44fe32c12c66d2ec3bc68188d34782eff169ee0cb700cd2a618439c0c
|
|
| MD5 |
405c9ac6c126c60e9eebc1668b6217c0
|
|
| BLAKE2b-256 |
95b309794be72e95e52540df020e219ffe77fce3d42e440a619eb3b945c55aa1
|
Provenance
The following attestation bundles were made for context_rag_cli-0.1.0-py3-none-any.whl:
Publisher:
publish.yml on pa1nagar/Project-cli
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
context_rag_cli-0.1.0-py3-none-any.whl -
Subject digest:
3870eae44fe32c12c66d2ec3bc68188d34782eff169ee0cb700cd2a618439c0c - Sigstore transparency entry: 1818549895
- Sigstore integration time:
-
Permalink:
pa1nagar/Project-cli@f6f0822ca53f3f1567ea21dc6025b84a2695c214 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/pa1nagar
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@f6f0822ca53f3f1567ea21dc6025b84a2695c214 -
Trigger Event:
push
-
Statement type: