Skip to main content

Privacy-preserving CLI AI agent for querying local code repositories using RAG

Project description

Autonomous RAG Agent

A privacy-preserving command-line AI assistant for your local code repositories. Index any codebase, search it with natural language, ask questions in an interactive chat session, and generate code patches — all without uploading your source files to the cloud.

Only semantically selected code snippets are sent to the LLM. Raw repository files never leave your machine.


How it works

Your repo  →  Chunk  →  Embed (local CPU)  →  ChromaDB (local disk)
                                                      ↓
Your query  →  Embed  →  Retrieve top-K chunks  →  Gemini / GPT / Claude
                                                      ↓
                                               Answer + sources
  1. Index — scans your repo, splits files into overlapping chunks, generates embeddings locally using all-MiniLM-L6-v2 (runs on CPU, no GPU needed), stores everything in ChromaDB on your disk.
  2. Chat / Ask / Search / Patch — your query is embedded the same way, the most relevant chunks are retrieved, and only those chunks are sent to the LLM along with your question.

Installation

Prerequisites

  • Python 3.10 or newer
  • pip

Install (recommended — pipx keeps it isolated)

pipx install context_rag_cli

Install (standard pip)

pip install context_rag_cli

Install from source

git clone <repo-url>
cd project_cli
pip install -e .

Verify the install:

agent --help

CPU-only PyTorch (saves ~1 GB): The default torch wheel from PyPI includes CUDA support. To install the lighter CPU-only build, see INSTALL.md.

The agent stores all its data in ~/.agent/ — it never writes files to your project directory.


Quick start — using it on your own project

Step 1 — Configure your LLM provider

Pick one of the three supported providers and set your API key:

# Gemini (Google)
agent config set provider gemini
agent config set model gemini-1.5-flash
agent config set api_key YOUR_GEMINI_API_KEY

# OpenAI
agent config set provider openai
agent config set model gpt-4o-mini
agent config set api_key YOUR_OPENAI_API_KEY

# Anthropic (Claude)
agent config set provider anthropic
agent config set model claude-3-5-sonnet-20241022
agent config set api_key YOUR_ANTHROPIC_API_KEY

Confirm everything looks right:

agent config show

Step 2 — Index your project

Navigate to any project directory and index it:

cd /path/to/your/project
agent index .

Or index a specific path from anywhere:

agent index "/path/to/your/project"

This step runs entirely locally. It may take a minute on the first run while the embedding model is downloaded (~90 MB, cached after that). All subsequent runs are fast.

Step 3 — Start a chat session (recommended)

The best way to explore a codebase is the interactive chat mode. The embedding model loads once and you type questions directly. The agent remembers the last 6 turns by default, so follow-up questions work naturally:

cd /path/to/your/project
agent chat
RAG Agent — Interactive Chat
  Project : /path/to/your/project
  Provider: gemini / gemini-1.5-flash
  Top-K   : 5  |  Memory: 6 turns

You: How does authentication work?
Agent: The authentication flow starts in auth.py...

You: What errors can it throw?          ← follow-up works
Agent: Based on our discussion of the auth flow,
       it raises AuthError on 401/403...

You: /search JWT token
# Shows raw search results without LLM

You: /top-k 10
# Now retrieves 10 chunks per question

You: /clear-history
# Forget conversation history without restarting

You: /exit

Step 4 — Or use single commands

agent ask "How does authentication work in this project?"
agent ask "Where is the database connection configured?"
agent search "error handling"
agent search "database query" --top-k 10

Step 5 — Generate a patch

agent patch "Add input validation to the login endpoint"
agent patch "Add docstrings to all public functions in utils.py" --dry-run

All commands

agent chat ⭐ recommended

Start an interactive chat session. The embedding model and vector store load once — no startup delay on every question. Conversation memory is built in — the agent remembers your last N turns so follow-up questions work naturally.

agent chat                    # default top-k 5, 6 turns memory, buffered responses
agent chat --stream           # stream tokens as they arrive
agent chat --top-k 8          # retrieve 8 chunks per question
agent chat --history 10       # keep last 10 turns in memory
agent chat --history 0        # stateless mode (no memory)
agent chat --verbose          # show source files after each answer

In-session commands:

Command Description
/exit or /quit End the session
/search <query> Search without calling the LLM
/snippet <file> [question] Load an entire file as pinned context
/top-k <N> Change chunk count for the rest of the session
/clear Clear the screen
/clear-history Forget conversation history without restarting
/help Show available commands
Ctrl+C Interrupt a slow response without killing the session

Asking about a specific file or code snippet — /snippet:

The /snippet command loads any file directly into the prompt as the highest-priority context. No indexing required. This is the reliable way to ask about a specific file on Windows — avoids the terminal paste problem where multiline text fires as separate questions:

You: /snippet agent/llm_client.py What does TokenUsage do?
Agent: TokenUsage is a dataclass that tracks token usage...

You: /snippet mycode.py
Question about this file: Explain the retry logic
Agent: ...

You: /snippet config.yaml Is this configuration valid?
Agent: ...

The file content is injected first (highest priority), followed by RAG-retrieved chunks for supporting codebase context.

Paste mode — short snippets typed manually:

Start your message with ``` to enter paste mode. Type or paste line by line, then close with ``` on its own line:

You: ```
  > def hello():
  >     return "world"
  > ```
Snippet captured (2 lines). Now type your question.
You: What does this function return?
Agent: It returns the string "world".

Windows tip: Pasting multiline code directly into the terminal sends each line as a separate message, triggering a Gemini call for each line (rate limit errors). Always save the code to a file and use /snippet <file> for anything larger than 2-3 lines.

Conversation memory example:

You: How does authentication work?
Agent: The auth module uses JWT tokens validated in auth.py...

You: What about error handling in that?   ← follow-up works
Agent: Based on our discussion of the JWT flow, errors are
       raised as AuthError (401/403) which...

You: /clear-history               ← start fresh without restarting
Conversation history cleared.

You: /exit

agent index <path>

Indexes a repository. Creates or replaces the local ChromaDB collection.

agent index .
agent index "/path/to/project"
agent index . --verbose        # show detailed timing logs
agent index . --changed        # only re-index files that changed since last run

--changed compares each file's SHA-256 hash against the manifest saved during the previous index. Only new or modified files are re-embedded; deleted files have their chunks removed. Falls back to a full index if no previous index exists.

Always prints file count, chunk count, and elapsed time on completion.


agent ask "<question>"

Single-shot question. Retrieves relevant chunks and sends them to the LLM.

agent ask "How is the config loaded at startup?"
agent ask "What errors can the API return?" --top-k 10
agent ask "Summarise the architecture" --stream

Every answer includes a Sources table showing which files and lines were used as context.


agent search "<query>"

Searches the indexed repository and returns the most relevant code chunks — no LLM call.

Uses hybrid search (vector similarity + BM25 keyword) by default. This catches exact function/variable name matches that pure semantic search misses.

agent search "authentication middleware"
agent search "getUserById"              # exact symbol name — BM25 helps here
agent search "database schema" --top-k 10
agent search "error handling" --semantic-only   # pure vector only

Each result shows: file path, line numbers, relevance score (0.000–1.000), and a code excerpt.


agent patch "<instruction>"

Retrieves relevant chunks, asks the LLM to generate a unified diff, and optionally applies it.

agent patch "Add type hints to all functions in auth.py"
agent patch "Refactor the login flow to use async/await" --dry-run    # preview only
agent patch --discard-backup src/auth.py    # delete the .bak file after applying

You will be asked to confirm before any file is modified. Backups are created at <file>.bak before every modification.


agent config set <key> <value>

Writes a setting to ~/.agent/config.toml.

agent config set provider gemini
agent config set api_key YOUR_KEY
agent config set top_k 8
agent config set stream true

agent config show

Displays the current effective configuration. The API key is masked (last 4 chars visible).

agent config show

agent list

Lists all indexed collections with their repository path, chunk count, and index timestamp.

agent list

agent purge <path>

Permanently deletes the indexed collection for a repository.

agent purge .
agent purge "/path/to/old-project"

agent watch [path]

Watches a directory for file changes and automatically re-indexes modified files — no manual agent index . --changed needed.

Uses incremental indexing under the hood: only changed, new, or deleted files are re-embedded. A short debounce window (default 2 s) batches rapid saves (e.g. auto-save storms) into a single re-index pass.

agent watch .                   # watch current directory
agent watch /path/to/project    # watch a specific path
agent watch . --verbose         # log every file event
agent watch . --debounce 5      # wait 5 s of quiet before re-indexing

Press Ctrl+C to stop watching.


Practical workflow for a developer

# 1. Install once (ever)
pipx install context_rag_cli
agent config set provider gemini
agent config set model gemini-1.5-flash
agent config set api_key YOUR_KEY

# 2. Go to any project
cd ~/projects/my-django-app

# 3. Index it (once per project, re-run after big changes)
agent index .

# 4. Start a chat session and explore
agent chat

You: What does this project do?
You: How is the database connection managed?
You: Where is input validation handled?
You: /search rate limiting
You: /top-k 10
You: How does the caching layer work?
You: /exit

# 5. Re-index after making changes (fast incremental update)
agent index . --changed   # only re-embeds modified files

# 6. Or keep the index always fresh automatically
agent watch .             # stays running; re-indexes on every save

# 7. Generate code from the terminal
agent patch "Add request ID logging to all API endpoints" --dry-run

Token savings

This agent saves 97–99%+ of LLM tokens compared to sending your full codebase. See TOKEN_SAVINGS.md for the full math and a script to audit your own codebase.

Quick numbers for a typical medium project (120,000 tokens):

Approach Tokens/query GPT-4o cost
Full repo 120,000 $0.60
This agent (top-k=5) ~2,300 $0.012
Saving 97.9% 50x cheaper

Architecture

The project is a single Python package (agent/) with one module per component:

agent/
├── cli.py              # Typer CLI — all commands including chat REPL
├── config.py           # ConfigManager — reads ~/.agent/config.toml + AGENT_* env vars
├── chunking.py         # ChunkingEngine — token-aware sliding window, Chunk dataclass
├── embedding.py        # EmbeddingModel — wraps all-MiniLM-L6-v2 via sentence-transformers
├── vector_store.py     # VectorStore — ChromaDB adapter, atomic replace_collection()
├── search.py           # SemanticSearch — orchestrates embed → query → return ScoredChunks
├── prompt_compiler.py  # PromptCompiler — assembles system + context + query into a Prompt
├── llm_client.py       # LLMClient + GeminiProvider / OpenAIProvider / AnthropicProvider
├── tool_executor.py    # ToolExecutor — validate/apply unified diffs, .bak backups
└── errors.py           # All typed error classes (AgentError hierarchy)

All data is stored in ~/.agent/ — the agent never writes to your project directory:

~/.agent/
├── config.toml          # your settings
├── chroma/              # ChromaDB vector store (all indexed collections)
└── logs/
    └── agent-YYYY-MM-DD.log

Key design decisions

  • Privacy by architecture — the LLM only sees PromptCompiler-assembled text. No raw files are ever sent.
  • Zero GPUEmbeddingModel sets device="cpu" explicitly. No CUDA required.
  • Atomic re-indexingVectorStore.replace_collection() writes to a temp collection first, then renames. No partial state is observable during re-index.
  • Provider agnosticismLLMProvider is an abstract base class. GeminiProvider, OpenAIProvider, and AnthropicProvider are concrete implementations selected by build_provider(config).
  • Hybrid searchHybridSearch combines ChromaDB vector similarity with BM25 keyword scoring via Reciprocal Rank Fusion (RRF). Vector search catches semantic similarity; BM25 catches exact symbol names. RRF combines both ranked lists without needing to tune score weights.
  • Cross-encoder reranking — after retrieving candidates via hybrid search, a CrossEncoderReranker (default: ms-marco-MiniLM-L-6-v2, ~22 MB, CPU-only) jointly encodes the query and each candidate passage to produce precise relevance scores. Candidates are re-ordered by cross-encoder score before the top-k are sent to the LLM. Pass --no-rerank to skip this step for faster responses.
  • Single retry on timeoutLLMClient retries exactly once (30s × 2) before raising TimeoutError. All other errors propagate immediately.
  • Explicit confirmation for patchesagent patch always prompts yes/no before modifying files. --dry-run skips both the prompt and any file writes.
  • Chat REPL loads components onceagent chat loads the embedding model once at startup and reuses it for every question in the session, eliminating per-command startup latency.
  • Conversation memoryagent chat accumulates up to N prior turns (default 6) and injects them as real user/assistant message pairs in the prompt, enabling natural follow-up questions. Reset anytime with /clear-history or set --history 0 for stateless mode.

Development setup

# Clone and install in editable mode with dev dependencies
git clone <repo-url>
cd project_cli
pip install -e ".[dev]"

Dev dependencies include pytest, hypothesis, and pytest-mock.

Running tests

# All unit tests
python -m pytest tests/unit/

# A specific test file
python -m pytest tests/unit/test_chunking.py -v

# Integration tests (requires a configured API key and network)
python -m pytest tests/integration/ -m integration

Test structure

tests/
├── unit/
│   ├── test_chunking.py        # ChunkingEngine + Chunk dataclass
│   ├── test_config.py          # ConfigManager (load, set, show, validate)
│   ├── test_embedding.py       # EmbeddingModel shape/dtype/determinism
│   ├── test_prompt_compiler.py # PromptCompiler compile/render
│   ├── test_search.py          # SemanticSearch delegation and guards
│   ├── test_tool_executor.py   # ToolExecutor validate/apply/rollback
│   └── test_vector_store.py    # VectorStore CRUD and atomicity
├── property/                   # Hypothesis property-based tests (optional)
└── integration/                # End-to-end tests (marked pytest.mark.integration)

229 unit tests, all passing. No internet connection required for unit tests.

Adding a new LLM provider

  1. Open agent/llm_client.py
  2. Add a new class inheriting from LLMProvider and implement complete(prompt, stream, timeout)
  3. Map HTTP errors using the shared _raise_for_status() helper
  4. Register the new class in the build_provider() factory dict

Supported file types

By default the agent indexes: .py .js .ts .jsx .tsx .java .c .cpp .h .go .rs .rb .php .swift .kt .scala .sh .bash .md .txt .yaml .yml .json .toml .html .css

To add or restrict extensions, see CONFIGURATION.md.


What gets excluded from indexing

The agent automatically skips directories that should never be indexed:

Always excluded (hardcoded, no configuration needed):

Category Directories skipped
Version control .git, .hg, .svn
Python __pycache__, .pytest_cache, .mypy_cache, .tox, .venv, venv, *.egg-info
JavaScript / Node node_modules, .next, .nuxt, .turbo
Build artefacts dist, build, out, target, bin, obj
IDE .idea, .vscode, .vs

.gitignore respected — if a .gitignore file exists at the root of the indexed directory, its patterns are applied automatically. Files and directories matched by .gitignore are skipped. This uses the pathspec library with gitwildmatch semantics (the same engine git uses).

To disable .gitignore parsing, pass respect_gitignore=False when constructing ChunkingEngine directly (no CLI flag needed for most use cases).


Privacy guarantee

  • Embeddings are generated locally on CPU — no data sent to HuggingFace at runtime (model is cached after first download)
  • ChromaDB stores all collections on your local disk at ~/.agent/chroma/
  • Only the retrieved code chunks (not full files) are included in the prompt sent to the LLM
  • A log of all outbound network requests is written to ~/.agent/logs/agent-YYYY-MM-DD.log
  • The agent never writes any files to your project directory

Further reading

Document What it covers
CONFIGURATION.md All settings, env vars, system prompt customisation
TOKEN_SAVINGS.md Token audit math + script to audit your own codebase
ARCHITECTURE_DX.md Deployment options, DX analysis, production considerations

Release process

  1. Bump version in pyproject.toml
  2. git commit -m "chore: release vX.Y.Z"
  3. git tag vX.Y.Z
  4. git push && git push --tags
  5. Monitor the publish.yml workflow on GitHub Actions — it builds and publishes to PyPI automatically

Troubleshooting

No indexed collection found — run agent index . from inside your project directory first.

Got unexpected extra argument — your path has spaces. Wrap it in quotes: agent index "C:\My Projects\app"

Path separator on Windows — use forward slashes or quotes: agent index "D:\My Projects\app"

Gemini / LLM timeout — the model took too long. Try fewer chunks: agent ask "..." --top-k 2. The agent retries once automatically (30s per attempt, 60s total max) before giving up.

Noisy httpx logs on startup (first run only) — on the very first agent index run the embedding model is downloaded from HuggingFace (~90 MB). After that, the agent sets TRANSFORMERS_OFFLINE=1 automatically so subsequent commands skip the network check entirely and start instantly. If you need to force a model update, set TRANSFORMERS_OFFLINE=0 in your shell before running.

Re-index after large changes — if answers seem stale or wrong, re-run agent index . to refresh the collection.

Multiline paste breaks into separate questions (Windows) — pasting multiline code directly into the Windows terminal sends each line as a separate Enter, triggering a Gemini API call per line (causes 429 rate limit errors). Solution: save the code to a file first, then use /snippet in agent chat:

# Save your code to a file (e.g. snippet.py), then in agent chat:
You: /snippet snippet.py What does this class do?

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

context_rag_cli-0.1.0.tar.gz (85.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

context_rag_cli-0.1.0-py3-none-any.whl (68.2 kB view details)

Uploaded Python 3

File details

Details for the file context_rag_cli-0.1.0.tar.gz.

File metadata

  • Download URL: context_rag_cli-0.1.0.tar.gz
  • Upload date:
  • Size: 85.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for context_rag_cli-0.1.0.tar.gz
Algorithm Hash digest
SHA256 79867e5c467c6a64a49b2e243f23d6304327f97f680b32b451608b7974d43692
MD5 c1b0bba4f5ed48f141c3ff4463bd0c97
BLAKE2b-256 6b68c402c414e22ef102483dc02e076e044b3f36f53881b9c31d75cdfce8aa02

See more details on using hashes here.

Provenance

The following attestation bundles were made for context_rag_cli-0.1.0.tar.gz:

Publisher: publish.yml on pa1nagar/Project-cli

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file context_rag_cli-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: context_rag_cli-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 68.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for context_rag_cli-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 3870eae44fe32c12c66d2ec3bc68188d34782eff169ee0cb700cd2a618439c0c
MD5 405c9ac6c126c60e9eebc1668b6217c0
BLAKE2b-256 95b309794be72e95e52540df020e219ffe77fce3d42e440a619eb3b945c55aa1

See more details on using hashes here.

Provenance

The following attestation bundles were made for context_rag_cli-0.1.0-py3-none-any.whl:

Publisher: publish.yml on pa1nagar/Project-cli

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page