Token-saving memory layer for AI coding assistants — local hybrid RAG (BM25 + semantic + reranker) over your project, exposed via MCP.
Project description
Tokengram — Hybrid RAG Memory Layer for AI Assistants
Azərbaycanca versiya: README.az.md
Local MCP server that lets Claude Code (and other MCP-capable AI assistants)
search your project via hybrid RAG instead of Read-ing whole files.
Measured token savings: ~70% on a typical Python project.
Progressive RAM: 93 MB (BM25-only) → 700 MB (semantic) → 900 MB (+ reranker).
What it is
When you ask Claude Code a question about your code, the assistant normally reads whole files into its context window — 5–10K input tokens per query. Tokengram pre-vectorizes your project and returns only the 3–5 most relevant chunks when Claude asks. So:
- ⚡ ~70% fewer input tokens → cheaper API calls
- 🔒 Local-only — your code never leaves the machine (local embeddings, local search)
- ♻️ Auto re-index — saved files are picked up by the MCP server automatically
- 🎯 Hybrid search — semantic (meaning) + BM25 (keyword) + cross-encoder reranker
When it works
- ✅ Medium-to-large Python / JS / Go / Rust projects (50+ files)
- ✅ Contextual questions: "where is this function called from?"
- ✅ Refactoring (gathering all touch points)
When it doesn't (yet)
- ❌ Projects with < 20 files — Claude can just read them all
- ❌ Machines with < 4 GB RAM (embedding + reranker need headroom)
- ❌ Systems without Python installed
Install (~3 minutes)
Requirements
- Python 3.10+
- VS Code + Claude Code extension
- 4 GB RAM (8 GB recommended)
- ~500 MB disk (for model cache)
Windows
git clone <repo-url> mla-agent
cd mla-agent
./setup.ps1
macOS / Linux
git clone <repo-url> mla-agent
cd mla-agent
./setup.sh
After install
- Close VS Code completely and reopen it (so it picks up the new MCP server entry).
- In Claude Code, run
/mcp—mla-agentshould appear asconnected. - Ask a question:
"Use mla-agent.search_code — where is authentication handled?"
Claude calls the MCP tool automatically, receives the top-4 matching chunks as context, and answers.
Indexing another project
Each project needs its own index.
Option 1 — run setup again with a project path
./setup.ps1 -ProjectDir "C:\path\to\my-project"
Option 2 — direct command
python main.py index /path/to/my-project --fresh
Note:
chroma_db/currently lives inside the Tokengram folder, so only one project can be indexed at a time. Multi-project support is on the roadmap.
CLI
python main.py index <dir> [--fresh] # Index a project
python main.py stats # Index statistics
python main.py route <dir> # Smart router suggestion
python main.py reset --yes # Wipe the index
python main.py watch <dir> # Auto re-index (Ctrl+C to stop)
python main.py ask "question" # CLI-only (requires API key)
MCP tools
Claude Code discovers these automatically and calls them on demand:
| Tool | What it does |
|---|---|
search_code(query, top_k, mode, file_type) |
3-tier search (see below) |
index_directory(path, fresh) |
Index a project |
index_file(path) |
Re-index a single file |
get_stats() |
Index statistics |
route_project(path) |
Project-size-aware recommendation |
search_code modes (progressive RAM)
mode |
RAM delta | First call | Best for |
|---|---|---|---|
"fast" (default) |
0 MB | ~50 ms | Function / class / file-name lookup, syntactic search |
"semantic" |
+400–700 MB | 30–60 s cold | Conceptual queries ("how does auth work?") |
"rerank" |
+200 MB | +30 s first | Maximum ranking quality |
Ideally Claude defaults to "fast". If a syntactic result is enough, it
stops. For conceptual answers it escalates to "semantic". Reranker is
reserved for the highest-stakes picks.
Slash commands
Drop-in actions inside Claude Code (each wraps the back-end Python script of the same name; the VS Code status-bar QuickPick calls them too):
| Command | What it does |
|---|---|
/MlaOn |
Enable Read-enforcement (block large indexed files) |
/MlaOff |
Disable enforcement |
/MlaRefreshBar |
Force the status bar to refresh |
/MlaResetBar |
Reset savings history (old history is backed up) |
/MlaArch |
Show the auto-generated architecture doc |
/MlaSync |
Manually flush session log → history |
/MlaReindex |
Full re-index after extension list changes |
See SLASH_COMMANDS.md for behaviour details and the 4-source token-tracking model.
Supported file types (36 extensions)
.py .sol .sql .md
.ts .tsx .js .jsx .mjs .cjs
.go .rs .java .kt .cs .cpp .cc .cxx .c .h .hpp
.rb .php .swift .dart
.html .css .scss
.json .yaml .yml .toml
.sh .bash .ps1 .tf
Excluded by default: node_modules/, dist/, build/, target/,
.venv/, .next/, … plus lock files (package-lock.json, Cargo.lock,
go.sum, …), minified bundles (*.min.js, *.bundle.js), source maps,
and any file > 1 MB. All filters live in mla_constants.py.
Localization
User-facing strings are localized. Two languages are bundled out of the box: English (default) and Azerbaijani.
# Linux / macOS
export MLA_LANG=az # Azerbaijani
export MLA_LANG=en # English (default)
# Windows PowerShell
$env:MLA_LANG = 'az'
The VS Code extension picks its locale from vscode.env.language (your
VS Code display language).
To add another language: copy messages/en.json to
messages/<code>.json, translate the values, and add the code to
_SUPPORTED in i18n.py. For the extension, add a new entry to
MESSAGES and to detectLocale() in
vscode-extension/src/i18n.ts.
VS Code status bar
The widget at the bottom-right shows a live X.Xk saved · YY% counter
based on the last 24 hours of activity. Hovering reveals a breakdown
(search / Grep / Read-block) plus lifetime totals.
Click the widget to open a QuickPick with all 6 actions (toggle, refresh, arch doc, sync, re-index, reset) — no terminal typing needed.
Auto re-index
Two parallel mechanisms feed the same queue (.mla_pending.txt); the
MCP server's background worker drains it:
1. Built-in watcher (default; works in every IDE)
The MCP server uses watchdog to watch the project folder. VS Code,
Antigravity, Cursor, plain editors — all good. No hooks needed.
- Override the watched directory:
MLA_WATCH_DIR=/path/to/project - Disable:
MLA_DISABLE_WATCHER=1 - Debounce: 2 s (rapid saves are coalesced)
2. Claude Code hooks (Claude Code only)
PostToolUse(~50 ms) after everyEdit/Write/MultiEdit— queues the fileStop— drains the queue and updates the architecture doc
Both mechanisms cooperate: the drainer hashes + deduplicates, so no file is re-indexed twice.
FAQ
Do I need an API key?
No. When you use Tokengram via the MCP server inside Claude Code, all
reasoning happens in Claude. An API key is only needed for the standalone
CLI command python main.py ask.
Why is the first search slow?
The first MCP search_code call takes 10–30 s — the embedding model
loads into RAM. Subsequent calls are ~100 ms.
What's the disk footprint?
- Embedding model: ~130 MB (
bge-small-en-v1.5) - Reranker model: ~90 MB (
ms-marco-MiniLM-L-6-v2) - ChromaDB index: ~5–10 KB per file
- Total for a 100-file project: ~220 MB.
Something broke?
Check PROBLEMS.md — 14 issues encountered during development and how each was resolved. The test pass record is in TEST_REPORT.md (66/66 green).
Project layout
mla-agent/
├── config.py # All parameters (re-exports from mla_constants)
├── mla_constants.py # Single source for extension / dir / file filters
├── i18n.py # tr() — localized message lookup
├── messages/
│ ├── en.json # English strings (default)
│ └── az.json # Azerbaijani strings
├── document_loader.py # AST + regex + fallback chunking
├── vector_store.py # ChromaDB wrapper
├── search_engine.py # Hybrid (semantic + BM25 + reranker)
├── llm_agent.py # Standalone CLI agent (optional)
├── smart_router.py # Project-size-aware mode selector
├── file_watcher.py # Standalone watchdog (optional)
├── main.py # CLI entry point
├── mcp_server.py # MCP server for Claude Code
├── hooks/ # All Stop / PreToolUse / PostToolUse hooks
├── tests/ # 5 phase suites, 66 cases total
├── vscode-extension/ # Status-bar widget + QuickPick menu
├── .mcp.json # Claude Code MCP registration
├── .claude/
│ ├── settings.json # Hook configuration
│ └── commands/ # Slash command definitions
├── setup.ps1, setup.sh # Install scripts
├── TEST_PLAN.md, TEST_REPORT.md
└── PROBLEMS.md
License
Commercial. See vscode-extension/LICENSE.txt. Feedback: n.ilkin.humbatov@gmail.com
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file tokengram-0.1.0-py3-none-any.whl.
File metadata
- Download URL: tokengram-0.1.0-py3-none-any.whl
- Upload date:
- Size: 736.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8fa4056ce365b99fe4bcebcf82864214df9a906f2a3e21a2a460592ad4ed5ff5
|
|
| MD5 |
c34658714ae6d5a5a5f05e5ba1f7c572
|
|
| BLAKE2b-256 |
44bd4909875205a2c3288091efb5b328ebc2670c4ce999ef1f278565e239da5e
|