Lightning-fast RAG for AI agents. ONNX-powered, 4-layer fusion, MCP server. No PyTorch.
Project description
๐ฆ VelociRAG
Lightning-fast RAG for AI agents.
Four-layer retrieval fusion powered by ONNX Runtime. No PyTorch. Sub-200ms warm search. Incremental graph updates. MCP-ready.
Most RAG solutions either drag in 2GB+ of PyTorch or limit you to single-layer vector search. VelociRAG gives you four retrieval methods โ vector similarity, BM25 keyword matching, knowledge graph traversal, and metadata filtering โ fused through reciprocal rank fusion with cross-encoder reranking. All running on ONNX Runtime, no GPU, no API keys. Comes with an MCP server for agent integration, a Unix socket daemon for warm queries, and a CLI that just works.
๐ Quick Start
MCP Server (Claude, Cursor, Windsurf)
pip install "velocirag[mcp]"
velocirag index ./my-docs
velocirag mcp
Claude Code โ add to .mcp.json in your project root:
{
"mcpServers": {
"velocirag": {
"command": "velocirag",
"args": ["mcp"],
"env": { "VELOCIRAG_DB": "/path/to/data" }
}
}
}
Then open /mcp in Claude Code and enable the velocirag server. If using a virtualenv, use the full path to the binary (e.g. .venv/bin/velocirag).
Claude Desktop โ add to claude_desktop_config.json:
{
"mcpServers": {
"velocirag": {
"command": "velocirag",
"args": ["mcp", "--db", "/path/to/data"]
}
}
}
Cursor โ add to .cursor/mcp.json:
{
"mcpServers": {
"velocirag": {
"command": "velocirag",
"args": ["mcp", "--db", "/path/to/data"]
}
}
}
Python API
from velocirag import Embedder, VectorStore, Searcher
embedder = Embedder()
store = VectorStore('./my-db', embedder)
store.add_directory('./my-docs')
searcher = Searcher(store, embedder)
results = searcher.search('query', limit=5)
CLI
pip install velocirag
velocirag index ./my-docs
velocirag search "your query here"
Search Daemon (warm engine for CLI users)
velocirag serve --db ./my-data # start daemon (background)
velocirag search "query" # auto-routes through daemon
velocirag status # check daemon health
velocirag stop # stop daemon
The daemon keeps the ONNX model + FAISS index warm over a Unix socket. First query loads the engine (~1s), subsequent queries return in ~180ms with full 4-layer fusion.
๐ฏ Why VelociRAG?
| VelociRAG | LangChain | LlamaIndex | Chroma | mcp-local-rag | |
|---|---|---|---|---|---|
| Search layers | 4 | 2 | 2 | 1 | 2 |
| Cross-encoder reranking | โ | โ | โ | โ | โ |
| Knowledge graph | โ | โ | โ | โ | โ |
| Incremental updates | โ | โ | โ | โ | โ |
| LLM required for search | โ | โ ๏ธ | โ ๏ธ | โ | โ |
| MCP server | โ | โ | โ | โ | โ |
| GPU required | โ | โ | โ | โ | โ |
| PyTorch required | โ | โ | โ | โ | โ |
| Install size | ~80MB | ~750MB+ | ~750MB+ | ~50MB | ~30MB |
| Warm search latency | ~3ms | โ | โ | ~50ms | ~200ms |
๐๏ธ How It Works
The 4-layer pipeline:
Query โ expand (acronyms, variants)
โ [Vector] FAISS cosine similarity (384d, MiniLM-L6-v2 via ONNX)
โ [Keyword] BM25 via SQLite FTS5
โ [Graph] Knowledge graph traversal
โ [Metadata] Structured SQL filters (tags, status, project)
โ RRF Fusion โ Cross-encoder rerank โ Results
What each layer catches:
| Query type | Vector | Keyword | Graph | Metadata |
|---|---|---|---|---|
| Conceptual ("improve error handling") | โ | โ | โ | โ |
| Exact match ("ERR_CONNECTION_REFUSED") | โ | โ | โ | โ |
| Connected concepts | โ | โ | โ | โ |
| Filtered ("#python status:active") | โ | โ | โ | โ |
| Combined ("React state management") | โ | โ | โ | โ |
โจ Features
- ONNX Runtime โ 184ms cold start, 3ms cached. No PyTorch, no GPU
- Four-layer fusion โ FAISS vector similarity + SQLite FTS5 (BM25) + knowledge graph + metadata filtering, merged via reciprocal rank fusion
- Cross-encoder reranking โ TinyBERT reranker via ONNX Runtime โ included in base install, no PyTorch needed. Downloads ~17MB model on first use
- Incremental graph updates โ file-centric provenance tracking detects what changed and only rebuilds affected nodes/edges. Multi-source support with isolated provenance per source
- MCP server โ Five tools (search, index, add_document, health, list_sources) for Claude, Cursor, Windsurf
- Search daemon โ Unix socket server keeps ONNX model + FAISS index warm between queries
- Knowledge graph โ Analyzers build entity, temporal, topic, and explicit-link edges from markdown. Optional GLiNER NER. 418 files in 2.1s
- Smart chunking โ Header-aware splitting preserves document structure and parent context
- Query expansion โ Acronym registry, casing/spacing variants, underscore-aware tokenization
- Runs anywhere โ CPU-only, 8GB RAM, no API keys, no external services
๐ค MCP Server
VelociRAG exposes a Model Context Protocol server for seamless agent integration:
Available tools:
searchโ 4-layer fusion search with rerankingindexโ Add documents to the knowledge baseadd_documentโ Insert single documenthealthโ System diagnosticslist_sourcesโ Show indexed document sources
The MCP server process stays alive between queries, so models load once and every subsequent search is warm. Works with any MCP-compatible client.
๐ Python API
Full 4-layer unified search:
from velocirag import (
Embedder, VectorStore, Searcher,
GraphStore, MetadataStore, UnifiedSearch,
GraphPipeline
)
# Build the full stack
embedder = Embedder()
store = VectorStore('./search-db', embedder)
graph_store = GraphStore('./search-db/graph.db')
metadata_store = MetadataStore('./search-db/metadata.db')
# Index with graph + metadata
store.add_directory('./docs')
pipeline = GraphPipeline(graph_store, embedder, metadata_store)
pipeline.build('./docs', source_name='my-docs')
# Unified search across all layers
searcher = Searcher(store, embedder)
unified = UnifiedSearch(searcher, graph_store, metadata_store)
results = unified.search(
'machine learning algorithms',
limit=5,
enrich_graph=True,
filters={'tags': ['python'], 'status': 'active'}
)
Quick semantic search:
from velocirag import Embedder, VectorStore, Searcher
embedder = Embedder()
store = VectorStore('./db', embedder)
store.add_directory('./docs')
searcher = Searcher(store, embedder)
results = searcher.search('neural networks', limit=10)
Incremental graph updates:
from velocirag import Embedder, GraphStore, GraphPipeline
# First run โ full build, populates provenance
gs = GraphStore('./db/graph.db')
pipeline = GraphPipeline(gs, embedder=Embedder())
pipeline.build('./docs', source_name='my-docs') # full build
# Subsequent runs โ only changed files get reprocessed
pipeline.build('./docs', source_name='my-docs') # incremental (automatic)
# Force full rebuild
pipeline.build('./docs', source_name='my-docs', force_rebuild=True)
# Multi-source graphs
pipeline.build('./project-a', source_name='project-a')
pipeline.build('./project-b', source_name='project-b') # isolated provenance
๐ป CLI Reference
# Index documents (graph + metadata built by default)
velocirag index <path> [--no-graph] [--no-metadata] [--gliner] [--full-graph] [--force]
[--source NAME] [--db PATH]
# Search across all layers (auto-routes through daemon if running)
velocirag search <query> [--limit N] [--threshold F] [--format text|json]
# Search daemon
velocirag serve [--db PATH] [-f] # start daemon (-f for foreground)
velocirag stop # stop daemon
velocirag status # check daemon health
# Metadata queries
velocirag query [--tags TAG] [--status S] [--project P] [--recent N]
# System health and status
velocirag health [--format text|json]
# Start MCP server
velocirag mcp [--db PATH] [--transport stdio|sse]
Options:
--no-graphโ Skip knowledge graph build--no-metadataโ Skip metadata extraction--full-graphโ Build graph WITH semantic similarity edges (~2GB extra RAM)--source NAMEโ Label for multi-source provenance isolation--forceโ Clear and rebuild from scratch--glinerโ Use GLiNER for entity extraction (requirespip install "velocirag[ner]")
๐ Performance
Real benchmarks on ByteByteGo/system-design-101 (418 files, 1,001 chunks):
| Metric | Value |
|---|---|
| Index (418 files) | 13.6s |
| Search (warm, 5 results) | 35โ90ms |
| Graph build (light) | 2.1s โ 2,397 nodes, 8,717 edges |
| Incremental update (1 file) | 1.3s |
| Reranker | Cross-encoder TinyBERT via ONNX |
| Install size | ~80MB (no PyTorch) |
| RAM usage | <1GB with all models loaded |
Production deployment (3,400+ documents, 4 sources):
| Metric | Value |
|---|---|
| Embedding (warm) | 3ms |
| Embedding (cold) | 184ms |
| Full 4-layer search (warm) | 76โ350ms |
| Hit rate (100-query benchmark) | 99/100 |
| Graph | 3,740 nodes, 92,719 edges |
| Provenance | 870 files across 4 sources |
โ๏ธ Configuration
| Environment Variable | Default | Description |
|---|---|---|
VELOCIRAG_DB |
./.velocirag |
Database directory |
VELOCIRAG_SOCKET |
/tmp/velocirag-daemon.sock |
Daemon socket path |
NO_COLOR |
โ | Disable colored output |
Dependencies (all included in base install):
onnxruntimeโ ONNX inference (embedder + reranker)tokenizers+huggingface-hubโ model loadingfaiss-cpuโ vector similarity searchnetworkx+scikit-learnโ knowledge graph + topic clusteringnumpy,click,pyyaml,python-frontmatter
Optional extras:
pip install "velocirag[mcp]"โ MCP server (addsfastmcp)pip install "velocirag[ner]"โ GLiNER entity extraction (addsgliner, requires PyTorch)
๐ License
MIT โ Use it anywhere, build anything.
Need agent integration help? Check AGENTS.md for machine-readable project context.
Built for agents who think fast and remember faster.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file velocirag-0.6.3.tar.gz.
File metadata
- Download URL: velocirag-0.6.3.tar.gz
- Upload date:
- Size: 183.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f066261e2723b047b6d8ac6af78a7b477016f6f9c5453a6f5d107a1532423321
|
|
| MD5 |
4c68be775be21daaaecff9934f1ffc60
|
|
| BLAKE2b-256 |
73e55c6c7cc67ac6cdfcaa05bea3cd0fa9de4d69bb3d71efd78659fa295ff24c
|
Provenance
The following attestation bundles were made for velocirag-0.6.3.tar.gz:
Publisher:
publish.yml on HaseebKhalid1507/VelociRAG
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
velocirag-0.6.3.tar.gz -
Subject digest:
f066261e2723b047b6d8ac6af78a7b477016f6f9c5453a6f5d107a1532423321 - Sigstore transparency entry: 1190831179
- Sigstore integration time:
-
Permalink:
HaseebKhalid1507/VelociRAG@72c6abb984847ce95a77cf0a9768c3f3fe320571 -
Branch / Tag:
refs/tags/v0.6.3 - Owner: https://github.com/HaseebKhalid1507
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@72c6abb984847ce95a77cf0a9768c3f3fe320571 -
Trigger Event:
release
-
Statement type:
File details
Details for the file velocirag-0.6.3-py3-none-any.whl.
File metadata
- Download URL: velocirag-0.6.3-py3-none-any.whl
- Upload date:
- Size: 114.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9dce04fc1a3d856d48b95b9403f4787b29a554c315c07c5e07ba3bc94b7579ed
|
|
| MD5 |
4f876c88ef9d6510ac95460349082fc0
|
|
| BLAKE2b-256 |
5fe56de6984928c9e9df00797722fde82bb4da71639d7043d1298f2c4bd92fbc
|
Provenance
The following attestation bundles were made for velocirag-0.6.3-py3-none-any.whl:
Publisher:
publish.yml on HaseebKhalid1507/VelociRAG
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
velocirag-0.6.3-py3-none-any.whl -
Subject digest:
9dce04fc1a3d856d48b95b9403f4787b29a554c315c07c5e07ba3bc94b7579ed - Sigstore transparency entry: 1190831213
- Sigstore integration time:
-
Permalink:
HaseebKhalid1507/VelociRAG@72c6abb984847ce95a77cf0a9768c3f3fe320571 -
Branch / Tag:
refs/tags/v0.6.3 - Owner: https://github.com/HaseebKhalid1507
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@72c6abb984847ce95a77cf0a9768c3f3fe320571 -
Trigger Event:
release
-
Statement type: