Skip to main content

Lightning-fast RAG for AI agents. ONNX-powered, 4-layer fusion, MCP server. No PyTorch.

Project description

๐Ÿฆ– VelociRAG

Lightning-fast RAG for AI agents.

Four-layer retrieval fusion powered by ONNX Runtime. No PyTorch. Sub-200ms warm search. Incremental graph updates. MCP-ready.


Most RAG solutions either drag in 2GB+ of PyTorch or limit you to single-layer vector search. VelociRAG gives you four retrieval methods โ€” vector similarity, BM25 keyword matching, knowledge graph traversal, and metadata filtering โ€” fused through reciprocal rank fusion with cross-encoder reranking. All running on ONNX Runtime, no GPU, no API keys. Comes with an MCP server for agent integration, a Unix socket daemon for warm queries, and a CLI that just works.

๐Ÿš€ Quick Start

MCP Server (Claude, Cursor, Windsurf)

pip install "velocirag[mcp]"
velocirag index ./my-docs --graph --metadata
velocirag mcp

Claude Code โ€” add to .mcp.json in your project root:

{
  "mcpServers": {
    "velocirag": {
      "command": "velocirag",
      "args": ["mcp"],
      "env": { "VELOCIRAG_DB": "/path/to/data" }
    }
  }
}

Then open /mcp in Claude Code and enable the velocirag server. If using a virtualenv, use the full path to the binary (e.g. .venv/bin/velocirag).

Claude Desktop โ€” add to claude_desktop_config.json:

{
  "mcpServers": {
    "velocirag": {
      "command": "velocirag",
      "args": ["mcp", "--db", "/path/to/data"]
    }
  }
}

Cursor โ€” add to .cursor/mcp.json:

{
  "mcpServers": {
    "velocirag": {
      "command": "velocirag",
      "args": ["mcp", "--db", "/path/to/data"]
    }
  }
}

Python API

from velocirag import Embedder, VectorStore, Searcher

embedder = Embedder()
store = VectorStore('./my-db', embedder)
store.add_directory('./my-docs')
searcher = Searcher(store, embedder)
results = searcher.search('query', limit=5)

CLI

pip install velocirag
velocirag index ./my-docs --graph --metadata
velocirag search "your query here"

Search Daemon (warm engine for CLI users)

velocirag serve --db ./my-data        # start daemon (background)
velocirag search "query"              # auto-routes through daemon
velocirag status                      # check daemon health
velocirag stop                        # stop daemon

The daemon keeps the ONNX model + FAISS index warm over a Unix socket. First query loads the engine (~1s), subsequent queries return in ~180ms with full 4-layer fusion.

๐ŸŽฏ Why VelociRAG?

VelociRAG LangChain LlamaIndex Chroma mcp-local-rag
Search layers 4 2 2 1 2
Cross-encoder reranking โœ… โŒ โœ… โŒ โŒ
Knowledge graph โœ… โŒ โœ… โŒ โŒ
Incremental updates โœ… โŒ โŒ โŒ โŒ
LLM required for search โŒ โš ๏ธ โš ๏ธ โŒ โŒ
MCP server โœ… โŒ โŒ โŒ โœ…
GPU required โŒ โŒ โŒ โŒ โŒ
PyTorch required โŒ โœ… โœ… โŒ โŒ
Install size ~80MB ~750MB+ ~750MB+ ~50MB ~30MB
Warm search latency ~3ms โ€” โ€” ~50ms ~200ms

๐Ÿ—๏ธ How It Works

The 4-layer pipeline:

Query โ†’ expand (acronyms, variants)
      โ†’ [Vector]   FAISS cosine similarity (384d, MiniLM-L6-v2 via ONNX)
      โ†’ [Keyword]  BM25 via SQLite FTS5
      โ†’ [Graph]    Knowledge graph traversal
      โ†’ [Metadata] Structured SQL filters (tags, status, project)
      โ†’ RRF Fusion โ†’ Cross-encoder rerank โ†’ Results

What each layer catches:

Query type Vector Keyword Graph Metadata
Conceptual ("improve error handling") โœ… โ€” โ€” โ€”
Exact match ("ERR_CONNECTION_REFUSED") โ€” โœ… โ€” โ€”
Connected concepts โ€” โ€” โœ… โ€”
Filtered ("#python status:active") โ€” โ€” โ€” โœ…
Combined ("React state management") โœ… โœ… โœ… โœ…

โœจ Features

  • ONNX Runtime โ€” 184ms cold start, 3ms cached. No PyTorch, no GPU
  • Four-layer fusion โ€” FAISS vector similarity + SQLite FTS5 (BM25) + knowledge graph + metadata filtering, merged via reciprocal rank fusion
  • Cross-encoder reranking โ€” TinyBERT reranker via ONNX Runtime โ€” included in base install, no PyTorch needed. Downloads ~17MB model on first use
  • Incremental graph updates โ€” file-centric provenance tracking detects what changed and only rebuilds affected nodes/edges. Multi-source support with isolated provenance per source
  • MCP server โ€” Five tools (search, index, add_document, health, list_sources) for Claude, Cursor, Windsurf
  • Search daemon โ€” Unix socket server keeps ONNX model + FAISS index warm between queries
  • Knowledge graph โ€” Analyzers build entity, temporal, topic, and explicit-link edges from markdown. Optional GLiNER NER. 418 files in 2.1s
  • Smart chunking โ€” Header-aware splitting preserves document structure and parent context
  • Query expansion โ€” Acronym registry, casing/spacing variants, underscore-aware tokenization
  • Runs anywhere โ€” CPU-only, 8GB RAM, no API keys, no external services

๐Ÿค– MCP Server

VelociRAG exposes a Model Context Protocol server for seamless agent integration:

Available tools:

  • search โ€” 4-layer fusion search with reranking
  • index โ€” Add documents to the knowledge base
  • add_document โ€” Insert single document
  • health โ€” System diagnostics
  • list_sources โ€” Show indexed document sources

The MCP server process stays alive between queries, so models load once and every subsequent search is warm. Works with any MCP-compatible client.

๐Ÿ Python API

Full 4-layer unified search:

from velocirag import (
    Embedder, VectorStore, Searcher,
    GraphStore, MetadataStore, UnifiedSearch,
    GraphPipeline
)

# Build the full stack
embedder = Embedder()
store = VectorStore('./search-db', embedder)
graph_store = GraphStore('./search-db/graph.db')
metadata_store = MetadataStore('./search-db/metadata.db')

# Index with graph + metadata
store.add_directory('./docs')
pipeline = GraphPipeline(graph_store, embedder, metadata_store)
pipeline.build('./docs', source_name='my-docs')

# Unified search across all layers
searcher = Searcher(store, embedder)
unified = UnifiedSearch(searcher, graph_store, metadata_store)
results = unified.search(
    'machine learning algorithms',
    limit=5,
    enrich_graph=True,
    filters={'tags': ['python'], 'status': 'active'}
)

Quick semantic search:

from velocirag import Embedder, VectorStore, Searcher

embedder = Embedder()
store = VectorStore('./db', embedder)
store.add_directory('./docs')
searcher = Searcher(store, embedder)
results = searcher.search('neural networks', limit=10)

Incremental graph updates:

from velocirag import Embedder, GraphStore, GraphPipeline

# First run โ€” full build, populates provenance
gs = GraphStore('./db/graph.db')
pipeline = GraphPipeline(gs, embedder=Embedder())
pipeline.build('./docs', source_name='my-docs')  # full build

# Subsequent runs โ€” only changed files get reprocessed
pipeline.build('./docs', source_name='my-docs')  # incremental (automatic)

# Force full rebuild
pipeline.build('./docs', source_name='my-docs', force_rebuild=True)

# Multi-source graphs
pipeline.build('./project-a', source_name='project-a')
pipeline.build('./project-b', source_name='project-b')  # isolated provenance

๐Ÿ’ป CLI Reference

# Index documents (graph built with light mode by default)
velocirag index <path> [--graph] [--metadata] [--gliner] [--full-graph] [--force]
                       [--source NAME] [--db PATH]

# Search across all layers (auto-routes through daemon if running)
velocirag search <query> [--limit N] [--threshold F] [--format text|json]

# Search daemon
velocirag serve [--db PATH] [-f]         # start daemon (-f for foreground)
velocirag stop                            # stop daemon
velocirag status                          # check daemon health

# Metadata queries
velocirag query [--tags TAG] [--status S] [--project P] [--recent N]

# System health and status
velocirag health [--format text|json]

# Start MCP server
velocirag mcp [--db PATH] [--transport stdio|sse]

Graph options:

  • --graph โ€” Build knowledge graph (light mode by default, skips semantic similarity)
  • --full-graph โ€” Build graph WITH semantic similarity edges (~2GB extra RAM)
  • --source NAME โ€” Label for multi-source provenance isolation
  • --force โ€” Clear and rebuild from scratch

๐Ÿ“Š Performance

Real benchmarks on ByteByteGo/system-design-101 (418 files, 1,001 chunks):

Metric Value
Index (418 files) 13.6s
Search (warm, 5 results) 35โ€“90ms
Graph build (light) 2.1s โ†’ 2,397 nodes, 8,717 edges
Incremental update (1 file) 1.3s
Reranker Cross-encoder TinyBERT via ONNX
Install size ~80MB (no PyTorch)
RAM usage <1GB with all models loaded

Production deployment (3,400+ documents, 4 sources):

Metric Value
Embedding (warm) 3ms
Embedding (cold) 184ms
Full 4-layer search (warm) 76โ€“350ms
Hit rate (100-query benchmark) 99/100
Graph 3,740 nodes, 92,719 edges
Provenance 870 files across 4 sources

โš™๏ธ Configuration

Environment Variable Default Description
VELOCIRAG_DB ./.velocirag Database directory
VELOCIRAG_SOCKET /tmp/velocirag-daemon.sock Daemon socket path
NO_COLOR โ€” Disable colored output

Dependencies (all included in base install):

  • onnxruntime โ€” ONNX inference (embedder + reranker)
  • tokenizers + huggingface-hub โ€” model loading
  • faiss-cpu โ€” vector similarity search
  • networkx + scikit-learn โ€” knowledge graph + topic clustering
  • numpy, click, pyyaml, python-frontmatter

Optional extras:

  • pip install "velocirag[mcp]" โ€” MCP server (adds fastmcp)
  • pip install "velocirag[ner]" โ€” GLiNER entity extraction (adds gliner, requires PyTorch)

๐Ÿ“„ License

MIT โ€” Use it anywhere, build anything.

Need agent integration help? Check AGENTS.md for machine-readable project context.


Built for agents who think fast and remember faster.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

velocirag-0.6.0.tar.gz (181.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

velocirag-0.6.0-py3-none-any.whl (111.7 kB view details)

Uploaded Python 3

File details

Details for the file velocirag-0.6.0.tar.gz.

File metadata

  • Download URL: velocirag-0.6.0.tar.gz
  • Upload date:
  • Size: 181.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for velocirag-0.6.0.tar.gz
Algorithm Hash digest
SHA256 43ae0674b6ed8cdd3035220659385d78c1d381566abc44084024529623fa4804
MD5 bf7a281e3cd694f298a0c927697b7785
BLAKE2b-256 f67855e3c31149f52dc3b83e9257260833efbc46265682dd704a00497d073a1c

See more details on using hashes here.

Provenance

The following attestation bundles were made for velocirag-0.6.0.tar.gz:

Publisher: publish.yml on HaseebKhalid1507/VelociRAG

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file velocirag-0.6.0-py3-none-any.whl.

File metadata

  • Download URL: velocirag-0.6.0-py3-none-any.whl
  • Upload date:
  • Size: 111.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for velocirag-0.6.0-py3-none-any.whl
Algorithm Hash digest
SHA256 281c61afe6352f54455c51d630caeece675989afee78265edd1d7a501e3a8306
MD5 d25f595d307724d758c47cfeb31dceee
BLAKE2b-256 6bb1d94ad62f5368dba37801849755d48c3a284dd24e6f9ec6067deb38c23381

See more details on using hashes here.

Provenance

The following attestation bundles were made for velocirag-0.6.0-py3-none-any.whl:

Publisher: publish.yml on HaseebKhalid1507/VelociRAG

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page