High-performance CUDA-accelerated vector retrieval core.
Project description
LocVec — Local Vector Retrieval Engine
High-speed, hardware-aware vector search for local RAG on consumer-grade GPUs — no cloud dependency required.
Overview
LocVec is a hardware-aware vector database and retrieval library written in Python with custom CUDA backends. It is designed for low-latency Retrieval-Augmented Generation (RAG) on consumer-grade hardware with limited VRAM.
By implementing a dynamic memory handoff system and optimized clustering algorithms, LocVec enables high-speed search across large document sets — entirely on-device.
Features
- Custom CUDA Kernels — Accelerated matrix operations for vector similarity search.
- Dynamic VRAM Handoff — Warm Loading cycle flushes the encoder after retrieval, freeing VRAM for LLM inference.
- IVF Indexing — Reduces search complexity from O(N) to O(N/K) via K-Means clustered Voronoi partitioning with dynamic K calculation.
- Streaming LLM Inference — Token-by-token generation from local models (e.g., Phi-3 via Ollama).
- PDF Ingestion — Built-in sharding and indexing pipeline for document corpora.
- No Cloud Required — Fully local; no API calls, no data leaving your machine.
System Requirements
- NVIDIA GPU (CUDA-capable)
- NVIDIA CUDA Toolkit — version matching your GPU drivers
- Python 3.10+
Verify your CUDA installation:
nvcc --version
If the command is not recognized, add the CUDA
bindirectory to your systemPATH.
Installation
# Clone the repository
git clone https://github.com/rAdvirtua/locvec.git
cd locvec
# Install required dependencies
pip install -r requirements.txt
# Compile and install the library (editable mode)
pip install -e .
Note: Editable mode (
-e) is required to trigger local compilation of the C/CUDA extensions via thesetup.pybuild script.
Setting Up a Local LLM with Ollama
LocVec uses Ollama to run LLMs locally. Follow these steps to get a model running before querying with LocVec.
1. Install Ollama
Download and install Ollama for your platform from ollama.com/download, or via the terminal:
# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh
For Windows, use the installer from the website.
2. Pull a Model
Once Ollama is installed, pull a model. Phi-3 is recommended for low-VRAM setups:
# Lightweight — good for 4–6 GB VRAM
ollama pull phi3
# Alternatively, for higher quality output
ollama pull llama3
ollama pull mistral
Browse the full model library at ollama.com/library.
3. Verify the Model is Running
ollama run phi3
You should see an interactive prompt. Type /bye to exit. Ollama runs as a background service automatically, so no manual server start is needed before using LocVec.
Tip: Match your model choice to available VRAM. On GPUs with less than 6 GB, stick to
phi3orgemma:2bto avoid OOM errors during the inference phase.
Usage
The usage example requires PyMuPDF for PDF text extraction. Install it with:
pip install pymupdf
import fitz
import time
import os
from locvec import LocalVec
def extract_and_chunk_pdf(file_path, chunk_size=300):
text = ""
with fitz.open(file_path) as doc:
for page in doc:
text += page.get_text()
return [text[i:i + chunk_size] for i in range(0, len(text), chunk_size)]
# Explicitly initialize the engine with the preferred model
engine = LocalVec(model_name='all-MiniLM-L6-v2')
pdf_path = "research_paper.pdf"
chunks = extract_and_chunk_pdf(pdf_path)
print(f"Extracted {len(chunks)} shards from {pdf_path}")
engine.build_full_index(chunks)
query = "Summarize the key findings of this document."
start_time = time.perf_counter()
idx, context = engine.search(query)
if idx < 0:
print(f"Error: {context}")
exit()
latency_ms = (time.perf_counter() - start_time) * 1000
print(f"Search completed in {latency_ms:.2f}ms")
engine.offload_encoder()
print("\nAI Response:")
for token in engine.query_llm_stream("phi3", query, context):
print(token, end="", flush=True)
print("\n")
PDF Chat Session
An interactive chat session over a PDF document. Maintains a short history window and enforces concise responses via a strict token limit.
import os
import json
import fitz
import urllib.request
from locvec import LocalVec
class PDFChatSession:
def __init__(self, pdf_path, db_name="chat_store"):
self.pdf_path = pdf_path
self.lv = LocalVec(db_name=db_name)
self.history = []
self.llm_endpoint = "http://localhost:11434/api/generate"
self.model = "phi3"
def ingest_pdf(self, shard_size=600):
doc = fitz.open(self.pdf_path)
full_text = "".join([page.get_text().replace('\n', ' ') for page in doc])
shards = [full_text[i:i + shard_size] for i in range(0, len(full_text), shard_size)]
self.lv.build_full_index(shards)
def _call_llm_stream(self, prompt):
payload = {
"model": self.model,
"prompt": prompt,
"stream": True,
"options": {"temperature": 0.2, "num_ctx": 4096, "num_predict": 150}
}
req = urllib.request.Request(
self.llm_endpoint,
data=json.dumps(payload).encode('utf-8'),
headers={'Content-Type': 'application/json'}
)
with urllib.request.urlopen(req) as response:
full_response = ""
for line in response:
if line:
chunk = json.loads(line.decode('utf-8'))
token = chunk.get('response', '')
full_response += token
print(token, end="", flush=True)
if chunk.get('done'):
break
return full_response
def start_chat(self):
while True:
query = input("\nUser: ").strip()
if query.lower() in ['exit', 'quit']:
break
_, context_shards = self.lv.search(query, top_k=2)
history_str = "\n".join([f"Q: {h['q']}\nA: {h['a']}" for h in self.history[-1:]])
full_prompt = (
f"SYSTEM: You are a concise technical assistant. "
f"Answer using ONLY the context provided. "
f"STRICT RULE: Your response MUST be 3-4 lines maximum.\n\n"
f"CONTEXT: {context_shards}\n\n"
f"HISTORY: {history_str}\n\n"
f"USER QUERY: {query}\n\n"
f"CONCISE ANSWER:"
)
print("\nAI: ", end="")
answer = self._call_llm_stream(full_prompt)
print()
self.history.append({"q": query, "a": answer})
if __name__ == "__main__":
PDF_FILE = "research_paper.pdf"
session = PDFChatSession(PDF_FILE)
session.ingest_pdf()
session.start_chat()
IVF vs. Brute-Force Benchmark
Compares LocVec's IVF-Voronoi search against a GPU flat (brute-force) baseline at scale. Useful for validating speedup on your specific hardware.
import time
import gc
import fitz
import random
import numpy as np
import torch
import json
import urllib.request
from locvec import LocalVec
SOURCE_PDF = "research_paper.pdf"
QUERY = "What is the specific methodology used in this paper?"
TARGET_SHARDS = 200000
OLLAMA_MODEL = "phi3"
def prepare_corpus(file_path, total_shards):
doc = fitz.open(file_path)
base_text = [p.get_text().replace('\n', ' ') for p in doc]
multiplier = (total_shards // len(base_text)) + 1
corpus = []
for i in range(multiplier):
for shard in base_text:
if len(corpus) < total_shards:
corpus.append(f"{shard} [id_{i}]")
random.seed(42)
random.shuffle(corpus)
return corpus
def stream_inference(query, context):
payload = {
"model": OLLAMA_MODEL,
"prompt": f"Context: {context}\n\nQuery: {query}\n\nTechnical Answer:",
"stream": True,
"options": {"temperature": 0.1, "num_predict": 350}
}
req = urllib.request.Request(
"http://localhost:11434/api/generate",
data=json.dumps(payload).encode('utf-8'),
headers={'Content-Type': 'application/json'}
)
with urllib.request.urlopen(req) as response:
for line in response:
if line:
chunk = json.loads(line.decode('utf-8'))
yield chunk.get('response', '')
if chunk.get('done'):
break
if __name__ == "__main__":
corpus = prepare_corpus(SOURCE_PDF, TARGET_SHARDS)
lv = LocalVec(db_name="benchmark_final")
lv.build_full_index(corpus)
time.sleep(20) # Thermal recovery for laptop hardware
embeddings_path = f"{lv.db_prefix}_offline_embeddings.bin"
raw_data = np.fromfile(embeddings_path, dtype=np.float32).reshape(-1, lv.dims)
all_vectors_gpu = torch.from_numpy(raw_data).to('cuda').half()
q_vec_gpu = torch.from_numpy(lv.encoder.encode(QUERY)).to('cuda').half()
# GPU Brute-Force
torch.cuda.synchronize()
_ = torch.norm(all_vectors_gpu[:100] - q_vec_gpu, dim=1)
t0 = time.perf_counter()
dist = torch.norm(all_vectors_gpu - q_vec_gpu, dim=1)
flat_idx = torch.argmin(dist).item()
torch.cuda.synchronize()
latency_flat = (time.perf_counter() - t0) * 1000
flat_context = corpus[flat_idx]
# LocVec IVF
t1 = time.perf_counter()
lv_idx, lv_context = lv.search(QUERY, top_k=4)
latency_lv = (time.perf_counter() - t1) * 1000
peak_vram = torch.cuda.max_memory_allocated() / (1024**2)
lv.offload_encoder()
del all_vectors_gpu
gc.collect()
torch.cuda.empty_cache()
speedup = latency_flat / latency_lv
print(f"Brute-Force: {latency_flat:.2f}ms | LocVec IVF: {latency_lv:.2f}ms | Speedup: {speedup:.2f}x | Peak VRAM: {peak_vram:.2f}MB")
for token in stream_inference(QUERY, flat_context):
print(token, end="", flush=True)
print()
for token in stream_inference(QUERY, lv_context):
print(token, end="", flush=True)
print()
User API Reference
To use the vector engine in your own scripts, import the main class:
from locvec import LocalVec
LocalVec(model_name='all-MiniLM-L6-v2')
Initializes the vector engine and loads the embedding model into memory.
model_name(str, optional): Specify a different SentenceTransformers model from HuggingFace. Defaults to a fast, lightweight 384-dimension model ('all-MiniLM-L6-v2').
engine.build_full_index(texts)
Ingests a list of text shards and builds the optimized vector index on your GPU. It automatically calculates the best cluster distribution for your dataset size.
texts(list of str): A list of strings (e.g., your chunked PDF pages or text documents).
engine.search(query_text)
Performs a high-speed, hardware-accelerated vector search against your indexed documents to find the most relevant context.
query_text(str): The question or prompt you want to search for.- Returns (tuple):
(idx, context)— The integer ID of the best-matching chunk and the actual text string of that chunk.
engine.offload_encoder()
Flushes the embedding model from your GPU's VRAM. Always call this before passing the retrieved context to a Local LLM. This guarantees your VRAM is completely free for generation, preventing out-of-memory crashes on consumer cards (4GB/6GB).
engine.query_llm_stream(model, query, context)
Sends the user's query alongside the retrieved context to your local Ollama instance, returning a stream of the response.
model(str): The name of the Ollama model to use (e.g.,"phi3"or"llama3").query(str): The user's original question.context(str): The text payload returned by theengine.search()method.- Returns (generator): Yields individual string tokens for real-time console or UI streaming.
How It Works
CUDA Backend & Memory Management
Most local RAG implementations fail on consumer hardware because the embedding model and the LLM compete for the same VRAM pool. LocVec solves this with a two-phase lifecycle:
- Retrieval Phase — Custom CUDA kernels handle vector similarity computations.
- Warm Loading / VRAM Handoff — The encoder is flushed from GPU memory immediately after retrieval, clearing space for LLM inference and preventing Out-of-Memory (OOM) errors.
IVF Indexing
LocVec transitions search complexity from linear to clustered using Inverted File Indexing (IVF):
| Step | Description |
|---|---|
| Clustering | K-Means partitions the vector space into K Voronoi cells dynamically based on dataset size. |
| Coarse Search | Query is matched to the nearest centroid. |
| Fine Search | k-NN search is performed only within the relevant cluster. |
| Complexity | Reduced from O(N) → O(N/K). |
Project Structure
locvec/
├── src/
│ ├── cuda/ # Custom CUDA kernels (k-means training, IVF search)
│ ├── bridge/ # C wrappers interfacing Python with CUDA memory management
│ └── locvec/ # Core Python library, CTypes bindings, and high-level API
├── setup.py # Build configuration for C/CUDA extensions
├── testcase.py # Reference implementation for PDF ingestion and search
└── README.md # Documentation
License
See LICENSE for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file locvec-0.1.8.tar.gz.
File metadata
- Download URL: locvec-0.1.8.tar.gz
- Upload date:
- Size: 343.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f8d20ff74b47511a1ceee9d5929be3cfbe198b5f6e940d725b3ee223d5ff309a
|
|
| MD5 |
1820f8b8402e5dc0a87bd1649ccd7e54
|
|
| BLAKE2b-256 |
88f10d517da2639210ec030adf81f5b0221a428eea466b4e11c32c824a7becf3
|