High-performance CUDA-accelerated vector retrieval core.

These details have not been verified by PyPI

Project description

LocVec — Local Vector Retrieval Engine

High-speed, hardware-aware vector search for local RAG on consumer-grade GPUs — no cloud dependency required.

Overview

LocVec is a hardware-aware vector database and retrieval library written in Python with custom CUDA backends. It is designed for low-latency Retrieval-Augmented Generation (RAG) on consumer-grade hardware with limited VRAM.

By implementing a dynamic memory handoff system and optimized clustering algorithms, LocVec enables high-speed search across large document sets — entirely on-device.

Features

Custom CUDA Kernels — Accelerated matrix operations for vector similarity search.
Dynamic VRAM Handoff — Warm Loading cycle flushes the encoder after retrieval, freeing VRAM for LLM inference.
IVF Indexing — Reduces search complexity from O(N) to O(N/K) via K-Means clustered Voronoi partitioning with dynamic K calculation.
Streaming LLM Inference — Token-by-token generation from local models (e.g., Phi-3 via Ollama).
PDF Ingestion — Built-in sharding and indexing pipeline for document corpora.
No Cloud Required — Fully local; no API calls, no data leaving your machine.

System Requirements

NVIDIA GPU (CUDA-capable)
NVIDIA CUDA Toolkit — version matching your GPU drivers
Python 3.10+

Verify your CUDA installation:

nvcc --version

If the command is not recognized, add the CUDA bin directory to your system PATH.

Installation

# Clone the repository
git clone https://github.com/rAdvirtua/locvec.git
cd locvec

# Install required dependencies
pip install -r requirements.txt

# Compile and install the library (editable mode)
pip install -e .

Note: Editable mode (-e) is required to trigger local compilation of the C/CUDA extensions via the setup.py build script.

Setting Up a Local LLM with Ollama

LocVec uses Ollama to run LLMs locally. Follow these steps to get a model running before querying with LocVec.

1. Install Ollama

Download and install Ollama for your platform from ollama.com/download, or via the terminal:

# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh

For Windows, use the installer from the website.

2. Pull a Model

Once Ollama is installed, pull a model. Phi-3 is recommended for low-VRAM setups:

# Lightweight — good for 4–6 GB VRAM
ollama pull phi3

# Alternatively, for higher quality output
ollama pull llama3
ollama pull mistral

Browse the full model library at ollama.com/library.

3. Verify the Model is Running

ollama run phi3

You should see an interactive prompt. Type /bye to exit. Ollama runs as a background service automatically, so no manual server start is needed before using LocVec.

Tip: Match your model choice to available VRAM. On GPUs with less than 6 GB, stick to phi3 or gemma:2b to avoid OOM errors during the inference phase.

Usage

The usage example requires PyMuPDF for PDF text extraction. Install it with:

pip install pymupdf

import fitz
import time
import os
from locvec import LocalVec

def extract_and_chunk_pdf(file_path, chunk_size=300):
    text = ""
    with fitz.open(file_path) as doc:
        for page in doc:
            text += page.get_text()

    return [text[i:i + chunk_size] for i in range(0, len(text), chunk_size)]

# Explicitly initialize the engine with the preferred model
engine = LocalVec(model_name='all-MiniLM-L6-v2')

pdf_path = "research_paper.pdf"
chunks = extract_and_chunk_pdf(pdf_path)

print(f"Extracted {len(chunks)} shards from {pdf_path}")
engine.build_full_index(chunks)

query = "Summarize the key findings of this document."
start_time = time.perf_counter()

idx, context = engine.search(query)

if idx < 0:
    print(f"Error: {context}")
    exit()

latency_ms = (time.perf_counter() - start_time) * 1000
print(f"Search completed in {latency_ms:.2f}ms")

engine.offload_encoder()

print("\nAI Response:")
for token in engine.query_llm_stream("phi3", query, context):
    print(token, end="", flush=True)
print("\n")

PDF Chat Session

An interactive chat session over a PDF document. Maintains a short history window and enforces concise responses via a strict token limit.

import os
import json
import fitz
import urllib.request
from locvec import LocalVec

class PDFChatSession:
    def __init__(self, pdf_path, db_name="chat_store"):
        self.pdf_path = pdf_path
        self.lv = LocalVec(db_name=db_name)
        self.history = []
        self.llm_endpoint = "http://localhost:11434/api/generate"
        self.model = "phi3"

    def ingest_pdf(self, shard_size=600):
        doc = fitz.open(self.pdf_path)
        full_text = "".join([page.get_text().replace('\n', ' ') for page in doc])
        shards = [full_text[i:i + shard_size] for i in range(0, len(full_text), shard_size)]
        self.lv.build_full_index(shards)

    def _call_llm_stream(self, prompt):
        payload = {
            "model": self.model,
            "prompt": prompt,
            "stream": True,
            "options": {"temperature": 0.2, "num_ctx": 4096, "num_predict": 150}
        }
        req = urllib.request.Request(
            self.llm_endpoint,
            data=json.dumps(payload).encode('utf-8'),
            headers={'Content-Type': 'application/json'}
        )
        with urllib.request.urlopen(req) as response:
            full_response = ""
            for line in response:
                if line:
                    chunk = json.loads(line.decode('utf-8'))
                    token = chunk.get('response', '')
                    full_response += token
                    print(token, end="", flush=True)
                    if chunk.get('done'):
                        break
            return full_response

    def start_chat(self):
        while True:
            query = input("\nUser: ").strip()
            if query.lower() in ['exit', 'quit']:
                break

            _, context_shards = self.lv.search(query, top_k=2)
            history_str = "\n".join([f"Q: {h['q']}\nA: {h['a']}" for h in self.history[-1:]])

            full_prompt = (
                f"SYSTEM: You are a concise technical assistant. "
                f"Answer using ONLY the context provided. "
                f"STRICT RULE: Your response MUST be 3-4 lines maximum.\n\n"
                f"CONTEXT: {context_shards}\n\n"
                f"HISTORY: {history_str}\n\n"
                f"USER QUERY: {query}\n\n"
                f"CONCISE ANSWER:"
            )
            print("\nAI: ", end="")
            answer = self._call_llm_stream(full_prompt)
            print()
            self.history.append({"q": query, "a": answer})

if __name__ == "__main__":
    PDF_FILE = "research_paper.pdf"
    session = PDFChatSession(PDF_FILE)
    session.ingest_pdf()
    session.start_chat()

IVF vs. Brute-Force Benchmark

Compares LocVec's IVF-Voronoi search against a GPU flat (brute-force) baseline at scale. Useful for validating speedup on your specific hardware.

import time
import gc
import fitz
import random
import numpy as np
import torch
import json
import urllib.request
from locvec import LocalVec

SOURCE_PDF = "research_paper.pdf"
QUERY = "What is the specific methodology used in this paper?"
TARGET_SHARDS = 200000
OLLAMA_MODEL = "phi3"

def prepare_corpus(file_path, total_shards):
    doc = fitz.open(file_path)
    base_text = [p.get_text().replace('\n', ' ') for p in doc]
    multiplier = (total_shards // len(base_text)) + 1
    corpus = []
    for i in range(multiplier):
        for shard in base_text:
            if len(corpus) < total_shards:
                corpus.append(f"{shard} [id_{i}]")
    random.seed(42)
    random.shuffle(corpus)
    return corpus

def stream_inference(query, context):
    payload = {
        "model": OLLAMA_MODEL,
        "prompt": f"Context: {context}\n\nQuery: {query}\n\nTechnical Answer:",
        "stream": True,
        "options": {"temperature": 0.1, "num_predict": 350}
    }
    req = urllib.request.Request(
        "http://localhost:11434/api/generate",
        data=json.dumps(payload).encode('utf-8'),
        headers={'Content-Type': 'application/json'}
    )
    with urllib.request.urlopen(req) as response:
        for line in response:
            if line:
                chunk = json.loads(line.decode('utf-8'))
                yield chunk.get('response', '')
                if chunk.get('done'):
                    break

if __name__ == "__main__":
    corpus = prepare_corpus(SOURCE_PDF, TARGET_SHARDS)

    lv = LocalVec(db_name="benchmark_final")
    lv.build_full_index(corpus)

    time.sleep(20)  # Thermal recovery for laptop hardware

    embeddings_path = f"{lv.db_prefix}_offline_embeddings.bin"
    raw_data = np.fromfile(embeddings_path, dtype=np.float32).reshape(-1, lv.dims)
    all_vectors_gpu = torch.from_numpy(raw_data).to('cuda').half()
    q_vec_gpu = torch.from_numpy(lv.encoder.encode(QUERY)).to('cuda').half()

    # GPU Brute-Force
    torch.cuda.synchronize()
    _ = torch.norm(all_vectors_gpu[:100] - q_vec_gpu, dim=1)
    t0 = time.perf_counter()
    dist = torch.norm(all_vectors_gpu - q_vec_gpu, dim=1)
    flat_idx = torch.argmin(dist).item()
    torch.cuda.synchronize()
    latency_flat = (time.perf_counter() - t0) * 1000
    flat_context = corpus[flat_idx]

    # LocVec IVF
    t1 = time.perf_counter()
    lv_idx, lv_context = lv.search(QUERY, top_k=4)
    latency_lv = (time.perf_counter() - t1) * 1000

    peak_vram = torch.cuda.max_memory_allocated() / (1024**2)
    lv.offload_encoder()

    del all_vectors_gpu
    gc.collect()
    torch.cuda.empty_cache()

    speedup = latency_flat / latency_lv
    print(f"Brute-Force: {latency_flat:.2f}ms | LocVec IVF: {latency_lv:.2f}ms | Speedup: {speedup:.2f}x | Peak VRAM: {peak_vram:.2f}MB")

    for token in stream_inference(QUERY, flat_context):
        print(token, end="", flush=True)
    print()

    for token in stream_inference(QUERY, lv_context):
        print(token, end="", flush=True)
    print()

User API Reference

To use the vector engine in your own scripts, import the main class:

from locvec import LocalVec

`LocalVec(model_name='all-MiniLM-L6-v2')`

Initializes the vector engine and loads the embedding model into memory.

model_name (str, optional): Specify a different SentenceTransformers model from HuggingFace. Defaults to a fast, lightweight 384-dimension model ('all-MiniLM-L6-v2').

`engine.build_full_index(texts)`

Ingests a list of text shards and builds the optimized vector index on your GPU. It automatically calculates the best cluster distribution for your dataset size.

texts (list of str): A list of strings (e.g., your chunked PDF pages or text documents).

`engine.search(query_text)`

Performs a high-speed, hardware-accelerated vector search against your indexed documents to find the most relevant context.

query_text (str): The question or prompt you want to search for.
Returns (tuple): (idx, context) — The integer ID of the best-matching chunk and the actual text string of that chunk.

`engine.offload_encoder()`

Flushes the embedding model from your GPU's VRAM. Always call this before passing the retrieved context to a Local LLM. This guarantees your VRAM is completely free for generation, preventing out-of-memory crashes on consumer cards (4GB/6GB).

`engine.query_llm_stream(model, query, context)`

Sends the user's query alongside the retrieved context to your local Ollama instance, returning a stream of the response.

model (str): The name of the Ollama model to use (e.g., "phi3" or "llama3").
query (str): The user's original question.
context (str): The text payload returned by the engine.search() method.
Returns (generator): Yields individual string tokens for real-time console or UI streaming.

How It Works

CUDA Backend & Memory Management

Most local RAG implementations fail on consumer hardware because the embedding model and the LLM compete for the same VRAM pool. LocVec solves this with a two-phase lifecycle:

Retrieval Phase — Custom CUDA kernels handle vector similarity computations.
Warm Loading / VRAM Handoff — The encoder is flushed from GPU memory immediately after retrieval, clearing space for LLM inference and preventing Out-of-Memory (OOM) errors.

IVF Indexing

LocVec transitions search complexity from linear to clustered using Inverted File Indexing (IVF):

Step	Description
Clustering	K-Means partitions the vector space into K Voronoi cells dynamically based on dataset size.
Coarse Search	Query is matched to the nearest centroid.
Fine Search	k-NN search is performed only within the relevant cluster.
Complexity	Reduced from O(N) → O(N/K).

Project Structure

locvec/
├── src/
│   ├── cuda/        # Custom CUDA kernels (k-means training, IVF search)
│   ├── bridge/      # C wrappers interfacing Python with CUDA memory management
│   └── locvec/      # Core Python library, CTypes bindings, and high-level API
├── setup.py         # Build configuration for C/CUDA extensions
├── testcase.py      # Reference implementation for PDF ingestion and search
└── README.md        # Documentation

License

See LICENSE for details.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.8

May 20, 2026

0.1.7 yanked

May 20, 2026

Reason this release was yanked:

incomplete code

0.1.6 yanked

May 20, 2026

Reason this release was yanked:

incomplete code

0.1.5

May 20, 2026

0.1.4

May 20, 2026

0.1.3

May 20, 2026

0.1.2

May 20, 2026

0.1.1

May 20, 2026

0.1.0

May 20, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

locvec-0.1.8.tar.gz (343.7 kB view details)

Uploaded May 20, 2026 Source

File details

Details for the file locvec-0.1.8.tar.gz.

File metadata

Download URL: locvec-0.1.8.tar.gz
Upload date: May 20, 2026
Size: 343.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for locvec-0.1.8.tar.gz
Algorithm	Hash digest
SHA256	`f8d20ff74b47511a1ceee9d5929be3cfbe198b5f6e940d725b3ee223d5ff309a`
MD5	`1820f8b8402e5dc0a87bd1649ccd7e54`
BLAKE2b-256	`88f10d517da2639210ec030adf81f5b0221a428eea466b4e11c32c824a7becf3`

See more details on using hashes here.

locvec 0.1.8

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

LocVec — Local Vector Retrieval Engine

Overview

Features

System Requirements

Installation

Setting Up a Local LLM with Ollama

1. Install Ollama

2. Pull a Model

3. Verify the Model is Running

Usage

PDF Chat Session

IVF vs. Brute-Force Benchmark

User API Reference

LocalVec(model_name='all-MiniLM-L6-v2')

engine.build_full_index(texts)

engine.search(query_text)

engine.offload_encoder()

engine.query_llm_stream(model, query, context)

How It Works

CUDA Backend & Memory Management

IVF Indexing

Project Structure

License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes

`LocalVec(model_name='all-MiniLM-L6-v2')`

`engine.build_full_index(texts)`

`engine.search(query_text)`

`engine.offload_encoder()`

`engine.query_llm_stream(model, query, context)`