A high-performance static internet index for LLM RAG applications
Project description
llmindex
🔍 Local semantic search for LLM applications
A lightweight Python library for searching a pre-trained FAISS index locally. Returns URLs and text content ready for your LLM's context window.
Installation
pip install -e .
Quick Start
from llmindex import LLMIndex
import json
# Load the pre-trained index
index = LLMIndex(model_dir="./models")
# Search
results_json = index.search("machine learning algorithms", top_k=5)
# Parse results
results = json.loads(results_json)
for item in results:
print(f"URL: {item['url']}")
print(f"Content: {item['content']}\n")
Architecture
flowchart LR
A[User Query] --> B[Encode with SentenceTransformer]
B --> C[PCA: 384 → 64 dims]
C --> D[PackBits to 8‑byte binary]
D --> E[FAISS Binary Index Search]
E --> F[Get FAISS Indices]
F --> G[Lookup → (dataset_id, row_id)]
G --> H[Fetch from HuggingFace Datasets]
subgraph DataSources
W[[Wikipedia]]
X[[FineWeb]]
W --> H
X --> H
end
H --> I[Return Results + Optional Rerank]
I --> J[JSON Output]
The diagram now uses proper Mermaid syntax: separate edges for the two data sources and clear final output. It matches the simplified flow you requested.
Detailed Architecture Flow
1. Query Encoding
- Input: User search query string
- Component: SentenceTransformer (
all-MiniLM-L6-v2) - Output: 384-dimensional dense embedding vector
- Device: Auto-detected (CUDA/GPU or CPU)
2. PCA Compression
- Input: 384-dim embedding
- Component: Pre-trained PCA model
- Process: Project to 64 dimensions
- Output: 64-dim normalized float vector in [0, 1] range
3. Binary Quantization
- Input: 64-dim float vector
- Process: PackBits thresholding (value > 0 → 1, else 0)
- Output: 8-byte binary vector (64 bits)
4. FAISS Binary Index Search
- Component: FAISS binary index
- Process: k-NN search on binary vectors
- Output: Top-K indices with distances
5. Mapping Lookup
- Component: Mapping pickle file
- Format: List of
(dataset_id, row_id)tuples - Purpose: Links FAISS indices to original dataset rows
6. Context Fetch (Parallel)
- Component: HuggingFace Datasets Server API
- Datasets:
- Wikipedia (
wikimedia/wikipedia, config:20231101.en) - FineWeb (
HuggingFaceFW/fineweb, config:CC-MAIN-2025-26)
- Wikipedia (
- Process: Fetch
url,text,date,source,dump - Implementation: Up to 16 concurrent HTTP requests
- Note: This step is parallelized and transparent to users
7. Optional Reranking
- When enabled: Fetch additional candidates, re-encode with full embeddings, compute cosine similarity, return best results
- Benefit: Improves relevance over pure binary search
- Cost: Additional encoding time
8. Final Output
- Returns: JSON array with result objects containing URL, content, source, date, dump
API
LLMIndex(model_dir="./models", device=None)
Initialize the search index.
Parameters:
model_dir- Directory containing the pre-trained modelsdevice-'cuda'or'cpu'(auto-detected if None)
index.search(query, top_k=5) → str
Search and return results as JSON.
Parameters:
query- Search query stringtop_k- Number of results (default: 5)
Returns: JSON string with results
[
{"url": "https://example.com/page1", "content": "..."},
{"url": "https://example.com/page2", "content": "..."}
]
Getting the Models
The pre-trained models are required to use this library. You have two options:
Option 1: Train Locally (see train.py)
python train.py --target-docs 100000000 --save-dir ./models/
Option 2: Download from HuggingFace
Coming soon - pre-trained models will be available on HuggingFace Hub.
Use Cases
RAG with LLMs
from llmindex import LLMIndex
from openai import OpenAI
import json
index = LLMIndex()
query = "What is transformers?"
# Get context
results_json = index.search(query, top_k=3)
results = json.loads(results_json)
context = "\n".join([r["content"] for r in results])
# Send to LLM
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4",
messages=[{
"role": "user",
"content": f"Context:\n{context}\n\nQuestion: {query}"
}]
)
Local Search Engine
from llmindex import LLMIndex
import json
index = LLMIndex()
while True:
query = input("Search: ")
results_json = index.search(query, top_k=10)
results = json.loads(results_json)
for i, item in enumerate(results, 1):
print(f"{i}. {item['url']}")
Requirements
- Python 3.8+
- PyTorch
- FAISS
- Sentence Transformers
- NumPy, scikit-learn
See requirements.txt for exact versions.
Performance
- Search latency: ~50ms (GPU) to 200ms (CPU) per query
- Memory: ~4GB for 100M document index
- Disk: ~2GB for all model files
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file llmsearchindex-1.0.0.tar.gz.
File metadata
- Download URL: llmsearchindex-1.0.0.tar.gz
- Upload date:
- Size: 6.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
08edc26f23a19e8cfb39619dbcfa34f7920c17b6d3079003c6b44a3c4f4da01d
|
|
| MD5 |
49301e23027b771e04d8950cd82ac2e4
|
|
| BLAKE2b-256 |
cb922623a01e18d3e0a159489cd03f330f972b769b26dceb8380745bc5ef7c56
|
File details
Details for the file llmsearchindex-1.0.0-py3-none-any.whl.
File metadata
- Download URL: llmsearchindex-1.0.0-py3-none-any.whl
- Upload date:
- Size: 7.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ca3f6f02b9a835058993c787a10360dc5031514d51b745e9f3cf86b1856c43a9
|
|
| MD5 |
01525c1e246f5f3cbdd8522634122070
|
|
| BLAKE2b-256 |
5b862b7d693bc540975114f8eb75d6abefb9b55416912f907f808b41fabea29e
|