Predictive Retrieval with Intelligent Memory Embeddings
Project description
PRIME
Predictive Retrieval with Intelligent Memory Embeddings
Predict what you need before you search.
What Makes PRIME Different
Traditional RAG systems are reactive: they embed your query and search for similar documents every single turn. This leads to:
- Over-retrieval: Wasted compute searching when context hasn't changed
- Suboptimal results: Query embeddings don't always match ideal context embeddings
- Memory fragmentation: Similar content stored as separate, redundant entries
PRIME is predictive. Inspired by Meta FAIR's VL-JEPA, PRIME:
| Problem | PRIME's Solution |
|---|---|
| When to retrieve? | Semantic State Monitor detects topic shifts via variance |
| What to retrieve? | Embedding Predictor predicts ideal context before searching |
| How to store? | Memory Cluster Store consolidates similar memories automatically |
Traditional RAG: Query → Embed → Search Every Time → Retrieve → Generate
PRIME: Query → Monitor Variance → Predict Target → Targeted Search → Generate
↓ ↓
(skip if same topic) (search for predicted ideal, not query)
Quick Start
Installation
# Basic install
pip install prime-rag
# With all optional dependencies
pip install prime-rag[all]
# Development install from source
git clone https://github.com/Mathews-Tom/PRIME.git
cd PRIME
uv sync
Library Usage (Recommended for Getting Started)
from prime import PRIME, PRIMEConfig
# Initialize with testing config (in-memory, no external services)
config = PRIMEConfig.for_testing()
prime = PRIME(config)
# Process a conversation turn
response = prime.process_turn(
"What is machine learning?",
session_id="demo_session",
)
print(f"Action: {response.action.value}") # continue, prepare, retrieve, or retrieve_consolidate
print(f"Boundary crossed: {response.boundary_crossed}")
print(f"Variance: {response.variance:.4f}")
# If retrieval was triggered, you get memories
if response.retrieved_memories:
for mem in response.retrieved_memories:
print(f" [{mem.similarity:.2f}] {mem.content[:100]}...")
# Record a response to memory for future retrieval
prime.record_response(
"Machine learning is a subset of AI that enables systems to learn from data...",
session_id="demo_session",
)
# Ingest external knowledge
prime.write_external_knowledge(
"JEPA (Joint Embedding Predictive Architecture) predicts in embedding space...",
metadata={"source": "research_paper", "topic": "ml_architectures"},
)
# Direct memory search (bypasses SSM boundary detection)
results = prime.search_memory("neural network architectures", k=5)
REST API Usage
Start the server:
# Using uvicorn directly
uvicorn prime.api.app:app --host 0.0.0.0 --port 8000
# Or with uv
uv run uvicorn prime.api.app:app --reload
Make requests:
# Process a turn
curl -X POST http://localhost:8000/api/v1/process \
-H "Content-Type: application/json" \
-d '{"input": "What is JEPA?", "session_id": "demo"}'
# Write to memory
curl -X POST http://localhost:8000/api/v1/memory/write \
-H "Content-Type: application/json" \
-d '{"content": "JEPA predicts in embedding space rather than pixel space."}'
# Search memory
curl -X POST http://localhost:8000/api/v1/memory/search \
-H "Content-Type: application/json" \
-d '{"query": "embedding prediction", "k": 5}'
# Health check
curl http://localhost:8000/api/v1/health
Python SDK (for API Clients)
from prime.client import PRIMEClient, PRIMEClientSync
# Async client
async with PRIMEClient(base_url="http://localhost:8000") as client:
response = await client.process_turn("What is JEPA?")
print(f"Action: {response.action}")
# Sync client
with PRIMEClientSync(base_url="http://localhost:8000") as client:
response = client.process_turn("What is JEPA?")
await client.write_memory("JEPA uses joint embedding spaces.")
Framework Integrations
LangChain
from langchain.chains import RetrievalQA
from langchain_openai import ChatOpenAI
from prime import PRIME, PRIMEConfig
from prime.adapters.langchain import PRIMERetriever
# Initialize PRIME
prime = PRIME(PRIMEConfig.for_testing())
# Create LangChain retriever
retriever = PRIMERetriever(
prime=prime,
mode="process_turn", # or "search" for direct search
session_id="langchain_session",
top_k=5,
)
# Use in a chain
llm = ChatOpenAI(model="gpt-4")
qa_chain = RetrievalQA.from_chain_type(llm=llm, retriever=retriever)
answer = qa_chain.invoke("What is predictive retrieval?")
LlamaIndex
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.llms.openai import OpenAI
from prime import PRIME, PRIMEConfig
from prime.adapters.llamaindex import PRIMELlamaIndexRetriever
# Initialize PRIME
prime = PRIME(PRIMEConfig.for_testing())
# Create LlamaIndex retriever
retriever = PRIMELlamaIndexRetriever(
prime=prime,
mode="process_turn",
session_id="llamaindex_session",
)
# Build query engine
query_engine = RetrieverQueryEngine.from_args(
retriever=retriever,
llm=OpenAI(model="gpt-4"),
)
response = query_engine.query("Explain memory consolidation in PRIME")
Architecture
┌─────────────────────────────────────────────────────────────────────────────┐
│ PRIME SYSTEM │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ User Input │
│ │ │
│ ▼ │
│ ┌───────────────────────────────────-───┐ │
│ │ SEMANTIC STATE MONITOR (SSM) │ │
│ │ ───────────────────────────── │ │
│ │ • Encodes input via Y-Encoder │ │
│ │ • Maintains sliding window │ │
│ │ • Calculates variance from centroid│ │
│ │ • Triggers on boundary crossing │ │
│ └──────────────┬─────────────────────-──┘ │
│ │ │
│ ┌─-────────┴─────────┐ │
│ │ Boundary crossed? │ │
│ └───-──────┬─────────┘ │
│ No │ │ Yes │
│ │ ▼ │
│ │ ┌────────────────────────────────────-──┐ │
│ │ │ EMBEDDING PREDICTOR │ │
│ │ │ ─────────────────── │ │
│ │ │ • Takes context window + query │ │
│ │ │ • Transformer predicts target Ŝ_Y │ │
│ │ │ • Trained with InfoNCE loss │ │
│ │ └──────────────┬──────────────────────-─┘ │
│ │ │ │
│ │ ▼ │
│ │ ┌─────────────────────────────────────-─┐ │
│ │ │ MEMORY CLUSTER STORE (MCS) │ │
│ │ │ ────────────────────────── │ │
│ │ │ • FAISS/Qdrant vector search │ │
│ │ │ • Searches with PREDICTED embedding│ │
│ │ │ • Auto-clusters similar memories │ │
│ │ │ • Consolidates into prototypes │ │
│ │ └──────────────┬─────────────────────-──┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌────────────────────────────────────────────────────────-──┐ │
│ │ RESPONSE │ │
│ │ • retrieved_memories: List[MemoryReadResult] │ │
│ │ • boundary_crossed: bool │ │
│ │ • variance: float │ │
│ │ • action: CONTINUE | PREPARE | RETRIEVE | CONSOLIDATE │ │
│ └────────────────────────────────────────────────────────-──┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Core Components
| Component | Purpose | Key Behavior |
|---|---|---|
| Semantic State Monitor | Decide when to retrieve | Tracks conversation trajectory; triggers on variance threshold |
| Embedding Predictor | Decide what to retrieve | Predicts ideal context embedding before search |
| Memory Cluster Store | Storage + retrieval | Auto-clusters, consolidates, and searches memories |
| Y-Encoder | Text → embedding | Encodes content for storage and prediction targets |
Action States
CONTINUE → Stay on topic, no retrieval needed
PREPARE → Approaching boundary (variance > 0.7θ), pre-warm caches
RETRIEVE → Topic shift detected (variance > θ), retrieve context
RETRIEVE_CONSOLIDATE → Major shift (variance > 2θ), retrieve + consolidate clusters
Configuration
Environment Variables
# API Server
PRIME_HOST=0.0.0.0
PRIME_PORT=8000
PRIME_WORKERS=4
PRIME_RATE_LIMIT=60
# Vector Database (for production)
QDRANT_HOST=localhost
QDRANT_PORT=6333
QDRANT_API_KEY=your-api-key # optional
# Model Configuration
PRIME_ENCODER_MODEL=sentence-transformers/all-MiniLM-L6-v2
# Evaluation
PRIME_RAGAS_ENABLED=true
PRIME_RAGAS_MODEL=gpt-4.1-mini
Programmatic Configuration
from prime import PRIMEConfig
from prime.ssm import SSMConfig
from prime.mcs import MCSConfig
from prime.predictor import PredictorConfig
from prime.encoder import YEncoderConfig
# Full custom configuration
config = PRIMEConfig(
ssm=SSMConfig(
variance_threshold=0.15, # Lower = more sensitive
window_size=5, # Conversation turns to track
smoothing_factor=0.3, # EMA smoothing
),
mcs=MCSConfig(
similarity_threshold=0.85, # Cluster membership cutoff
consolidation_threshold=5, # Min size to consolidate
index_type="faiss", # or "qdrant" for production
),
predictor=PredictorConfig(
input_dim=384,
hidden_dim=768,
output_dim=384,
num_layers=2,
num_heads=4,
),
y_encoder=YEncoderConfig(
model_name="sentence-transformers/all-MiniLM-L6-v2",
embedding_dim=384,
),
)
prime = PRIME(config)
Preset Configurations
# Testing (in-memory, small models)
config = PRIMEConfig.for_testing()
# Production (from environment variables)
config = PRIMEConfig.from_env()
# Validate for production deployment
config.validate_for_production() # Raises if misconfigured
API Reference
Core Methods
| Method | Description | Returns |
|---|---|---|
process_turn(text, session_id, force_retrieval, k) |
Process conversation turn with SSM boundary detection | PRIMEResponse |
record_response(content, session_id, metadata) |
Store LLM response to memory | MemoryWriteResult |
write_external_knowledge(content, metadata) |
Ingest external documents | MemoryWriteResult |
search_memory(query, k, min_similarity) |
Direct memory search (bypasses SSM) | List[MemoryReadResult] |
get_diagnostics() |
System health and metrics | PRIMEDiagnostics |
reset_session(session_id) |
Clear session state | None |
REST Endpoints
| Endpoint | Method | Description |
|---|---|---|
/api/v1/process |
POST | Process turn with predictive retrieval |
/api/v1/memory/write |
POST | Write content to memory |
/api/v1/memory/search |
POST | Search memory directly |
/api/v1/health |
GET | Health check |
/api/v1/diagnostics |
GET | System diagnostics |
/api/v1/clusters |
GET | List memory clusters |
/api/v1/config |
GET/PUT | View/update configuration |
Examples
See the examples/ directory for runnable demonstrations:
library_quickstart.py- In-process library usagesdk_client.py- Python SDK client examplesapi_server.md- REST API usage guide
Production Notes
Vector Database Selection
| Mode | Backend | Use Case |
|---|---|---|
index_type="faiss" |
FAISS (in-memory) | Development, testing, small deployments |
index_type="qdrant" |
Qdrant | Production, persistence, horizontal scaling |
Known Limitations (Alpha)
-
Cluster state is in-memory: Even with Qdrant for vectors, cluster bookkeeping (membership, prototypes) lives in-process. A service restart loses cluster state unless you implement persistence.
-
Single-process sessions: Session context (for the Predictor) is stored in-memory per process. Multi-worker deployments need external session storage.
-
Predictor is untrained by default: The Embedding Predictor initializes with random weights. For production quality, train on your domain using the InfoNCE objective.
Scaling Recommendations
# For production with Qdrant
config = PRIMEConfig(
mcs=MCSConfig(
index_type="qdrant",
qdrant_host="qdrant.your-infra.com",
qdrant_port=6333,
qdrant_api_key="...",
),
)
Development
# Install dev dependencies
uv sync --group dev
# Run tests
uv run pytest
# Run tests with coverage
uv run pytest --cov=src --cov-report=term-missing
# Type checking
uv run mypy src
# Linting
uv run ruff check .
uv run ruff check . --fix # Auto-fix
Project Status
PRIME is in alpha. The core architecture is implemented and functional, but:
- The Embedding Predictor needs training on domain-specific data
- Production deployment patterns are still being refined
- API may change before 1.0
Contributions and feedback welcome.
References
- VL-JEPA: Video-Language Joint Embedding Predictive Architecture - The architectural inspiration
- PRIME Project Overview - Full design specification
- InfoNCE Loss - Training objective for the Predictor
License
MIT License - see LICENSE for details.
PRIME: Because the best retrieval is the one you predicted you'd need.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file prime_rag-0.1.0.tar.gz.
File metadata
- Download URL: prime_rag-0.1.0.tar.gz
- Upload date:
- Size: 445.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a22e711cf42104712dcfc36b6e19df1d7bc7501fabd8867bf593918ba34e1b03
|
|
| MD5 |
ad653229f196dc8c3c7640d4ecd9bc67
|
|
| BLAKE2b-256 |
19581cd3a3d31f627091da0d20c09993d03c52561b97dbc0a6ca78814f201c61
|
File details
Details for the file prime_rag-0.1.0-py3-none-any.whl.
File metadata
- Download URL: prime_rag-0.1.0-py3-none-any.whl
- Upload date:
- Size: 100.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4c3f1a3b1a0fe18307808b1d7bb93e2f36828880657f0819e0a9b6aed30c750e
|
|
| MD5 |
5be2d4a891cac1cb5cdbbc28263c3084
|
|
| BLAKE2b-256 |
274944ae74f068c35a1b8db39f1db9928fc0fa114a85b8ad753c9a869518ccd7
|