Skip to main content

Lightweight long-term memory for LLM agents via vector-quantized routing

Project description

Routing Memory (RM)

Lightweight long-term memory for LLM agents via vector-quantized routing.

RM replaces brute-force dense retrieval with a VQ codebook that compresses N items into K centroid buckets. Queries probe only the top-n centroids, rerank by dot product, and return results — achieving 768x per-item compression and 99%+ recall at a fraction of the latency and memory cost.

pip install routing-memory

Quick Start

from rm import RoutingMemory

memory = RoutingMemory()

# Store memories
memory.add("User prefers dark mode for all applications")
memory.add("Meeting with Alice scheduled for March 15 at 2pm")
memory.add("Project deadline is end of Q1 2026")

# Search
results = memory.search("what are the user's UI preferences?", top_k=3)
for r in results:
    print(f"  [{r['score']:.3f}] {r['text']}")

Features

Feature Description
VQ Codebook MiniBatchKMeans clustering with adaptive K = ceil(N/B_target)
Multi-probe retrieval Query top-n centroids, collect candidates, rerank by dot product
Score filtering Threshold-based filtering saves tokens by dropping low-relevance results
Drift detection Rolling qerr monitoring with automatic alarm when distribution shifts
Online adaptation EMA centroid updates, bucket splits, idle centroid pruning
Persistence SQLite backend for durable storage across sessions
Pluggable backends Swap embedding models or storage engines via clean interfaces

Architecture

Query ──> Encode ──> Top-n Centroids ──> Collect Candidates ──> Dot-Product Rerank ──> Filter ──> Results
                          |                                           |
                     VQ Codebook                               Score Threshold
                     (K centroids)                               (tau >= 0.3)

Compression: Each item needs only a 2-byte centroid assignment vs 384x4 = 1536 bytes for dense fp32. That's 768x compression.

Recall: With n=4 probes on 5K items: R@5 = 0.9916 (99.2% of dense baseline).

API Reference

RoutingMemory

RoutingMemory(
    db_path="rm_memory.db",      # SQLite path (None for in-memory)
    embedding_model="all-MiniLM-L6-v2",  # any sentence-transformers model
    n_probes=3,                  # centroids to probe per query
    score_threshold=0.3,         # minimum retrieval score
    seed=42,                     # random seed
)

Methods:

Method Description
add(text, item_id=None, metadata=None) Store a memory item, returns item ID
search(query, top_k=5, threshold=None) Semantic search, returns list of dicts
search_with_signals(query, top_k=5) Search with routing signals (confidence, margin, qerr)
stats() Memory statistics (item count, K, compression, drift)
codebook_info() Codebook details (K, dim, Gini, dead codes)
save() Persist codebook state
close() Close storage connection

Low-level Components

from rm import Codebook, L1Retriever, DriftMonitor

# Direct codebook access
cb = Codebook(dim=384, seed=42)
cb.fit(embeddings, item_ids)
centroid_id, qerr = cb.encode(query_embedding)
conf = cb.conf(query_embedding)
margin = cb.margin(query_embedding)

# Retriever
retriever = L1Retriever(cb, n_probes=4, top_k=10, score_threshold=0.3)
result = retriever.query(query_embedding)  # returns L1Result

# Drift monitor
monitor = DriftMonitor()
alarm = monitor.record(qerr, margin)  # returns DriftAlarm or None

Experiment Suite

RM ships with 13 reproducible experiments (7 hypothesis tests + 6 application benchmarks).

# Run all experiments
python -m rm.experiments.run_all

# Run specific experiments
python -m rm.experiments.run_all --select H1 H2 A4

Results Summary

Exp Name Key Metric Result
H1 Codebook Fundamentals Fidelity@K=64 0.7415
H2 Retrieval Quality R@5 (n=4 probes) 0.9916
H3 Score-Based Filtering Savings@tau=0.7 59.1% (R@5=0.958)
H4 Adaptive K Heuristics Best heuristic sqrtN (lowest Gini)
H5 Drift Detection Alarm latency 11 episodes
H6 Multi-Encoder Robustness RM/Dense ratio spread 0.0036
H7 Storage & Latency Per-item compression 768x vs fp32
A1 MS-MARCO Passage Retrieval R@5 (n=4) 0.9585
A2 LoCoMo Conversational Memory R@5 0.9934
A3 Enrichment Generalization Delta RM +0.053
A4 Million-Scale (1M items) R@5 0.8556
A5 Pareto Frontier RM dominant at n>=4 91.8% R@5 @ 2.8ms
A6 Bucket Imbalance Gini (100K, K=256) 0.3628

All experiments use real embeddings (all-MiniLM-L6-v2, d=384) and seed=42.

Project Structure

rm/
  rm/                     # Core package
    __init__.py
    codebook.py           # VQ codebook (MiniBatchKMeans, adaptive K)
    retrieval.py          # Multi-probe retrieval with dot-product rerank
    filtering.py          # Score-based result filtering
    drift.py              # Distribution drift detection
    memory.py             # RoutingMemory high-level API
    embeddings/           # Pluggable embedding backends
      base.py             # Abstract interface
      local.py            # sentence-transformers wrapper
    storage/              # Pluggable storage backends
      base.py             # Abstract interface
      sqlite.py           # SQLite persistence
  experiments/            # 13 reproducible experiments
    run_all.py            # Experiment runner (--select support)
    shared/               # Data generation, plotting utilities
    h1_codebook/ .. h7_storage/   # Hypothesis tests
    a1_msmarco/ .. a6_imbalance/  # Application benchmarks
  tests/                  # pytest test suite
  pyproject.toml          # Package configuration
  LICENSE                 # MIT

Development

git clone https://github.com/AhmetYSertel/routing-memory.git
cd routing-memory
pip install -e ".[dev]"
pytest tests/ -v

Citation

If you use RM in your research, please cite the HGA paper:

@article{sertel2026hga,
  title={Hybrid Governance Architecture: Structured Memory and Adaptive Routing for LLM Agents},
  author={Sertel, Ahmet Yigit},
  year={2026}
}

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

routing_memory-0.1.0.tar.gz (16.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

routing_memory-0.1.0-py3-none-any.whl (19.9 kB view details)

Uploaded Python 3

File details

Details for the file routing_memory-0.1.0.tar.gz.

File metadata

  • Download URL: routing_memory-0.1.0.tar.gz
  • Upload date:
  • Size: 16.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for routing_memory-0.1.0.tar.gz
Algorithm Hash digest
SHA256 03f52d3cf432e9b88b87160f56ef9f08e8ffc361fef712c4bf45560c0be9caaa
MD5 64d1b37e19aa18f87d03510f1b60f7d6
BLAKE2b-256 003b947f068646ce2da785b4fa5ff5e6945c774ae17d34c7f427c59e2d1dc0ec

See more details on using hashes here.

File details

Details for the file routing_memory-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: routing_memory-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 19.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for routing_memory-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0f7f851c0c18733875b4bd7ac5c055b3e34b504774a51b0f9f3219becb0b7250
MD5 bb0b280ed369ff6d28a09c9742fba795
BLAKE2b-256 b586e0a91f79b093817d39d8b95880f77f53eaf7dc5f4618556dead1cb73afd2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page