Self-improving memory store for LLM agents: hybrid retrieval, clustered retrieval-induced forgetting, optional LLM enrichment
Project description
lethe
Λήθη: the ancient Greek personification of forgetfulness, and one of the five rivers of the underworld.
A memory store for LLM agents. Hybrid BM25 + dense retrieval, cross-encoder reranking, clustered retrieval-induced forgetting (RIF), and an optional LLM enrichment layer at write time.
Read this before the benchmark numbers
Most memory-tool READMEs advertise NDCG or recall in the 95 to 99% range. Those numbers come from evaluating on a per-query fresh database of ~50 sessions, at session granularity, with recall@5. That's a roughly 2000× easier task than "find this specific turn among 200,000". Some implementations additionally leak ground-truth annotations into the index.
lethe's numbers are reported on the full 199,509-turn LongMemEval S corpus, turn-level retrieval, NDCG@10, no leakage. Random baseline is 0.005%. Vector-only search gets 0.14. BM25 alone gets 0.24. When other tools are run on this setup, they land in the same 0.1 to 0.4 range. The 99% numbers don't translate.
No 99%. No cherry-picking. These are the honest numbers on a hard benchmark:
| Stage | NDCG@10 | notes |
|---|---|---|
| Hybrid BM25 + vector (RRF) | 0.217 | basic retrieval (most popular) |
| + cross-encoder reranking | 0.293 | +35% from semantic reranking |
| + clustered+gap RIF (checkpoint 13) | 0.312 | +6.5% from retrieval-induced forgetting |
| + LLM enrichment, on covered queries | 0.473 | +21% on the 75 queries where the answer turn was Haiku-enriched |
Enrichment only helps where it's been applied; the full-corpus number is diluted by the uncovered 425 queries. Full methodology in BENCHMARKS.md. The research path, including 10 failed approaches before anything worked, is in RESEARCH_JOURNEY.md.
https://github.com/user-attachments/assets/ea6e9657-7525-4fed-ba40-df7a14e6b9f4
Lethe full stack vs basic hybrid RRF on the 199,509-turn LongMemEval corpus. NDCG@10 over 5,000 queries.
What's new here: retrieval-induced forgetting
RIF is a well-studied phenomenon in cognitive psychology (Anderson, 1994). In plain words:
When you retrieve one thing, related things that could have been retrieved but weren't get quietly suppressed. Ask someone to name a fruit; they say "apple". A few minutes later, "pear" and "orange" are measurably harder for them to recall, even though nothing asked about them. Remembering is also an act of forgetting.
lethe brings this mechanism to vector memory:
- Every retrieval surfaces a candidate pool (BM25 ∪ vector top-k).
- The cross-encoder picks the winners.
- Entries that reached the pool but lost don't disappear. They accumulate a per-cluster suppression score.
- On future retrievals in the same query cluster ("travel-like queries", "debugging-like queries"), those entries get their scores penalised before the cross-encoder sees them, freeing slots for entries that were previously crowded out.
The key detail: suppression is cue-dependent, not global. An entry suppressed for travel queries stays fully available for food queries. Global suppression was ~5× weaker in our benchmarks. The clustering is what made it work.
Why this matters: a memory that evolves
Most memory stores are static caches. You put strings in; you retrieve them by similarity. The retrieval function never changes.
RIF makes retrieval behaviour adapt over time without retraining any model:
- Entries that keep losing to more relevant alternatives get quieter in their own cluster.
- Entries that keep winning get amplified via tier promotion (naive → gc → memory) and exemption from decay.
- The system learns "in the context of this kind of query, you usually mean that chunk, not this one" from its own retrieval history. No fine-tuning. No extra LLM calls. Just O(n_clusters × n_entries) bookkeeping.
Concrete example: you store both "I prefer window seats on flights" and "my wife needs aisle seats". Travel queries keep picking your preference first. The system quietly penalises your wife's preference for travel-sounding queries, not globally, just in that cluster. Ask about your spouse's preferences and both entries remain equally findable.
Checkpoint 13 (clustered + rank-gap RIF) gives +6.5% NDCG and +9.5% recall@30 on LongMemEval over the baseline hybrid pipeline. As far as I can find, this is the first implementation of clustered RIF in a production retrieval system.
Claude Code plugin
Lethe ships as a Claude Code plugin that persists memory across sessions per-project:
/plugin marketplace add teimurjan/lethe
/plugin install lethe
Hooks write daily markdown to .lethe/memory/YYYY-MM-DD.md, summarize each turn with claude -p --model haiku (no extra API key), reindex via the lethe CLI, and a memory-recall skill surfaces prior-session context on demand.
See plugins/claude-code/README.md for hook behavior, CLI reference, and local-install instructions.
Quick start
from lethe import MemoryStore
from sentence_transformers import SentenceTransformer, CrossEncoder
store = MemoryStore(
"./my_memories",
bi_encoder=SentenceTransformer("all-MiniLM-L6-v2"),
cross_encoder=CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2"),
)
store.add("I prefer window seats on flights", session_id="trip")
store.add("My wife needs aisle seats", session_id="trip")
store.add("I work at Google as a software engineer", session_id="work")
results = store.retrieve("What are my travel preferences?", k=5)
for entry_id, content, score in results:
print(f" [{score:.1f}] {content}")
store.save()
store.close()
Architecture
Query
│
├── FAISS top-30 (dense vector similarity)
├── BM25 top-30 (sparse keyword match)
│
└── Merge (RRF)
│
└── RIF suppression penalty (per-cluster, gap-based)
│
└── Cross-encoder rerank → top-k
│
└── Update suppression state, affinities, tier
Optional write-time LLM enrichment layer (src/lethe/enrichment.py): before indexing, each memory can be processed by an LLM (default claude-haiku-4-5) to produce a gist, 3 anticipated queries, entities, and temporal markers. All fields index alongside the original text; cross-encoder still scores against original. Attacks the vocabulary-mismatch failure mode.
RIF: technical details
The formula that made it work (checkpoint 13 of 17):
competition_strength = max(0, xenc_rank - initial_rank) / pool_size × sigmoid(-xenc_score)
suppression[eid, cluster] += learning_rate × competition_strength
score_adjusted = base_score - alpha × suppression[eid, cluster]
The rank-gap term only penalises entries that dropped in rank between the initial hybrid retrieval and the cross-encoder rerank AND were actively rejected (negative xenc score). Entries that just lost a close race aren't suppressed. That prevents the system from forgetting legitimately useful context because it happened to rank second once.
k-means runs once (30 clusters) on the bi-encoder query embedding at retrieval time; the cluster assignment is the "cue" coordinate. Based on Anderson's inhibition theory (1994) and the SAM competitive-sampling model (Raaijmakers & Shiffrin, 1981).
Three storage layers
| Layer | File | Purpose |
|---|---|---|
| DuckDB | lethe.duckdb |
Entries, suppression state, rescue cache, stats. Many project DBs can be ATTACHed simultaneously for cross-project search. |
| numpy + FAISS | embeddings.npz, faiss.index |
Vector storage |
| BM25 | In-memory, rebuilt on startup | Sparse keyword index |
Entry lifecycle (germinal-center inspired)
NAIVE → GC → MEMORY
↓
APOPTOTIC
- Naive: new entries, unproven
- GC: retrieved 3+ times, actively evaluated
- Memory: high affinity + frequently retrieved, stable, exempt from decay
- Apoptotic: low affinity + idle > 1000 steps, excluded from search
Useful for long-running agents; doesn't directly improve retrieval quality (that's what RIF and enrichment do).
Deduplication (on add)
- Exact: SHA-256 content hash (free)
- Near-duplicate: cosine similarity > 0.95 (keeps the longer entry)
Install
From PyPI (the distribution name is lethe-memory; the import name is still lethe):
pip install lethe-memory
# or
uv pip install lethe-memory
As the lethe CLI on PATH without a venv:
uv tool install lethe-memory
lethe --version
From source, for development:
git clone https://github.com/teimurjan/lethe && cd lethe
uv venv --python 3.12 && uv pip install -e .
As a Claude Code plugin (see plugin README):
/plugin marketplace add teimurjan/lethe
/plugin install lethe
Global search across projects
Every lethe index registers the project in ~/.lethe/projects.json. lethe search --all then ATTACHes each registered lethe.duckdb as a read-only schema and runs the full hybrid + RIF + cross-encoder pipeline across the union.
# In each project you care about:
lethe index
# Search across all registered projects:
lethe search "mongodb pool" --all --top-k 5
# [p_site_fix_pentest_8a99] [+4.63] 06e6... - MongoDB connection pool issue at 100% rollout…
# [p_lethe_77be] [+3.41] 4670... - lethe search returning real hybrid results…
# Restrict to a subset:
lethe search "auth flow" --projects p_site_fix_pentest_8a99,p_api_server_12ab
# Inspect / manage the registry:
lethe projects list
lethe projects add /path/to/other-project
lethe projects remove <slug-or-path>
lethe projects prune # drop entries whose roots no longer exist
Opt out of auto-registration in three ways:
- Per invocation:
lethe index --no-register - Per project:
lethe config set auto_register false - Globally:
export LETHE_DISABLE_GLOBAL_REGISTRY=1
The registry only stores absolute project paths, never contents. Cross-project search is read-only; it does not mutate RIF state. Local lethe search continues to learn as normal.
DuckDB has no cap on attached databases, so the 10/125 ceiling in SQLite's ATTACH DATABASE doesn't apply. Real-world bottleneck before you'd feel it is the per-project BM25 rebuild, which scales linearly.
Benchmark
# prep LongMemEval
uv run python experiments/data_prep.py --dataset longmemeval
# retrieval-only baseline + RIF variants
uv run python benchmarks/run_benchmark.py
uv run python benchmarks/run_rif_benchmark.py
# LLM enrichment layer (needs ANTHROPIC_API_KEY)
export ANTHROPIC_API_KEY=sk-ant-...
uv run python experiments/enrich_longmemeval.py # one-time, ~$16 for 10k entries
uv run python benchmarks/run_rif_enriched.py # 3-arm benchmark
See BENCHMARKS.md for results tables and methodology details.
Project structure
src/lethe/ # Production library (162 tests, ~95% coverage)
├── memory_store.py
├── markdown_store.py # Markdown chunker for .lethe/memory/*.md
├── cli.py # `lethe` console script
├── union_store.py # Cross-project read-only search via DuckDB ATTACH
├── db.py # DuckDB persistence
├── vectors.py # FAISS + BM25 index
├── reranker.py # Cross-encoder + adaptive depth
├── rif.py # Retrieval-induced forgetting (clustered + gap)
├── enrichment.py # Optional LLM write-time enrichment (Anthropic SDK)
├── dedup.py # Hash + cosine deduplication
└── entry.py # MemoryEntry dataclass + Tier enum
plugins/claude-code/ # Claude Code plugin (hooks + memory-recall skill)
.claude-plugin/ # Marketplace manifest
benchmarks/ # Per-checkpoint benchmark scripts
├── run_*.py
├── _lib/ # Benchmark-only helpers (metrics, etc.)
└── results/ # Raw per-run output markdowns
scripts/ # Reproducibility utilities (dataset prep, enrichment runner)
research/ # Experimental / research code (not production)
├── gc_mutation/ # Germinal-center mutation thread (checkpoints 1-10)
└── sdm/ # Sparse Distributed Memory prototype (checkpoint 15)
tests/ # Production unit tests
Research background
This project started as an experiment porting the immune system's germinal-center mechanism to vector memory. 17 checkpoints so far:
- Checkpoints 1-10 (GC-mutation approaches): all failed to improve retrieval quality. Useful for lifecycle management, not retrieval.
- Checkpoints 11-13 (RIF): cognitive-science inspired. Clustered + gap-formula variant gives +6.5% NDCG, +9.5% recall@30 on the full 500-query eval. First positive learned-retrieval result.
- Checkpoints 14-15 (exploration/rescue, Sparse Distributed Memory prototype): both negative at full scale.
- Checkpoint 16 (extended behavior metrics for checkpoint 13): RIF's gain is primarily from reducing cross-topic retrieval (−1.6pp wrong_family); sibling confusion and stale-fact rate unchanged.
- Checkpoint 17 (LLM enrichment layer): write-time structured extraction via Haiku. On covered queries: +8.3pp NDCG over checkpoint 13, the largest single-lever gain in the journey. Overall diluted by partial coverage; scaling to full coverage is pending.
Full journey in RESEARCH_JOURNEY.md.
License
MIT.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file lethe_memory-0.2.1.tar.gz.
File metadata
- Download URL: lethe_memory-0.2.1.tar.gz
- Upload date:
- Size: 2.8 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ee496ffbc1b19cda44b123d03d698d4504d374d02a6789f14eb8312278e54350
|
|
| MD5 |
3592defb9137e868b0e08e6540c856e0
|
|
| BLAKE2b-256 |
71f5c327af6ad9f31afce66428ced698c8eaf26a43bfa07d3b2b0a2fc9513c0b
|
Provenance
The following attestation bundles were made for lethe_memory-0.2.1.tar.gz:
Publisher:
release-please.yml on teimurjan/lethe
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
lethe_memory-0.2.1.tar.gz -
Subject digest:
ee496ffbc1b19cda44b123d03d698d4504d374d02a6789f14eb8312278e54350 - Sigstore transparency entry: 1328228992
- Sigstore integration time:
-
Permalink:
teimurjan/lethe@ba1116105467eaf2e2aefb4ff45fc7d3a17e6aa6 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/teimurjan
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release-please.yml@ba1116105467eaf2e2aefb4ff45fc7d3a17e6aa6 -
Trigger Event:
push
-
Statement type:
File details
Details for the file lethe_memory-0.2.1-py3-none-any.whl.
File metadata
- Download URL: lethe_memory-0.2.1-py3-none-any.whl
- Upload date:
- Size: 44.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
40ef2c9bc9bdc1b654cd88a1dda1de448351bde95a3cc6e19cc36acab7fc4d3b
|
|
| MD5 |
0065e31323a7434684abc5d49895b1c9
|
|
| BLAKE2b-256 |
5abe538f8e3c793bba3db0c131cc0809531b66c1ea060efc5b089b4d3ec48157
|
Provenance
The following attestation bundles were made for lethe_memory-0.2.1-py3-none-any.whl:
Publisher:
release-please.yml on teimurjan/lethe
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
lethe_memory-0.2.1-py3-none-any.whl -
Subject digest:
40ef2c9bc9bdc1b654cd88a1dda1de448351bde95a3cc6e19cc36acab7fc4d3b - Sigstore transparency entry: 1328229002
- Sigstore integration time:
-
Permalink:
teimurjan/lethe@ba1116105467eaf2e2aefb4ff45fc7d3a17e6aa6 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/teimurjan
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release-please.yml@ba1116105467eaf2e2aefb4ff45fc7d3a17e6aa6 -
Trigger Event:
push
-
Statement type: