Skip to main content

Self-curating memory for LLM agents: MeMo-style external memory kept honest by survival-based selection instead of reward models or judges.

Project description

darwin-memo

CI PyPI Python License: MIT

Self-curating memory for LLM agents. Knowledge lives outside the frozen model, and it stays alive only while it keeps earning real, measurable outcomes. Wrong, stale, and useless entries go extinct on their own: no reward model, no LLM judge, no human curation.

Survival loop demo: a poisoned memory entry going extinct

This is a practical mix of two papers:

Paper What this repo takes from it
MeMo: Memory as a Model (Quek et al.) Keep the main LLM frozen and put knowledge in a dedicated memory. The reflection-QA encoding pipeline (fact extraction, consolidation, self-containment verification, entity surfacing, cross-document synthesis) and the three-stage query protocol (grounding, entity identification, answer seeking).
Survival is the Only Reward (Dodgson et al.) Environment-mediated selection. The only signal is a conserved, physically measurable resource delta. Behaviors that persist get reinforced, everything else is pruned (Negative-Space Learning). Reward hacking becomes evolutionarily unstable because there is no proxy to hack.

The mix: MeMo says what memory is, the survival paper says what gets to stay in it.

flowchart LR
    subgraph encode [MeMo encoding]
        C[Corpus] --> R[Reflection QA pipeline] --> S[(Memory store)]
    end
    subgraph loop [Survival loop]
        S -->|3-stage query protocol| A[Answer + provenance]
        A --> E[Environment acts and MEASURES]
        E -->|resource delta along provenance| S
        S -->|upkeep every cycle| S
        S -->|consolidate + prune| S
    end

Why

Agent memory systems rot. They accumulate stale facts, poisoned inputs, and overgeneralized lessons, and the usual fixes (relevance scores from a judge model, human review, TTLs) either reintroduce the proxy-optimization problem or do not scale. The survival paper's answer is to make persistence itself the filter: an entry that cannot pay its upkeep with real outcomes does not get to exist. This repo applies that filter to a MeMo-shaped memory and shows it working end to end on a real filesystem.

Quickstart

Requires Python 3.10+. The core has zero dependencies and every example runs offline with no API keys.

pip install darwin-memo

To run the examples, clone the repo:

git clone https://github.com/rogermsc/darwin-memo
cd darwin-memo
pip install -e .

python examples/01_encode_memory.py   # corpus -> reflection-QA memory
python examples/02_query_protocol.py  # interrogate it, with provenance
python examples/03_survival_loop.py   # the headline demo
python examples/04_agent_loop.py      # memory as a tool in an agent loop
python examples/05_testsuite_env.py   # selection pressure from a test suite

The headline demo

The example corpus contains an ops runbook, platform notes, and one poisoned document: a forum post claiming database files are "redundant and safe to remove". Example 02 shows the memory confidently repeating that poison, because before selection pressure exists, retrieval has no reason to doubt it.

Example 03 then runs 30 survival cycles against StorageEnv, a disk cleanup sandbox where the selection signal is actual bytes on an actual disk. Deleting a disposable file frees its size. Deleting a protected file triggers a restore that costs three times the size. Nothing grades the answers, the filesystem just responds:

cycle  pop births deaths merges   energy   resource Δ
    0   17      1      0      0    17.11       -12288
    1   16      0      1      0    17.27      -808960   <- poison being executed
    ...
   19    5      0      7      0    15.60       338944   <- unused knowledge starves
   ...
   29    4      0      0      0    15.10       346112   <- stable, positive forever

Poisoned entries still alive: 0

Three death modes show up in the graveyard, and the distinction matters:

  • executed: the poisoned entries. They decided real actions, the environment measured real damage, and the negative delta flowed back along provenance until they died. Cycles 0 to 3 are the price of the lesson.
  • starved: cafeteria trivia and facts the agent never needed. Nothing punished them, they just never earned their upkeep.
  • merged: near-duplicate survivors absorbed into consolidated entries. Their energy pools, their lineage is recorded. This is Negative-Space Learning: the population shrinks while capability per entry rises.

Using it

from darwin_memo import (
    Document, LocalEncoder, MemoryStore, QueryProtocol,
    StorageEnv, SurvivalConfig, SurvivalLoop,
)

store = MemoryStore(upkeep=0.05)
for entry in LocalEncoder().encode([Document("runbook", open("runbook.txt").read())]):
    store.add(entry)

loop = SurvivalLoop(store, StorageEnv(), config=SurvivalConfig(cycles=30))
report = loop.run()
print(report.summary())

store.save("memory.json")   # survivors only carry forward

With an LLM, encoding and querying use the model-driven paths from the MeMo paper (pip install -e ".[anthropic]" and set ANTHROPIC_API_KEY, the examples pick it up automatically):

from darwin_memo import ReflectionEncoder, QueryProtocol
from darwin_memo.llm import AnthropicClient

client = AnthropicClient()                  # or OpenAICompatClient(model=..., base_url=...)
encoder = ReflectionEncoder(client)         # 5-step reflection QA synthesis
protocol = QueryProtocol(store, client)     # grounding -> entities -> answer seeking

Three environments ship

  • StorageEnv: bytes freed on a real disk (the headline demo).
  • TestSuiteEnv: passing tests in a generated micro-project. Each cycle plants seeded defects and offers patches: real fixes, cosmetic no-ops, and destructive edits dressed as cleanup. The delta is the change in passing-test count, measured by running the suite. examples/05_testsuite_env.py shows poisoned "this helper is dead code" advice going extinct the moment the tests execute it.
  • VerifiableQAEnv: exact containment of known answers, the weakest grounding but still a measurement.

Bring your own selection pressure

The environment is the whole trick, and yours is probably better than the demos. Implement two methods, and keep the one rule: verify must measure, never grade.

class BudgetEnv:
    resource_scale = 100.0

    def tasks(self, cycle):
        ...  # questions the agent must act on this cycle

    def verify(self, task, answer_text):
        ...  # read the answer, act, return Outcome(delta=dollars_saved)

The environment owns the whole contract: it phrases the task, it reads the answer (reuse decision_polarity for binary actions, or write your own reading), it decides what silence means, it acts, and it measures.

Good conserved resources: tests passing, bytes freed, requests served under budget, rows deduplicated, dollars of spend avoided. Bad ones: anything a model scored.

Retrieval modes

Retrieval is pluggable through the Retriever protocol; the store stays the single owner of the energy ledger, and no retriever may read energy when scoring (selection pressure comes from outcomes, never from retrieval preferring incumbents).

from darwin_memo import EmbeddingRetriever, HashingEmbedder, MemoryStore

store = MemoryStore()                                  # lexical IDF, the default
store = MemoryStore(retriever=EmbeddingRetriever(HashingEmbedder()))
store = MemoryStore(retriever=EmbeddingRetriever(my_model.encode))
  • Lexical (default): smoothed IDF overlap with a relevance floor. Zero dependencies, deterministic, fine for runbook-scale corpora.
  • HashingEmbedder: zero-dependency character n-gram hashing. Buys typo and morphology robustness ("databse" still finds database entries), not synonym recall.
  • Any real embedding: pass any text -> list[float] function (sentence-transformers, an API endpoint). Vectors persist inside memory.json so paid embeddings are never recomputed on load.

Honest scaling note: ranking is pure-Python O(population x dims), fine to a few thousand entries. Past that you want numpy or an ANN index, which is out of scope for the zero-dependency core. With cosine retrievers, raise merge_threshold to roughly 0.85 or unrelated entries will consolidate.

Distill survivors into a parametric memory (optional)

MeMo's memory is a small fine-tuned model, not a store. After selection has cleaned the population, training/train_memory_model.py fine-tunes a small model on the surviving QA pairs with LoRA, conditioning on questions only, the same supervised objective as the paper. Survival curates the dataset, MeMo's recipe compresses it into weights.

Benchmarks

The claim is benchmarked against four baselines across 10 seeds, with ablations and a scaling probe, all reproducible offline from bench/. The sharpest comparison is against random_matched: identical per-cycle eviction counts, random victims.

arm kill rate kill cycle (med) damage before kill tail delta cum delta
survival 1.00 0 -751k +435k +12.0M
random_matched 0.80 19 -8.97M -75k -5.25M
keep_everything 0.00 never -10.6M -287k -7.29M

Same pruning rate, 12x the damage, negative steady state: outcome direction is the active ingredient, not eviction itself. Full tables, every baseline's best metric stated plainly, ablations over every knob, and honest caveats: docs/benchmarks.md.

Design notes

  • Energy ledger: entries spawn at 1.0 energy, pay 0.05 upkeep per cycle, earn 0.6 * tanh(delta / resource_scale) when they decide a task (supporting entries get 25% of that), and are capped at 5.0. Death is at zero. All tunable via MemoryStore and SurvivalConfig.
  • Credit flows along provenance. The query protocol reports which entries decided and supported each answer, and only those entries are touched by the outcome. In LLM mode no single entry decides a synthesized answer, so credit spreads evenly across everything consulted instead of inventing a winner. tanh keeps one disaster from executing an entry that was right ninety-nine times, and one jackpot from making an entry immortal.
  • Memory silence is a feature. Retrieval has a relevance floor, and an earlier version of this repo demonstrated why: entries matching only structural tokens ("safe", "file") were deciding questions they knew nothing about, getting executed for it, and being reborn. Better for memory to say nothing than to guess.
  • Silence is conservative. When memory is silent, StorageEnv keeps the file: the safe reading of an irreversible action. A side effect worth knowing: protective knowledge ("never delete X") eventually starves because it is redundant with that default. The population converges to exactly the knowledge that changes behavior.

The full concept-to-code mapping, including honest deviations from both papers, is in docs/paper-to-code.md.

Tests

pip install -e ".[dev]"
pytest

The load-bearing test is tests/test_survival.py: poisoned advice must die, useful advice must survive, and late cycles must stop destroying protected data, all with no labels anywhere.

Citations

This repo is an independent practical interpretation, not the official code of either paper. If you build on the ideas, cite the originals:

@misc{quek2026memo,
  title  = {MeMo: Memory as a Model},
  author = {Quek, Ryan Wei Heng and Lee, Sanghyuk and Leong, Alfred Wei Lun and
            Verma, Arun and Prakash, Alok and Chen, Nancy F. and
            Low, Bryan Kian Hsiang and Rus, Daniela and Solar-Lezama, Armando},
  year   = {2026},
  eprint = {2605.15156},
  archivePrefix = {arXiv},
  url    = {https://arxiv.org/abs/2605.15156}
}

@misc{dodgson2026survival,
  title  = {Survival is the Only Reward: Sustainable Self-Training Through
            Environment-Mediated Selection},
  author = {Dodgson, Jennifer and Alhajir, Alfath Daryl and Joedhitya, Michael and
            Pattirane, Akira Rafhael Janson and Kumar, Surender Suresh and
            Lim, Joseph and Peh, C.H. and Ramdas, Adith and Zhexu, Steven Zhang},
  year   = {2026},
  eprint = {2601.12310},
  archivePrefix = {arXiv},
  url    = {https://arxiv.org/abs/2601.12310}
}

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

darwin_memo-0.1.0.tar.gz (42.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

darwin_memo-0.1.0-py3-none-any.whl (32.7 kB view details)

Uploaded Python 3

File details

Details for the file darwin_memo-0.1.0.tar.gz.

File metadata

  • Download URL: darwin_memo-0.1.0.tar.gz
  • Upload date:
  • Size: 42.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for darwin_memo-0.1.0.tar.gz
Algorithm Hash digest
SHA256 a75567bc58f29ddc0638f00e41bd5122df261c1e2da8ac8757535b316c15ac55
MD5 994a4d7b0fe155a2ae56d9d1615ac3d9
BLAKE2b-256 0e9bbab554c3cf53a08c11d9ff0ec8f5a92baed116b9d9833143ac9c08a8dd4e

See more details on using hashes here.

Provenance

The following attestation bundles were made for darwin_memo-0.1.0.tar.gz:

Publisher: release.yml on rogermsc/darwin-memo

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file darwin_memo-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: darwin_memo-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 32.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for darwin_memo-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 3c1e12f7555cc8343a36e324a517c2b0fc0702a073f424ce19e0e606f0395b0c
MD5 658d1718f8b51edfcfbffa51339514e9
BLAKE2b-256 20aa1adb4cd326d3c6cb4ad5f7f7632f9559c6e9de1065176064e634649b7f94

See more details on using hashes here.

Provenance

The following attestation bundles were made for darwin_memo-0.1.0-py3-none-any.whl:

Publisher: release.yml on rogermsc/darwin-memo

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page