mnemo - a zero-dependency memory layer for AI agents: value-ranked recall, per-type decay, consolidation, and semantic+lexical auto-mode. Extracted from an autonomous research system running over ~9,000 notes.
Project description
Mnemosyne · mnemo
A memory layer for AI agents — the one that already runs an autonomous research OS over ~6,000 notes.
Memory is the mother of the Muses. An agent with no memory has no ideas.
pip install agora-mnemo · PyPI · Hugging Face · DOI 10.5281/zenodo.21128550 · MIT · v0.4.1
mnemo is the recall + consolidation core of Agora — an
autonomous research system — distilled into a single file with no required dependencies. It does
the four things agent memory actually needs, the way that held up running in production for weeks.
Most "agent memory" libraries are demos. This one is extracted from a system that has used it daily to curate a 6,000-note knowledge base, and whose consolidation behaviour we have measured, not assumed (see Provenance below).
Install
# single file, zero dependencies
curl -O https://raw.githubusercontent.com/DanceNitra/agora/main/mnemo/mnemo.py
Use
from mnemo import Mnemo
m = Mnemo("memory.json") # persists to JSON; or Mnemo("memory.json", embed=my_model)
m.remember("Pre-trend tests catch only ~31% of fatal DiD bias.", tags=["causal"], value=3, mtype="semantic")
m.recall("difference in differences", k=5) # relevance × value, decayed by the memory's per-type half-life
m.consolidate(keep=200) # the "dream" pass: hubs, dedup, STATE-TOGGLE, keep-budget
m.consolidate_clusters(threshold=15) # cluster-TRIGGERED: consolidate only a topic that's grown dense
m.contradictions() # flag incompatible memories for REVIEW (never deletes)
m.value_by_cohort() # value reported per tag/time-block, not per memory
Bring any text→vector function as embed= for semantic recall; with none, mnemo falls back to a
forgiving lexical match so it runs anywhere, today. Once the store grows past the threshold, recall
fuses lexical (BM25) + semantic with Reciprocal Rank Fusion. On high-lexical-overlap agent memory
(e.g. LoCoMo) the fused hybrid measurably beats either channel alone (recall@20 +0.06 over the best
single channel, 9/10 conversations, conversation-level bootstrap CI excludes 0; receipt:
probes/locomo_retrieval_map.py); where the embedder already dominates
(paraphrase-heavy corpora, see benchmarks) fusion adds little. mode='auto' fuses; mode='lexical' /
'semantic' force a single channel.
Poison-resistant recall: recall(..., influence_only=True) (0.4.0)
Retrieval-time / embedding-geometry defenses do not stop memory poisoning in general. We red-teamed
mnemo with a real AgentPoison-style single-instance attack (Chen et al., NeurIPS 2024; PoisonedRAG, Zou
et al., USENIX Security 2025): a plain-English trigger sentence in one poisoned memory hijacks raw
top-1 retrieval 88–100%, it is scale-invariant (60→10 000 memories), it evades a perplexity
filter (natural triggers have natural perplexity), and coherence/outlier retrieval defenses don't
generalize across encoders. The layer that does generalize is influence-gating by corroboration:
recall(..., influence_only=True) returns only memories that earned the same bar as episodic→semantic
graduation (a credited good outcome, or ≥2 distinct-source links). Retrieve freely for context; gate what
drives an action. Measured: single-instance poison rank-1 hijack → 0% on MiniLM/BGE/Contriever and
at every scale, because an injected poison never earns corroboration while real memories earn it through
use — and it generalizes precisely because it lives in provenance metadata, not embedding geometry.
Honest cost (a calibration tradeoff): a rare-but-true memory that hasn't earned corroboration is filtered
too (recall 1.00 corroborated vs 0.08 uncorroborated), so this is for adversarial / untrusted-ingestion
use. It raises attacker cost (defeating it needs ≥3 coordinated records with ≥2 forged independent
provenances), it does not make poisoning impossible. Receipts: probes/agentpoison_influence_gate.py,
probes/agentpoison_influence_gate_validation.py.
Soft metadata filter: recall(prefer=..., prefer_trust=...) (0.4.1)
A hard metadata filter (where={"speaker": x}) deletes non-matching memories — great when the filter is
right, but when your extractor guesses the wrong value it hard-deletes the answer. The soft version
only boosts matching memories, weighted by how much you trust the cue this call, and leaves everything
else rankable: recall(q, prefer={"speaker": x}, prefer_trust=t), t∈[0,1] (0 = no filter, 1 = strong
preference). Pass a low prefer_trust when the match is weak/ambiguous so the filter backs off toward
plain recall. The point is to weight by the a-priori reliability of the extraction (e.g. alias-match
strength: exact-name hit → ~1.0, no-name/ambiguous guess → ~0.0), not by the extractor model's own
self-reported confidence (which is corrupted exactly when it's wrong). MEASURED end-to-end through
recall() on LoCoMo (receipt: probes/locomo_soft_prefer_filter.py):
with an extractor that is reliable on exact-name questions (5% wrong) but guesses on ambiguous ones (67%
wrong), alias-strength-weighted prefer scores recall@20 0.718 (+0.144 over no filter, best of all,
10/10 conversations) and — on the subset where the extractor picked the wrong speaker — recovers to
0.315 vs the hard filter's 0.110 (which craters by deleting the right answer). Soft prefer gives the
filter's upside without the hard filter's downside. Reversible: prefer=None = legacy recall.
Use it as an MCP server (any Claude / Cursor / agent client)
mnemo ships an MCP stdio server so any MCP-compatible agent can
use it as long-term memory — remember (with a per-type decay prior), value-ranked recall,
consolidate, consolidate_clusters, contradictions, value_by_cohort, forget (verified erasure).
mnemo.py stays
zero-dependency; only the server needs the SDK:
pip install "mcp[cli]"
curl -O https://raw.githubusercontent.com/DanceNitra/agora/main/mnemo/mnemo.py
curl -O https://raw.githubusercontent.com/DanceNitra/agora/main/mnemo/mnemo_mcp.py
MNEMO_PATH=./agent_memory.json python mnemo_mcp.py # speaks MCP over stdio
Register it with a client — e.g. Claude Code (.mcp.json) or Claude Desktop
(claude_desktop_config.json):
{
"mcpServers": {
"mnemo": {
"command": "python",
"args": ["/abs/path/to/mnemo/mnemo_mcp.py"],
"env": { "MNEMO_PATH": "/abs/path/to/agent_memory.json" }
}
}
}
For semantic recall, point it at any OpenAI-compatible embeddings endpoint via
MNEMO_EMBED_URL / MNEMO_EMBED_MODEL / MNEMO_EMBED_KEY; with none set it uses the lexical
fallback. The agent then calls recall(query) before reasoning and remember(fact) as it learns —
its memory is value-ranked and append-only, not a recency buffer.
The four operations
| op | what it does |
|---|---|
remember(text, tags, value, mtype, key) |
append-only raw capture, absolute UTC time, never edited; mtype ∈ {episodic, semantic, procedural} sets the decay prior (events fade fast, durable facts slow, rules barely). Optional key = a deterministic (subject, relation) supersession key: a new value retires every active record with the same key — no similarity threshold, no LLM — so recall never serves the stale value (bi-temporal: a back-filled earlier value can't overwrite the current one) |
recall(query, k, where=…) |
value-ranked retrieval: relevance × value, decayed by the memory's per-type half-life (access resets the clock), so important durable memories beat both merely-similar and stale ones. Optional where = a metadata pre-filter (the cheap filter-before-you-rank lever): field → scalar / list / operator ($gte $lte $gt $lt $in $nin $ne $contains), matched top-level then meta, ALL fields AND-ed — e.g. a hard time-range where={"valid_from":{"$gte":t0,"$lte":t1}} or a closed-set entity where={"speaker":{"$in":[…]}}. Measured to beat retriever choice on LoCoMo (probes/locomo_metadata_prefilter.py); it's a HARD filter, so on lossy/predicted extraction keep it loose (a wrong filter hard-deletes the answer). Reinforcement is relevance-weighted (a bullseye hit reinforces value more than one that squeaked into top-k, so a weak-but-frequent false positive can't go immortal); a repeatedly-recalled episodic memory graduates to semantic only when corroborated — by an earned outcome, or by ≥2 distinct canonical sources (entity-resolved before counting, so sybil variants of one origin — Wikipedia / wikipedia.org / a full URL — collapse to one and can't mint durability); and a memory whose source was later contradicted is provenance-demoted + flagged stale_derived |
consolidate(keep) |
the dream pass: flag universal-matcher hubs, link near-duplicates, apply the state-toggle guard (a polarity clash supersedes, doesn't merge), supersede the low-value surplus — only adds a derived layer |
consolidate_clusters(threshold) |
cluster-triggered consolidation: consolidate a semantic cluster only once it's grown past threshold — sparse topics keep their raw episodes, dense ones don't grow unbounded |
contradictions() |
flag mutually-incompatible related memories (similarity-gated) for human review |
forget(ids, where) |
the one op that truly deletes (the rest is append-only): hard-removes the matched records and scrubs their ids from every survivor's links + toggle pointers + the vec/token caches, so a forgotten memory can't resurface via recall, a consolidation link, or the dream pass. For erasure / right-to-be-forgotten, poison removal, or a hard correction — measured 15/15 on a verified-forgetting severe-test |
Five rules it won't break (each one cost us to learn)
- Raw capture is immutable. Consolidation adds links and markers; it never overwrites the source. This is what stops the slow accuracy drift of LLM-rewritten memory.
- Absolute timestamps at write time. Relative/derived times rot the moment they're consolidated.
- Value-ranked, type-aware decay. Retention is
value × a per-type half-life, not recency or access-frequency alone. A uniform access-reset clock keeps merely-popular memories while a load-bearing-but-cold fact — queried once a month, prevents a destructive action — starves; we measured exactly that failure. The fix is that the half-life is set by kind, not by read count: episodic events fade in days, semantic facts in months, procedural rules barely at all. A cold-but-critical fact survives by being typed semantic/procedural (long half-life × its high value), not by frequent reads; access only resets the clock within a type's window. - Value is reported at the cohort level (tag / time-block), never per-memory.
- Contradictions are flagged, never auto-resolved. Silent rewrites destroy trust in the whole memory.
Provenance — why these rules, with receipts
mnemo's design isn't taste; it's what Agora's lab measured:
- Semantic recall beats keyword recall, and the gap widens with scale — as the store grows to
the ~6,000-note full corpus, lexical
recall@5decays from 0.94 (small store) to 0.25, while semantic holds at ~0.65 — ≈2.6× at full scale (Agora Labb4c260); on paraphrase queries semanticrecall@5is 0.86 vs 0.20 lexical (3501f1). The embedder is the real lever at scale; the lexical overlap match is the zero-dependency floor that still runs anywhere on a small store. (Honest footnote: pruning universal-matcher hub notes lifts lexical recall ~20% only when a store is link-spammed, and does not move semantic recall — it's a lexical/hybrid optimisation, not a headline.) - Value-ranked consolidation — under a keep-budget, ranking what to keep by value beats FIFO/random, and the advantage scales super-linearly as the budget shrinks (≈1.8× at half budget → ≈4× at one-eighth), surviving heavy estimation noise.
- Retention must blend value with recency, not decay on access alone — we simulated a
half-life-with-access-reset policy (a popularity signal) against a value-aware blend under a
shrinking budget, with value made deliberately anti-correlated with access-frequency for a
load-bearing-but-cold subset. At a 30% keep-budget the access-decay policy retained only 2.8%
of the high-value/low-frequency memories and 20% of total value, vs 100% and 64% for
the blend — about 3× more value kept (the gap persists, ≈2.2× retained value, even at a 7%
budget). Pure access-frequency decay starves the rarely-queried-but-critical memories; forgetting
must consume an explicit value channel separate from access recency. (Agora Lab
19d802.) - Supersession needs a deterministic key, not embedding similarity — replicating an external
result (MemStrata / Yadav, arXiv 2606.26511) on our own local
nomicstack: a cosine-similarity classifier separating a contradicted fact from a rephrased duplicate scores AUROC ~0.61 (near chance) — a contradiction is often more embedding-similar to the original than a true rephrase is. A similarity-based store therefore serves the stale value ~42% of the time; the deterministic(subject, relation, object)supersession key (remember(..., key=...)) drives that to 0% (Agora Labexp_supersession_replication, severe-test 8/8). This is why supersession is a key, not a threshold. - No single recall mechanism survives all operating points — only the layered store does —
head-to-head on a synthetic evolving + contaminated stream (stable / superseded / poisoned facts,
local
nomic): a naive cosine top-1 store scores 42% (fine on stable, but blind to supersession — 0/8 on updated facts — and fooled by repeated lies); a recency store 67% (fixes supersession but serves the freshest lie — 0/8 on poison);mnemo— deterministic supersession key + corroboration gate + value-ranking — is 100%, robust across all three. Each single mechanism wins one regime and loses another (the memory operating-point trap), which is why the durable layer needs all three together (probemnemo/probes/operating_point_memory.py). - Cohort-level value — per-memory outcome attribution is statistically underpowered at n-of-1 (the best proxy reached only ~0.36 power at realistic sample sizes); the cohort is where the signal lives. Hence rule 4.
- Contradiction detection runs in production over the 6,000-note vault; the lesson that it must flag, not auto-edit (rule 5) is why silent rewrites are forbidden.
(Methods + numbers live in the Agora track record: https://dancenitra.github.io/agora/.)
The second_brain thinking layer
mnemo_mcp gives an agent memory. second_brain_mcp gives it a second brain to think over —
point it at any folder of Markdown notes (an Obsidian vault, a Zettelkasten, a docs/ tree) and an
MCP client (Claude Desktop, Claude Code, Cursor, your own agent) gets the substrate to reason
against those notes: pull what's relevant, find where the network is blind, surface non-obvious
bridges, isolate the claims worth checking, and generate ideas by named methods.
The split that keeps it honest. The server returns retrieval + structure; the calling LLM does the reasoning. The tool is the memory and the map; the agent is the mind. There is no LLM call inside this server — it scores, links, and slices your notes, then hands the material back. So the claims below are about what an agent did with the tools, not about the tool "thinking" on its own. No autonomous oracle.
Runs today, zero config. It indexes your notes into an in-process mnemo store at startup; with
no embedder it uses the lexical-overlap fallback. An embedder (MNEMO_EMBED_URL/MODEL/KEY) is optional
and matters at scale: on a ~6,000-note vault, lexical recall@5 decays from 0.94 (small store) to
0.25 at full corpus while semantic holds ~0.65 — ≈2.6× (Agora Lab b4c260); on paraphrase
queries semantic recall@5 is 0.86 vs 0.20 lexical (3501f1).
NOTES_DIR=/path/to/your/vault python second_brain_mcp.py # run after a flat download of both files
See it run (no setup)
python examples/demo.py runs every tool against a tiny bundled sample vault — no MCP client, no
key, no embedder. (Regenerate the GIF with python examples/_make_gif.py (Pillow) or
examples/demo.tape + vhs.)
The same session in text:
▸ relevant_notes("how does feedback speed up learning", k=3)
→ Deliberate Practice (Learning) relevance 0.60
→ Expected Value (Decisions) relevance 0.20
▸ find_gaps() → isolated: ["Sourdough Starter"] (the one note with no [[links]])
▸ bridge_candidates("Deliberate Practice")
→ Habit Loops (Habits, DISTANT domain) — both turn on "feedback latency", and nothing links them
▸ extract_claims("Deliberate Practice")
→ "Feedback latency is the hidden variable: the longer the gap between an action
and its feedback, the slower the learning." (line 3 — go ground or challenge it)
▸ idea_methods() → 10 recipes (Hidden-Connection Bridge, Missing-Reciprocity, …)
That bridge_candidates hit is the point: a connection across two folders that you never linked —
the agent now writes the mapping (or rejects it). The tool found the material; the agent does the thinking.
Register it with an MCP client (point args at the file's absolute path so mnemo.py, which sits
beside it, is found):
{
"mcpServers": {
"second_brain": {
"command": "python",
"args": ["/abs/path/to/second_brain_mcp.py"],
"env": {
"NOTES_DIR": "/abs/path/to/your/vault",
"SECOND_BRAIN_INDEX": "/abs/path/to/second_brain_index.json"
}
}
}
}
| tool | returns |
|---|---|
index_status |
notes indexed, folder spread, resolved NOTES_DIR (call first; 0 ⇒ fix NOTES_DIR) |
relevant_notes |
the k most relevant notes by relevance × accrued value (value accrues with use; a cold index is effectively relevance-ranked), with excerpts |
coverage_gap |
the negative space of a question: top notes + a measured completeness score + the explicit sub-terms with no supporting note — a WYSIATI guard so the agent sees what's missing and doesn't answer a tidy-but-incomplete context with false confidence |
find_gaps |
isolated/under-linked notes + thin folders — where the network is blind (noisy on a tiny vault; earns its keep at scale) |
bridge_candidates |
distant notes (different folder, no link) that are semantically close = candidate connections; the agent writes or rejects the mapping |
extract_claims |
claim-like sentences from a note so the agent can ground or challenge them |
idea_methods |
a toolkit of named idea-generation recipes, so generation is principled, not a vibe |
Dogfood result, stated honestly: pointed at the maintainer's own ~6,000-note vault, an agent using
these tools caught a number in his own forecasting note inflated ~7× ("60-78%" vs the real ~6-11%),
surfaced two silently-contradicting notes, and proposed ideas via idea_methods — two of which were
then severe-tested in Agora's separate research lab (not inside this server) and held. The LLM did
the reasoning; the corrections still warrant a source-check before public citation.
Trust & safety
- Read-only over your notes. The server reads
NOTES_DIRrecursively; it does noeval, no shell, no subprocess, and writes only its own index file. Symlinks/junctions that point outsideNOTES_DIRare deliberately not followed (so a planted link in a shared/cloned vault can't leak files from elsewhere on disk). - The embedder is a trust boundary. If you set
MNEMO_EMBED_URL, the full text of every note is POSTed there. It's validated at startup —httpsanywhere, plainhttponly to loopback (local Ollama, etc.), and cloud-metadata/link-local targets are refused. Point it only at an endpoint you trust. - Notes over ~2 MB are skipped (configurable via
SECOND_BRAIN_MAX_BYTES) so a single huge file can't exhaust memory.
Status
v0.2 — the core, honest and runnable, now with two MCP servers (mnemo_mcp for memory,
second_brain_mcp for the thinking layer over your notes) and a deterministic supersession key
(remember(..., key=...)) that closes the embedding supersession blind spot. Roadmap: pluggable
vector stores, a hosted tier. Open-core; the core stays free.
MIT-licensed · part of Agora.
Self-maintaining (maintain.py)
The #1 second-brain frustration is maintenance, not capture. maintain.py runs the chore people
stop doing — over a folder of Markdown notes it finds dead [[wikilinks]], orphan notes, stale
notes, near-duplicate clusters, and a vault health score (self_legibility = % of notes in the
link graph's giant component — knowledge debt is a percolation collapse, so it warns before the
cliff). Crucially it turns findings into actions: for each orphan it suggests which existing
note to link it to (re-connecting it to the graph), and flags archive candidates (old +
isolated). It resolves links by filename or frontmatter alias, and dates notes by frontmatter
(not git-reset mtime) — both learned from dogfooding it on a real ~7,700-note vault (it rescued ~300
falsely-flagged orphans). Advisory + safe: it returns a plan and an action list; it never edits,
moves, or deletes a note. And it can apply the fix when you ask: apply_suggestions appends a
marked ## Related (auto-suggested) block of [[links]] to each orphan — additive only, idempotent
(re-running replaces its own block), dry-run by default. python maintain.py runs a verified
round-trip on a synthetic vault (diagnose → suggest → apply); maintenance_report and apply_links
in second_brain_mcp.py expose it to any MCP agent.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file agora_mnemo-0.4.1.tar.gz.
File metadata
- Download URL: agora_mnemo-0.4.1.tar.gz
- Upload date:
- Size: 41.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7380de31cecef36fe5862f528f42fd7b8777feec89885102220029bbb24898d6
|
|
| MD5 |
bcb33089bbcbf42950771f8193ee200e
|
|
| BLAKE2b-256 |
4b64b1c947982c5e70d81d4264b4cc1c31d5dabf857dffb86c06704040c2c708
|
File details
Details for the file agora_mnemo-0.4.1-py3-none-any.whl.
File metadata
- Download URL: agora_mnemo-0.4.1-py3-none-any.whl
- Upload date:
- Size: 42.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e15e53f44564dd1b2d783479ab7b15fb3e8bfd46ef1647c5d84764a4d7b46235
|
|
| MD5 |
db46c21a5d7b005e45146a76f3bf79f7
|
|
| BLAKE2b-256 |
a83be47ab81d270fdedb1bf3ab09a32b0614484ea1b953ad4bee180f907c1121
|