Persistent, graph-backed memory for AI coding agents. Bayesian confidence, FTS5 search, HRR vocabulary bridge, entity-index retrieval, MCP server.
Project description
agentmemory
Persistent memory for AI coding agents. Your agent remembers what you discussed, decided, and corrected, so the next session does not start from scratch.
Read the handbook · Install · Workflow · Architecture · Benchmarks · Project writeup
Why
When a session ends, your agent forgets everything. You end up re-explaining the project, re-stating the same preferences, and watching the same mistakes happen again.
agentmemory captures decisions, corrections, and context as you work, and hands them back to the agent next time. No manual notes. No context files. Just memory.
Install
uv pip install git+https://github.com/robot-rocket-science/agentmemory.git
agentmemory setup
Restart Claude Code, then in any project:
/mem:onboard .
Full prerequisites and troubleshooting: docs/INSTALL.md.
What it does
- Remembers automatically. Captures decisions, corrections, and preferences from your conversations without you lifting a finger.
- Learns what matters. Memories that help get stronger over time. Memories that hurt get weaker. The system tunes itself to your project.
- Stays on your machine. Everything lives in local SQLite. No cloud, no vector database, no telemetry unless you opt in.
- Works with any MCP agent. Claude Code is the primary target, but any MCP-compatible client can connect to the server.
A sketch of what using it feels like
Session 1
─────────
you We decided to use uv for this project, not poetry.
agent Got it.
...session ends, days pass, new session opens...
Session 2
─────────
you Set up the environment please.
agent Using uv, per the project decision from last week.
Pinning Python 3.12 as configured. Proceeding.
The second session starts already knowing. That is the whole pitch.
How it works
Conversations become scored beliefs in a local graph. Each belief gets stronger or weaker based on whether it helped. Retrieval pulls the most relevant subset into the agent's context on every turn, within a fixed token budget.
Deep dive in the handbook: Chapter 5 - Architecture.
Documentation
The full handbook is at docs/README.md and is structured as a short book with prev/next navigation on every page. Jump to a chapter:
- Part I - Getting Started: Installation · Workflow
- Part II - Reference: Commands · Obsidian
- Part III - Under the Hood: Architecture · Privacy
- Part IV - Benchmarks and Research: Protocol · Results · Research Freeze
Wonder and Reason
agentmemory includes two graph-aware research commands that go beyond simple keyword search. They use the belief graph -- edges like SUPPORTS, CONTRADICTS, SUPERSEDES, CITES -- to surface connected evidence and detect reasoning gaps.
/mem:wonder <topic> -- Deep Research
Wonder is exploratory. You give it a topic and it fans out across the belief graph to collect everything relevant, even things you did not directly search for.
- Retrieves seed beliefs via FTS5 keyword search
- Expands outward along graph edges (BFS, configurable depth)
- Scores uncertainty for each belief using Beta distribution variance
- Detects contradictions between beliefs in the result set
- Outputs a structured context block with three sections: Known Facts (direct hits), Connected Evidence (reached via graph traversal), and Open Questions (high-uncertainty beliefs)
Use wonder when you want to survey what the system knows about a topic before making a decision. It answers: "what do we know, what is connected, and where are we uncertain?"
/mem:reason <question> -- Hypothesis Testing
Reason is focused. You give it a question or hypothesis and it builds branching consequence paths to evaluate whether the evidence supports it.
- Retrieves seed beliefs, then checks relevance (content-word overlap filter)
- Builds consequence paths -- chains of beliefs linked by edges, with compound confidence decay at each hop
- Checks constraints -- compares paths against locked beliefs for conflicts
- Detects impasses -- four types: ties (contradicting beliefs at similar confidence), gaps (dead-end paths), constraint failures (conflicts with locked beliefs), and no-change (all low-confidence evidence)
- Issues a verdict: SUFFICIENT, INSUFFICIENT, UNCERTAIN, CONTRADICTORY, or PARTIAL
Use reason when you need to evaluate a specific claim or decision. It answers: "does the evidence support this, and if not, where does the reasoning break down?"
The difference
Wonder is divergent -- cast a wide net, see what is out there. Reason is convergent -- evaluate a specific claim against the evidence. Together they form a research loop: wonder to survey the landscape, reason to test specific hypotheses that emerge from it.
Benchmarks
[!NOTE] About these numbers. I run and publish benchmarks because I believe objective, replicable methodology and transparent result reporting matter, and that readers deserve to see them. I place limited personal weight on the numbers themselves. V&V for agent memory systems is a specialized area where I do not have deep hands-on experience, and I cannot be fully confident that Claude and I have exercised these systems as rigorously as a dedicated evaluator would.
What I can commit to is the scientific rigor I was trained on and the professional engineering standards I am obligated to uphold: pre-registered hypotheses, contamination protocols, protocol-correct evaluation, and full methodology disclosure.
I welcome constructive criticism, independent replication, and analysis that refutes or supports any of these claims, and I would be glad to collaborate with anyone interested in strengthening the evaluation.
Evaluated across 5 published benchmarks. All results are protocol-correct with contamination-proof isolation (separate GT files, verified by verify_clean.py, enforced by 65 pytest protocol tests). No embeddings, no vector DB. Methodology follows the Lin checklist for reproducibility.
Results by version
| Benchmark | Metric | v1.0 | v1.1 | v1.2.1 | v2.2.2 |
|---|---|---|---|---|---|
| MAB SH 262K | SEM | 60% | 90% | 90% | 92% |
| MAB MH 262K | SEM | 6% | 35%* | 60% | 58% |
| StructMemEval | Accuracy | 29% | 100% | 100% | 100% |
| LongMemEval | Opus judge | -- | -- | 59.0% | 59.6% |
| LoCoMo | F1 | -- | 66.1% | 66.1% | 50.8%** |
* chain-valid score; raw SEM was 47% ** reader variance; retrieval code unchanged from v1.2.1 (see analysis below)
Compared to published systems
| Benchmark | agentmemory (best) | Paper ceiling / SOTA | Other published systems |
|---|---|---|---|
| MAB SH 262K | 92% SEM | 88% GPT-4o, 45% GPT-4o-mini | o4-mini 100% (6K context only) |
| MAB MH 262K | 58% SEM | <=7% all methods (paper ceiling) | -- |
| StructMemEval | 100% (14/14) | vector stores fail | -- |
| LongMemEval | 59.6% | 60.6% GPT-4o pipeline | -- |
| LoCoMo | 66.1% (v1.2.1) | 87.9% human ceiling | 92.3% EverMemOS, 74.0% Letta, 68.5% Mem0, 51.6% GPT-4-turbo |
LoCoMo comparison note: EverMemOS (92.3%), Letta (74.0%), and Mem0 (68.5%) use different retrieval architectures (embeddings, filesystem grep, LLM extraction respectively). agentmemory uses FTS5 keyword retrieval only, no embeddings. The v2.2.2 LoCoMo regression (50.8%) is driven by LLM reader variance from sub-agent batching, not retrieval quality. Per Lin's methodology, single-run results are insufficient when the reader is a variable; >=5 runs with mean +/- std are needed.
Methodology, per-benchmark details, and audit trails: Chapter 8 - Benchmark Results.
Session metrics
Beyond benchmarks, agentmemory tracks real-world usage metrics from conversation logs. Run agentmemory metrics for the full report.
| Metric | Value | Note |
|---|---|---|
| Correction rate | 0.72% | FP-adjusted, ~90% precision |
| Retrieval tokens/search | ~1,800 | Stable (2K budget cap) |
| Retrieval budget fill | 73% -> 100% | Improving as belief store grows |
| Correction trend | 1.7% -> 0.5% | Suggestive, not yet significant |
| Fix commit rate | 12% | 50/404 commits in dev period |
Evaluation protocol: docs/EVALUATION_PROTOCOL.md -- three-part framework covering benchmarks, acceptance tests (872+), and session metrics.
Development
git clone https://github.com/robot-rocket-science/agentmemory.git
cd agentmemory
uv sync --all-groups
uv run pytest tests/ -x -q
uv run pyright src/
Contributions welcome. See CONTRIBUTING.md.
Citation
If you use agentmemory in your research or project, please cite:
@software{agentmemory2026,
author = {robotrocketscience},
title = {agentmemory: Persistent Memory for AI Coding Agents},
year = {2026},
url = {https://github.com/robot-rocket-science/agentmemory},
license = {MIT}
}
License
MIT -- free for personal, commercial, and any other use. Citation appreciated.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file agentmemory_rrs-2.5.0.tar.gz.
File metadata
- Download URL: agentmemory_rrs-2.5.0.tar.gz
- Upload date:
- Size: 4.9 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d62c7da3fdf5f32d94832303b6587a5ce6ffc269abba0892f3a05bb198f2476a
|
|
| MD5 |
f9eff67c2496f68ffdc7882d52fb3be9
|
|
| BLAKE2b-256 |
e1fc44920cc46f311f6c21b5c304cc8b86a9fcf7728504d186ccf2df886e525b
|
Provenance
The following attestation bundles were made for agentmemory_rrs-2.5.0.tar.gz:
Publisher:
publish.yml on robot-rocket-science/agentmemory
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
agentmemory_rrs-2.5.0.tar.gz -
Subject digest:
d62c7da3fdf5f32d94832303b6587a5ce6ffc269abba0892f3a05bb198f2476a - Sigstore transparency entry: 1341763961
- Sigstore integration time:
-
Permalink:
robot-rocket-science/agentmemory@3ed1c373bf055f544460b2df4f32d201ebbb4849 -
Branch / Tag:
refs/tags/v2.5.0 - Owner: https://github.com/robot-rocket-science
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@3ed1c373bf055f544460b2df4f32d201ebbb4849 -
Trigger Event:
push
-
Statement type:
File details
Details for the file agentmemory_rrs-2.5.0-py3-none-any.whl.
File metadata
- Download URL: agentmemory_rrs-2.5.0-py3-none-any.whl
- Upload date:
- Size: 195.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
424d390346234686e181ed729ac389667af7c55faf28952ee291eb00ba90ea14
|
|
| MD5 |
0f59c9b328733c261b454fefc51ad2d2
|
|
| BLAKE2b-256 |
2968dd931258c8dd1b0036c50447ec0b1f1eb267168696ef1dd9d79fdf01ff21
|
Provenance
The following attestation bundles were made for agentmemory_rrs-2.5.0-py3-none-any.whl:
Publisher:
publish.yml on robot-rocket-science/agentmemory
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
agentmemory_rrs-2.5.0-py3-none-any.whl -
Subject digest:
424d390346234686e181ed729ac389667af7c55faf28952ee291eb00ba90ea14 - Sigstore transparency entry: 1341763964
- Sigstore integration time:
-
Permalink:
robot-rocket-science/agentmemory@3ed1c373bf055f544460b2df4f32d201ebbb4849 -
Branch / Tag:
refs/tags/v2.5.0 - Owner: https://github.com/robot-rocket-science
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@3ed1c373bf055f544460b2df4f32d201ebbb4849 -
Trigger Event:
push
-
Statement type: