Skip to main content

Persistent, graph-backed memory for AI coding agents. Bayesian confidence, FTS5 search, HRR vocabulary bridge, entity-index retrieval, MCP server.

Project description

agentmemory

Persistent memory for AI coding agents. Your agent remembers what you discussed, decided, and corrected, so the next session does not start from scratch.

License: MIT Python 3.12+ Status

Read the handbook · Install · Workflow · Architecture · Benchmarks · Project writeup


Why

When a session ends, your agent forgets everything. You end up re-explaining the project, re-stating the same preferences, and watching the same mistakes happen again.

agentmemory captures decisions, corrections, and context as you work, and hands them back to the agent next time. No manual notes. No context files. Just memory.

Install

uv pip install git+https://github.com/robot-rocket-science/agentmemory.git
agentmemory setup

Restart Claude Code, then in any project:

/mem:onboard .

Full prerequisites and troubleshooting: docs/INSTALL.md.

What it does

  • Remembers automatically. Captures decisions, corrections, and preferences from your conversations without you lifting a finger.
  • Learns what matters. Memories that help get stronger over time. Memories that hurt get weaker. The system tunes itself to your project.
  • Stays on your machine. Everything lives in local SQLite. No cloud, no vector database, no telemetry unless you opt in.
  • Works with any MCP agent. Claude Code is the primary target, but any MCP-compatible client can connect to the server.

A sketch of what using it feels like

Session 1
─────────
you    We decided to use uv for this project, not poetry.
agent  Got it.

   ...session ends, days pass, new session opens...

Session 2
─────────
you    Set up the environment please.
agent  Using uv, per the project decision from last week.
       Pinning Python 3.12 as configured. Proceeding.

The second session starts already knowing. That is the whole pitch.

How it works

Conversations become scored beliefs in a local graph. Each belief gets stronger or weaker based on whether it helped. Retrieval pulls the most relevant subset into the agent's context on every turn, within a fixed token budget.

agentmemory pipeline: ingestion, retrieval, and feedback

Deep dive in the handbook: Chapter 5 - Architecture.

Documentation

The full handbook is at docs/README.md and is structured as a short book with prev/next navigation on every page. Jump to a chapter:

Wonder and Reason

agentmemory includes two graph-aware research commands that go beyond simple keyword search. They use the belief graph -- edges like SUPPORTS, CONTRADICTS, SUPERSEDES, CITES -- to surface connected evidence and detect reasoning gaps.

/mem:wonder <topic> -- Deep Research

Wonder is exploratory. You give it a topic and it fans out across the belief graph to collect everything relevant, even things you did not directly search for.

  1. Retrieves seed beliefs via FTS5 keyword search
  2. Expands outward along graph edges (BFS, configurable depth)
  3. Scores uncertainty for each belief using Beta distribution variance
  4. Detects contradictions between beliefs in the result set
  5. Outputs a structured context block with three sections: Known Facts (direct hits), Connected Evidence (reached via graph traversal), and Open Questions (high-uncertainty beliefs)

Use wonder when you want to survey what the system knows about a topic before making a decision. It answers: "what do we know, what is connected, and where are we uncertain?"

/mem:reason <question> -- Hypothesis Testing

Reason is focused. You give it a question or hypothesis and it builds branching consequence paths to evaluate whether the evidence supports it.

  1. Retrieves seed beliefs, then checks relevance (content-word overlap filter)
  2. Builds consequence paths -- chains of beliefs linked by edges, with compound confidence decay at each hop
  3. Checks constraints -- compares paths against locked beliefs for conflicts
  4. Detects impasses -- four types: ties (contradicting beliefs at similar confidence), gaps (dead-end paths), constraint failures (conflicts with locked beliefs), and no-change (all low-confidence evidence)
  5. Issues a verdict: SUFFICIENT, INSUFFICIENT, UNCERTAIN, CONTRADICTORY, or PARTIAL

Use reason when you need to evaluate a specific claim or decision. It answers: "does the evidence support this, and if not, where does the reasoning break down?"

The difference

Wonder is divergent -- cast a wide net, see what is out there. Reason is convergent -- evaluate a specific claim against the evidence. Together they form a research loop: wonder to survey the landscape, reason to test specific hypotheses that emerge from it.

Benchmarks

[!NOTE] About these numbers. I run and publish benchmarks because I believe objective, replicable methodology and transparent result reporting matter, and that readers deserve to see them. I place limited personal weight on the numbers themselves. V&V for agent memory systems is a specialized area where I do not have deep hands-on experience, and I cannot be fully confident that Claude and I have exercised these systems as rigorously as a dedicated evaluator would.

What I can commit to is the scientific rigor I was trained on and the professional engineering standards I am obligated to uphold: pre-registered hypotheses, contamination protocols, protocol-correct evaluation, and full methodology disclosure.

I welcome constructive criticism, independent replication, and analysis that refutes or supports any of these claims, and I would be glad to collaborate with anyone interested in strengthening the evaluation.

Evaluated across 5 published benchmarks. All results are protocol-correct with contamination-proof isolation (separate GT files, verified by verify_clean.py, enforced by 65 pytest protocol tests). No embeddings, no vector DB. Methodology follows the Lin checklist for reproducibility.

Results by version

Benchmark Metric v1.0 v1.1 v1.2.1 v2.2.2
MAB SH 262K SEM 60% 90% 90% 92%
MAB MH 262K SEM 6% 35%* 60% 58%
StructMemEval Accuracy 29% 100% 100% 100%
LongMemEval Opus judge -- -- 59.0% 59.6%
LoCoMo F1 -- 66.1% 66.1% 50.8%**

* chain-valid score; raw SEM was 47% ** reader variance; retrieval code unchanged from v1.2.1 (see analysis below)

Compared to published systems

Benchmark agentmemory (best) Paper ceiling / SOTA Other published systems
MAB SH 262K 92% SEM 88% GPT-4o, 45% GPT-4o-mini o4-mini 100% (6K context only)
MAB MH 262K 58% SEM <=7% all methods (paper ceiling) --
StructMemEval 100% (14/14) vector stores fail --
LongMemEval 59.6% 60.6% GPT-4o pipeline --
LoCoMo 66.1% (v1.2.1) 87.9% human ceiling 92.3% EverMemOS, 74.0% Letta, 68.5% Mem0, 51.6% GPT-4-turbo

LoCoMo comparison note: EverMemOS (92.3%), Letta (74.0%), and Mem0 (68.5%) use different retrieval architectures (embeddings, filesystem grep, LLM extraction respectively). agentmemory uses FTS5 keyword retrieval only, no embeddings. The v2.2.2 LoCoMo regression (50.8%) is driven by LLM reader variance from sub-agent batching, not retrieval quality. Per Lin's methodology, single-run results are insufficient when the reader is a variable; >=5 runs with mean +/- std are needed.

Methodology, per-benchmark details, and audit trails: Chapter 8 - Benchmark Results.

Session metrics

Beyond benchmarks, agentmemory tracks real-world usage metrics from conversation logs. Run agentmemory metrics for the full report.

Metric Value Note
Correction rate 0.72% FP-adjusted, ~90% precision
Retrieval tokens/search ~1,800 Stable (2K budget cap)
Retrieval budget fill 73% -> 100% Improving as belief store grows
Correction trend 1.7% -> 0.5% Suggestive, not yet significant
Fix commit rate 12% 50/404 commits in dev period

Evaluation protocol: docs/EVALUATION_PROTOCOL.md -- three-part framework covering benchmarks, acceptance tests (872+), and session metrics.

Development

git clone https://github.com/robot-rocket-science/agentmemory.git
cd agentmemory
uv sync --all-groups
uv run pytest tests/ -x -q
uv run pyright src/

Contributions welcome. See CONTRIBUTING.md.

Citation

If you use agentmemory in your research or project, please cite:

@software{agentmemory2026,
  author    = {robotrocketscience},
  title     = {agentmemory: Persistent Memory for AI Coding Agents},
  year      = {2026},
  url       = {https://github.com/robot-rocket-science/agentmemory},
  license   = {MIT}
}

License

MIT -- free for personal, commercial, and any other use. Citation appreciated.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agentmemory_rrs-2.5.0.tar.gz (4.9 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

agentmemory_rrs-2.5.0-py3-none-any.whl (195.5 kB view details)

Uploaded Python 3

File details

Details for the file agentmemory_rrs-2.5.0.tar.gz.

File metadata

  • Download URL: agentmemory_rrs-2.5.0.tar.gz
  • Upload date:
  • Size: 4.9 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for agentmemory_rrs-2.5.0.tar.gz
Algorithm Hash digest
SHA256 d62c7da3fdf5f32d94832303b6587a5ce6ffc269abba0892f3a05bb198f2476a
MD5 f9eff67c2496f68ffdc7882d52fb3be9
BLAKE2b-256 e1fc44920cc46f311f6c21b5c304cc8b86a9fcf7728504d186ccf2df886e525b

See more details on using hashes here.

Provenance

The following attestation bundles were made for agentmemory_rrs-2.5.0.tar.gz:

Publisher: publish.yml on robot-rocket-science/agentmemory

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file agentmemory_rrs-2.5.0-py3-none-any.whl.

File metadata

  • Download URL: agentmemory_rrs-2.5.0-py3-none-any.whl
  • Upload date:
  • Size: 195.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for agentmemory_rrs-2.5.0-py3-none-any.whl
Algorithm Hash digest
SHA256 424d390346234686e181ed729ac389667af7c55faf28952ee291eb00ba90ea14
MD5 0f59c9b328733c261b454fefc51ad2d2
BLAKE2b-256 2968dd931258c8dd1b0036c50447ec0b1f1eb267168696ef1dd9d79fdf01ff21

See more details on using hashes here.

Provenance

The following attestation bundles were made for agentmemory_rrs-2.5.0-py3-none-any.whl:

Publisher: publish.yml on robot-rocket-science/agentmemory

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page