GRAFOMEM — agent-memory conformance benchmark and compliance toolkit
Project description
GRAFOMEM
An agent-memory benchmark that became a memory protocol.
GRAFOMEM began as a benchmark for one question — what should a standard for agent memory actually specify? — and turned into the answer: a benchmark, a protocol (GMP), an executable conformance suite, and a certified, persistent reference implementation. The thesis in one line:
Memory capabilities are orthogonal, a declared capability is not the same as observed behavior, and the only way to tell them apart is to test — so agent memory should be specified and conformance-checked like any other protocol.
Clean-room research project. grafomem.com
The thesis, in three results
- Four orthogonal axes. A memory standard must separately specify: representational capability (versioning / supersession), embedding quality, retention policy, and a two-sided privacy primitive (deletion and tenant isolation). The benchmark shows these are separately specifiable and verifiable.
- Claims ≠ behavior. A backend can declare
HARD_DELETEorMULTI_TENANTand still leak forbidden data (findings F10, F12). The declaration is not the guarantee. This is the load-bearing result — and the reason a conformance suite has to exist. - Protocol + conformance. "Supports capability X" is defined operationally: passes the conformance suite for X. The spec, the suite, and implementations that certify against it all exist and agree.
The stack
| Layer | What it is | Where |
|---|---|---|
| Benchmark | 10 workloads (W1–W10), 20 findings; locked corpus 135 traces / 61,754 turns / 17,612 queries (v0.2.0) | src/aml/generator/, scripts/run_w*.py |
| Paper | arXiv technical report | docs/grafomem-paper.pdf |
| Spec | GMP v0.2 (draft) — protocol semantics (RFC 2119) | docs/gmp-spec-v0.2.md |
| Conformance | executable §8: supports X ≝ passes the suite for X |
src/aml/eval/conformance.py |
| Reference | in-memory backend, self-certifying | src/aml/backends/gmp_reference.py |
| Wire | HTTP + JSON binding; the client is a MemoryBackend |
src/aml/wire.py |
| Store | persistent SQLite + sqlite-vec; survives restart | src/aml/backends/sqlite_gmp.py |
Each layer certifies the one beside it. The reference backend runs the conformance suite on itself; the wire client runs the same suite over a socket; the SQLite store runs it on a file. The contract is transport- and implementation-independent by construction — not by assertion.
v0.2: W7–W10 built (findings F14–F20); W7, W9, W10 corpus-locked into v0.2.0, W8 held out; W10 (operational concurrency & isolation) gated by M8 isolation conformance — §4.10, gmp-spec §10.
Key findings
The benchmark is the evidence base. Full table in the paper (Appendix C); the load-bearing ones:
- Capabilities are inert without the workload that needs them. On a pure-vector
retrieval task, declaring
SUPERSESSION_CHAIN/BI_TEMPORALchanges nothing (Δ = +0.000). On a drift task, supersession recovers recall from 0.281 → 0.867 at a tight budget (+0.585). The capability matters exactly when the workload exercises it. - The embedder is the lever, not the capabilities. Swapping the stub for a real embedder moves recall +0.510 at budget 32; toggling capabilities on the same task moves it +0.000.
- Declared ≠ honest (F10, F12). A backend that claims
HARD_DELETEbut soft-deletes leaks deleted facts with probability 1.0; one that claimsMULTI_TENANTbut shares an index leaks across tenants with probability 1.0. Both pass their own type signature and fail conformance. Deletion and tenant isolation unify at the read path — a singleForbidden(q)set — which is why one two-sided test catches both.
Latency — reference store, locked corpus
The W1–W6 subset ingested into one growing SQLite + sqlite-vec store (N = 38,882 rows), BGE-small embeddings, on an Apple-Silicon laptop. Numbers are post-v0.2 (the metadata-column pre-filter):
| op | count | p50 | p95 |
|---|---|---|---|
| write | 37,015 | 10.17ms | 13.66ms |
| supersede | 2,262 | 9.85ms | 12.90ms |
| delete | 395 | 0.03ms | 0.05ms |
| retrieve | 15,342 | 11.27ms | 27.80ms |
What the numbers say:
- The embedder is the floor. Every op that embeds sits at ~10ms;
delete(the one op that doesn't) is 0.03ms. The store's own machinery is sub-millisecond. Single-item write throughput is ~97/s on MPS — and awrite_manybulk-ingest path that batches the embedder under one transaction hits 847/s (8.6×) with an identical resulting store, confirming the embedder, not the store, was the entire write cost. - The v0.2 pre-filter crushed the tail. In v0.1, selective queries (
as_of, tenant) ranked-then-filtered and triggered an adaptive widening loop, putting retrieve p95 at 82ms. v0.2 pushes the tenant/valid-time predicate into the KNN as metadata columns and boundskby the char budget — p95 fell to 27.80ms (−66%) with identical results, and the high-N p50 sits at or below v0.1's flat region. - Retrieve p50 still grows with N (~10ms under 10k → ~25ms at 25–50k). sqlite-vec is brute-force — there is no ANN index — so the scan is O(N). The pre-filter made retrieval correct and tight, not sublinear; the next lever for retrieve-at-scale is an actual ANN index, not more tuning. At 38k rows / 25ms p50 it isn't needed yet.
- sqlite-vec ≈ numpy brute force on pure-vector workloads (every bucket within ~1ms). The store's value at this scale is persistence and not pinning the corpus in RAM, not speed — confirming the in-memory reference is the right default below large scale.
Running it
Editable install (src-layout; this also puts aml on the path permanently):
python -m venv .venv && source .venv/bin/activate
pip install -e ".[backends]" # aml + sentence-transformers + torch
pip install sqlite-vec apsw # for the persistent store
macOS note: the store needs SQLite extension loading. If your Python's
sqlite3lacks it, the backend transparently falls back toapsw(bundled SQLite). Keepapswinstalled.
Self-validating smokes — each stands up a backend and runs the conformance suite against it:
python -m aml.backends.gmp_reference # reference impl certifies itself
python -m aml.wire # conformance suite passes over HTTP
python -m aml.backends.sqlite_gmp # persistent store: survives reopen + passes suite
Benchmark experiments and the scale probe:
python -m scripts.run_w1 # F1, F2 — vector vs recency floor + budget sweep
python -m scripts.run_w2 # F3–F5 — drift, supersession, bi-temporal
python -m scripts.run_w3 # F6, F7 — distractor noise; the embedder lever
python -m scripts.scale_probe # corpus latency + sqlite-vec vs brute force
Status & roadmap
v0.1 — complete. Benchmark, paper, GMP v0.1 spec, conformance suite, in-memory
reference (certified), HTTP+JSON wire binding (suite passes over a socket), persistent
SQLite + sqlite-vec store (survives restart, passes the full profile), scale probe.
v0.1 normative subset: {AUDIT, SUPERSESSION_CHAIN, BI_TEMPORAL, HARD_DELETE, MULTI_TENANT}.
v0.2 — in progress.
- Metadata-column pre-filter in the store — done.
tenant_id/ valid-time live in the vec0 table (nulls sentinel-encoded, since sqlite-vec metadata filters don't supportIS NULL); the KNN filters natively andkis bounded by the char budget. Drops retrieve p95 82→28ms with identical results and exact selective-filter retrieval; the O(N) brute-force scan remains (no ANN index — a separate lever, not needed at this scale). - Batched embedding on the ingest path — done. A
write_manyfast-path embeds a batch in one forward pass under one transaction: 97 → 847 items/s (8.6×), same resulting store. Optional accelerator; the single-writeProtocol path is unchanged. - Reserved capabilities — provenance pair done.
PROVENANCE(normative) andCRYPTOGRAPHIC_PROVENANCE(optional) are implemented in both backends and certified by the v0.2 conformance suite: Ed25519 over a content-store fact_id,sourcepersisted (signatures survive restart). Provenance has no benchmark workload by design — it is integrity metadata, "verifiability, not ranking" (gmp-spec §7.5), so the suite tests it with constructed probes likeAUDIT, not a retrieval workload (§8.3). - Workloads W7–W9 — built. W7 (Conflict Detection), W8 (Forgetting Curve), W9 (Cross-Session
Deletion); generators, backend spectrums, runners, findings F14–F18. W7 and W9 are corpus-locked
into v0.2.0; W8 is held out pending the summarise/merge retention variant. These home the last
two reserved flags —
CONFLICT_DETECTION(W7) andCROSS_SESSION_PROPAGATION(W9), now un-reserved in gmp-spec §7.4. - W10 — Operational Concurrency & Isolation — built and corpus-locked. Trace-schema v0.2
carries set-valued ground truth;
interface.pyaddssubmit_concurrent(gated byCONCURRENCY_CONTROL, the 10th flag),IsolationPolicy, and adeclared_policyself-description on the concurrent backend. A five-store spectrum (serializable → resurrecting) drives M8 isolation conformance, which catches both the over-claimer (declares serializable, delivers read-committed — F19) and the §10.4 durability violator (resurrects a committed delete — F20). Locked into v0.2.0 (135 traces). The suite's last open frontier, now closed.
The arc is protocol-first: the spec and suite are the standard; the implementations are the proof it's real and runnable. "Postgres for agent memory" is the destination, not the starting point.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file grafomem-0.2.0.tar.gz.
File metadata
- Download URL: grafomem-0.2.0.tar.gz
- Upload date:
- Size: 162.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7773f18afbf3be13694fad4d44e61d3e32183e5475296da96f6462a9c237af45
|
|
| MD5 |
fe0065a62b1995d94a57da94ff517d9a
|
|
| BLAKE2b-256 |
b40022b894637ec27c52493cfd192755ec4cc1d24534f2278e6582d5e4b404db
|
File details
Details for the file grafomem-0.2.0-py3-none-any.whl.
File metadata
- Download URL: grafomem-0.2.0-py3-none-any.whl
- Upload date:
- Size: 195.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
18d5f6e92328802c9554125f50a6955f3b0d55298b80cb84ecb03272354179f2
|
|
| MD5 |
df9af136df84276e3a351bc379f69c73
|
|
| BLAKE2b-256 |
5d16c817cda8ea0ecfb8d67b8e5958625d40e2fe5170e78638860c7db453f9b9
|