Memory traces you can trust — the open-source memory reliability layer for production AI agents
Project description
memnotary
Memory your AI agents can actually trust.
AI agents accumulate memories over time, and some will contradict each other. Your agent stored "refund policy is 30 days" in January, then "refund policy is 14 days" in March. Both sit in your vector store. When retrieved together, your LLM picks one — silently, often wrongly, with no flag that a contradiction exists.
memnotary wraps your existing vector backend and adds what's missing: contradiction detection, memory health scoring, automatic consolidation, and an audit trail. You don't replace anything. You just stop trusting your memory blindly.
What memnotary is not
memnotary is not a vector database and does not replace Qdrant, Chroma, or Postgres. It is the reliability layer on top: health checks, conflict detection, consolidation, and provenance. Your existing storage stays exactly where it is.
When to use memnotary
Use memnotary if your agent stores long-lived memories and you need to know:
- whether two memories contradict each other
- whether old facts are still being retrieved when they shouldn't be
- whether memory quality is getting worse over time
- why a memory exists and where it came from
- what would happen before automatic consolidation mutates state
30-second example
memnotary is provider-agnostic. Bring your own LLM and embedding function:
async def your_llm(prompt: str) -> str:
# OpenAI, Anthropic, a local model — anything works
response = await client.chat.completions.create(...)
return response.choices[0].message.content
def embed(text: str) -> list[float]:
# any embedding function — OpenAI, sentence-transformers, etc.
...
from memnotary import Memnotary, ContradictionDetector, Consolidator, Memory, InMemoryAdapter
mn = Memnotary(
InMemoryAdapter(),
detector=ContradictionDetector(llm_fn=your_llm),
consolidator=Consolidator(llm_fn=your_llm),
)
async with mn:
await mn.store(Memory(agent_id="bot", text="Refund policy is 30 days", embedding=embed("Refund policy is 30 days")))
await mn.store(Memory(agent_id="bot", text="Refund policy changed to 14 days", embedding=embed("Refund policy changed to 14 days")))
# ↑ Conflict detected on the second store. memnotary saved a ConflictRecord.
results = await mn.search("bot", embed("refund policy"), top_k=5)
for result in results:
if result.conflict_flag:
print(result.conflict_summary) # one-sentence explanation of the conflict
print(result.recommended) # False if a higher-ranked result already covers this
await mn.consolidate("bot")
# memnotary supersedes, merges, or flags the conflict depending on type and confidence.
What it does
Contradiction detection. Every store() call runs a similarity search against existing memories. If potential conflicts are found, your LLM classifies them. Confirmed contradictions become ConflictRecord objects you can inspect, act on, or queue for review.
Health scoring. await mn.health(agent_id) returns a snapshot with signals like contradiction_score, freshness_score, and confidence_accuracy_gap. Useful for dashboards or for deciding when to run consolidation.
Consolidation. await mn.consolidate(agent_id) reads all pending conflicts and plans a batch of actions: supersede the outdated memory, merge duplicates, or flag uncertain cases for a human. Then it executes them.
Provenance. Memories can carry a ProvenanceRecord — where it came from, who ingested it, what it was derived from. await mn.export_provenance_json(agent_id, memory_id) gives you a compliance-ready audit trail.
How it compares
| memnotary | Mem0 | Zep | Raw vector DB | |
|---|---|---|---|---|
| Stores memories | wraps yours | yes | yes | yes |
| Detects contradictions | yes | partial | no | no |
| Health scoring | yes | no | no | no |
| Provenance / audit trail | yes | no | partial | no |
| Bring your own backend | yes | no | no | — |
| Bring your own LLM | yes | partial | yes | — |
memnotary doesn't replace Mem0 or Zep — you can run it on top of either. It replaces the blind trust in whatever is already storing your memories.
LangChain bridge
Drop-in VectorStore and BaseChatMessageHistory backed by memnotary. Bulk adds skip per-document detection — call scan_contradictions() after loading to catch conflicts across the batch.
from memnotary.integrations.langchain import MemnotaryVectorStore, MemnotaryChatMessageHistory
from langchain_openai import OpenAIEmbeddings
from langchain_core.messages import HumanMessage, AIMessage
store = MemnotaryVectorStore(mn, embeddings=OpenAIEmbeddings(), agent_id="bot")
await store.aadd_texts(["Refund policy is 30 days", "Refund policy changed to 14 days"])
await mn.scan_contradictions("bot") # detect conflicts across the batch
docs = await store.asimilarity_search("what is the refund policy", k=3)
# docs[0].metadata["_memory_id"] lets you trace back to the original Memory
history = MemnotaryChatMessageHistory(mn, session_id="conv-42")
await history.aadd_messages([
HumanMessage(content="What's the refund policy?"),
AIMessage(content="It's 14 days."),
])
msgs = await history.aget_messages()
Backends
| Backend | Best for | Install |
|---|---|---|
| In-memory | tests, local development | built-in |
| Qdrant | production deployments, hybrid search | pip install "memnotary[qdrant]" |
| Chroma | local-first apps, prototypes | pip install "memnotary[chroma]" |
| pgvector | teams already on Postgres | pip install "memnotary[pgvector]" |
Every backend is a subclass of AbstractAdapter. Adding your own takes one file.
Installation
pip install memnotary # core + in-memory adapter
pip install "memnotary[qdrant]" # + Qdrant
pip install "memnotary[chroma]" # + Chroma
pip install "memnotary[pgvector]" # + pgvector (requires asyncpg)
pip install "memnotary[langchain]" # + LangChain bridge
pip install "memnotary[all]" # everything
Requires Python 3.11+. memnotary is fully async.
Benchmark
Two tracks. Track 1 is infrastructure correctness. Track 2 is behavioral — how each system handles real contradiction scenarios, scored on three dimensions.
Track 2 — Behavioral (the headline numbers)
We ran 7 real-world conflict scenarios against memnotary, Mem0, and raw Qdrant. Each scenario was scored on:
- Correctness (weight 0.4) — did the right answer come back?
- Signal (weight 0.4) — was the stale or conflicting memory flagged or suppressed?
- Preservation (weight 0.2) — were unrelated facts left untouched?
memnotary flagged 6 of 7 contradiction scenarios. Mem0 flagged 3. Raw Qdrant flagged 2. All three systems returned the right answer — the difference is whether the wrong answer was also surfaced silently alongside it.
| System | Overall | Correctness | Signal | Preservation | Risk |
|---|---|---|---|---|---|
| memnotary | 0.94 | 1.00 | 0.86 | 1.00 | LOW |
| mem0 | 0.77 | 1.00 | 0.43 | 1.00 | MEDIUM |
| naive-qdrant | 0.71 | 1.00 | 0.29 | 1.00 | MEDIUM |
Every system eventually surfaces the right answer in the top-k. But Mem0 and raw Qdrant also return the contradicting wrong answer alongside it, with no flag. Your LLM sees both. It picks one. You don't know which. Signal is the difference between an agent that knows it's uncertain and one that confidently returns the wrong policy.
Per-scenario breakdown:
| Scenario | memnotary | mem0 | naive-qdrant | What it tests |
|---|---|---|---|---|
| B1 — Direct contradiction | 1.00 | 0.60 | 0.60 | Old fact superseded by new |
| B2 — Retention | 1.00 | 1.00 | 1.00 | Three unrelated facts all survive |
| B3 — Temporal chain | 1.00 | 0.60 | 0.60 | Three versions; only the latest surfaces |
| B4 — False positive guard | 1.00 | 1.00 | 1.00 | Two non-contradictory sub-policies both survive |
| B5 — Temporal language | 1.00 | 1.00 | 0.60 | Rescheduled event; old schedule flagged |
| B6 — Lexically varied temporal | 0.60 | 0.60 | 0.60 | Same fact, different phrasing |
| B7 — Metadata timestamp | 1.00 | 0.60 | 0.60 | Structured timestamps override insertion order |
B6: All three systems score 0.60 here. The two sentences are phrased differently enough that their cosine similarity falls below memnotary's 0.82 cluster threshold, so the LLM classifier is never invoked. This is a known trade-off: conflict detection requires semantic overlap at the embedding level before the more expensive LLM step fires. Varied real-world phrasing that expresses the same underlying fact can fall below this threshold.
To reproduce (requires OPENAI_API_KEY and Docker):
docker run -d --name memnotary-qdrant-mem0 -p 6333:6333 qdrant/qdrant
OPENAI_API_KEY=sk-... python benchmark/run_track2.py
Track 1 — Infrastructure Reliability
50 deterministic test cases across five backends — four memnotary-backed adapters and one raw Qdrant wrapper with no memnotary data model. No API key required.
| Backend | Score | Risk | Pass |
|---|---|---|---|
| memnotary-inmemory | 0.88 | LOW | 44/50 |
| memnotary-qdrant | 0.88 | LOW | 44/50 |
| memnotary-chroma | 0.88 | LOW | 44/50 |
| memnotary-pgvector | 0.88 | LOW | 44/50 |
| naive-qdrant | 0.42 | CRITICAL | 20/50 |
Score is identical across all four memnotary backends — reliability comes from the data model, not the choice of vector backend. The largest gap is in temporal reliability: memnotary scores 1.00, naive Qdrant scores 0.05.
To reproduce:
python benchmark/run_track1.py # ~2 min, no API key needed
python benchmark/report.py # prints the table above
Limitations
- Track 1 uses synthetic 16-dim embeddings; production embeddings (768–3072 dim) will produce different absolute scores. The data-model gap should hold but margins may compress.
- Track 2 is 7 hand-crafted scenarios. Small N is intentional — every failure is inspectable — but it is not a stress test.
- B6 reveals a real ceiling: lexically distant phrasings of the same fact fall below the cosine cluster threshold and never reach the LLM classifier. Improving this is on the 0.2 roadmap.
- Mem0 was tested with default settings; advanced configurations may close the signal gap.
- Cost/latency comparison (tokens per stored memory across systems) is coming in 0.2.
See benchmark/README.md for full setup and Docker requirements.
Status
0.1.0 — the core reliability loop (store → detect → score → consolidate → provenance) is complete and covered by 880+ unit tests. The pgvector adapter and LangChain bridge are included.
Not production-tested yet. The API is stable but may have breaking changes before 1.0.
What's planned for 0.2: LlamaIndex bridge, a sync facade for non-async code, OpenTelemetry instrumentation, and cost/latency benchmarks.
See CONTRIBUTING.md if you want to help build it.
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file memnotary-0.1.1.tar.gz.
File metadata
- Download URL: memnotary-0.1.1.tar.gz
- Upload date:
- Size: 1.8 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4dacbfee2b1c77e2c283fcf777aee283d680e17a8f0072a5a4f28d6efdf04674
|
|
| MD5 |
8f0880f49e64efa3580f9179763cf3ca
|
|
| BLAKE2b-256 |
9042ff1dbdf15a41a521aa7067ad1385fc550acab51dbef3e47f3d34dd0707f4
|
File details
Details for the file memnotary-0.1.1-py3-none-any.whl.
File metadata
- Download URL: memnotary-0.1.1-py3-none-any.whl
- Upload date:
- Size: 79.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e85fdee10475c958daad9702eb4af59251db31ba604ed25ef1d4ff9b2996b708
|
|
| MD5 |
ea30fb870468ad05f1ffcd42c47326b0
|
|
| BLAKE2b-256 |
88f1ff988681cb8854679ba2bd4d10012c686f9a7c714dee5652c1a0569abcae
|