Skip to main content

Memory traces you can trust — the open-source memory reliability layer for production AI agents

Project description

memnotary

Memory your AI agents can actually trust.

Python License: MIT PyPI


AI agents accumulate memories over time, and some will contradict each other. Your agent stored "refund policy is 30 days" in January, then "refund policy is 14 days" in March. Both sit in your vector store. When retrieved together, your LLM picks one — silently, often wrongly, with no flag that a contradiction exists.

memnotary wraps your existing vector backend and adds what's missing: contradiction detection, memory health scoring, automatic consolidation, and an audit trail. You don't replace anything. You just stop trusting your memory blindly.

What memnotary is not

memnotary is not a vector database and does not replace Qdrant, Chroma, or Postgres. It is the reliability layer on top: health checks, conflict detection, consolidation, and provenance. Your existing storage stays exactly where it is.

When to use memnotary

Use memnotary if your agent stores long-lived memories and you need to know:

  • whether two memories contradict each other
  • whether old facts are still being retrieved when they shouldn't be
  • whether memory quality is getting worse over time
  • why a memory exists and where it came from
  • what would happen before automatic consolidation mutates state

30-second example

memnotary is provider-agnostic. Bring your own LLM and embedding function:

async def your_llm(prompt: str) -> str:
    # OpenAI, Anthropic, a local model — anything works
    response = await client.chat.completions.create(...)
    return response.choices[0].message.content

def embed(text: str) -> list[float]:
    # any embedding function — OpenAI, sentence-transformers, etc.
    ...
from memnotary import memnotary, ContradictionDetector, Consolidator, Memory, InMemoryAdapter

eng = memnotary(
    InMemoryAdapter(),
    detector=ContradictionDetector(llm_fn=your_llm),
    consolidator=Consolidator(llm_fn=your_llm),
)

async with eng:
    await eng.store(Memory(agent_id="bot", text="Refund policy is 30 days", embedding=embed("Refund policy is 30 days")))
    await eng.store(Memory(agent_id="bot", text="Refund policy changed to 14 days", embedding=embed("Refund policy changed to 14 days")))
    # ↑ Conflict detected on the second store. memnotary saved a ConflictRecord.

    results = await eng.search("bot", embed("refund policy"), top_k=5)
    for result in results:
        if result.conflict_flag:
            print(result.conflict_summary)  # one-sentence explanation of the conflict
            print(result.recommended)       # False if a higher-ranked result already covers this

    await eng.consolidate("bot")
    # memnotary supersedes, merges, or flags the conflict depending on type and confidence.

What it does

Contradiction detection. Every store() call runs a similarity search against existing memories. If potential conflicts are found, your LLM classifies them. Confirmed contradictions become ConflictRecord objects you can inspect, act on, or queue for review.

Health scoring. await eng.health(agent_id) returns a snapshot with signals like contradiction_score, freshness_score, and confidence_accuracy_gap. Useful for dashboards or for deciding when to run consolidation.

Consolidation. await eng.consolidate(agent_id) reads all pending conflicts and plans a batch of actions: supersede the outdated memory, merge duplicates, or flag uncertain cases for a human. Then it executes them.

Provenance. Memories can carry a ProvenanceRecord — where it came from, who ingested it, what it was derived from. await eng.export_provenance_json(agent_id, memory_id) gives you a compliance-ready audit trail.

How it compares

memnotary Mem0 Zep Raw vector DB
Stores memories wraps yours yes yes yes
Detects contradictions yes partial no no
Health scoring yes no no no
Provenance / audit trail yes no partial no
Bring your own backend yes no no
Bring your own LLM yes partial yes

memnotary doesn't replace Mem0 or Zep — you can run it on top of either. It replaces the blind trust in whatever is already storing your memories.

LangChain bridge

Drop-in VectorStore and BaseChatMessageHistory backed by memnotary. Bulk adds skip per-document detection — call scan_contradictions() after loading to catch conflicts across the batch.

from memnotary.integrations.langchain import MemnotaryVectorStore, MemnotaryChatMessageHistory
from langchain_openai import OpenAIEmbeddings
from langchain_core.messages import HumanMessage, AIMessage

store = MemnotaryVectorStore(eng, embeddings=OpenAIEmbeddings(), agent_id="bot")
await store.aadd_texts(["Refund policy is 30 days", "Refund policy changed to 14 days"])
await eng.scan_contradictions("bot")  # detect conflicts across the batch
docs = await store.asimilarity_search("what is the refund policy", k=3)
# docs[0].metadata["_memory_id"] lets you trace back to the original Memory

history = MemnotaryChatMessageHistory(eng, session_id="conv-42")
await history.aadd_messages([
    HumanMessage(content="What's the refund policy?"),
    AIMessage(content="It's 14 days."),
])
msgs = await history.aget_messages()

Backends

Backend Best for Install
In-memory tests, local development built-in
Qdrant production deployments, hybrid search pip install "memnotary[qdrant]"
Chroma local-first apps, prototypes pip install "memnotary[chroma]"
pgvector teams already on Postgres pip install "memnotary[pgvector]"

Every backend is a subclass of AbstractAdapter. Adding your own takes one file.

Installation

pip install memnotary                   # core + in-memory adapter
pip install "memnotary[qdrant]"         # + Qdrant
pip install "memnotary[chroma]"         # + Chroma
pip install "memnotary[pgvector]"       # + pgvector (requires asyncpg)
pip install "memnotary[langchain]"      # + LangChain bridge
pip install "memnotary[all]"            # everything

Requires Python 3.11+. memnotary is fully async.

Benchmark

Two tracks. Track 1 is infrastructure correctness. Track 2 is behavioral — how each system handles real contradiction scenarios, scored on three dimensions.

Track 2 — Behavioral (the headline numbers)

We ran 7 real-world conflict scenarios against memnotary, Mem0, and raw Qdrant. Each scenario was scored on:

  • Correctness (weight 0.4) — did the right answer come back?
  • Signal (weight 0.4) — was the stale or conflicting memory flagged or suppressed?
  • Preservation (weight 0.2) — were unrelated facts left untouched?

memnotary flagged 6 of 7 contradiction scenarios. Mem0 flagged 3. Raw Qdrant flagged 2. All three systems returned the right answer — the difference is whether the wrong answer was also surfaced silently alongside it.

System Overall Correctness Signal Preservation Risk
memnotary 0.94 1.00 0.86 1.00 LOW
mem0 0.77 1.00 0.43 1.00 MEDIUM
naive-qdrant 0.71 1.00 0.29 1.00 MEDIUM

Every system eventually surfaces the right answer in the top-k. But Mem0 and raw Qdrant also return the contradicting wrong answer alongside it, with no flag. Your LLM sees both. It picks one. You don't know which. Signal is the difference between an agent that knows it's uncertain and one that confidently returns the wrong policy.

Per-scenario breakdown:

Scenario memnotary mem0 naive-qdrant What it tests
B1 — Direct contradiction 1.00 0.60 0.60 Old fact superseded by new
B2 — Retention 1.00 1.00 1.00 Three unrelated facts all survive
B3 — Temporal chain 1.00 0.60 0.60 Three versions; only the latest surfaces
B4 — False positive guard 1.00 1.00 1.00 Two non-contradictory sub-policies both survive
B5 — Temporal language 1.00 1.00 0.60 Rescheduled event; old schedule flagged
B6 — Lexically varied temporal 0.60 0.60 0.60 Same fact, different phrasing
B7 — Metadata timestamp 1.00 0.60 0.60 Structured timestamps override insertion order

B6: All three systems score 0.60 here. The two sentences are phrased differently enough that their cosine similarity falls below memnotary's 0.82 cluster threshold, so the LLM classifier is never invoked. This is a known trade-off: conflict detection requires semantic overlap at the embedding level before the more expensive LLM step fires. Varied real-world phrasing that expresses the same underlying fact can fall below this threshold.

To reproduce (requires OPENAI_API_KEY and Docker):

docker run -d --name memnotary-qdrant-mem0 -p 6333:6333 qdrant/qdrant
OPENAI_API_KEY=sk-... python benchmark/run_track2.py

Track 1 — Infrastructure Reliability

50 deterministic test cases across five backends — four memnotary-backed adapters and one raw Qdrant wrapper with no memnotary data model. No API key required.

Backend Score Risk Pass
memnotary-inmemory 0.88 LOW 44/50
memnotary-qdrant 0.88 LOW 44/50
memnotary-chroma 0.88 LOW 44/50
memnotary-pgvector 0.88 LOW 44/50
naive-qdrant 0.42 CRITICAL 20/50

Score is identical across all four memnotary backends — reliability comes from the data model, not the choice of vector backend. The largest gap is in temporal reliability: memnotary scores 1.00, naive Qdrant scores 0.05.

To reproduce:

python benchmark/run_track1.py   # ~2 min, no API key needed
python benchmark/report.py       # prints the table above

Limitations

  • Track 1 uses synthetic 16-dim embeddings; production embeddings (768–3072 dim) will produce different absolute scores. The data-model gap should hold but margins may compress.
  • Track 2 is 7 hand-crafted scenarios. Small N is intentional — every failure is inspectable — but it is not a stress test.
  • B6 reveals a real ceiling: lexically distant phrasings of the same fact fall below the cosine cluster threshold and never reach the LLM classifier. Improving this is on the 0.2 roadmap.
  • Mem0 was tested with default settings; advanced configurations may close the signal gap.
  • Cost/latency comparison (tokens per stored memory across systems) is coming in 0.2.

See benchmark/README.md for full setup and Docker requirements.

Status

0.1.0a2 — the core reliability loop (store → detect → score → consolidate → provenance) is complete and covered by 880+ unit tests. The pgvector adapter and LangChain bridge are included.

Not production-tested yet. The API is stable but may have breaking changes before 1.0.

What's planned for 0.2: LlamaIndex bridge, a sync facade for non-async code, OpenTelemetry instrumentation, and cost/latency benchmarks.

See CONTRIBUTING.md if you want to help build it.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

memnotary-0.1.0a2.tar.gz (188.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

memnotary-0.1.0a2-py3-none-any.whl (79.0 kB view details)

Uploaded Python 3

File details

Details for the file memnotary-0.1.0a2.tar.gz.

File metadata

  • Download URL: memnotary-0.1.0a2.tar.gz
  • Upload date:
  • Size: 188.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for memnotary-0.1.0a2.tar.gz
Algorithm Hash digest
SHA256 2c7ee10ca204b988c8b955a7a131d28dc1cc02f0367f5b2cda4a1fe1d3bff883
MD5 c3518a3b61df1c648b58c1865f483877
BLAKE2b-256 9bb64e4de18929a51bcd3dd6858ad138258e8d93b9e0d4e5481eb5c886eace23

See more details on using hashes here.

File details

Details for the file memnotary-0.1.0a2-py3-none-any.whl.

File metadata

  • Download URL: memnotary-0.1.0a2-py3-none-any.whl
  • Upload date:
  • Size: 79.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for memnotary-0.1.0a2-py3-none-any.whl
Algorithm Hash digest
SHA256 9372a9ec722f05f46f99381154794e0ddf93714f787aa343da1aff0f82c47387
MD5 b81a2eb3a9c0638c8d58450ff61b06c0
BLAKE2b-256 42bc61b3ca00408172f8f5071799e2fc037f134069f333e08b632d2bb969fa92

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page