Skip to main content

A memory layer that remembers how your agent decided, not just what it knows

Project description

Episodic Memory

A memory layer that remembers how your agent decided,
not just what it knows.

PyPI License CI Coverage Stars


Why?

Most memory systems for AI agents store facts: "user prefers Python", "the API key is at X", "the last conversation was about Y".

They are semantic memories — they know what is true.

But agents don't just need to know things. They need to judge things:

  • "Should I modify this config file directly, or ask the user first?"
  • "The last time I changed database credentials, it broke the connection — try a different approach."
  • "I've been burned by skipping tests in CI before. Never again."

These are episodic memories: decisions made, reasoning used, outcomes observed. They capture how the agent should behave, not just what is true.

Episodic Memory is a lightweight Python library that adds this missing dimension to your agent stack.


How is this different?

Semantic Memory (Mem0, RAG, vector stores) Episodic Memory (this)
What it stores Facts, preferences, conversation history Decisions, judgments, reasoning chains
Query pattern "What does the user prefer?" "How should I handle this situation?"
Feedback loop None — stored facts are trusted Utility-weighted: was the judgment adopted or corrected?
Ranking Cosine similarity only Dynamic: similarity × (1 + α · utility_score)
Learning over time One-shot recall Accumulates verified judgments — what works, what doesn't
Training data output Not designed for this Export (context, judgment, outcome) triples for fine-tuning

The key insight: semantic memory answers "what is relevant?" Episodic memory answers "what has been proven correct?"


Does it actually work? (Benchmark)

We built a synthetic judgment-recall task: 10 scenarios, each with two competing judgments that look alike to an embedder — one correct (validated), one wrong (corrected). Then we measure how often the correct judgment ranks first under each strategy.

metric                            cosine    +utility flywheel
--------------------------------------------------------
precision@1 (higher better)         0.40        0.90
mean rank of correct (lower)        1.90        1.30

Pure cosine retrieval finds the right judgment only 40% of the time — because the embedder can't distinguish "looks relevant" from "is actually correct." Adding the utility flywheel brings it to 90%.

The benchmark is fully reproducible — run it yourself:

pip install episodic-memory sentence-transformers
python benchmarks/judgment_recall.py

Note: This is a synthetic proof-of-concept, not production validation. Real-world results depend on your domain, embedding quality, and verification coverage. The script is meant to be a starting point — we encourage you to adapt it to your own use case.


Quickstart

pip install episodic-memory
from episodic_memory import EpisodicMemory

# Create an in-process memory store
memory = EpisodicMemory()

# Save a judgment
memory.store(
    trigger="User asks agent to modify config.json in production",
    judgment="Production config changes must be confirmed with the user first",
    reasoning="Direct config writes have caused outages before. The agent should always propose, not execute.",
    domain="ops",
)
# → "mem_abc123"

# Later, when a similar situation arises, query past judgments
results = memory.search(
    query="Can I edit the production config file?",
    top_k=3,
)

for r in results:
    print(f"  [{r.distance:.2f}] {r.judgment}")
# → [0.21] Production config changes must be confirmed with the user first

# Did this judgment actually help? Let the feedback loop know.
memory.verify("mem_abc123", adopted=True)

# Next time, the utility-weighted search will rank this judgment higher.
results = memory.search("Can I edit the production config?", use_utility=True)

The Utility Flywheel

The core idea: judgments that have been repeatedly validated should rank higher than untested or disproven ones, even when the semantic similarity is the same.

                      ┌──────────────────┐
                      │  store() saves   │
                      │  a judgment      │
                      └────────┬─────────┘
                               │
                               ▼
  ┌─────────────────────────────────────────┐
  │  search(query, use_utility=True)        │
  │  ranks by sim × (1 + α · utility)      │
  └─────────┬───────────────────────────────┘
            │
            ▼
  ┌─────────────────────────────────────────┐
  │  Agent follows the judgment — or not?   │
  └─────────┬───────────────────────────────┘
            │
            ▼
  ┌─────────────────────────────────────────┐
  │  verify(id, adopted=True/False)         │
  │  updates utility_score                  │
  └─────────┬───────────────────────────────┘
            │
            ▼
  search() with use_utility=True (loop back)

The formula is simple: rank_score = cosine_similarity × (1 + α · utility_score), where utility_score = adoption_count / (adoption_count + correction_count).

  • ⍺ = 0.5 by default, adjustable via utility_weight
  • Raw cosine similarity is always the base — utility can't override relevance, only disambiguate it
  • use_utility=False (the default) preserves pure vanilla behavior

When to use it

Scenario Without Episodic Memory With Episodic Memory
Agent repeatedly makes the same mistake Each run starts from scratch Past judgment is retrieved — "last time this broke"
Agent needs to know your operation style Hardcoded in system prompt, never evolves Utility feedback loop reinforces what works
Onboarding new agents Every agent needs its own instructions Shared memory of accumulated operational wisdom
Debugging agent behavior "Why did it do that?" is guesswork Every judgment carries its reasoning chain

API Reference

EpisodicMemory(embedder=None, db_path=":memory:")

Create a memory store. Defaults to in-memory SQLite; pass db_path for persistence.

Parameter Default Description
embedder None (auto: SentenceTransformer) Custom embedding function: callable(str) → list[float]
db_path ":memory:" Path to SQLite database file

store(trigger, judgment, reasoning, domain=None, metadata=None) → str

Save a new episodic memory. Raises ValueError if trigger, judgment, or reasoning is empty.

search(query, top_k=5, domain=None, min_score=0.0, use_utility=False, utility_weight=0.5) → list[SearchResult]

Search past judgments by semantic similarity, optionally weighted by proven utility.

Parameter Default Description
query required Natural language query
top_k 5 Max results
domain None Filter by domain
min_score 0.0 Minimum similarity threshold
use_utility False Rank by relevance × utility — the flywheel
utility_weight 0.5 How strongly utility boosts the ranking

Returns list of SearchResult(id, judgment, reasoning, trigger, domain, distance, metadata, utility_score).

verify(memory_id, adopted, user_correction=None) → None

Record whether a retrieved judgment was useful. Raises KeyError if the memory_id doesn't exist.

Parameter Required Description
memory_id ID returned by store()
adopted True if agent followed this judgment
user_correction User's correction if the judgment was wrong

export_triples(min_utility=0.0) → list[Triple]

Export (context, judgment, outcome) triples for fine-tuning. Only exports memories with utility_score > min_utility (strictly greater, so unverified memories are excluded by default).

close()

Close the underlying storage connection. Also works as a context manager: with EpisodicMemory() as m: ...


When NOT to use it

  • If your agent only needs facts and preferences — use Mem0 or a vector store. This library is not designed for semantic memory.
  • If you don't have a way to verify judgments — the flywheel needs feedback. Without verify(), the utility scores stay at 0 and the library behaves like a plain vector store.
  • For more than ~10K records — the current knn_search scans all rows in Python. At scale, swap in sqlite-vec or pgvector (the interface is just Storage).

Project Status

This is an early release (v0.1.0). The API is stable for the core loop (store → search → verify → weighted recall), but expect additions before 1.0:

  • Pluggable ANN backends (sqlite-vec, pgvector)
  • Time-decayed utility weighting
  • Memory consolidation (merge duplicate judgments)
  • Streaming export for online fine-tuning

Related Work

  • Mem0 (⭐58k) — Universal semantic memory layer for AI agents. Complementary: use Mem0 for facts, episodic-memory for judgments.
  • LangGraph (⭐35k) — Stateful agent orchestration. Integrates via its MemorySaver interface.

License

MIT — see LICENSE.


Contributing

We welcome contributions! See CONTRIBUTING.md to get started.

Questions, ideas, or bugs? Open an issue.


⭐ If this project resonates, star it on GitHub — it helps others find it.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

episodic_judgment-0.1.0.tar.gz (15.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

episodic_judgment-0.1.0-py3-none-any.whl (13.0 kB view details)

Uploaded Python 3

File details

Details for the file episodic_judgment-0.1.0.tar.gz.

File metadata

  • Download URL: episodic_judgment-0.1.0.tar.gz
  • Upload date:
  • Size: 15.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for episodic_judgment-0.1.0.tar.gz
Algorithm Hash digest
SHA256 2a9db9bb2829ecd1e2b01b479e5f76f6842001354b53ad571c6030494e53558b
MD5 e9c8428d1f22995fc9e176bd185d023b
BLAKE2b-256 5d21076f705fa2fb9e49bac3a42b476c3d49710d3e07399868114e2ce1e4e6e8

See more details on using hashes here.

File details

Details for the file episodic_judgment-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for episodic_judgment-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 16f15df0923f6e3cea1ac07b5cf06032a7651a6ad3236c660f15996933f009d4
MD5 2f1715f1b1ef3459e2895c333bc2ed90
BLAKE2b-256 7db30d17954e0e3916f2acb954d7c71a79d4c22628a96eddaf54ea2949726f16

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page