A memory layer that remembers how your agent decided, not just what it knows
Project description
A memory layer that remembers how your agent decided,
not just what it knows.
Why?
Most memory systems for AI agents store facts: "user prefers Python", "the API key is at X", "the last conversation was about Y".
They are semantic memories — they know what is true.
But agents don't just need to know things. They need to judge things:
- "Should I modify this config file directly, or ask the user first?"
- "The last time I changed database credentials, it broke the connection — try a different approach."
- "I've been burned by skipping tests in CI before. Never again."
These are episodic memories: decisions made, reasoning used, outcomes observed. They capture how the agent should behave, not just what is true.
Episodic Memory is a lightweight Python library that adds this missing dimension to your agent stack.
How is this different?
| Semantic Memory (Mem0, RAG, vector stores) | Episodic Memory (this) | |
|---|---|---|
| What it stores | Facts, preferences, conversation history | Decisions, judgments, reasoning chains |
| Query pattern | "What does the user prefer?" | "How should I handle this situation?" |
| Feedback loop | None — stored facts are trusted | Utility-weighted: was the judgment adopted or corrected? |
| Ranking | Cosine similarity only | Dynamic: similarity × (1 + α · utility_score) |
| Learning over time | One-shot recall | Accumulates verified judgments — what works, what doesn't |
| Training data output | Not designed for this | Export (context, judgment, outcome) triples for fine-tuning |
The key insight: semantic memory answers "what is relevant?" Episodic memory answers "what has been proven correct?"
Does it actually work? (Benchmark)
We built a synthetic judgment-recall task: 10 scenarios, each with two competing judgments that look alike to an embedder — one correct (validated), one wrong (corrected). Then we measure how often the correct judgment ranks first under each strategy.
metric cosine +utility flywheel
--------------------------------------------------------
precision@1 (higher better) 0.40 0.90
mean rank of correct (lower) 1.90 1.30
Pure cosine retrieval finds the right judgment only 40% of the time — because the embedder can't distinguish "looks relevant" from "is actually correct." Adding the utility flywheel brings it to 90%.
The benchmark is fully reproducible — run it yourself:
pip install episodic-memory sentence-transformers
python benchmarks/judgment_recall.py
Note: This is a synthetic proof-of-concept, not production validation. Real-world results depend on your domain, embedding quality, and verification coverage. The script is meant to be a starting point — we encourage you to adapt it to your own use case.
Quickstart
pip install episodic-memory
from episodic_memory import EpisodicMemory
# Create an in-process memory store
memory = EpisodicMemory()
# Save a judgment
memory.store(
trigger="User asks agent to modify config.json in production",
judgment="Production config changes must be confirmed with the user first",
reasoning="Direct config writes have caused outages before. The agent should always propose, not execute.",
domain="ops",
)
# → "mem_abc123"
# Later, when a similar situation arises, query past judgments
results = memory.search(
query="Can I edit the production config file?",
top_k=3,
)
for r in results:
print(f" [{r.distance:.2f}] {r.judgment}")
# → [0.21] Production config changes must be confirmed with the user first
# Did this judgment actually help? Let the feedback loop know.
memory.verify("mem_abc123", adopted=True)
# Next time, the utility-weighted search will rank this judgment higher.
results = memory.search("Can I edit the production config?", use_utility=True)
The Utility Flywheel
The core idea: judgments that have been repeatedly validated should rank higher than untested or disproven ones, even when the semantic similarity is the same.
┌──────────────────┐
│ store() saves │
│ a judgment │
└────────┬─────────┘
│
▼
┌─────────────────────────────────────────┐
│ search(query, use_utility=True) │
│ ranks by sim × (1 + α · utility) │
└─────────┬───────────────────────────────┘
│
▼
┌─────────────────────────────────────────┐
│ Agent follows the judgment — or not? │
└─────────┬───────────────────────────────┘
│
▼
┌─────────────────────────────────────────┐
│ verify(id, adopted=True/False) │
│ updates utility_score │
└─────────┬───────────────────────────────┘
│
▼
search() with use_utility=True (loop back)
The formula is simple: rank_score = cosine_similarity × (1 + α · utility_score), where utility_score = adoption_count / (adoption_count + correction_count).
- ⍺ = 0.5 by default, adjustable via
utility_weight - Raw cosine similarity is always the base — utility can't override relevance, only disambiguate it
use_utility=False(the default) preserves pure vanilla behavior
When to use it
| Scenario | Without Episodic Memory | With Episodic Memory |
|---|---|---|
| Agent repeatedly makes the same mistake | Each run starts from scratch | Past judgment is retrieved — "last time this broke" |
| Agent needs to know your operation style | Hardcoded in system prompt, never evolves | Utility feedback loop reinforces what works |
| Onboarding new agents | Every agent needs its own instructions | Shared memory of accumulated operational wisdom |
| Debugging agent behavior | "Why did it do that?" is guesswork | Every judgment carries its reasoning chain |
API Reference
EpisodicMemory(embedder=None, db_path=":memory:")
Create a memory store. Defaults to in-memory SQLite; pass db_path for persistence.
| Parameter | Default | Description |
|---|---|---|
embedder |
None (auto: SentenceTransformer) |
Custom embedding function: callable(str) → list[float] |
db_path |
":memory:" |
Path to SQLite database file |
store(trigger, judgment, reasoning, domain=None, metadata=None) → str
Save a new episodic memory. Raises ValueError if trigger, judgment, or reasoning is empty.
search(query, top_k=5, domain=None, min_score=0.0, use_utility=False, utility_weight=0.5) → list[SearchResult]
Search past judgments by semantic similarity, optionally weighted by proven utility.
| Parameter | Default | Description |
|---|---|---|
query |
required | Natural language query |
top_k |
5 |
Max results |
domain |
None |
Filter by domain |
min_score |
0.0 |
Minimum similarity threshold |
use_utility |
False |
Rank by relevance × utility — the flywheel |
utility_weight |
0.5 |
How strongly utility boosts the ranking |
Returns list of SearchResult(id, judgment, reasoning, trigger, domain, distance, metadata, utility_score).
verify(memory_id, adopted, user_correction=None) → None
Record whether a retrieved judgment was useful. Raises KeyError if the memory_id doesn't exist.
| Parameter | Required | Description |
|---|---|---|
memory_id |
✅ | ID returned by store() |
adopted |
✅ | True if agent followed this judgment |
user_correction |
User's correction if the judgment was wrong |
export_triples(min_utility=0.0) → list[Triple]
Export (context, judgment, outcome) triples for fine-tuning. Only exports memories with utility_score > min_utility (strictly greater, so unverified memories are excluded by default).
close()
Close the underlying storage connection. Also works as a context manager: with EpisodicMemory() as m: ...
When NOT to use it
- If your agent only needs facts and preferences — use Mem0 or a vector store. This library is not designed for semantic memory.
- If you don't have a way to verify judgments — the flywheel needs feedback. Without
verify(), the utility scores stay at 0 and the library behaves like a plain vector store. - For more than ~10K records — the current
knn_searchscans all rows in Python. At scale, swap insqlite-vecorpgvector(the interface is justStorage).
Project Status
This is an early release (v0.1.0). The API is stable for the core loop (store → search → verify → weighted recall), but expect additions before 1.0:
- Pluggable ANN backends (sqlite-vec, pgvector)
- Time-decayed utility weighting
- Memory consolidation (merge duplicate judgments)
- Streaming export for online fine-tuning
Related Work
- Mem0 (⭐58k) — Universal semantic memory layer for AI agents. Complementary: use Mem0 for facts, episodic-memory for judgments.
- LangGraph (⭐35k) — Stateful agent orchestration. Integrates via its
MemorySaverinterface.
License
MIT — see LICENSE.
Contributing
We welcome contributions! See CONTRIBUTING.md to get started.
Questions, ideas, or bugs? Open an issue.
⭐ If this project resonates, star it on GitHub — it helps others find it.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file episodic_judgment-0.1.0.tar.gz.
File metadata
- Download URL: episodic_judgment-0.1.0.tar.gz
- Upload date:
- Size: 15.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2a9db9bb2829ecd1e2b01b479e5f76f6842001354b53ad571c6030494e53558b
|
|
| MD5 |
e9c8428d1f22995fc9e176bd185d023b
|
|
| BLAKE2b-256 |
5d21076f705fa2fb9e49bac3a42b476c3d49710d3e07399868114e2ce1e4e6e8
|
File details
Details for the file episodic_judgment-0.1.0-py3-none-any.whl.
File metadata
- Download URL: episodic_judgment-0.1.0-py3-none-any.whl
- Upload date:
- Size: 13.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
16f15df0923f6e3cea1ac07b5cf06032a7651a6ad3236c660f15996933f009d4
|
|
| MD5 |
2f1715f1b1ef3459e2895c333bc2ed90
|
|
| BLAKE2b-256 |
7db30d17954e0e3916f2acb954d7c71a79d4c22628a96eddaf54ea2949726f16
|