episodic-judgment

A memory layer that remembers how your agent decided, not just what it knows

These details have not been verified by PyPI

Project links

Project description

Episodic Memory

A memory layer that remembers how your agent decided,
not just what it knows.

Why?

Most memory systems for AI agents store facts: "user prefers Python", "the API key is at X", "the last conversation was about Y".

They are semantic memories — they know what is true.

But agents don't just need to know things. They need to judge things:

"Should I modify this config file directly, or ask the user first?"
"The last time I changed database credentials, it broke the connection — try a different approach."
"I've been burned by skipping tests in CI before. Never again."

These are episodic memories: decisions made, reasoning used, outcomes observed. They capture how the agent should behave, not just what is true.

Episodic Memory is a lightweight Python library that adds this missing dimension to your agent stack.

How is this different?

	Semantic Memory (Mem0, RAG, vector stores)	Episodic Memory (this)
What it stores	Facts, preferences, conversation history	Decisions, judgments, reasoning chains
Query pattern	"What does the user prefer?"	"How should I handle this situation?"
Feedback loop	None — stored facts are trusted	Utility-weighted: was the judgment adopted or corrected?
Ranking	Cosine similarity only	Dynamic: similarity × (1 + α · utility_score)
Learning over time	One-shot recall	Accumulates verified judgments — what works, what doesn't
Training data output	Not designed for this	Export (context, judgment, outcome) triples for fine-tuning

The key insight: semantic memory answers "what is relevant?" Episodic memory answers "what has been proven correct?"

Does it actually work? (Benchmark)

We built a synthetic judgment-recall task: 10 scenarios, each with two competing judgments that look alike to an embedder — one correct (validated), one wrong (corrected). Then we measure how often the correct judgment ranks first under each strategy.

metric                            cosine    +utility flywheel
--------------------------------------------------------
precision@1 (higher better)         0.40        0.90
mean rank of correct (lower)        1.90        1.30

Pure cosine retrieval finds the right judgment only 40% of the time — because the embedder can't distinguish "looks relevant" from "is actually correct." Adding the utility flywheel brings it to 90%.

The benchmark is fully reproducible — run it yourself:

pip install episodic-memory sentence-transformers
python benchmarks/judgment_recall.py

Note: This is a synthetic proof-of-concept, not production validation. Real-world results depend on your domain, embedding quality, and verification coverage. The script is meant to be a starting point — we encourage you to adapt it to your own use case.

Quickstart

pip install episodic-memory

from episodic_memory import EpisodicMemory

# Create an in-process memory store
memory = EpisodicMemory()

# Save a judgment
memory.store(
    trigger="User asks agent to modify config.json in production",
    judgment="Production config changes must be confirmed with the user first",
    reasoning="Direct config writes have caused outages before. The agent should always propose, not execute.",
    domain="ops",
)
# → "mem_abc123"

# Later, when a similar situation arises, query past judgments
results = memory.search(
    query="Can I edit the production config file?",
    top_k=3,
)

for r in results:
    print(f"  [{r.distance:.2f}] {r.judgment}")
# → [0.21] Production config changes must be confirmed with the user first

# Did this judgment actually help? Let the feedback loop know.
memory.verify("mem_abc123", adopted=True)

# Next time, the utility-weighted search will rank this judgment higher.
results = memory.search("Can I edit the production config?", use_utility=True)

The Utility Flywheel

The core idea: judgments that have been repeatedly validated should rank higher than untested or disproven ones, even when the semantic similarity is the same.

                      ┌──────────────────┐
                      │  store() saves   │
                      │  a judgment      │
                      └────────┬─────────┘
                               │
                               ▼
  ┌─────────────────────────────────────────┐
  │  search(query, use_utility=True)        │
  │  ranks by sim × (1 + α · utility)      │
  └─────────┬───────────────────────────────┘
            │
            ▼
  ┌─────────────────────────────────────────┐
  │  Agent follows the judgment — or not?   │
  └─────────┬───────────────────────────────┘
            │
            ▼
  ┌─────────────────────────────────────────┐
  │  verify(id, adopted=True/False)         │
  │  updates utility_score                  │
  └─────────┬───────────────────────────────┘
            │
            ▼
  search() with use_utility=True (loop back)

The formula is simple: rank_score = cosine_similarity × (1 + α · utility_score), where utility_score = adoption_count / (adoption_count + correction_count).

⍺ = 0.5 by default, adjustable via utility_weight
Raw cosine similarity is always the base — utility can't override relevance, only disambiguate it
use_utility=False (the default) preserves pure vanilla behavior

When to use it

Scenario	Without Episodic Memory	With Episodic Memory
Agent repeatedly makes the same mistake	Each run starts from scratch	Past judgment is retrieved — "last time this broke"
Agent needs to know your operation style	Hardcoded in system prompt, never evolves	Utility feedback loop reinforces what works
Onboarding new agents	Every agent needs its own instructions	Shared memory of accumulated operational wisdom
Debugging agent behavior	"Why did it do that?" is guesswork	Every judgment carries its reasoning chain

API Reference

`EpisodicMemory(embedder=None, db_path=":memory:")`

Create a memory store. Defaults to in-memory SQLite; pass db_path for persistence.

Parameter	Default	Description
`embedder`	`None` (auto: SentenceTransformer)	Custom embedding function: `callable(str) → list[float]`
`db_path`	`":memory:"`	Path to SQLite database file

`store(trigger, judgment, reasoning, domain=None, metadata=None) → str`

Save a new episodic memory. Raises ValueError if trigger, judgment, or reasoning is empty.

`search(query, top_k=5, domain=None, min_score=0.0, use_utility=False, utility_weight=0.5) → list[SearchResult]`

Search past judgments by semantic similarity, optionally weighted by proven utility.

Parameter	Default	Description
`query`	required	Natural language query
`top_k`	`5`	Max results
`domain`	`None`	Filter by domain
`min_score`	`0.0`	Minimum similarity threshold
`use_utility`	`False`	Rank by relevance × utility — the flywheel
`utility_weight`	`0.5`	How strongly utility boosts the ranking

Returns list of SearchResult(id, judgment, reasoning, trigger, domain, distance, metadata, utility_score).

`verify(memory_id, adopted, user_correction=None) → None`

Record whether a retrieved judgment was useful. Raises KeyError if the memory_id doesn't exist.

Parameter	Required	Description
`memory_id`	✅	ID returned by `store()`
`adopted`	✅	`True` if agent followed this judgment
`user_correction`		User's correction if the judgment was wrong

`export_triples(min_utility=0.0) → list[Triple]`

Export (context, judgment, outcome) triples for fine-tuning. Only exports memories with utility_score > min_utility (strictly greater, so unverified memories are excluded by default).

`close()`

Close the underlying storage connection. Also works as a context manager: with EpisodicMemory() as m: ...

When NOT to use it

If your agent only needs facts and preferences — use Mem0 or a vector store. This library is not designed for semantic memory.
If you don't have a way to verify judgments — the flywheel needs feedback. Without verify(), the utility scores stay at 0 and the library behaves like a plain vector store.
For more than ~10K records — the current knn_search scans all rows in Python. At scale, swap in sqlite-vec or pgvector (the interface is just Storage).

Project Status

This is an early release (v0.1.0). The API is stable for the core loop (store → search → verify → weighted recall), but expect additions before 1.0:

Pluggable ANN backends (sqlite-vec, pgvector)
Time-decayed utility weighting
Memory consolidation (merge duplicate judgments)
Streaming export for online fine-tuning

Related Work

Mem0 (⭐58k) — Universal semantic memory layer for AI agents. Complementary: use Mem0 for facts, episodic-memory for judgments.
LangGraph (⭐35k) — Stateful agent orchestration. Integrates via its MemorySaver interface.

License

MIT — see LICENSE.

Contributing

We welcome contributions! See CONTRIBUTING.md to get started.

Questions, ideas, or bugs? Open an issue.

⭐ If this project resonates, star it on GitHub — it helps others find it.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Jun 18, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

episodic_judgment-0.1.0.tar.gz (15.4 kB view details)

Uploaded Jun 18, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

episodic_judgment-0.1.0-py3-none-any.whl (13.0 kB view details)

Uploaded Jun 18, 2026 Python 3

File details

Details for the file episodic_judgment-0.1.0.tar.gz.

File metadata

Download URL: episodic_judgment-0.1.0.tar.gz
Upload date: Jun 18, 2026
Size: 15.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for episodic_judgment-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`2a9db9bb2829ecd1e2b01b479e5f76f6842001354b53ad571c6030494e53558b`
MD5	`e9c8428d1f22995fc9e176bd185d023b`
BLAKE2b-256	`5d21076f705fa2fb9e49bac3a42b476c3d49710d3e07399868114e2ce1e4e6e8`

See more details on using hashes here.

File details

Details for the file episodic_judgment-0.1.0-py3-none-any.whl.

File metadata

Download URL: episodic_judgment-0.1.0-py3-none-any.whl
Upload date: Jun 18, 2026
Size: 13.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for episodic_judgment-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`16f15df0923f6e3cea1ac07b5cf06032a7651a6ad3236c660f15996933f009d4`
MD5	`2f1715f1b1ef3459e2895c333bc2ed90`
BLAKE2b-256	`7db30d17954e0e3916f2acb954d7c71a79d4c22628a96eddaf54ea2949726f16`

See more details on using hashes here.

episodic-judgment 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

A memory layer that remembers how your agent decided, not just what it knows.

Why?

How is this different?

Does it actually work? (Benchmark)

Quickstart

The Utility Flywheel

When to use it

API Reference

EpisodicMemory(embedder=None, db_path=":memory:")

store(trigger, judgment, reasoning, domain=None, metadata=None) → str

search(query, top_k=5, domain=None, min_score=0.0, use_utility=False, utility_weight=0.5) → list[SearchResult]

verify(memory_id, adopted, user_correction=None) → None

export_triples(min_utility=0.0) → list[Triple]

close()

When NOT to use it

Project Status

Related Work

License

Contributing

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

A memory layer that remembers how your agent decided,
not just what it knows.

`EpisodicMemory(embedder=None, db_path=":memory:")`

`store(trigger, judgment, reasoning, domain=None, metadata=None) → str`

`search(query, top_k=5, domain=None, min_score=0.0, use_utility=False, utility_weight=0.5) → list[SearchResult]`

`verify(memory_id, adopted, user_correction=None) → None`

`export_triples(min_utility=0.0) → list[Triple]`

`close()`