Reproducible evaluation harness for agent memory systems (LongMemEval and beyond).
Project description
memory-core-eval
Reproducible evaluation harness for agent memory systems.
This repository lets anyone benchmark a memory system — a BM25 or hybrid baseline, Memory Core, or a custom adapter — against the LongMemEval and LoCoMo retrieval benchmarks, and produce comparable, auditable results.
It is the open verification layer of the Memory Core project. The goal is simple: anyone should be able to reproduce a score, plug in their own system, and compare head-to-head without trusting anyone's marketing.
What this is / is not
Is:
- An eval harness with a stable
MemoryAdapterinterface. - Built-in baselines: BM25, dense (sentence-transformers), BM25+dense RRF hybrid — the paper baselines.
- Two peer adapters for context: Hindsight (Vectorize AI) and m_flow (FlowElement-ai).
- A Memory Core adapter that talks to a self-hosted instance over HTTP.
- A reproducibility contract: pinned dataset revision + hash, deterministic ordering, full traces.
Is not:
- The Memory Core engine. Retrieval, ranking, and consolidation live in the main Memory Core repo.
- An end-to-end QA benchmark. This measures retrieval (Recall@k), not answer generation.
Install
memory-core-eval is not on PyPI yet — install editable from source:
git clone https://github.com/Evanyuan-builder/memory-core-eval.git
cd memory-core-eval
pip install -e . # core + BM25 baseline
pip install -e ".[dense]" # + sentence-transformers for dense / hybrid
pip install -e ".[dev]" # + pytest, ruff
Hindsight and m_flow adapters require their upstream client packages
(hindsight-client, m_flow); install separately if you want to run those
peers.
Quick start
# BM25 baseline on a 20-question stratified sample
mceval run --adapter bm25 --sample 20 --seed 0 --stratified
# LongMemEval session-haystack split
mceval run --adapter bm25 --dataset longmemeval --split s --sample 100
# LoCoMo (long-range conversational memory)
mceval run --adapter bm25 --dataset locomo --sample 100
# Memory Core against a self-hosted instance
mceval run --adapter memory-core --base-url http://localhost:8001 --sample 100
# Head-to-head comparison
mceval compare --adapters bm25,dense,hybrid-rrf,memory-core \
--base-url http://localhost:8001 --sample 100
Latest results
Paper baselines are included as anchors, not apples-to-apples leaderboard claims — sample sizes and harness versions differ. They are the strongest published numbers we know of for the same datasets, included so a reader can position the current run within that landscape.
LoCoMo (Maharana et al. 2024) — long-range conversational memory. Session-level Recall@k. n=100 stratified, seed=0, top_k=10:
| System | n | R@1 | R@5 | R@10 |
|---|---|---|---|---|
| BM25 (paper anchor) | 100 | 54.0 | 74.0 | 84.0 |
| Hybrid-RRF (paper anchor) | 100 | 50.0 | 78.0 | 85.0 |
| Memory Core (current run) | 100 | 57.0 | 80.0 | 87.0 |
LongMemEval-S (Wu et al. 2024) — session-haystack (~50 sessions / question). The Memory Core run is n=100 stratified; the paper anchors are at n=500. Treat the gap as suggestive until the larger sweep lands.
| System | n | R@10 |
|---|---|---|
| BM25 (paper anchor) | 500 | 96.2 |
| Hybrid-RRF (paper anchor) | 500 | 97.9 |
| Memory Core (current run) | 100 | 98.9 |
Cross-restart stability is verified in reference benchmark runs. Canonical
reference JSONs live under baselines/.
LongMemEval-M and full n=500 sweeps are queued.
Datasets
- LongMemEval (
xiaowu0162/longmemeval-cleanedon HuggingFace) — three haystack splits via--split:oracle(default): only evidence sessions, saturates at top.s: ~50 sessions / question, the discriminative split.m: long-horizon, multi-month haystack.
- LoCoMo (
snap-research/locomo10.json) — 10 conversations × ~30 sessions × ~600 turns, ~200 QA pairs each. Session-level Recall@k (looser than the paper's dia_id-level metric, but apples-to-apples with LongMemEval here).
Both loaders pin a dataset revision + content hash for reproducibility.
Adapter inventory
Built into the harness:
| Adapter | Role | Extra deps |
|---|---|---|
bm25 |
Paper baseline (rank-bm25) | core only |
dense |
Paper baseline (sentence-transformers MiniLM) | [dense] |
hybrid-rrf |
Paper baseline (BM25 + dense, RRF k=60) | [dense] |
memory-core |
Memory Core HTTP client | core only |
hindsight |
Vectorize AI peer | hindsight-client (separate install) |
mflow |
FlowElement-ai m_flow peer | m_flow (separate install) |
Writing an adapter
Implement four methods on the MemoryAdapter protocol:
from datetime import datetime
from mceval.adapters.base import MemoryAdapter, Turn, Memory
class MyAdapter:
name = "my-system"
def reset(self, namespace: str) -> None: ...
def store(self, namespace: str, turn: Turn) -> str: ... # returns memory id
def search(
self,
namespace: str,
query: str,
top_k: int,
as_of_date: datetime | None = None, # for relative-time queries
) -> list[Memory]: ...
as_of_date is the reference time for resolving phrases like "yesterday"
or "last Tuesday" against a stored timeline. Adapters that don't model time
can ignore it.
Run the contract tests against your adapter:
pytest tests/test_adapter_contract.py -k my_system
Diagnostic tool
For investigating why a specific question lands or misses against a baseline, use the A/B probe:
MEMORY_CORE_URL=http://127.0.0.1:8001 \
python -m mceval.diagnose.ab conv-41:q19 conv-42:q186
Stores the same haystack in memory-core and hybrid-rrf, runs the question
through both, and reports where the gold session lands in each top-K. Tells
you whether a gap is upstream of ranking (retrieval candidate pool) or
downstream (rank ordering).
Reproducibility
Each run pins:
- Dataset revision hash (HuggingFace dataset for LongMemEval; commit-pinned URL for LoCoMo).
- SHA-256 of the downloaded file.
- Adapter name and harness version.
- Full per-question trace (question → stored turns → search results →
verdict) as JSONL when
--traceis passed.
Canonical baseline JSONs (paper anchors, the current Memory Core canonical
state, and cross-restart determinism evidence) live under baselines/.
License
Apache-2.0. See LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file memory_core_eval-0.2.0.tar.gz.
File metadata
- Download URL: memory_core_eval-0.2.0.tar.gz
- Upload date:
- Size: 1.3 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
771393de56af2f760fe58d80daeb4d96a6cee9af251e12697eae8df63e406c26
|
|
| MD5 |
22ca2485495e03ad21400f8b443fbf1d
|
|
| BLAKE2b-256 |
79a9b6a3bd69994a35051644835bdd270c68820d1e305bffa8a5f9849ab56f40
|
File details
Details for the file memory_core_eval-0.2.0-py3-none-any.whl.
File metadata
- Download URL: memory_core_eval-0.2.0-py3-none-any.whl
- Upload date:
- Size: 40.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0cb15e3ee7100e331f40b227682a59e42a67f4e7648fbf587e9762b008c098d2
|
|
| MD5 |
2d0def8068653ebe16b677c0080fb240
|
|
| BLAKE2b-256 |
6b466c26129879efcda2a6e2d5a56f6dd4c8746bc028a50ec4c51806cc75e3f3
|