Reproducible evaluation harness for agent memory systems (LongMemEval and beyond).

These details have not been verified by PyPI

Project links

Project description

memory-core-eval

Reproducible evaluation harness for agent memory systems.

This repository lets anyone benchmark a memory system — a BM25 or hybrid baseline, Memory Core, or a custom adapter — against the LongMemEval and LoCoMo retrieval benchmarks, and produce comparable, auditable results.

It is the open verification layer of the Memory Core project. The goal is simple: anyone should be able to reproduce a score, plug in their own system, and compare head-to-head without trusting anyone's marketing.

What this is / is not

Is:

An eval harness with a stable MemoryAdapter interface.
Built-in baselines: BM25, dense (sentence-transformers), BM25+dense RRF hybrid — the paper baselines.
Two peer adapters for context: Hindsight (Vectorize AI) and m_flow (FlowElement-ai).
A Memory Core adapter that talks to a self-hosted instance over HTTP.
A reproducibility contract: pinned dataset revision + hash, deterministic ordering, full traces.

Is not:

The Memory Core engine. Retrieval, ranking, and consolidation live in the main Memory Core repo.
An end-to-end QA benchmark. This measures retrieval (Recall@k), not answer generation.

Install

memory-core-eval is not on PyPI yet — install editable from source:

git clone https://github.com/Evanyuan-builder/memory-core-eval.git
cd memory-core-eval
pip install -e .                  # core + BM25 baseline
pip install -e ".[dense]"         # + sentence-transformers for dense / hybrid
pip install -e ".[dev]"           # + pytest, ruff

Hindsight and m_flow adapters require their upstream client packages (hindsight-client, m_flow); install separately if you want to run those peers.

Quick start

# BM25 baseline on a 20-question stratified sample
mceval run --adapter bm25 --sample 20 --seed 0 --stratified

# LongMemEval session-haystack split
mceval run --adapter bm25 --dataset longmemeval --split s --sample 100

# LoCoMo (long-range conversational memory)
mceval run --adapter bm25 --dataset locomo --sample 100

# Memory Core against a self-hosted instance
mceval run --adapter memory-core --base-url http://localhost:8001 --sample 100

# Head-to-head comparison
mceval compare --adapters bm25,dense,hybrid-rrf,memory-core \
  --base-url http://localhost:8001 --sample 100

Latest results

Paper baselines are included as anchors, not apples-to-apples leaderboard claims — sample sizes and harness versions differ. They are the strongest published numbers we know of for the same datasets, included so a reader can position the current run within that landscape.

LoCoMo (Maharana et al. 2024) — long-range conversational memory. Session-level Recall@k. n=100 stratified, seed=0, top_k=10:

System	n	R@1	R@5	R@10
BM25 (paper anchor)	100	54.0	74.0	84.0
Hybrid-RRF (paper anchor)	100	50.0	78.0	85.0
Memory Core (current run)	100	57.0	80.0	87.0

LongMemEval-S (Wu et al. 2024) — session-haystack (~50 sessions / question). The Memory Core run is n=100 stratified; the paper anchors are at n=500. Treat the gap as suggestive until the larger sweep lands.

System	n	R@10
BM25 (paper anchor)	500	96.2
Hybrid-RRF (paper anchor)	500	97.9
Memory Core (current run)	100	98.9

Cross-restart stability is verified in reference benchmark runs. Canonical reference JSONs live under baselines/.

LongMemEval-M and full n=500 sweeps are queued.

Datasets

LongMemEval (xiaowu0162/longmemeval-cleaned on HuggingFace) — three haystack splits via --split:
- oracle (default): only evidence sessions, saturates at top.
- s: ~50 sessions / question, the discriminative split.
- m: long-horizon, multi-month haystack.
LoCoMo (snap-research/locomo10.json) — 10 conversations × ~30 sessions × ~600 turns, ~200 QA pairs each. Session-level Recall@k (looser than the paper's dia_id-level metric, but apples-to-apples with LongMemEval here).

Both loaders pin a dataset revision + content hash for reproducibility.

Adapter inventory

Built into the harness:

Adapter	Role	Extra deps
`bm25`	Paper baseline (rank-bm25)	core only
`dense`	Paper baseline (sentence-transformers MiniLM)	`[dense]`
`hybrid-rrf`	Paper baseline (BM25 + dense, RRF k=60)	`[dense]`
`memory-core`	Memory Core HTTP client	core only
`hindsight`	Vectorize AI peer	`hindsight-client` (separate install)
`mflow`	FlowElement-ai m_flow peer	`m_flow` (separate install)

Writing an adapter

Implement four methods on the MemoryAdapter protocol:

from datetime import datetime
from mceval.adapters.base import MemoryAdapter, Turn, Memory

class MyAdapter:
    name = "my-system"

    def reset(self, namespace: str) -> None: ...
    def store(self, namespace: str, turn: Turn) -> str: ...                # returns memory id
    def search(
        self,
        namespace: str,
        query: str,
        top_k: int,
        as_of_date: datetime | None = None,                                # for relative-time queries
    ) -> list[Memory]: ...

as_of_date is the reference time for resolving phrases like "yesterday" or "last Tuesday" against a stored timeline. Adapters that don't model time can ignore it.

Run the contract tests against your adapter:

pytest tests/test_adapter_contract.py -k my_system

Diagnostic tool

For investigating why a specific question lands or misses against a baseline, use the A/B probe:

MEMORY_CORE_URL=http://127.0.0.1:8001 \
    python -m mceval.diagnose.ab conv-41:q19 conv-42:q186

Stores the same haystack in memory-core and hybrid-rrf, runs the question through both, and reports where the gold session lands in each top-K. Tells you whether a gap is upstream of ranking (retrieval candidate pool) or downstream (rank ordering).

Reproducibility

Each run pins:

Dataset revision hash (HuggingFace dataset for LongMemEval; commit-pinned URL for LoCoMo).
SHA-256 of the downloaded file.
Adapter name and harness version.
Full per-question trace (question → stored turns → search results → verdict) as JSONL when --trace is passed.

Canonical baseline JSONs (paper anchors, the current Memory Core canonical state, and cross-restart determinism evidence) live under baselines/.

License

Apache-2.0. See LICENSE.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.2.0

May 21, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

memory_core_eval-0.2.0.tar.gz (1.3 MB view details)

Uploaded May 21, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

memory_core_eval-0.2.0-py3-none-any.whl (40.4 kB view details)

Uploaded May 21, 2026 Python 3

File details

Details for the file memory_core_eval-0.2.0.tar.gz.

File metadata

Download URL: memory_core_eval-0.2.0.tar.gz
Upload date: May 21, 2026
Size: 1.3 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.2

File hashes

Hashes for memory_core_eval-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`771393de56af2f760fe58d80daeb4d96a6cee9af251e12697eae8df63e406c26`
MD5	`22ca2485495e03ad21400f8b443fbf1d`
BLAKE2b-256	`79a9b6a3bd69994a35051644835bdd270c68820d1e305bffa8a5f9849ab56f40`

See more details on using hashes here.

File details

Details for the file memory_core_eval-0.2.0-py3-none-any.whl.

File metadata

Download URL: memory_core_eval-0.2.0-py3-none-any.whl
Upload date: May 21, 2026
Size: 40.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.2

File hashes

Hashes for memory_core_eval-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0cb15e3ee7100e331f40b227682a59e42a67f4e7648fbf587e9762b008c098d2`
MD5	`2d0def8068653ebe16b677c0080fb240`
BLAKE2b-256	`6b466c26129879efcda2a6e2d5a56f6dd4c8746bc028a50ec4c51806cc75e3f3`

See more details on using hashes here.

memory-core-eval 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

memory-core-eval

What this is / is not

Install

Quick start

Latest results

Datasets

Adapter inventory

Writing an adapter

Diagnostic tool

Reproducibility

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes