Skip to main content

Agent-agnostic faithfulness evaluation framework — evaluator → judge → curator pipeline for agent memory quality scoring

Project description

dream-eval

Agent-agnostic faithfulness evaluation framework for agent memory quality scoring.

What it does

dream-eval implements the evaluator → judge → curator pipeline pattern:

  • Evaluator reads transcripts + soul (interpretive lens), proposes items
  • Judge scores against labels WITHOUT reading soul (enforcing objectivity)
  • Curator writes results (enforcing separation of concerns)

This pattern is unique in the agent memory space — no competitor (mem0, Cognee, LangMem) offers automated faithfulness evaluation.

Install

pip install dream-eval

Quick start

from dream_eval import compute_faithfulness
from dream_eval.types import ProposedItem, LabeledItem

proposed = [
    ProposedItem(id="pref-1", category="pref", content={"key": "dark_mode"}),
    ProposedItem(id="workflow-1", category="workflow", content={"key": "ci_merge"}),
]
labels = [
    LabeledItem(id="pref-1", category="pref"),
    LabeledItem(id="workflow-1", category="workflow"),
]

report = compute_faithfulness(proposed, labels)
print(f"Faithfulness: {report.faithfulness_score}")

CLI

# Score evaluator report against labels
dream-eval score --report report.json --labels labels.json

# Run deterministic gates
dream-eval gate --labels labels.json --output evaluator_output.txt

# Export to metrics.json format
dream-eval export --input eval_result.json --output metrics.json

Deterministic gates

These fail the eval regardless of LLM scores:

  • secret_leak — checks for forbidden patterns (API keys, tokens, passwords)
  • hash_determinism — verifies BOM/CRLF normalization produces stable hashes

Memory backend adapter

dream-eval works with any memory backend via BaseMemoryBackend:

from dream_eval.adapter import BaseMemoryBackend

class MyBackend(BaseMemoryBackend):
    def read_transcripts(self, corpus_path=None):
        # Read from your storage
        ...

    def read_labels(self, labels_path=None):
        # Read ground truth labels
        ...

    def write_eval_result(self, result):
        # Write evaluation results
        ...

Built-in DictMemoryBackend for testing.

Architecture

dream-eval/
├── src/dream_eval/
│   ├── __init__.py      # Package exports
│   ├── types.py         # Pydantic models (EvalResult, FaithfulnessReport, etc.)
│   ├── scoring.py       # Faithfulness, precision, recall algorithms
│   ├── gates.py         # Deterministic gates (secret_leak, hash_determinism)
│   ├── adapter.py       # Abstract BaseMemoryBackend + DictMemoryBackend
│   └── cli.py           # CLI entry point
└── tests/               # Test suite

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dream_eval-0.2.0.tar.gz (16.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dream_eval-0.2.0-py3-none-any.whl (17.3 kB view details)

Uploaded Python 3

File details

Details for the file dream_eval-0.2.0.tar.gz.

File metadata

  • Download URL: dream_eval-0.2.0.tar.gz
  • Upload date:
  • Size: 16.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for dream_eval-0.2.0.tar.gz
Algorithm Hash digest
SHA256 352b7fcf6b969938dbef083213fc0653de894ebe08ad2a4fff6512617a2bcb9c
MD5 e47990e9ff97614cbd93782e2b7e240f
BLAKE2b-256 4b05e29f51cae2098770ed9615d652a47aca16233ac5c8abe664d3301b5faa5f

See more details on using hashes here.

File details

Details for the file dream_eval-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: dream_eval-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 17.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for dream_eval-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 14c57fd73b63e66389da2b12aecfdd9af410007ec982c7931d75db0b16e1ba9b
MD5 fed090434230b159fb74e2c751b55580
BLAKE2b-256 4ecb017ea6efd1cccf4487e7b8f7a1012ab0ae9bf5c6c72f6387076e94df607f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page