Claim Memory Graph: a lightweight audit layer for inspectable LLM-as-a-judge decisions.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

ML0037

These details have not been verified by PyPI

Project description

Animated CMG mascot

CMG - Claim Memory Graph

An LLM judge usually hands you a verdict and little else. You get a PASS, but you cannot tell whether it really checked your rubric or used the evidence you gave it. CMG closes that gap by making the judge back up each verdict with claims and tying every claim to the evidence behind it. A set of plain checks then flags the cases where the verdict does not hold up, without putting a second LLM in the loop. It will not tell you who is right, but it will tell you which verdicts you can trust and which ones a person should read.

Why

LLM judges are useful, but they are not neutral. Researchers keep finding the same failure modes.

Zheng et al. report position bias, verbosity bias, self-enhancement bias, and limited reasoning.
Li et al. show scoring bias from rubric order, score ids, and reference answer scoring.
Feng et al. show that explicit rubrics and criteria can help judge consistency, but do not solve it.
Wang et al. show weak evidence verification in research-agent judging.
Chen et al. show reliability gaps for long-form outputs, even when rubrics or references are present.

CMG does not pretend to fix these biases, but it does make them easy to spot. You tell the judge what to check by passing the task, the answer, an optional reference, the rubric, and the criteria, and CMG saves all of that as evidence for the judge to make claims against. Each verdict then has to rest on real claims, and each claim has to point back to a piece of that evidence, so when the judge cuts a corner the viewer flags it, whether that is missing evidence, an ignored reference, a rubric item nobody checked, a bad verdict, or an unsafe verdict change.

For now the local viewer is the dashboard.

cmg-view cmg-runs/*.cmg.jsonl --flagged-only

A web dashboard can read the same report data later.

When to use CMG

Use CMG when you run an LLM judge and cannot just trust what it says.

Large eval runs. You score thousands of cases and cannot read every explanation by hand, so CMG flags the ones that need a human and lets you skip the rest.
Reference checks. You want to catch a verdict that never cited the gold answer (reference_ignored).
Rubric coverage. You need every criterion checked, not quietly skipped (rubric_coverage_gap).
Audit and debugging. You want a replayable trail for each decision, so you can explain a score or work out why scores drift between runs.
Multi-turn judging. You need to catch a verdict that flipped without a proper retraction (verdict_flip_without_invalidation).

CMG will not tell you whether the judge is right, because that call still belongs to a person. What it does check is whether the judge backed its verdict, covered your rubric, and stayed consistent, and it points you at the cases where it did not.

Install

pip install claim-memory-graph

Optional provider helpers:

pip install 'claim-memory-graph[openai]'
pip install 'claim-memory-graph[anthropic]'

The distribution is named claim-memory-graph, but you import it as cmg. The core package has no runtime dependencies.

Quickstart

Start with the local demo. It needs no API key.

python examples/local_judge_demo.py
cmg-view cmg-runs/*.cmg.jsonl --summary
cmg-view cmg-runs/*.cmg.jsonl --show-evidence
cmg-view cmg-runs/*.cmg.jsonl --flagged-only

The --summary view gives you the whole run at a glance.

cmg-view --summary terminal output with the owl mascot, verdict bars, hard and soft flag counts, criteria coverage, and top review cases

Once that runs, wire CMG into your own judge. You keep the main task and the rubric. CMG only adds the audit layer.

from pathlib import Path

from cmg import ClaimGraph, JsonlStorage, arun_judge, judge_report


async def judge_fn(messages):
    return await call_your_judge_model(messages)


async with ClaimGraph(JsonlStorage(Path("cmg-runs/case-1.cmg.jsonl"))) as graph:
    result = await arun_judge(
        graph,
        judge_fn,
        prompt="Question shown to the candidate model.",
        candidate_output="Candidate model answer.",
        reference_answer="Optional gold answer.",
        rubric="How the judge should decide.",
        criteria=("Correctness", "Completeness"),
        verdicts=("pass", "fail"),
    )

    report = judge_report(graph)

if result.decision is None:
    print("The judge returned a missing or invalid verdict.")
else:
    print(result.decision.content)

print(report["human_review_flags"])

What the judge must return

The judge's visible answer has to start with a verdict line.

VERDICT: pass

It should also add a hidden CMG block with its claims.

```cmg
{"ops": [{"op": "commitment", "content": "The answer matches the reference.", "refs": ["s-..."]}]}
```

CMG records the final Decision itself, so if the model sends a decision op, arun_judge ignores it. And if the model returns maybe when only pass and fail are allowed, CMG records no decision and the report marks the case for human review.

What you get

judge_report(graph) returns these fields.

verdict
claims
criteria
judge_responses
verdict_errors
retracted
human_review_flags
violations

Flags come in two kinds. Hard flags are real failures in the audit. Soft flags are gentler, just things to review. Here are the ones you will use most.

Flag	Meaning
`missing_verdict`	The judge did not return a valid verdict line.
`invalid_verdict`	The verdict was not in the allowed list.
`uncited_verdict`	A verdict has no active cited claims.
`no_supported_claims`	No active claim has valid evidence.
`criterion_citation_gap`	A criterion was discussed or may be covered, but no active claim cited that exact criterion id.
`rubric_coverage_gap`	A criterion does not appear to be covered by any active claim text.
`reference_ignored`	A reference answer exists, but no active claim cites it.
`verdict_flip_without_invalidation`	A verdict changed without retracting old claims first.
`silent_commitment_drop`	A later decision dropped an active claim without a retraction.

Integrations

CMG does not replace your eval framework. It sits inside it. Keep using the framework for datasets, model calls, scores, and totals. Let CMG hold the per-case audit log. Each example below is a small adapter you can drop into one common setup.

DeepEval. Wrap arun_judge in a custom metric. examples/deepeval_metric.py subclasses BaseMetric, so each measure call writes a per-case .cmg.jsonl, turns the verdict into a score, and puts the CMG path and review flags in the metric's reason.
Inspect AI. Register a @scorer that runs the judge. examples/inspect_ai_scorer.py returns an Inspect Score and keeps the CMG graph path, review flags, and claims in the score metadata, so the audit data rides along with every sample.
OpenAI, or any provider. For a judge with no framework around it, examples/openai_judge_demo.py passes make_openai_llm_fn(...) straight in as the judge_fn. CMG does not care which provider sits behind it.

Use a fresh output file for each case run. Do not append many runs of the same case to one JSONL file.

Docs

Topic	Link
User guide	docs/user-guide.md
Developer guide	docs/dev-guide.md
Release checklist	docs/release.md

These docs, this README included, were drafted with AI and reviewed by hand.

Sources

License

Apache-2.0.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

ML0037

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.1

Jun 13, 2026

0.1.0

May 27, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

claim_memory_graph-0.1.1.tar.gz (3.9 MB view details)

Uploaded Jun 13, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

claim_memory_graph-0.1.1-py3-none-any.whl (32.6 kB view details)

Uploaded Jun 13, 2026 Python 3

File details

Details for the file claim_memory_graph-0.1.1.tar.gz.

File metadata

Download URL: claim_memory_graph-0.1.1.tar.gz
Upload date: Jun 13, 2026
Size: 3.9 MB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for claim_memory_graph-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`eaca26d5aa254217d99724437067ee387c90dda3566f3d9052e96fcf201fee04`
MD5	`5b90e28c366a15d5d372ea8a43655afb`
BLAKE2b-256	`bff2b1c388464e7cac9c335628989d830bb0d6d0f1b6bc1b5b5bb0bac6036f78`

See more details on using hashes here.

Provenance

The following attestation bundles were made for claim_memory_graph-0.1.1.tar.gz:

Publisher: publish.yml on MatteoLeonesi/claim-memory-graph-sdk

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: claim_memory_graph-0.1.1.tar.gz
- Subject digest: eaca26d5aa254217d99724437067ee387c90dda3566f3d9052e96fcf201fee04
- Sigstore transparency entry: 1809335144
- Sigstore integration time: Jun 13, 2026
Source repository:
- Permalink: MatteoLeonesi/claim-memory-graph-sdk@84bd2d070995dcd3c5f884bf18d4b985e5d1a3e4
- Branch / Tag: refs/tags/v0.1.1
- Owner: https://github.com/MatteoLeonesi
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@84bd2d070995dcd3c5f884bf18d4b985e5d1a3e4
- Trigger Event: release

File details

Details for the file claim_memory_graph-0.1.1-py3-none-any.whl.

File metadata

Download URL: claim_memory_graph-0.1.1-py3-none-any.whl
Upload date: Jun 13, 2026
Size: 32.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for claim_memory_graph-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`7ef54c59c9e4eb1ac7eae456141fa040761a38f79714da119f384f3930339fb8`
MD5	`d2d00f6411f949b49b91eef8d620e191`
BLAKE2b-256	`0d7f5dffa587b3d0749a787393fafb5532d9de10c31d3b078d85f766f344cbf0`

See more details on using hashes here.

Provenance

The following attestation bundles were made for claim_memory_graph-0.1.1-py3-none-any.whl:

Publisher: publish.yml on MatteoLeonesi/claim-memory-graph-sdk

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: claim_memory_graph-0.1.1-py3-none-any.whl
- Subject digest: 7ef54c59c9e4eb1ac7eae456141fa040761a38f79714da119f384f3930339fb8
- Sigstore transparency entry: 1809335163
- Sigstore integration time: Jun 13, 2026
Source repository:
- Permalink: MatteoLeonesi/claim-memory-graph-sdk@84bd2d070995dcd3c5f884bf18d4b985e5d1a3e4
- Branch / Tag: refs/tags/v0.1.1
- Owner: https://github.com/MatteoLeonesi
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@84bd2d070995dcd3c5f884bf18d4b985e5d1a3e4
- Trigger Event: release

claim-memory-graph 0.1.1

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

CMG - Claim Memory Graph

Why

When to use CMG

Install

Quickstart

What the judge must return

What you get

Integrations

Docs

Sources

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance