Transparent multimodal reasoning metrics from the CRYSTAL benchmark (Match F1, Ordered Match F1, accuracy).
Project description
crystal-metrics
Transparent multimodal reasoning metrics from the CRYSTAL benchmark.
Your model gets the right answer. But does it actually reason? Standard
benchmarks only check the final answer, so a lucky guess scores the same as
sound reasoning. crystal-metrics scores the reasoning chain itself —
step-level precision/recall, ordering, and answer accuracy.
What it measures
| Metric | Measures |
|---|---|
| Match F1 | Step-level F1 of predicted vs. reference reasoning steps via semantic-similarity matching |
| Precision | Fraction of predicted steps that match a reference step (few wrong things) |
| Recall | Fraction of reference steps that were covered (completeness — the hard part) |
| Ordered Match F1 | Match F1 penalized for out-of-order reasoning (Kendall's τ or LIS ratio) |
| Accuracy | Multi-format final-answer correctness (yes/no, numeric, multiple choice, free text) |
It also ships the RL reward functions used to train models on CRYSTAL —
Causal Process Reward (CPR) and Semantic Process Reward (SPR) — in
crystal_metrics.rewards (pure Python, model-agnostic). See the
rewards docs.
Install
pip install crystal-metrics # core metrics (no LLM required)
pip install crystal-metrics[judge] # + optional LLM judge for free-form answers
Requires Python 3.8+. The default embedding model all-distilroberta-v1 is
downloaded and cached on first use.
Quickstart
from crystal_metrics import MLLMReasoningEvaluator
evaluator = MLLMReasoningEvaluator() # all-distilroberta-v1, threshold τ=0.35 (paper defaults)
m = evaluator.evaluate_single(
predicted_steps=[
"Three objects sit on the table",
"The middle console is the smallest",
"Therefore the answer is C",
],
reference_steps=[
"There are three objects in the image",
"Compare the sizes of the three objects",
"The middle object is smallest",
"Select option C",
],
alpha=0.3, # enable Ordered Match F1 (0 = order-insensitive)
)
print(f"Match F1: {m.match_f1:.3f}")
print(f"Precision: {m.precision:.3f}")
print(f"Recall: {m.recall:.3f}")
print(f"Ordered Match F1: {m.ordered_match_f1:.3f}")
Answer accuracy
from crystal_metrics import AccuracyCalculator
calc = AccuracyCalculator(use_llm_grader=False) # rule-based, no LLM
acc = calc.evaluate_dataset(predictions, references)
print(acc["overall_accuracy"], acc["type_statistics"])
The optional LLM judge (for free-form text) needs the [judge] extra and any
OpenAI-compatible endpoint (e.g. a local Ollama server):
calc = AccuracyCalculator(use_llm_grader=True, llm_model="gpt-oss:120b",
base_url="http://localhost:11434/v1")
Command line
crystal-metrics evaluate predictions.json references.json --alpha 0.3
=== CRYSTAL metrics ===
samples : 3
match_f1 : 0.5524
precision : 0.6667
recall : 0.4722
ordered_match_f1 : 0.4952
accuracy : 0.6667
Data format
// predictions
{"<id>": {"question": "...", "reasoning_steps": ["..."], "answer": "..."}}
// references
{"<id>": {"reference_steps": ["..."], "answer": "..."}}
Paper defaults
| Setting | Value | Source |
|---|---|---|
| Embedding model | all-distilroberta-v1 |
Paper §4.3 |
| Similarity threshold τ | 0.35 | Ablation-validated |
Recommended alpha |
0.3 | Paper |
| Numeric tolerance | ε_abs = 0.05, ε_rel = 0.10 | Paper Eq. (2) |
Documentation
Benchmark: 🤗 waybarrios/CRYSTAL · Project: github.com/waybarrios/crystal
Citation
@misc{barrios2026crystal,
title = {Beyond Final Answers: CRYSTAL Benchmark for Transparent
Multimodal Reasoning Evaluation},
author = {Wayner Barrios and SouYoung Jin},
year = {2026},
eprint = {2603.13099},
archivePrefix = {arXiv},
primaryClass = {cs.AI},
url = {https://arxiv.org/abs/2603.13099}
}
License
MIT — see the project repository.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file crystal_metrics-0.2.0.tar.gz.
File metadata
- Download URL: crystal_metrics-0.2.0.tar.gz
- Upload date:
- Size: 33.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
72801b1bfc5ac3abd8cb65608a5ca8c2e002d46c0c1ad71cb8cf817112020229
|
|
| MD5 |
4e4a8b5ee80119e66f0fe9d746509d53
|
|
| BLAKE2b-256 |
237458212368d54cc364a183bb6caa01c3a291f4cfa8268a782a5576d98494ba
|
File details
Details for the file crystal_metrics-0.2.0-py3-none-any.whl.
File metadata
- Download URL: crystal_metrics-0.2.0-py3-none-any.whl
- Upload date:
- Size: 26.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3281619e5b1791030426813c296dea9dfbe08e2391c39e6213dfca17b6101972
|
|
| MD5 |
ebd78a53a6111c8efb3e777c575a40b1
|
|
| BLAKE2b-256 |
725ab688371aebc5e6bae29a4d4af498b5922bd14aa7c763a52244092baeb15e
|