Skip to main content

Transparent multimodal reasoning metrics from the CRYSTAL benchmark (Match F1, Ordered Match F1, accuracy).

Project description

crystal-metrics

PyPI Python License arXiv

Transparent multimodal reasoning metrics from the CRYSTAL benchmark.

Your model gets the right answer. But does it actually reason? Standard benchmarks only check the final answer, so a lucky guess scores the same as sound reasoning. crystal-metrics scores the reasoning chain itself — step-level precision/recall, ordering, and answer accuracy.

What it measures

Metric Measures
Match F1 Step-level F1 of predicted vs. reference reasoning steps via semantic-similarity matching
Precision Fraction of predicted steps that match a reference step (few wrong things)
Recall Fraction of reference steps that were covered (completeness — the hard part)
Ordered Match F1 Match F1 penalized for out-of-order reasoning (Kendall's τ or LIS ratio)
Accuracy Multi-format final-answer correctness (yes/no, numeric, multiple choice, free text)

It also ships the RL reward functions used to train models on CRYSTAL — Causal Process Reward (CPR) and Semantic Process Reward (SPR) — in crystal_metrics.rewards (pure Python, model-agnostic). See the rewards docs.

Install

pip install crystal-metrics          # core metrics (no LLM required)
pip install crystal-metrics[judge]   # + optional LLM judge for free-form answers

Requires Python 3.8+. The default embedding model all-distilroberta-v1 is downloaded and cached on first use.

Quickstart

from crystal_metrics import MLLMReasoningEvaluator

evaluator = MLLMReasoningEvaluator()  # all-distilroberta-v1, threshold τ=0.35 (paper defaults)

m = evaluator.evaluate_single(
    predicted_steps=[
        "Three objects sit on the table",
        "The middle console is the smallest",
        "Therefore the answer is C",
    ],
    reference_steps=[
        "There are three objects in the image",
        "Compare the sizes of the three objects",
        "The middle object is smallest",
        "Select option C",
    ],
    alpha=0.3,  # enable Ordered Match F1 (0 = order-insensitive)
)

print(f"Match F1:         {m.match_f1:.3f}")
print(f"Precision:        {m.precision:.3f}")
print(f"Recall:           {m.recall:.3f}")
print(f"Ordered Match F1: {m.ordered_match_f1:.3f}")

Answer accuracy

from crystal_metrics import AccuracyCalculator

calc = AccuracyCalculator(use_llm_grader=False)          # rule-based, no LLM
acc = calc.evaluate_dataset(predictions, references)
print(acc["overall_accuracy"], acc["type_statistics"])

The optional LLM judge (for free-form text) needs the [judge] extra and any OpenAI-compatible endpoint (e.g. a local Ollama server):

calc = AccuracyCalculator(use_llm_grader=True, llm_model="gpt-oss:120b",
                          base_url="http://localhost:11434/v1")

Command line

crystal-metrics evaluate predictions.json references.json --alpha 0.3
=== CRYSTAL metrics ===
  samples           : 3
  match_f1          : 0.5524
  precision         : 0.6667
  recall            : 0.4722
  ordered_match_f1  : 0.4952
  accuracy          : 0.6667

Data format

// predictions
{"<id>": {"question": "...", "reasoning_steps": ["..."], "answer": "..."}}
// references
{"<id>": {"reference_steps": ["..."], "answer": "..."}}

Paper defaults

Setting Value Source
Embedding model all-distilroberta-v1 Paper §4.3
Similarity threshold τ 0.35 Ablation-validated
Recommended alpha 0.3 Paper
Numeric tolerance ε_abs = 0.05, ε_rel = 0.10 Paper Eq. (2)

Documentation

Benchmark: 🤗 waybarrios/CRYSTAL · Project: github.com/waybarrios/crystal

Citation

@misc{barrios2026crystal,
  title   = {Beyond Final Answers: CRYSTAL Benchmark for Transparent
             Multimodal Reasoning Evaluation},
  author  = {Wayner Barrios and SouYoung Jin},
  year    = {2026},
  eprint  = {2603.13099},
  archivePrefix = {arXiv},
  primaryClass  = {cs.AI},
  url     = {https://arxiv.org/abs/2603.13099}
}

License

MIT — see the project repository.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

crystal_metrics-0.2.0.tar.gz (33.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

crystal_metrics-0.2.0-py3-none-any.whl (26.3 kB view details)

Uploaded Python 3

File details

Details for the file crystal_metrics-0.2.0.tar.gz.

File metadata

  • Download URL: crystal_metrics-0.2.0.tar.gz
  • Upload date:
  • Size: 33.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.2

File hashes

Hashes for crystal_metrics-0.2.0.tar.gz
Algorithm Hash digest
SHA256 72801b1bfc5ac3abd8cb65608a5ca8c2e002d46c0c1ad71cb8cf817112020229
MD5 4e4a8b5ee80119e66f0fe9d746509d53
BLAKE2b-256 237458212368d54cc364a183bb6caa01c3a291f4cfa8268a782a5576d98494ba

See more details on using hashes here.

File details

Details for the file crystal_metrics-0.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for crystal_metrics-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 3281619e5b1791030426813c296dea9dfbe08e2391c39e6213dfca17b6101972
MD5 ebd78a53a6111c8efb3e777c575a40b1
BLAKE2b-256 725ab688371aebc5e6bae29a4d4af498b5922bd14aa7c763a52244092baeb15e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page