Skip to main content

Transparent multimodal reasoning metrics from the CRYSTAL benchmark (Match F1, Ordered Match F1, accuracy).

Project description

crystal-metrics

PyPI Python License arXiv

Transparent multimodal reasoning metrics from the CRYSTAL benchmark.

Your model gets the right answer. But does it actually reason? Standard benchmarks only check the final answer, so a lucky guess scores the same as sound reasoning. crystal-metrics scores the reasoning chain itself — step-level precision/recall, ordering, and answer accuracy.

What it measures

Metric Measures
Match F1 Step-level F1 of predicted vs. reference reasoning steps via semantic-similarity matching
Precision Fraction of predicted steps that match a reference step (few wrong things)
Recall Fraction of reference steps that were covered (completeness — the hard part)
Ordered Match F1 Match F1 penalized for out-of-order reasoning (Kendall's τ or LIS ratio)
Accuracy Multi-format final-answer correctness (yes/no, numeric, multiple choice, free text)

Install

pip install crystal-metrics          # core metrics (no LLM required)
pip install crystal-metrics[judge]   # + optional LLM judge for free-form answers

Requires Python 3.8+. The default embedding model all-distilroberta-v1 is downloaded and cached on first use.

Quickstart

from crystal_metrics import MLLMReasoningEvaluator

evaluator = MLLMReasoningEvaluator()  # all-distilroberta-v1, threshold τ=0.35 (paper defaults)

m = evaluator.evaluate_single(
    predicted_steps=[
        "Three objects sit on the table",
        "The middle console is the smallest",
        "Therefore the answer is C",
    ],
    reference_steps=[
        "There are three objects in the image",
        "Compare the sizes of the three objects",
        "The middle object is smallest",
        "Select option C",
    ],
    alpha=0.3,  # enable Ordered Match F1 (0 = order-insensitive)
)

print(f"Match F1:         {m.match_f1:.3f}")
print(f"Precision:        {m.precision:.3f}")
print(f"Recall:           {m.recall:.3f}")
print(f"Ordered Match F1: {m.ordered_match_f1:.3f}")

Answer accuracy

from crystal_metrics import AccuracyCalculator

calc = AccuracyCalculator(use_llm_grader=False)          # rule-based, no LLM
acc = calc.evaluate_dataset(predictions, references)
print(acc["overall_accuracy"], acc["type_statistics"])

The optional LLM judge (for free-form text) needs the [judge] extra and any OpenAI-compatible endpoint (e.g. a local Ollama server):

calc = AccuracyCalculator(use_llm_grader=True, llm_model="gpt-oss:120b",
                          base_url="http://localhost:11434/v1")

Command line

crystal-metrics evaluate predictions.json references.json --alpha 0.3
=== CRYSTAL metrics ===
  samples           : 3
  match_f1          : 0.5524
  precision         : 0.6667
  recall            : 0.4722
  ordered_match_f1  : 0.4952
  accuracy          : 0.6667

Data format

// predictions
{"<id>": {"question": "...", "reasoning_steps": ["..."], "answer": "..."}}
// references
{"<id>": {"reference_steps": ["..."], "answer": "..."}}

Paper defaults

Setting Value Source
Embedding model all-distilroberta-v1 Paper §4.3
Similarity threshold τ 0.35 Ablation-validated
Recommended alpha 0.3 Paper
Numeric tolerance ε_abs = 0.05, ε_rel = 0.10 Paper Eq. (2)

Documentation

Benchmark: 🤗 waybarrios/CRYSTAL · Project: github.com/waybarrios/crystal

Citation

@misc{barrios2026crystal,
  title   = {Beyond Final Answers: CRYSTAL Benchmark for Transparent
             Multimodal Reasoning Evaluation},
  author  = {Wayner Barrios and SouYoung Jin},
  year    = {2026},
  eprint  = {2603.13099},
  archivePrefix = {arXiv},
  primaryClass  = {cs.AI},
  url     = {https://arxiv.org/abs/2603.13099}
}

License

MIT — see the project repository.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

crystal_metrics-0.1.1.tar.gz (27.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

crystal_metrics-0.1.1-py3-none-any.whl (21.5 kB view details)

Uploaded Python 3

File details

Details for the file crystal_metrics-0.1.1.tar.gz.

File metadata

  • Download URL: crystal_metrics-0.1.1.tar.gz
  • Upload date:
  • Size: 27.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.2

File hashes

Hashes for crystal_metrics-0.1.1.tar.gz
Algorithm Hash digest
SHA256 e989b9d2cc2e7c8bdb9de8f25dd8564bf9fb868a8c99ea55f79582b57f682444
MD5 30a30dd0583d897f47a9dfd0e61102cf
BLAKE2b-256 9c45ce0a3a7af3d816fc4abb6e92e11bcaabd52a53c22787960a9000e55a9818

See more details on using hashes here.

File details

Details for the file crystal_metrics-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for crystal_metrics-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 2571beca226ee0e4aa35d7050df28fc3f03d8f29d72ece04db0eee203969e979
MD5 a8a368ad3bdf34f876943e49e1c8f931
BLAKE2b-256 1e894cb34faaa866a5b6e972f7b74fe546698697aa39c5b122b43f0feb500324

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page