Skip to main content

A multi-dimensional behavioral framework for evaluating LLM reasoning quality beyond accuracy

Project description

PyPI arXiv License: MIT

LLM Reasoning Quality Evaluation Framework

A config-driven, multi-dimensional framework for evaluating reasoning quality in Large Language Models — beyond simple answer correctness.

6 metrics · 7 models (4 API + 3 local) · 4 benchmark datasets · no code changes needed to add models or datasets


Table of Contents


Overview

Standard LLM evaluation asks: "Is the answer correct?"

This framework asks: "How well does the model reason?"

It evaluates models across 6 complementary dimensions of reasoning quality, producing a composite score that captures correctness, behavioral stability, robustness, logical integrity, and efficiency simultaneously.

Q = f(CQ, CS, RS, LS, ES, SS)

The framework is fully config-driven — models, datasets, metrics, and aggregation weights are all controlled from a single YAML file. No code changes are needed for common use cases.


Metrics

Symbol Name Formula What It Measures
CQ Correctness (1/N) Σ I(ŷᵢ = yᵢ) Fraction of correct answers
CS Consistency (2/K(K−1)) Σ I(ŷᵢ⁽ᵏ⁾ = ŷᵢ⁽ˡ⁾) Same answer across K repeated runs (pairwise)?
RS Robustness (1/N) Σ (1/P) Σ I(ŷᵢ = ŷᵢᵖ) · I(yᵢ = ŷᵢ) Same answer on semantically equivalent rephrases?
LS Logical Coherence 1 − (1/N) Σ (1/(nᵢ−1)) Σ ψ(sⱼ, sⱼ₊₁) No contradictions between consecutive reasoning steps?
ES Efficiency Harmonic mean of CQ and inverse normalized token count Correct and concise?
SS Stability (2/K(K−1)) Σ BERTScore(Tᵢ⁽ᵏ⁾, Tᵢ⁽ˡ⁾) Same reasoning process across K runs?

Key design decisions

CQ — Multi-strategy matching pipeline: Raw model outputs are often verbose (e.g. "John has 8 apples." instead of "8"). The correctness metric applies 7 sequential matching strategies before marking an answer wrong: exact match → normalized → number extraction → yes/no extraction → A/B/C/D extraction → substring match → numeric tolerance. This prevents local models from being penalized purely for output format.

RS — Conditioned on correctness: Robustness is only counted for questions the model originally answered correctly. A model that gets everything wrong trivially gets RS=1.0 otherwise.

LS — NLI-based contradiction detection: Uses cross-encoder/nli-deberta-v3-small to detect contradictions between consecutive reasoning steps. Falls back gracefully to LS=1.0 if the NLI model is unavailable.

ES — Harmonic mean: Prevents rewarding short-but-wrong or long-but-correct responses equally. Both correctness and conciseness must be high for ES to be high.

SS — BERTScore similarity: Measures semantic similarity between reasoning traces across runs, not just whether the final answer matches. Falls back to Jaccard similarity if bert-score is not installed.

CS/SS and temperature: Running with deterministic: true (temperature=0) produces CS=SS=1.0 for all models — this is a mathematical artifact, not a real measurement. Set temperature: 0.7 per model in config to get meaningful CS/SS scores.

Aggregation strategies

Seven built-in weighting schemes are computed for every experiment. All appear as separate columns in the Excel output.

Strategy CQ CS RS LS ES SS Use case
Balanced 1/6 1/6 1/6 1/6 1/6 1/6 General comparison
Safety Priority 0.30 0.20 0.30 0.10 0.05 0.05 High-stakes deployment
Accuracy Priority 0.40 0.25 0.15 0.10 0.05 0.05 Accuracy-critical tasks
Efficiency Priority 0.20 0.15 0.15 0.10 0.30 0.10 Resource-constrained deployment
Medical Triage 0.40 0.05 0.30 0.20 0.03 0.02 Clinical decision support
Legal/Compliance 0.15 0.25 0.20 0.35 0.03 0.02 Audit-sensitive applications
Edge Device/IoT 0.30 0.03 0.10 0.05 0.50 0.02 Resource-limited edge deployment

Custom strategies can be added directly in config.yaml — no code changes needed.

Models

# Model Provider Type Parameters RAM estimate
1 GPT-4o-mini OpenAI API
2 Gemini 2.0 Flash Google API
3 DeepSeek-V3 DeepSeek API
4 Groq LLaMA-3.3-70B Groq API (OpenAI-compatible)
5 Phi-2 Microsoft Local (HF) 2.7B ~6 GB (float32)
6 Qwen2.5-1.5B-Instruct Alibaba Local (HF) 1.5B ~4 GB (float32)
7 Mistral-7B-Instruct-v0.3 Mistral AI Local (HF) 7B ~5 GB (4-bit)
8 LLaMA-3-8B-Instruct Meta Local (HF) 8B ~6 GB (4-bit)
Claude Haiku 4.5 Anthropic API

Local models are loaded one at a time and released from RAM before the next model loads — allowing evaluation on machines without enough RAM to hold all models simultaneously.

HuggingFace models are downloaded automatically on first run and cached in ~/.cache/huggingface/.


Datasets

Dataset Type Size used Answer format Source
Synthetic Auto-generated Configurable Mixed Built-in
GSM8K Math word problems 250 (default) Numerical openai/gsm8k
StrategyQA Commonsense reasoning 250 (default) Yes / No wics/strategy-qa
MMLU Multi-subject knowledge 225 (default) A / B / C / D cais/mmlu

Note: MMLU loads 225 items by default (not 250) because the moral_reasoning subject does not exist in the cais/mmlu dataset. The framework skips missing subjects automatically and continues with the remaining ones.

The synthetic dataset generates three categories of questions automatically: arithmetic/logic reasoning, adversarial questions (designed to expose brittle reasoning), and robustness items (paraphrase pairs).

Custom JSON datasets can also be added — see Adding New Datasets.


Installation

Option 1 — Install via pip (recommended)

pip install llm-reasoning-quality

After installation, run from any directory:

# Download the default config
curl -O https://raw.githubusercontent.com/senolali/LLM-Reasoning-Quality-Evaluation-Metrics/main/config/config.yaml

# Set your API keys
export OPENAI_API_KEY="sk-..."

# Run evaluation
llm-eval --config config.yaml

Or use as a Python library:

from llm_reasoning_quality import run_evaluation
run_evaluation(config_path="config/config.yaml")

Option 2 — Install from source (for development)

Prerequisites

  • Python 3.11 (recommended)
  • Miniconda or Anaconda

⚠️ PyTorch must be installed first and separately — platform-specific instructions below. Do not run pip install -r requirements.txt before PyTorch is installed.


Windows + NVIDIA GPU (tested & recommended)

Tested configuration: GTX 1650 4 GB · CUDA 12.1 · Python 3.11 · PyTorch 2.4.0 · bitsandbytes 0.44.0

⚠️ Use conda install for PyTorch on Windows — not pip install torch --index-url. The pip CUDA wheels cause fbgemm.dll or cusparse64_11.dll errors on many Windows systems. Conda resolves all DLL dependencies automatically.

Step 1 — Create environment:

conda create -n llm_eval_gpu python=3.11 -y
conda activate llm_eval_gpu

Step 2 — Install PyTorch via conda (CUDA 12.1):

conda install pytorch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 pytorch-cuda=12.1 -c pytorch -c nvidia

Step 3 — Verify GPU:

python -c "import torch; print('CUDA:', torch.cuda.is_available()); print('GPU:', torch.cuda.get_device_name(0))"

Expected: CUDA: True and your GPU name.

Step 4 — Install bitsandbytes (pinned version):

pip install bitsandbytes==0.44.0

Step 5 — Install remaining dependencies:

pip install -r requirements.txt
pip install transformers -U

Windows CPU-only

Step 1 — Create environment:

conda create -n llm_eval python=3.11 -y
conda activate llm_eval

Step 2 — Install PyTorch CPU wheel (max 2.3.x):

pip install torch==2.3.1 --index-url https://download.pytorch.org/whl/cpu

⚠️ Do NOT use PyTorch 2.4+ on Windows CPU — it causes fbgemm.dll errors.

Step 3 — Pin transformers:

pip install "transformers==4.45.2"

Step 4 — Install remaining dependencies:

pip install -r requirements.txt

Linux / macOS

conda create -n llm_eval_gpu python=3.11 -y
conda activate llm_eval_gpu

# GPU (CUDA 12.1 — tested):
pip install torch==2.4.0 --index-url https://download.pytorch.org/whl/cu121
pip install bitsandbytes==0.44.0

# CPU:
pip install torch

pip install -r requirements.txt
pip install transformers -U

Quick Start

1. Set API keys

Windows PowerShell:

$env:OPENAI_API_KEY    = "sk-..."
$env:ANTHROPIC_API_KEY = "sk-ant-..."
$env:GOOGLE_API_KEY    = "AIza..."
$env:DEEPSEEK_API_KEY  = "sk-..."

Linux / macOS:

export OPENAI_API_KEY="sk-..."
export ANTHROPIC_API_KEY="sk-ant-..."
export GOOGLE_API_KEY="AIza..."
export DEEPSEEK_API_KEY="sk-..."

Models with missing API keys are automatically skipped — you don't need all keys to run the framework.

2. Run full evaluation

python main.py --config config/config.yaml

Or specify a custom config:

python main.py --config config/my_experiment.yaml

Custom Evaluation: Your Own Dataset & Weights

This section shows how to evaluate your own dataset with custom metric weights — no code changes needed.

Step 1 — Prepare your dataset

Create a JSON file (e.g. my_dataset.json):

[
  {
    "id": "q001",
    "question": "What is the capital of France?",
    "answer": "Paris",
    "type": "reasoning",
    "perturbations": [
      "Name the capital city of France.",
      "Which city serves as France's capital?",
      "What city is the capital of France?"
    ]
  },
  {
    "id": "q002",
    "question": "If a train travels 120 km in 2 hours, what is its speed?",
    "answer": "60",
    "type": "reasoning",
    "perturbations": [
      "A train covers 120 km in 2 hours. Find its speed.",
      "What speed does a train reach if it travels 120 km in 2 hours?",
      "Calculate the speed of a train that goes 120 km in 2 hours."
    ]
  }
]

Rules:

  • answer must be a string (exact match used for CQ)
  • perturbations — 3 rephrased versions of the same question (used for RS)
  • type — any label you choose (used for grouping in output)

Step 2 — Define your custom weights

Create config/config_custom.yaml:

experiment:
  name: "my_custom_eval"
  seed: 42
  output_dir: "outputs"

models:
  - name: "GPT-4o-mini"
    type: "openai"
    params:
      model_id: "gpt-4o-mini"
      api_key_env: "OPENAI_API_KEY"
      max_tokens: 256
      temperature: 0.7

datasets:
  - name: "my_dataset"
    type: "json"
    params:
      path: "my_dataset.json"   # path to your JSON file
      num_samples: 100          # how many items to use

metrics:
  consistency_runs: 3
  robustness_perturbations: 3
  stability_runs: 3
  nli_model: "cross-encoder/nli-deberta-v3-small"
  bertscore_model: "distilbert-base-uncased"

aggregation:
  strategies:

    # Equal weights — general baseline
    balanced:
      correctness:       0.1667
      consistency:       0.1667
      robustness:        0.1667
      logical_coherence: 0.1667
      efficiency:        0.1667
      stability:         0.1667

    # Example: high-stakes domain (correctness + coherence matter most)
    my_custom_strategy:
      correctness:       0.40
      logical_coherence: 0.30
      robustness:        0.15
      consistency:       0.10
      efficiency:        0.03
      stability:         0.02

Weight rules:

  • Six keys: correctness, consistency, robustness, logical_coherence, efficiency, stability
  • Weights are auto-normalized — they don't need to sum exactly to 1.0
  • Add as many custom strategies as you need

Step 3 — Run evaluation

python main.py --config config/config_custom.yaml

Step 4 — Read your results

Results are saved to outputs/my_custom_eval_<timestamp>/:

File What it contains
reasoning_quality_results.xlsx CQ, CS, RS, LS, ES, SS scores + your custom strategy scores
radar_plot.png Visual comparison across all 6 dimensions
summary.json Full results in machine-readable format

In the Excel file, your custom strategy appears as a separate column next to the built-in strategies — making it easy to compare how different weight choices affect model rankings.

Configuration

Everything is controlled from a single YAML file. The default is config/config.yaml.

Experiment settings

experiment:
  name: "my_experiment"     # Used as prefix for output folder name
  seed: 42                  # Random seed for reproducibility
  deterministic: true       # true = greedy decoding (temperature=0)
  output_dir: "outputs"     # Where results are saved

max_workers: 1              # Always set to 1 for local models — parallel loading
                            # causes meta tensor errors and CUDA OOM

Adding an API model

models:
  - name: "GPT-4o"          # Display name (appears in Excel / radar chart)
    type: "openai"           # openai | anthropic | gemini | deepseek | local | mock
    params:
      model_id: "gpt-4o"
      api_key_env: "OPENAI_API_KEY"   # Environment variable name
      max_tokens: 512
      temperature: 0.7       # Optional — overrides deterministic setting for CS/SS
      max_retries: 3
      timeout: 60

Adding a local HuggingFace model

models:
  - name: "Qwen2.5-1.5B"
    type: "local"
    params:
      model_id: "Qwen/Qwen2.5-1.5B-Instruct"
      device: "cuda"
      use_4bit: true                   # Attempts 4-bit; falls back to float16 if unsupported
      max_new_tokens: 64               # 64 is sufficient for all benchmark answer formats
      temperature: 0.7

RAM guide for local models:

Model size use_4bit VRAM needed
1.5B–2.7B false ~4–6 GB (float32)
1.5B–2.7B true ~1.5–2 GB (4-bit, may fall back to float16)
7B–8B true ~5–6 GB (4-bit, required)

Note on 4-bit fallback: For small models (Qwen2.5-1.5B, Phi-2) on some hardware/driver configurations, 4-bit loading may fail with a meta tensor error. The framework catches this automatically and falls back to float16 CUDA. The copying from a non-meta parameter warnings in the log are expected in this case and do not affect results.

Note on max_new_tokens: 64 tokens is sufficient for numerical answers (GSM8K, synthetic), Yes/No answers (StrategyQA), and A/B/C/D answers (MMLU). Longer explanations may be truncated, but this does not affect metric scoring since answer extraction reads the first matching token.

Metric settings

metrics:
  consistency_runs: 3           # K — number of repeated runs per question for CS
  robustness_perturbations: 3   # P — number of paraphrase variants per question for RS
  stability_runs: 3             # K — number of repeated runs per question for SS
  nli_model: "cross-encoder/nli-deberta-v3-small"
  bertscore_model: "distilbert-base-uncased"

Performance note for local models: Each item requires 1 + consistency_runs + robustness_perturbations inference calls. With the defaults (3+3) this is 7 calls/item. At ~7–8s/call on a GTX 1650 (float16), 975 items takes approximately 12–15 hours per local model. Reducing to consistency_runs: 2 and robustness_perturbations: 2 brings this to ~8–10 hours.

Adding a custom aggregation strategy

aggregation:
  strategies:
    my_strategy:
      correctness:       0.50
      robustness:        0.30
      logical_coherence: 0.20
      consistency:       0.00
      efficiency:        0.00
      stability:         0.00

Weights are auto-normalized if they don't sum exactly to 1.0.

Temperature and CS/SS measurement

By default, deterministic: true sets temperature=0. This causes CS=SS=1.0 for all models (deterministic models always produce the same output — this is a mathematical artifact, not a meaningful measurement).

To get meaningful CS/SS scores, add temperature: 0.7 per model:

experiment:
  deterministic: true    # Keep this — only temperature param overrides it

models:
  - name: "GPT-4o-mini"
    type: "openai"
    params:
      model_id: "gpt-4o-mini"
      temperature: 0.7    # ← This overrides deterministic for this model only

Outputs

All results are saved to outputs/<experiment_name>_<timestamp>/:

File Description
reasoning_quality_results.xlsx Full results: raw metrics, all aggregation strategies, per-dataset breakdown, metadata
radar_plot.png Multi-dimensional radar chart — one polygon per model
summary.json Complete results in machine-readable JSON
<ModelName>_result.json Per-model detailed results

Excel structure

The Excel file contains multiple sheets:

  • Results — one row per model, columns: CQ, CS, RS, LS, ES, SS + all aggregation strategy scores
  • Per-Dataset Breakdown — same metrics split by dataset (GSM8K, MMLU, etc.)
  • Experiment Metadata — config parameters, timestamps, dataset sizes

Project Structure

LLM-Reasoning-Quality-Evaluation-Metrics/
│
└── .github/
    └── workflows/
        └── publish.yml         ← Automated PyPI publishing
├── config/
│   ├── config.yaml             ← Main config: add models/datasets/strategies here
│   └── config_test.yaml        ← Quick test (mock + Phi-2 + synthetic, ~5 min)
│
├── models/
│   ├── base_model.py           ← Abstract base class (cache, interface)
│   │                             Cache disabled for stochastic models (temperature>0)
│   ├── openai_model.py         ← GPT-4o-mini, GPT-4o, any OpenAI-compatible API
│   ├── anthropic_model.py      ← Claude models
│   ├── gemini_model.py         ← Gemini models
│   ├── deepseek_model.py       ← DeepSeek (OpenAI-compatible endpoint)
│   ├── local_model.py          ← HuggingFace local models
│   │                             4-bit quantization with float16 fallback
│   │                             Sequential RAM management (one model at a time)
│   │                             Pre-loading before evaluation loop (no per-item reload)
│   │                             Prompt templates per model family
│   └── mock_model.py           ← Deterministic mock for testing without APIs
│
├── llm_datasets/
│   ├── base_dataset.py         ← Abstract base + JSON file loader
│   ├── synthetic_dataset.py    ← Auto-generated reasoning/adversarial/robustness items
│   ├── gsm8k_dataset.py        ← GSM8K math word problems
│   ├── mmlu_dataset.py         ← MMLU multi-subject multiple choice
│   │                             Skips missing subjects (e.g. moral_reasoning) gracefully
│   ├── strategyqa_dataset.py   ← StrategyQA commonsense yes/no
│   └── multi_dataset.py        ← Combines multiple datasets, tracks source per item
│
├── llm_reasoning_quality/
│   ├── __init__.py             ← PyPI package entry point
│   └── cli.py                  ← llm-eval CLI command
├── metrics/
│   ├── accuracy.py             ← CQ — 7-strategy fuzzy matching pipeline
│   ├── consistency.py          ← CS — pairwise agreement across K runs
│   ├── robustness.py           ← RS — perturbation matching (conditioned on CQ)
│   ├── logical_consistency.py  ← LS — NLI contradiction detection
│   ├── efficiency.py           ← ES — harmonic mean of CQ and inverse token count
│   ├── explainability.py       ← SS — BERTScore across reasoning traces
│   └── aggregation.py          ← Weighted composite Q score, 7 strategies
│
├── evaluation/
│   └── evaluator.py            ← Main pipeline: load → generate → 6 metrics → export
│                                 Local model detection → forces workers=1
│                                 Pre-loads model once before evaluation loop
│                                 Per-dataset breakdown support
│
├── visualization/
│   └── radar_plot.py           ← Radar chart + grouped bar chart
│
├── utils/
│   ├── logger.py               ← Structured logging
│   ├── reproducibility.py      ← Seed setting across Python / NumPy / PyTorch
│   └── experiment_tracker.py   ← JSON + Excel export, result aggregation
│
├── outputs/                    ← Auto-created; all results go here
├── MANIFEST.in                 ← Package file inclusion rules
├── requirements.txt
└── main.py                     ← Entry point; config parsing + model/dataset registration
├── pyproject.toml              ← PyPI package configuration



Adding New Models

Option A — Config only (API models)

For any OpenAI-compatible API:

- name: "My-Model"
  type: "openai"
  params:
    model_id: "my-model-id"
    api_key_env: "MY_API_KEY"
    max_tokens: 512

For HuggingFace local models:

- name: "My-Local-Model"
  type: "local"
  params:
    model_id: "org/model-name"
    device: "cuda"
    use_4bit: true            # Falls back to float16 if unsupported
    max_new_tokens: 64
    temperature: 0.7

Option B — Custom model class

  1. Create models/my_model.py extending BaseModel
  2. Implement generate(prompt) and generate_with_trace(prompt)
  3. Add a _build_mytype() function in main.py
  4. Register in the MODEL_BUILDERS dict in main.py
  5. Use type: "mytype" in config

Adding New Datasets

Option A — JSON file (no code needed)

Prepare a JSON file with this structure:

[
  {
    "id": "q001",
    "question": "What is 2 + 2?",
    "answer": "4",
    "type": "reasoning",
    "perturbations": [
      "What does 2 plus 2 equal?",
      "Calculate 2 + 2",
      "Find the sum of 2 and 2"
    ]
  }
]

Then add to config:

datasets:
  - name: "my_dataset"
    type: "json"
    params:
      path: "data/my_questions.json"
      num_samples: 100

The perturbations field is used for RS (robustness) metric. If omitted, robustness is skipped for that item.

Option B — HuggingFace dataset class

  1. Create llm_datasets/my_dataset.py extending BaseDataset
  2. Implement the load() method to populate self._data
  3. Register the type in main.py

Known Issues & Platform Notes

Windows GPU — DLL errors (fbgemm.dll / cusparse64_11.dll)

Both errors share the same cause: pip CUDA wheels have DLL dependency issues on many Windows systems.

Fix — use conda install instead of pip install:

conda install pytorch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 pytorch-cuda=12.1 -c pytorch -c nvidia

This resolves all DLL issues automatically.

⚠️ Do NOT install torch 2.10.x. It breaks torchvision/torchaudio compatibility and reintroduces DLL errors. If you accidentally upgrade, restore with:

pip uninstall torch torchvision torchaudio bitsandbytes -y
pip cache purge
conda install pytorch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 pytorch-cuda=12.1 -c pytorch -c nvidia
pip install bitsandbytes==0.44.0

bitsandbytes version compatibility

bitsandbytes torch Status
0.44.0 2.4.0 + CUDA 12.1 ✅ Tested, works
0.49.x 2.4.0 ❌ Incompatible — causes CUDA errors
any 2.10.x ❌ Do not use torch 2.10.x

Windows CPU — PyTorch version

PyTorch 2.4+ causes fbgemm.dll on Windows CPU pip wheels. Use 2.3.x for CPU-only:

pip install torch==2.3.1 --index-url https://download.pytorch.org/whl/cpu
pip install "transformers==4.45.2"

transformers version (CPU-only systems)

transformers >= 4.46 requires torch >= 2.4. On Windows CPU where 2.4 cannot be installed, pin transformers to 4.45.2. On GPU systems with PyTorch 2.4+, install the latest transformers freely:

pip install transformers -U

Local model — meta tensor / copying from non-meta parameter warnings

During 4-bit model loading you may see many warnings like:

UserWarning: for model.layers.X...: copying from a non-meta parameter in the
checkpoint to a meta parameter in the current model, which is a no-op.

This is expected and harmless. It means the 4-bit loading path was attempted but fell back to float16 CUDA. The model loads correctly in float16 and inference proceeds normally. The warning appears once per layer (28 layers × ~10 weights = ~280 lines for Qwen2.5-1.5B).

Local model — 4-bit fallback to float16

For small models (Qwen2.5-1.5B, Phi-2) 4-bit quantization may fail on some hardware with:

Cannot copy out of meta tensor; no data!

The framework catches this and automatically falls back to float16 CUDA. You will see:

[Qwen2.5-1.5B] 4-bit failed (...), falling back to float16 CUDA

This is not an error — evaluation continues normally. float16 uses slightly more VRAM (~3 GB for 1.5B) but works reliably on GTX 1650.

Local model — workers must be 1

The evaluator automatically detects local models and forces workers=1 regardless of the max_workers config setting. Running multiple local model workers causes repeated HuggingFace downloads, CUDA OOM, and meta tensor errors. This is by design.

Flash attention warning

Torch was not compiled with flash attention.

This is harmless on GTX 1650 (Turing architecture). The model uses standard scaled dot-product attention instead, which works correctly. Flash attention requires Ampere or newer (RTX 3000+).

logits type warning

Starting from v4.46, the logits model output will have the same type as the model

Harmless informational warning from transformers. Does not affect results.

MMLU — moral_reasoning subject not found

[MMLU] Could not load subject 'moral_reasoning': BuilderConfig 'moral_reasoning' not found.

moral_reasoning does not exist in the cais/mmlu dataset. The framework logs this warning and skips it automatically. MMLU loads 225 items from the remaining subjects instead of 250. This is expected.

Mistral / LLaMA tokenizer error

Cannot instantiate this tokenizer from a slow version... sentencepiece

Fix:

pip install sentencepiece

CS = SS = 1.0 for all models

This happens when deterministic: true and no temperature is set per model. The cache returns the same response for all K runs. Fix: add temperature: 0.7 to each model in config. See Temperature and CS/SS measurement.

Per-dataset breakdown is slow (NLI/BERTScore reload)

In earlier versions, NLI and BERTScore models were reloaded for each dataset in the per-dataset breakdown, causing 30–80 min overhead per dataset. This is fixed in v5: NLI/BERTScore models are now kept in memory across all per-dataset passes and released only once after all datasets are processed.


Citation

If you use this framework in your research, please cite:

@article{senol2026reasoning,
  title         = {Measuring Reasoning Quality in Large Language Models: A Multi-Dimensional Behavioral Framework},
  author        = {Şenol, Ali and Agrawal, Garima and Liu, Huan},
  year          = {2026},
  eprint        = {2605.24661},
  archivePrefix = {arXiv},
  primaryClass  = {cs.AI},
  url           = {https://arxiv.org/abs/2605.24661}
}

License

MIT License — see LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_reasoning_quality-1.0.5.tar.gz (65.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llm_reasoning_quality-1.0.5-py3-none-any.whl (67.8 kB view details)

Uploaded Python 3

File details

Details for the file llm_reasoning_quality-1.0.5.tar.gz.

File metadata

  • Download URL: llm_reasoning_quality-1.0.5.tar.gz
  • Upload date:
  • Size: 65.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for llm_reasoning_quality-1.0.5.tar.gz
Algorithm Hash digest
SHA256 f3e25c2acc4f65e08a3af9ef675592ce41a4ac99f5f3310957adbcc0f5d89b95
MD5 06fb8d616467992f953792d9b66dcb48
BLAKE2b-256 b5190a7db296a8d73317c60d74541e0d42363b9968e98111f8f7c5a2798bb0ec

See more details on using hashes here.

File details

Details for the file llm_reasoning_quality-1.0.5-py3-none-any.whl.

File metadata

File hashes

Hashes for llm_reasoning_quality-1.0.5-py3-none-any.whl
Algorithm Hash digest
SHA256 1e4ccbd857d310e669e4e8ff74c23f35441eddbd173258d2627d45aeb50f892b
MD5 4e735276c18beb66290eed788f484efb
BLAKE2b-256 b86c311c33e646c97008bc445d6f11da23a59f16f40aa91c8b7601c766f07298

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page