A multi-dimensional behavioral framework for evaluating LLM reasoning quality beyond accuracy

These details have not been verified by PyPI

Project links

Project description

LLM Reasoning Quality Evaluation Framework

A config-driven, multi-dimensional framework for evaluating reasoning quality in Large Language Models — beyond simple answer correctness.

6 metrics · 7 models (5 API + 2 local) · 4 benchmark datasets · CLI + no-code web interface · no code changes needed to add models or datasets

📄 Paper (preprint): Measuring Reasoning Quality in LLMs: A Multi-Dimensional Behavioral Framework

Overview
Metrics
Models
Datasets
Installation
Quick Start
Web Interface (No Code Required)
Custom Evaluation: Your Own Dataset & Weights
Reproducing the Paper Results
Configuration
Outputs
Project Structure
Adding New Models
Adding New Datasets
Known Issues & Platform Notes
Citation

Overview

Standard LLM evaluation asks: "Is the answer correct?"

This framework asks: "How well does the model reason?"

It evaluates models across 6 complementary dimensions of reasoning quality, producing a composite score that captures correctness, behavioral stability, robustness, logical integrity, and efficiency simultaneously.

Q = f(CQ, CS, RS, LS, ES, SS)

The framework is fully config-driven — models, datasets, metrics, and aggregation weights are all controlled from a single YAML file. No code changes are needed for common use cases.

Metrics

Symbol	Name	What It Measures
CQ	Correctness	Fraction of correct final answers
CS	Consistency	Same answer across K independent runs?
RS	Robustness	Same answer on semantically equivalent rephrases?
LS	Local Logical Coherence	No contradictions between consecutive reasoning steps?
ES	Efficiency	Correct and concise? (harmonic mean of CQ and inverse normalized token count)
SS	Stability	Same reasoning process across K runs? (BERTScore over traces)

Formal definitions of all six metrics are given in Section 3 of the paper.

Key design decisions

CQ — Multi-strategy matching pipeline: Raw model outputs are often verbose (e.g. "John has 8 apples." instead of "8"). The correctness metric applies 7 sequential matching strategies before marking an answer wrong: exact match → normalized → number extraction → yes/no extraction → A/B/C/D extraction → substring match → numeric tolerance. This prevents local models from being penalized purely for output format.

RS — Conditioned on correctness: Robustness is only counted for questions the model originally answered correctly. A model that gets everything wrong would trivially get RS = 1.0 otherwise.

LS — NLI-based contradiction detection: Uses cross-encoder/nli-deberta-v3-small to detect contradictions between consecutive reasoning steps. Single-sentence responses receive LS = 1.0 by convention (a single atomic step admits no internal contradiction). Falls back gracefully to LS = 1.0 if the NLI model is unavailable.

ES — Harmonic mean: Prevents rewarding short-but-wrong or long-but-correct responses equally. Both correctness and conciseness must be high for ES to be high.

SS — BERTScore similarity: Measures semantic similarity between reasoning traces across runs, not just whether the final answer matches. Falls back to Jaccard similarity if bert-score is not installed.

CS/SS and temperature: Running with deterministic: true (temperature = 0) produces CS = SS = 1.0 for all models — this is a mathematical artifact, not a real measurement. Set temperature: 0.7 per model in config to get meaningful CS/SS scores. All paper results were obtained at temperature = 0.7.

Aggregation strategies

Seven built-in weighting schemes are computed for every experiment. All appear as separate columns in the Excel output.

Strategy	CQ	CS	RS	LS	ES	SS	Use case
Balanced	1/6	1/6	1/6	1/6	1/6	1/6	General comparison
Safety Priority	0.30	0.20	0.30	0.10	0.05	0.05	High-stakes deployment
Accuracy Priority	0.40	0.25	0.15	0.10	0.05	0.05	Accuracy-critical tasks
Efficiency Priority	0.20	0.15	0.15	0.10	0.30	0.10	Resource-constrained deployment
Medical Triage	0.40	0.05	0.30	0.20	0.03	0.02	Clinical decision support
Legal/Compliance	0.15	0.25	0.20	0.35	0.03	0.02	Audit-sensitive applications
Edge Device/IoT	0.30	0.03	0.10	0.05	0.50	0.02	Resource-limited edge deployment

These weight vectors are theoretically motivated illustrative defaults; practitioners should calibrate them against their own operational requirements. Custom strategies can be added directly in config.yaml — no code changes needed.

Models

The seven models evaluated in the paper:

#	Model	Provider	Type	Parameters	Access
1	GPT-4o-mini	OpenAI	API	—	OpenAI API
2	Claude Haiku 4.5	Anthropic	API	—	Anthropic API
3	DeepSeek-V3	DeepSeek AI	API	—	DeepSeek API
4	Gemini 2.5 Flash	Google	API	—	Google API
5	LLaMA-3-70B	Meta	API (OpenAI-compatible)	70B	OpenRouter
6	Qwen2.5-1.5B-Instruct	Alibaba	Local (HF)	1.5B	HuggingFace, float16
7	Phi-2	Microsoft	Local (HF)	2.7B	HuggingFace, float16

The framework additionally supports any OpenAI-compatible endpoint (e.g., Groq) and any HuggingFace causal LM (e.g., Mistral-7B-Instruct-v0.3, LLaMA-3-8B-Instruct with 4-bit quantization) via config only — see Adding New Models. These additional models are supported by the framework but were not part of the paper's evaluation.

Local models are loaded one at a time and released from RAM before the next model loads — allowing evaluation on machines without enough RAM to hold all models simultaneously. HuggingFace models are downloaded automatically on first run and cached in ~/.cache/huggingface/.

Datasets

The 975-item evaluation suite used in the paper:

Dataset	Type	Size (paper)	Answer format	Source
GSM8K	Math word problems	250	Numerical	`openai/gsm8k`
MMLU	9 reasoning subjects	225	A / B / C / D	`cais/mmlu`
StrategyQA	Commonsense reasoning	250	Yes / No	`wics/strategy-qa`
Synthetic	Built-in generator	250	Mixed	This repository

MMLU subjects (9): logical fallacies, formal logic, abstract algebra, elementary mathematics, high school mathematics, college mathematics, high school statistics, conceptual physics, philosophy. The framework skips subjects missing from cais/mmlu automatically and continues with the remaining ones, yielding 225 items.

Synthetic dataset (250 items): 100 arithmetic word problems with numerical variation, 75 adversarial instances embedding deliberate logical contradictions into otherwise valid premises, and 75 robustness probes pairing each item with two surface-level paraphrases. Sizes are configurable; the values above reproduce the paper.

All sampling uses a fixed random seed (seed: 42) for reproducibility. Custom JSON datasets can also be added — see Adding New Datasets.

Installation

Option A — Install from PyPI (recommended for using the framework)

pip install llm-reasoning-quality

Then scaffold a ready-to-run project in any directory:

mkdir my-llm-eval && cd my-llm-eval
llm-eval setup                    # copies main.py, app.py, config/ into the current directory
llm-eval --config config/config_test.yaml   # quick smoke test, no API keys needed

CLI commands:

Command	Description
`llm-eval setup`	Set up project files in the current directory (`--dir` for a different target)
`llm-eval --config <file>`	Run an evaluation with the given YAML config
`llm-eval --version`	Show installed version

Option B — Install from source (recommended for reproducing the paper / development)

Prerequisites

Python 3.11 (recommended)
Miniconda or Anaconda

⚠️ PyTorch must be installed first and separately — platform-specific instructions below. Do not run pip install -r requirements.txt before PyTorch is installed.

Windows + NVIDIA GPU (tested & recommended)

Tested configuration: GTX 1650 4 GB · CUDA 12.1 · Python 3.11 · PyTorch 2.4.0 · bitsandbytes 0.44.0

⚠️ Use conda install for PyTorch on Windows — not pip install torch --index-url. The pip CUDA wheels cause fbgemm.dll or cusparse64_11.dll errors on many Windows systems. Conda resolves all DLL dependencies automatically.

# Step 1 — Create environment
conda create -n llm_eval_gpu python=3.11 -y
conda activate llm_eval_gpu

# Step 2 — Install PyTorch via conda (CUDA 12.1)
conda install pytorch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 pytorch-cuda=12.1 -c pytorch -c nvidia

# Step 3 — Verify GPU
python -c "import torch; print('CUDA:', torch.cuda.is_available()); print('GPU:', torch.cuda.get_device_name(0))"

# Step 4 — Install bitsandbytes (pinned version)
pip install bitsandbytes==0.44.0

# Step 5 — Install remaining dependencies
pip install -r requirements.txt
pip install transformers -U

Windows CPU-only

conda create -n llm_eval python=3.11 -y
conda activate llm_eval

# PyTorch CPU wheel (max 2.3.x — 2.4+ causes fbgemm.dll errors on Windows CPU)
pip install torch==2.3.1 --index-url https://download.pytorch.org/whl/cpu
pip install "transformers==4.45.2"
pip install -r requirements.txt

Linux / macOS

conda create -n llm_eval_gpu python=3.11 -y
conda activate llm_eval_gpu

# GPU (CUDA 12.1 — tested):
pip install torch==2.4.0 --index-url https://download.pytorch.org/whl/cu121
pip install bitsandbytes==0.44.0

# CPU:
pip install torch

pip install -r requirements.txt
pip install transformers -U

Quick Start

1. Set API keys

Windows PowerShell:

$env:OPENAI_API_KEY     = "sk-..."
$env:ANTHROPIC_API_KEY  = "sk-ant-..."
$env:GOOGLE_API_KEY     = "AIza..."
$env:DEEPSEEK_API_KEY   = "sk-..."
$env:OPENROUTER_API_KEY = "sk-or-..."

Linux / macOS:

export OPENAI_API_KEY="sk-..."
export ANTHROPIC_API_KEY="sk-ant-..."
export GOOGLE_API_KEY="AIza..."
export DEEPSEEK_API_KEY="sk-..."
export OPENROUTER_API_KEY="sk-or-..."

Models with missing API keys are automatically skipped — you don't need all keys to run the framework.

2. Run full evaluation

python main.py --config config/config.yaml

Or specify a custom config:

python main.py --config config/my_experiment.yaml

For a quick smoke test without API keys (~5 min):

python main.py --config config/config_test.yaml

Web Interface (No Code Required)

For users who prefer a graphical interface — researchers, clinicians, or domain experts without a coding background — the framework includes a Streamlit web app. Everything is point-and-click: no YAML, no terminal commands, no code.

Launch

pip install streamlit
cd my-llm-eval          # your project directory created by 'llm-eval setup'
streamlit run app.py

A browser window opens automatically.

What you can do in the browser

Step 1 — Dataset

Upload your own dataset (JSON or CSV), with a live preview tab, or
Pick a built-in benchmark (GSM8K, StrategyQA, MMLU, synthetic) and set the number of items with a slider

⚠️ CSV uploads map only question and answer columns — the RS (robustness) metric requires perturbations and is skipped for CSV datasets. Use the JSON format (see Custom Evaluation) to evaluate all six dimensions.

Step 2 — Models

Enable/disable any model with a checkbox
Add new models with + Add New Model — choose the provider (OpenAI, Anthropic, Gemini, DeepSeek, local HuggingFace, mock), enter the model ID and API key directly in the browser
Edit temperature, max tokens, device, quantization, and base URL per model
A Mock Model is available for trying the interface without any API key

Step 3 — Aggregation Strategy

Choose a preset (Balanced, Clinical/Medical, Legal/Compliance, Accuracy Priority, Efficiency Priority), or
Build a Custom strategy with one slider per dimension (weights auto-normalized to 1.0)

Step 4 — Run

Click Run Evaluation and watch live progress with streaming logs
Inspect the generated YAML in the Config Preview tab
View results as per-model score cards and full tables (raw metrics + aggregated scores)
Download the complete Excel report with one click

Example: a clinician comparing models on their own cases

streamlit run app.py
Upload a JSON file of clinical questions and reference answers
Enable GPT-4o-mini and Claude, paste API keys
Select the Clinical/Medical preset
Click Run Evaluation and download the Excel report

Custom Evaluation: Your Own Dataset & Weights

Step 1 — Prepare your dataset

Create a JSON file (e.g. my_dataset.json):

[
  {
    "id": "q001",
    "question": "What is the capital of France?",
    "answer": "Paris",
    "type": "reasoning",
    "perturbations": [
      "Name the capital city of France.",
      "Which city serves as France's capital?",
      "What city is the capital of France?"
    ]
  }
]

Rules:

answer must be a string (matched by the CQ pipeline)

perturbations — rephrased versions of the same question, used for RS (omit to skip RS for that item)

type — any label you choose, used for grouping in output

Step 2 — Define your custom weights

Add a strategy to your config (auto-normalized if weights don't sum to 1.0):

aggregation:
  strategies:
    my_strategy:
      correctness:       0.50
      robustness:        0.30
      logical_coherence: 0.20
      consistency:       0.00
      efficiency:        0.00
      stability:         0.00

Step 3 — Run

llm-eval --config config/config_custom.yaml

Or do all of the above with zero code in the web interface.

Reproducing the Paper Results

The configuration below reproduces the experimental setup reported in the paper (975 items, 7 models, temperature = 0.7, max_new_tokens = 256, seed = 42):

experiment:
  name: "paper_reproduction"
  seed: 42
  deterministic: true        # overridden per model by temperature below
  output_dir: "outputs"

metrics:
  consistency_runs: 3        # K = 3
  robustness_perturbations: 3  # P = 3
  stability_runs: 3
  nli_model: "cross-encoder/nli-deberta-v3-small"
  bertscore_model: "distilbert-base-uncased"

datasets:
  - { name: "gsm8k",      params: { num_samples: 250 } }
  - { name: "mmlu",       params: { num_samples: 250 } }   # yields 225 after subject filtering
  - { name: "strategyqa", params: { num_samples: 250 } }
  - { name: "synthetic",  params: { num_samples: 250 } }

Every model entry must set temperature: 0.7 and max_new_tokens: 256 (API models: max_tokens: 256).

⚠️ max_new_tokens matters for LS and SS. Final-answer extraction (CQ) works even with max_new_tokens: 64, but LS and SS are computed over the full reasoning trace. Truncating generation below 256 tokens shortens or removes traces, inflating LS via the single-step convention and distorting SS. Use 256 to reproduce the paper. Lower values are acceptable only for quick correctness-oriented smoke tests.

Runtime note for local models: Each item requires 1 + consistency_runs + robustness_perturbations inference calls (7 with defaults). At ~7–8 s/call on a GTX 1650 (float16), 975 items take approximately 12–15 hours per local model. Reducing to consistency_runs: 2 and robustness_perturbations: 2 brings this to ~8–10 hours, at the cost of deviating from the paper setup.

Configuration

Everything is controlled from a single YAML file. The default is config/config.yaml.

Experiment settings

experiment:
  name: "my_experiment"     # Used as prefix for output folder name
  seed: 42                  # Random seed for reproducibility
  deterministic: true       # true = greedy decoding (temperature=0)
  output_dir: "outputs"     # Where results are saved

max_workers: 1              # Always set to 1 for local models — parallel loading
                            # causes meta tensor errors and CUDA OOM

Adding an API model

models:
  - name: "GPT-4o"          # Display name (appears in Excel / radar chart)
    type: "openai"           # openai | anthropic | gemini | deepseek | local | mock
    params:
      model_id: "gpt-4o"
      api_key_env: "OPENAI_API_KEY"   # Environment variable name
      max_tokens: 256
      temperature: 0.7       # Optional — overrides deterministic setting for CS/SS
      max_retries: 3
      timeout: 60

Adding a local HuggingFace model

models:
  - name: "Qwen2.5-1.5B"
    type: "local"
    params:
      model_id: "Qwen/Qwen2.5-1.5B-Instruct"
      device: "cuda"
      use_4bit: true                   # Attempts 4-bit; falls back to float16 if unsupported
      max_new_tokens: 256              # Use 256 for full-trace metrics (LS/SS); see note above
      temperature: 0.7

RAM guide for local models:

Model size	`use_4bit`	VRAM needed
1.5B–2.7B	`false`	~4–6 GB (float32)
1.5B–2.7B	`true`	~1.5–2 GB (4-bit, may fall back to float16)
7B–8B	`true`	~5–6 GB (4-bit, required)

Note on 4-bit fallback: For small models (Qwen2.5-1.5B, Phi-2) on some hardware/driver configurations, 4-bit loading may fail with a meta tensor error. The framework catches this automatically and falls back to float16 CUDA. The copying from a non-meta parameter warnings in the log are expected in this case and do not affect results.

Adding a custom aggregation strategy

aggregation:
  strategies:
    my_strategy:
      correctness:       0.50
      robustness:        0.30
      logical_coherence: 0.20
      consistency:       0.00
      efficiency:        0.00
      stability:         0.00

Weights are auto-normalized if they don't sum exactly to 1.0.

Temperature and CS/SS measurement

By default, deterministic: true sets temperature = 0. This causes CS = SS = 1.0 for all models (deterministic models always produce the same output — a mathematical artifact, not a meaningful measurement).

To get meaningful CS/SS scores, add temperature: 0.7 per model:

experiment:
  deterministic: true    # Keep this — only the temperature param overrides it

models:
  - name: "GPT-4o-mini"
    type: "openai"
    params:
      model_id: "gpt-4o-mini"
      temperature: 0.7    # ← This overrides deterministic for this model only

Outputs

All results are saved to outputs/<experiment_name>_<timestamp>/:

File	Description
`reasoning_quality_results.xlsx`	Full results: raw metrics, all aggregation strategies, per-dataset breakdown, metadata
`radar_plot.png`	Multi-dimensional radar chart — one polygon per model
`summary.json`	Complete results in machine-readable JSON
`<ModelName>_result.json`	Per-model detailed results

Excel structure

Overall Raw Metrics — one row per model, columns: CQ, CS, RS, LS, ES, SS
Aggregated Scores — composite Q scores per model × all seven aggregation strategies
Additional sheets — per-dataset breakdown and experiment metadata (config parameters, timestamps, dataset sizes)

Project Structure

LLM-Reasoning-Quality-Evaluation-Metrics/
│
├── config/
│   ├── config.yaml             ← Main config: add models/datasets/strategies here
│   └── config_test.yaml        ← Quick test (mock + Phi-2 + synthetic, ~5 min)
│
├── models/
│   ├── base_model.py           ← Abstract base class (cache, interface)
│   │                             Cache disabled for stochastic models (temperature>0)
│   ├── openai_model.py         ← GPT-4o-mini, GPT-4o, any OpenAI-compatible API
│   ├── anthropic_model.py      ← Claude models
│   ├── gemini_model.py         ← Gemini models
│   ├── deepseek_model.py       ← DeepSeek (OpenAI-compatible endpoint)
│   ├── local_model.py          ← HuggingFace local models
│   │                             4-bit quantization with float16 fallback
│   │                             Sequential RAM management (one model at a time)
│   │                             Pre-loading before evaluation loop (no per-item reload)
│   │                             Prompt templates per model family
│   └── mock_model.py           ← Deterministic mock for testing without APIs
│
├── llm_datasets/
│   ├── base_dataset.py         ← Abstract base + JSON file loader
│   ├── synthetic_dataset.py    ← Auto-generated reasoning/adversarial/robustness items
│   ├── gsm8k_dataset.py        ← GSM8K math word problems
│   ├── mmlu_dataset.py         ← MMLU multi-subject multiple choice
│   │                             Skips missing subjects gracefully
│   ├── strategyqa_dataset.py   ← StrategyQA commonsense yes/no
│   └── multi_dataset.py        ← Combines multiple datasets, tracks source per item
│
├── metrics/
│   ├── accuracy.py             ← CQ — 7-strategy fuzzy matching pipeline
│   ├── consistency.py          ← CS — pairwise agreement across K runs
│   ├── robustness.py           ← RS — perturbation matching (conditioned on CQ)
│   ├── logical_consistency.py  ← LS — NLI contradiction detection
│   ├── efficiency.py           ← ES — harmonic mean of CQ and inverse token count
│   ├── explainability.py       ← SS — BERTScore across reasoning traces
│   └── aggregation.py          ← Weighted composite Q score, 7 built-in strategies
│
├── evaluation/
│   └── evaluator.py            ← Main pipeline: load → generate → 6 metrics → export
│
├── visualization/
│   └── radar_plot.py           ← Radar chart + grouped bar chart
│
├── utils/
│   ├── logger.py               ← Structured logging
│   ├── reproducibility.py      ← Seed setting across Python / NumPy / PyTorch
│   └── experiment_tracker.py   ← JSON + Excel export, result aggregation
│
├── outputs/                    ← Auto-created; all results go here
├── app.py                      ← Streamlit web interface (streamlit run app.py)
├── requirements.txt
└── main.py                     ← Entry point; config parsing + model/dataset registration

(Installed via PyPI, the package additionally provides the `llm-eval` CLI:
`llm-eval setup` copies main.py, app.py and config/ into your working directory.)

Adding New Models

Option A — Config only (API models)

For any OpenAI-compatible API (OpenRouter, Groq, etc.):

- name: "My-Model"
  type: "openai"
  params:
    model_id: "my-model-id"
    api_key_env: "MY_API_KEY"
    max_tokens: 256

For HuggingFace local models:

- name: "My-Local-Model"
  type: "local"
  params:
    model_id: "org/model-name"
    device: "cuda"
    use_4bit: true            # Falls back to float16 if unsupported
    max_new_tokens: 256
    temperature: 0.7

Option B — Custom model class

Create models/my_model.py extending BaseModel
Implement generate(prompt) and generate_with_trace(prompt)
Add a _build_mytype() function in main.py
Register in the MODEL_BUILDERS dict in main.py
Use type: "mytype" in config

Adding New Datasets

Option A — JSON file (no code needed)

Prepare a JSON file with this structure:

[
  {
    "id": "q001",
    "question": "What is 2 + 2?",
    "answer": "4",
    "type": "reasoning",
    "perturbations": [
      "What does 2 plus 2 equal?",
      "Calculate 2 + 2",
      "Find the sum of 2 and 2"
    ]
  }
]

Then add to config:

datasets:
  - name: "my_dataset"
    type: "json"
    params:
      path: "data/my_questions.json"
      num_samples: 100

The perturbations field is used for the RS (robustness) metric. If omitted, robustness is skipped for that item.

Option B — HuggingFace dataset class

Create llm_datasets/my_dataset.py extending BaseDataset
Implement the load() method to populate self._data
Register the type in main.py

Known Issues & Platform Notes

Windows GPU — DLL errors (fbgemm.dll / cusparse64_11.dll)

Both errors share the same cause: pip CUDA wheels have DLL dependency issues on many Windows systems.

Fix — use conda install instead of pip install:

conda install pytorch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 pytorch-cuda=12.1 -c pytorch -c nvidia

⚠️ Do NOT install torch 2.10.x. It breaks torchvision/torchaudio compatibility and reintroduces DLL errors. If you accidentally upgrade, restore with:
pip uninstall torch torchvision torchaudio bitsandbytes -y
pip cache purge
conda install pytorch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 pytorch-cuda=12.1 -c pytorch -c nvidia
pip install bitsandbytes==0.44.0

bitsandbytes version compatibility

bitsandbytes	torch	Status
0.44.0	2.4.0 + CUDA 12.1	✅ Tested, works
0.49.x	2.4.0	❌ Incompatible — causes CUDA errors
any	2.10.x	❌ Do not use torch 2.10.x

Windows CPU — PyTorch version

PyTorch 2.4+ causes fbgemm.dll errors with Windows CPU pip wheels. Use 2.3.x for CPU-only:

pip install torch==2.3.1 --index-url https://download.pytorch.org/whl/cpu
pip install "transformers==4.45.2"

transformers version (CPU-only systems)

transformers >= 4.46 requires torch >= 2.4. On Windows CPU where 2.4 cannot be installed, pin transformers to 4.45.2. On GPU systems with PyTorch 2.4+, install the latest transformers freely.

Local model — meta tensor / copying from non-meta parameter warnings

During 4-bit model loading you may see many warnings like:

UserWarning: for model.layers.X...: copying from a non-meta parameter in the
checkpoint to a meta parameter in the current model, which is a no-op.

This is expected and harmless. It means the 4-bit loading path was attempted but fell back to float16 CUDA. The model loads correctly in float16 and inference proceeds normally.

Local model — 4-bit fallback to float16

For small models (Qwen2.5-1.5B, Phi-2) 4-bit quantization may fail on some hardware with Cannot copy out of meta tensor; no data!. The framework catches this and automatically falls back to float16 CUDA. This is not an error — evaluation continues normally. float16 uses slightly more VRAM (~3 GB for 1.5B) but works reliably on GTX 1650.

Local model — workers must be 1

The evaluator automatically detects local models and forces workers=1 regardless of the max_workers config setting. Running multiple local model workers causes repeated HuggingFace downloads, CUDA OOM, and meta tensor errors. This is by design.

Flash attention warning

Torch was not compiled with flash attention. — harmless on GTX 1650 (Turing architecture). The model uses standard scaled dot-product attention instead. Flash attention requires Ampere or newer (RTX 3000+).

MMLU — missing subjects

Subjects not present in cais/mmlu are logged and skipped automatically; MMLU loads 225 items from the 9 available reasoning subjects. This is expected and matches the paper.

Mistral / LLaMA tokenizer error

Cannot instantiate this tokenizer from a slow version... sentencepiece — fix with:

pip install sentencepiece

CS = SS = 1.0 for all models

This happens when deterministic: true and no temperature is set per model. The cache returns the same response for all K runs. Fix: add temperature: 0.7 to each model in config. See Temperature and CS/SS measurement.

Citation

If you use this framework in your research, please cite:

@article{senol2026reasoning,
  title         = {Measuring Reasoning Quality in LLMs: A Multi-Dimensional Behavioral Framework},
  author        = {Şenol, Ali and Agrawal, Garima and Liu, Huan},
  year          = {2026},
  eprint        = {2605.24661},
  archivePrefix = {arXiv},
  primaryClass  = {cs.AI},
  url           = {https://arxiv.org/abs/2605.24661}
}

License

MIT License — see LICENSE for details.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

1.1.4

Jun 11, 2026

1.1.3

Jun 5, 2026

1.1.2

Jun 5, 2026

1.1.1

Jun 5, 2026

1.1.0

Jun 5, 2026

1.0.7

Jun 5, 2026

1.0.6

Jun 5, 2026

1.0.5

Jun 5, 2026

1.0.4

Jun 5, 2026

1.0.3

Jun 5, 2026

1.0.2

Jun 2, 2026

1.0.0

Jun 2, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_reasoning_quality-1.1.4.tar.gz (89.3 kB view details)

Uploaded Jun 11, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

llm_reasoning_quality-1.1.4-py3-none-any.whl (83.8 kB view details)

Uploaded Jun 11, 2026 Python 3

File details

Details for the file llm_reasoning_quality-1.1.4.tar.gz.

File metadata

Download URL: llm_reasoning_quality-1.1.4.tar.gz
Upload date: Jun 11, 2026
Size: 89.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for llm_reasoning_quality-1.1.4.tar.gz
Algorithm	Hash digest
SHA256	`14488b935c60ceaa9cd44558f60f8019255a6041bf42a3a2f8facdf9841c563a`
MD5	`0203004cf3637599b3d67dde9bec913b`
BLAKE2b-256	`80420caba655e30a5e422896ca02caf17bbef9767731c40ffd4c2541e3475895`

See more details on using hashes here.

File details

Details for the file llm_reasoning_quality-1.1.4-py3-none-any.whl.

File metadata

Download URL: llm_reasoning_quality-1.1.4-py3-none-any.whl
Upload date: Jun 11, 2026
Size: 83.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for llm_reasoning_quality-1.1.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f27a0f3da788498e71d49317a58ebf0e7d3ab184369ca2fc00f15428379782ff`
MD5	`0bd648789f3d001c6be3772112b4478f`
BLAKE2b-256	`fd6ad62c7eb39d9087cfa1e8a0226ea1f7fd6c737f90f9c8813adb807bddb0a4`

See more details on using hashes here.

llm-reasoning-quality 1.1.4

Navigation

Verified details

Project links

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

LLM Reasoning Quality Evaluation Framework

Table of Contents

Overview

Metrics

Key design decisions

Aggregation strategies

Models

Datasets

Installation

Option A — Install from PyPI (recommended for using the framework)

Option B — Install from source (recommended for reproducing the paper / development)

Prerequisites

Windows + NVIDIA GPU (tested & recommended)

Windows CPU-only

Linux / macOS

Quick Start

1. Set API keys

2. Run full evaluation

Web Interface (No Code Required)

Launch

What you can do in the browser

Example: a clinician comparing models on their own cases

Custom Evaluation: Your Own Dataset & Weights

Step 1 — Prepare your dataset

Step 2 — Define your custom weights

Step 3 — Run

Reproducing the Paper Results

Configuration

Experiment settings

Adding an API model

Adding a local HuggingFace model

Adding a custom aggregation strategy

Temperature and CS/SS measurement

Outputs

Excel structure

Project Structure

Adding New Models

Option A — Config only (API models)

Option B — Custom model class

Adding New Datasets

Option A — JSON file (no code needed)

Option B — HuggingFace dataset class

Known Issues & Platform Notes

Windows GPU — DLL errors (fbgemm.dll / cusparse64_11.dll)

bitsandbytes version compatibility

Windows CPU — PyTorch version

transformers version (CPU-only systems)

Local model — meta tensor / copying from non-meta parameter warnings

Local model — 4-bit fallback to float16

Local model — workers must be 1

Flash attention warning

MMLU — missing subjects

Mistral / LLaMA tokenizer error

CS = SS = 1.0 for all models

Citation

License

Project details

Verified details

Project links

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata