A multi-dimensional behavioral framework for evaluating LLM reasoning quality beyond accuracy

These details have not been verified by PyPI

Project links

Project description

LLM Reasoning Quality Evaluation Framework

A config-driven, multi-dimensional framework for evaluating reasoning quality in Large Language Models — beyond simple answer correctness.

6 metrics · 7 models (4 API + 3 local) · 4 benchmark datasets · no code changes needed to add models or datasets

Overview
Metrics
Installation
Quick Start
Custom Evaluation
Models
Datasets
Configuration
Outputs
Project Structure
Adding New Models
Adding New Datasets
Known Issues & Platform Notes
Citation

Overview

Standard LLM evaluation asks: "Is the answer correct?"

This framework asks: "How well does the model reason?"

It evaluates models across 6 complementary dimensions of reasoning quality, producing a composite score that captures correctness, behavioral stability, robustness, logical integrity, and efficiency simultaneously.

Q = f(CQ, CS, RS, LS, ES, SS)

The framework is fully config-driven — models, datasets, metrics, and aggregation weights are all controlled from a single YAML file. No code changes are needed for common use cases.

Metrics

Symbol	Name	Formula	What It Measures
CQ	Correctness	`(1/N) Σ I(ŷᵢ = yᵢ)`	Fraction of correct answers
CS	Consistency	`(2/K(K−1)) Σ I(ŷᵢ⁽ᵏ⁾ = ŷᵢ⁽ˡ⁾)`	Same answer across K repeated runs (pairwise)?
RS	Robustness	`(1/\|C\|) Σ (1/P) Σ I(ŷᵢ = ŷᵢᵖ)`	Same answer on semantically equivalent rephrases?
LS	Local Logical Coherence	`1 − (1/N) Σ (1/(nᵢ−1)) Σ ψ(sⱼ, sⱼ₊₁)`	No contradictions between consecutive reasoning steps?
ES	Efficiency	Harmonic mean of CQ and inverse normalized token count	Correct and concise?
SS	Stability	`(2/K(K−1)) Σ BERTScore(Tᵢ⁽ᵏ⁾, Tᵢ⁽ˡ⁾)`	Same reasoning process across K runs?

Key design decisions

CQ — Multi-strategy matching pipeline: Raw model outputs are often verbose (e.g. "John has 8 apples." instead of "8"). The correctness metric applies 7 sequential matching strategies before marking an answer wrong: exact match → normalized → number extraction → yes/no extraction → A/B/C/D extraction → substring match → numeric tolerance. This prevents local models from being penalized purely for output format.

RS — Conditioned on correctness: Robustness is only counted for questions the model originally answered correctly. A model that gets everything wrong trivially gets RS=1.0 otherwise.

LS — NLI-based contradiction detection: Uses cross-encoder/nli-deberta-v3-small to detect contradictions between consecutive reasoning steps. Single-sentence responses receive LS=1.0 by convention — no internal contradiction is possible. High LS reflects absence of detected local contradiction, not deep semantic validity.

ES — Harmonic mean: Prevents rewarding short-but-wrong or long-but-correct responses equally. Both correctness and conciseness must be high for ES to be high.

SS — BERTScore similarity: Measures semantic similarity between reasoning traces across runs, not just whether the final answer matches. Falls back to Jaccard similarity if bert-score is not installed.

CS/SS and temperature: Running with deterministic: true (temperature=0) produces CS=SS=1.0 for all models — this is a mathematical artifact, not a real measurement. Set temperature: 0.7 per model in config to get meaningful CS/SS scores.

Aggregation strategies

Seven built-in weighting schemes are computed for every experiment. All appear as separate columns in the Excel output.

Strategy	CQ	CS	RS	LS	ES	SS	Use case
Balanced	1/6	1/6	1/6	1/6	1/6	1/6	General comparison
Safety Priority	0.30	0.20	0.30	0.10	0.05	0.05	High-stakes deployment
Accuracy Priority	0.40	0.25	0.15	0.10	0.05	0.05	Accuracy-critical tasks
Efficiency Priority	0.20	0.15	0.15	0.10	0.30	0.10	Resource-constrained deployment
Medical Triage	0.40	0.05	0.30	0.20	0.03	0.02	Clinical decision support
Legal/Compliance	0.15	0.25	0.20	0.35	0.03	0.02	Audit-sensitive applications
Edge Device/IoT	0.30	0.03	0.10	0.05	0.50	0.02	Resource-limited edge deployment

Custom strategies can be added directly in config.yaml — no code changes needed.

Installation

Option 1 — Install via pip (recommended)

pip install llm-reasoning-quality

Then set up your project directory:

mkdir my-llm-eval
cd my-llm-eval
llm-eval setup

llm-eval setup copies all config files, main.py, and .env.example to your directory. Then:

# Quick test — no API keys or GPU needed (~2 min)
python main.py --config config/config_test.yaml

# Full evaluation
python main.py --config config/config.yaml

Option 2 — Install from source (for development)

Prerequisites

Python 3.11 (recommended)
Miniconda or Anaconda

⚠️ PyTorch must be installed first and separately — platform-specific instructions below. Do not run pip install -r requirements.txt before PyTorch is installed.

Windows + NVIDIA GPU (tested & recommended)

Tested configuration: GTX 1650 4 GB · CUDA 12.1 · Python 3.11 · PyTorch 2.4.0 · bitsandbytes 0.44.0

⚠️ Use conda install for PyTorch on Windows — not pip install torch --index-url. The pip CUDA wheels cause fbgemm.dll or cusparse64_11.dll errors on many Windows systems. Conda resolves all DLL dependencies automatically.

Step 1 — Create environment:

conda create -n llm_eval_gpu python=3.11 -y
conda activate llm_eval_gpu

Step 2 — Install PyTorch via conda (CUDA 12.1):

conda install pytorch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 pytorch-cuda=12.1 -c pytorch -c nvidia

Step 3 — Verify GPU:

python -c "import torch; print('CUDA:', torch.cuda.is_available()); print('GPU:', torch.cuda.get_device_name(0))"

Step 4 — Install bitsandbytes:

pip install bitsandbytes==0.44.0

Step 5 — Install remaining dependencies:

pip install -r requirements.txt
pip install transformers -U

Windows CPU-only

conda create -n llm_eval python=3.11 -y
conda activate llm_eval
pip install torch==2.3.1 --index-url https://download.pytorch.org/whl/cpu
pip install "transformers==4.45.2"
pip install -r requirements.txt

⚠️ Do NOT use PyTorch 2.4+ on Windows CPU — it causes fbgemm.dll errors.

Linux / macOS

conda create -n llm_eval_gpu python=3.11 -y
conda activate llm_eval_gpu

# GPU (CUDA 12.1):
pip install torch==2.4.0 --index-url https://download.pytorch.org/whl/cu121
pip install bitsandbytes==0.44.0

# CPU:
pip install torch

pip install -r requirements.txt
pip install transformers -U

Quick Start

1. Set API keys

Windows PowerShell:

$env:OPENAI_API_KEY    = "sk-..."
$env:ANTHROPIC_API_KEY = "sk-ant-..."
$env:GOOGLE_API_KEY    = "AIza..."
$env:DEEPSEEK_API_KEY  = "sk-..."
$env:GROQ_API_KEY      = "gsk_..."

Linux / macOS:

export OPENAI_API_KEY="sk-..."
export ANTHROPIC_API_KEY="sk-ant-..."
export GOOGLE_API_KEY="AIza..."
export DEEPSEEK_API_KEY="sk-..."
export GROQ_API_KEY="gsk_..."

Models with missing API keys are automatically skipped — you don't need all keys to run the framework.

2. Quick test (no API keys needed)

python main.py --config config/config_test.yaml

Uses a mock model + synthetic dataset. Runs in ~2 minutes on any machine, no API keys or GPU required.

3. Run full evaluation

python main.py --config config/config.yaml

Or specify a custom config:

python main.py --config config/my_experiment.yaml

Custom Evaluation: Your Own Dataset & Weights

Step 1 — Prepare your dataset

Create a JSON file (e.g. my_dataset.json):

[
  {
    "id": "q001",
    "question": "What is the capital of France?",
    "answer": "Paris",
    "type": "reasoning",
    "perturbations": [
      "Name the capital city of France.",
      "Which city serves as France's capital?",
      "What city is the capital of France?"
    ]
  },
  {
    "id": "q002",
    "question": "If a train travels 120 km in 2 hours, what is its speed?",
    "answer": "60",
    "type": "reasoning",
    "perturbations": [
      "A train covers 120 km in 2 hours. Find its speed.",
      "What speed does a train reach if it travels 120 km in 2 hours?",
      "Calculate the speed of a train that goes 120 km in 2 hours."
    ]
  }
]

Rules:

answer must be a string (exact match used for CQ)

perturbations — 3 rephrased versions of the same question (used for RS)

type — any label you choose (used for grouping in output)

Step 2 — Define your custom weights

Create config/config_custom.yaml:

experiment:
  name: "my_custom_eval"
  seed: 42
  output_dir: "outputs"

models:
  - name: "GPT-4o-mini"
    type: "openai"
    params:
      model_id: "gpt-4o-mini"
      api_key_env: "OPENAI_API_KEY"
      max_tokens: 256
      temperature: 0.7

datasets:
  - name: "my_dataset"
    type: "json"
    params:
      path: "my_dataset.json"
      num_samples: 100

metrics:
  consistency_runs: 3
  robustness_perturbations: 3
  stability_runs: 3
  nli_model: "cross-encoder/nli-deberta-v3-small"
  bertscore_model: "distilbert-base-uncased"

aggregation:
  strategies:
    balanced:
      correctness:       0.1667
      consistency:       0.1667
      robustness:        0.1667
      logical_coherence: 0.1667
      efficiency:        0.1667
      stability:         0.1667
    my_custom_strategy:
      correctness:       0.40
      logical_coherence: 0.30
      robustness:        0.15
      consistency:       0.10
      efficiency:        0.03
      stability:         0.02

Weight rules:

Six keys: correctness, consistency, robustness, logical_coherence, efficiency, stability

Weights are auto-normalized — they don't need to sum exactly to 1.0

Add as many custom strategies as you need

Step 3 — Run evaluation

python main.py --config config/config_custom.yaml

Step 4 — Read your results

Results are saved to outputs/my_custom_eval_<timestamp>/:

File	What it contains
`reasoning_quality_results.xlsx`	CQ, CS, RS, LS, ES, SS scores + your custom strategy scores
`radar_plot.png`	Visual comparison across all 6 dimensions
`summary.json`	Full results in machine-readable format

In the Excel file, your custom strategy appears as a separate column next to the built-in strategies.

Models

#	Model	Provider	Type	Parameters	RAM estimate
1	GPT-4o-mini	OpenAI	API	—	—
2	Gemini 2.0 Flash	Google	API	—	—
3	DeepSeek-V3	DeepSeek AI	API	—	—
4	Groq LLaMA-3.3-70B	Groq	API (OpenAI-compatible)	—	—
—	Claude Haiku 4.5	Anthropic	API	—	—
5	Phi-2	Microsoft	Local (HF)	2.7B	~6 GB (float32)
6	Qwen2.5-1.5B-Instruct	Alibaba	Local (HF)	1.5B	~4 GB (float32)
7	Mistral-7B-Instruct-v0.3	Mistral AI	Local (HF)	7B	~5 GB (4-bit)
8	LLaMA-3-8B-Instruct	Meta	Local (HF)	8B	~6 GB (4-bit)

Local models are loaded one at a time and released from RAM before the next model loads — allowing evaluation on machines without enough RAM to hold all models simultaneously.

HuggingFace models are downloaded automatically on first run and cached in ~/.cache/huggingface/.

Datasets

Dataset	Type	Size used	Answer format	Source
Synthetic	Auto-generated	Configurable	Mixed	Built-in
GSM8K	Math word problems	250 (default)	Numerical	`openai/gsm8k`
StrategyQA	Commonsense reasoning	250 (default)	Yes / No	`wics/strategy-qa`
MMLU	Multi-subject knowledge	225 (default)	A / B / C / D	`cais/mmlu`

Note: MMLU loads 225 items by default (not 250) because the moral_reasoning subject does not exist in the cais/mmlu dataset. The framework skips missing subjects automatically.

The synthetic dataset generates three categories of questions automatically: arithmetic/logic reasoning, adversarial questions (designed to expose brittle reasoning), and robustness items (paraphrase pairs).

Custom JSON datasets can also be added — see Adding New Datasets.

Configuration

Everything is controlled from a single YAML file. The default is config/config.yaml.

Experiment settings

experiment:
  name: "my_experiment"     # Used as prefix for output folder name
  seed: 42                  # Random seed for reproducibility
  deterministic: false      # true = greedy decoding (causes CS=SS=1.0 artifact)
  output_dir: "outputs"     # Where results are saved
  verbose: true

Adding an API model

models:
  - name: "GPT-4o"
    type: "openai"           # openai | anthropic | gemini | deepseek | local | mock
    params:
      model_id: "gpt-4o"
      api_key_env: "OPENAI_API_KEY"
      max_tokens: 512
      temperature: 0.7
      max_retries: 3
      timeout: 60

Adding a local HuggingFace model

models:
  - name: "Qwen2.5-1.5B"
    type: "local"
    params:
      model_id: "Qwen/Qwen2.5-1.5B-Instruct"
      device: "cuda"
      use_4bit: true
      max_new_tokens: 64
      trust_remote_code: true
      temperature: 0.7

RAM guide for local models:

Model size	`use_4bit`	VRAM needed
1.5B–2.7B	`false`	~4–6 GB (float32)
1.5B–2.7B	`true`	~1.5–2 GB (4-bit, may fall back to float16)
7B–8B	`true`	~5–6 GB (4-bit, required)

Note on 4-bit fallback: For small models on some hardware, 4-bit loading may fail and fall back to float16 CUDA automatically. The copying from a non-meta parameter warnings are expected and harmless.

Note on max_new_tokens: 64 tokens is sufficient for numerical answers (GSM8K), Yes/No (StrategyQA), and A/B/C/D (MMLU). Longer explanations may be truncated, but this does not affect metric scoring.

Metric settings

metrics:
  consistency_runs: 3           # K — number of repeated runs per question for CS
  robustness_perturbations: 3   # P — number of paraphrase variants per question for RS
  stability_runs: 3             # K — number of repeated runs per question for SS
  nli_model: "cross-encoder/nli-deberta-v3-small"
  bertscore_model: "distilbert-base-uncased"

Performance note for local models: Each item requires 1 + consistency_runs + robustness_perturbations inference calls. With defaults (3+3) = 7 calls/item. At ~7–8s/call on a GTX 1650 (float16), 975 items takes ~12–15 hours per local model. Reducing to consistency_runs: 2 and robustness_perturbations: 2 brings this to ~8–10 hours.

Adding a custom aggregation strategy

aggregation:
  strategies:
    my_strategy:
      correctness:       0.50
      robustness:        0.30
      logical_coherence: 0.20
      consistency:       0.00
      efficiency:        0.00
      stability:         0.00

Weights are auto-normalized if they don't sum exactly to 1.0.

Temperature and CS/SS measurement

By default, deterministic: false enables stochastic sampling. To get meaningful CS/SS scores, add temperature: 0.7 per model:

models:
  - name: "GPT-4o-mini"
    type: "openai"
    params:
      model_id: "gpt-4o-mini"
      temperature: 0.7    # ← enables stochastic sampling for CS/SS

Outputs

All results are saved to outputs/<experiment_name>_<timestamp>/:

File	Description
`reasoning_quality_results.xlsx`	Full results: raw metrics, all aggregation strategies, per-dataset breakdown, metadata
`radar_plot.png`	Multi-dimensional radar chart — one polygon per model
`summary.json`	Complete results in machine-readable JSON
`<ModelName>_result.json`	Per-model detailed results

Excel structure

The Excel file contains multiple sheets:

Overall Raw Metrics — one row per model, columns: CQ, CS, RS, LS, ES, SS
Aggregated Scores — all 7 aggregation strategy scores per model
Per-Dataset Breakdown — same metrics split by dataset (GSM8K, MMLU, etc.)
Experiment Metadata — config parameters, timestamps, dataset sizes

Project Structure

LLM-Reasoning-Quality-Evaluation-Metrics/
│
├── config/
│   ├── config.yaml             ← Main config: add models/datasets/strategies here
│   └── config_test.yaml        ← Quick test (mock + synthetic, ~2 min, no API needed)
│
├── models/
│   ├── base_model.py           ← Abstract base class (cache, interface)
│   ├── openai_model.py         ← GPT-4o-mini, GPT-4o, any OpenAI-compatible API
│   ├── anthropic_model.py      ← Claude models
│   ├── gemini_model.py         ← Gemini models
│   ├── deepseek_model.py       ← DeepSeek (OpenAI-compatible endpoint)
│   ├── local_model.py          ← HuggingFace local models (4-bit with float16 fallback)
│   └── mock_model.py           ← Deterministic mock for testing without APIs
│
├── llm_datasets/
│   ├── base_dataset.py         ← Abstract base + JSON file loader
│   ├── synthetic_dataset.py    ← Auto-generated reasoning/adversarial/robustness items
│   ├── gsm8k_dataset.py        ← GSM8K math word problems
│   ├── mmlu_dataset.py         ← MMLU multi-subject multiple choice
│   ├── strategyqa_dataset.py   ← StrategyQA commonsense yes/no
│   └── multi_dataset.py        ← Combines multiple datasets, tracks source per item
│
├── metrics/
│   ├── accuracy.py             ← CQ — 7-strategy fuzzy matching pipeline
│   ├── consistency.py          ← CS — pairwise agreement across K runs
│   ├── robustness.py           ← RS — perturbation matching (conditioned on CQ)
│   ├── logical_consistency.py  ← LS — NLI contradiction detection
│   ├── efficiency.py           ← ES — harmonic mean of CQ and inverse token count
│   ├── explainability.py       ← SS — BERTScore across reasoning traces
│   └── aggregation.py          ← Weighted composite Q score, 7 strategies
│
├── evaluation/
│   └── evaluator.py            ← Main pipeline: load → generate → 6 metrics → export
│
├── visualization/
│   └── radar_plot.py           ← Radar chart + grouped bar chart
│
├── utils/
│   ├── logger.py               ← Structured logging
│   ├── reproducibility.py      ← Seed setting across Python / NumPy / PyTorch
│   └── experiment_tracker.py   ← JSON + Excel export, result aggregation
│
├── llm_reasoning_quality/      ← PyPI package entry point
│   ├── __init__.py             ← run_evaluation() function
│   ├── cli.py                  ← llm-eval CLI (setup + run commands)
│   └── _data/                  ← Bundled config files for llm-eval setup
│
├── outputs/                    ← Auto-created; all results go here
├── pyproject.toml
├── requirements.txt
├── .env.example                ← Copy to .env and fill in API keys
└── main.py                     ← Entry point; config parsing + model/dataset registration

Adding New Models

Option A — Config only (API models)

For any OpenAI-compatible API:

- name: "My-Model"
  type: "openai"
  params:
    model_id: "my-model-id"
    api_key_env: "MY_API_KEY"
    max_tokens: 512

For HuggingFace local models:

- name: "My-Local-Model"
  type: "local"
  params:
    model_id: "org/model-name"
    device: "cuda"
    use_4bit: true
    max_new_tokens: 64
    temperature: 0.7

Option B — Custom model class

Create models/my_model.py extending BaseModel
Implement generate(prompt) and generate_with_trace(prompt)
Add a _build_mytype() function in main.py
Register in the MODEL_BUILDERS dict in main.py
Use type: "mytype" in config

Adding New Datasets

Option A — JSON file (no code needed)

Prepare a JSON file with this structure:

[
  {
    "id": "q001",
    "question": "What is 2 + 2?",
    "answer": "4",
    "type": "reasoning",
    "perturbations": [
      "What does 2 plus 2 equal?",
      "Calculate 2 + 2",
      "Find the sum of 2 and 2"
    ]
  }
]

Then add to config:

datasets:
  - name: "my_dataset"
    type: "json"
    params:
      path: "data/my_questions.json"
      num_samples: 100

The perturbations field is used for RS metric. If omitted, robustness is skipped for that item.

Option B — HuggingFace dataset class

Create llm_datasets/my_dataset.py extending BaseDataset
Implement the load() method to populate self._data
Register the type in main.py

Known Issues & Platform Notes

Windows GPU — DLL errors (`fbgemm.dll` / `cusparse64_11.dll`)

Both errors share the same cause: pip CUDA wheels have DLL dependency issues on many Windows systems.

Fix — use conda install instead of pip install:

conda install pytorch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 pytorch-cuda=12.1 -c pytorch -c nvidia

⚠️ Do NOT install torch 2.10.x. Restore with:

pip uninstall torch torchvision torchaudio bitsandbytes -y
pip cache purge
conda install pytorch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 pytorch-cuda=12.1 -c pytorch -c nvidia
pip install bitsandbytes==0.44.0

bitsandbytes version compatibility

bitsandbytes	torch	Status
0.44.0	2.4.0 + CUDA 12.1	✅ Tested, works
0.49.x	2.4.0	❌ Incompatible — causes CUDA errors
any	2.10.x	❌ Do not use torch 2.10.x

Windows CPU — PyTorch version

PyTorch 2.4+ causes fbgemm.dll on Windows CPU pip wheels. Use 2.3.x:

pip install torch==2.3.1 --index-url https://download.pytorch.org/whl/cpu
pip install "transformers==4.45.2"

Local model — meta tensor warnings

During 4-bit model loading you may see:

UserWarning: copying from a non-meta parameter in the checkpoint to a meta parameter

This is expected and harmless — 4-bit loading fell back to float16. Evaluation continues normally.

Local model — workers must be 1

The evaluator automatically forces workers=1 for local models. Running multiple workers causes CUDA OOM and meta tensor errors. This is by design.

Flash attention warning

Torch was not compiled with flash attention.

Harmless on GTX 1650 (Turing architecture). Flash attention requires Ampere or newer (RTX 3000+).

CS = SS = 1.0 for all models

Happens when deterministic: true with no per-model temperature. Fix: add temperature: 0.7 to each model in config.

MMLU — `moral_reasoning` subject not found

[MMLU] Could not load subject 'moral_reasoning': BuilderConfig not found.

Expected — this subject doesn't exist in cais/mmlu. Framework skips it automatically. MMLU loads 225 items instead of 250.

Mistral / LLaMA tokenizer error

Cannot instantiate this tokenizer from a slow version... sentencepiece

Fix: pip install sentencepiece

Per-dataset breakdown is slow (NLI/BERTScore reload)

Fixed in current version: NLI/BERTScore models are now kept in memory across all per-dataset passes and released only once after all datasets are processed.

Citation

If you use this framework in your research, please cite:

@article{senol2026reasoning,
  title         = {Measuring Reasoning Quality in Large Language Models: A Multi-Dimensional Behavioral Framework},
  author        = {{\c{S}}enol, Ali and Agrawal, Garima and Liu, Huan},
  year          = {2026},
  eprint        = {2605.24661},
  archivePrefix = {arXiv},
  primaryClass  = {cs.AI},
  url           = {https://arxiv.org/abs/2605.24661}
}

License

MIT License — see LICENSE for details.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.1.4

Jun 11, 2026

1.1.3

Jun 5, 2026

1.1.2

Jun 5, 2026

1.1.1

Jun 5, 2026

This version

1.1.0

Jun 5, 2026

1.0.7

Jun 5, 2026

1.0.6

Jun 5, 2026

1.0.5

Jun 5, 2026

1.0.4

Jun 5, 2026

1.0.3

Jun 5, 2026

1.0.2

Jun 2, 2026

1.0.0

Jun 2, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_reasoning_quality-1.1.0.tar.gz (69.8 kB view details)

Uploaded Jun 5, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

llm_reasoning_quality-1.1.0-py3-none-any.whl (74.5 kB view details)

Uploaded Jun 5, 2026 Python 3

File details

Details for the file llm_reasoning_quality-1.1.0.tar.gz.

File metadata

Download URL: llm_reasoning_quality-1.1.0.tar.gz
Upload date: Jun 5, 2026
Size: 69.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for llm_reasoning_quality-1.1.0.tar.gz
Algorithm	Hash digest
SHA256	`4fffae8b819c96917afc32a149dd59089db0b8cc898663b6d4a9b4d689167feb`
MD5	`efa801ec3d8b04d379cd16100900b77d`
BLAKE2b-256	`1ae9e1834a1c64bd44414687e72f8b2ea17e3d8e2e292be78112d59658ec640c`

See more details on using hashes here.

File details

Details for the file llm_reasoning_quality-1.1.0-py3-none-any.whl.

File metadata

Download URL: llm_reasoning_quality-1.1.0-py3-none-any.whl
Upload date: Jun 5, 2026
Size: 74.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for llm_reasoning_quality-1.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`5fbb7b3d7554c08012da9a7cbe917629504598dcfec5ea0d27ef1d151a771796`
MD5	`4a705a159d4d92a8eb4f50224c17c384`
BLAKE2b-256	`4f1c4b4cf0a7324596c4aa71fa80d57de459cbe66d01f94ff0b6bdd74a912e03`

See more details on using hashes here.

llm-reasoning-quality 1.1.0

Navigation

Verified details

Project links

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

LLM Reasoning Quality Evaluation Framework

Table of Contents

Overview

Metrics

Key design decisions

Aggregation strategies

Installation

Option 1 — Install via pip (recommended)

Option 2 — Install from source (for development)

Prerequisites

Windows + NVIDIA GPU (tested & recommended)

Windows CPU-only

Linux / macOS

Quick Start

1. Set API keys

2. Quick test (no API keys needed)

3. Run full evaluation

Custom Evaluation: Your Own Dataset & Weights

Step 1 — Prepare your dataset

Step 2 — Define your custom weights

Step 3 — Run evaluation

Step 4 — Read your results

Models

Datasets

Configuration

Experiment settings

Adding an API model

Adding a local HuggingFace model

Metric settings

Adding a custom aggregation strategy

Temperature and CS/SS measurement

Outputs

Excel structure

Project Structure

Adding New Models

Option A — Config only (API models)

Option B — Custom model class

Adding New Datasets

Option A — JSON file (no code needed)

Option B — HuggingFace dataset class

Known Issues & Platform Notes

Windows GPU — DLL errors (fbgemm.dll / cusparse64_11.dll)

bitsandbytes version compatibility

Windows CPU — PyTorch version

Local model — meta tensor warnings

Local model — workers must be 1

Flash attention warning

CS = SS = 1.0 for all models

MMLU — moral_reasoning subject not found

Mistral / LLaMA tokenizer error

Per-dataset breakdown is slow (NLI/BERTScore reload)

Citation

License

Project details

Verified details

Project links

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

Windows GPU — DLL errors (`fbgemm.dll` / `cusparse64_11.dll`)

MMLU — `moral_reasoning` subject not found