A framework for evaluating overthinking and basic reasoning capabilities of Large Language Models

These details have not been verified by PyPI

Project links

Development Status
- 3 - Alpha
Intended Audience
- Science/Research
Operating System
- OS Independent
Programming Language
- Python :: 3
Topic
- Scientific/Engineering :: Artificial Intelligence

Project description

🧠 LLMThinkBench

A Framework for Evaluating Basic Math Reasoning Capabilities and Overthinking of Language Models

LLMThinkBench is a comprehensive framework designed to rigorously evaluate the basic math reasoning capabilities of Language Models, while also identifying instances of overthinking—where models apply unnecessarily complex logic to simple problems. Through standardized and reproducible benchmarks, it offers valuable insights into how well models perform on various reasoning tasks, from basic arithmetic to complex logical operations.

📰 News & Releases

v0.1.6 (Latest) - Enhanced reporting with efficiency rankings to identify models that achieve high accuracy with minimal tokens. Introduced Efficiency Score and Overthinking Ratio metrics that properly balance accuracy and verbosity.
v0.1.5 - Major backend revamp with robust error handling, intelligent fallback support, better token estimation, flexible device control, and smoother interruption handling.
v0.1.4 - Major improvements to parsing robustness for fair evaluations. Enhanced result validation mechanisms and edge case handling.
v0.1.3 - Added mean, median, mode tasks and implemented GPU customization options, allowing users to specify GPU memory allocation.
v0.1.2 - Expanded task library with find_maximum, find_minimum, absolute_difference, and division tasks. Improved documentation.
v0.1.1 - Fixed several inference issues and optimized performance using vLLM for high-throughput evaluation.
v0.1.0 - Initial release with core functionality including sorting and comparison tasks.

🌟 Key Features

Comprehensive Evaluation: Test LLMs on a range of mathematical and logical reasoning tasks
🎯 Advanced Overthinking Detection: Novel efficiency metrics that balance accuracy and conciseness
📊 Efficiency Rankings: Identify models that achieve high performance without excessive verbosity
Modular Architecture: Easily extend with custom evaluation tasks
Fair Comparison: Standardized methodology for comparing models
Efficient Inference: Built on vLLM for high-throughput batched evaluation
Detailed Metrics: Comprehensive reports on accuracy, instruction following, and output characteristics
Multi-GPU Support: Scale evaluations across multiple GPUs
Reproducible Results: Consistent methodology across model comparisons
Output Analysis: Identify when and how models make reasoning errors

🧮 Revolutionary Overthinking Metrics

Efficiency Score

Formula: Efficiency Score = 2 × (Accuracy × Token_Efficiency) / (Accuracy + Token_Efficiency)

Where:

Token_Efficiency = 1 - normalized_tokens
normalized_tokens = (tokens - min_tokens) / (max_tokens - min_tokens)

Benefits:

Balances accuracy and conciseness using harmonic mean
Penalizes verbosity while rewarding high accuracy
Provides intuitive ranking: Higher score = better model
Handles edge cases gracefully

Overthinking Ratio

Formula: Overthinking Ratio = (tokens - baseline) / accuracy

Interpretation:

Lower values indicate less overthinking
Measures "extra tokens per accuracy point"
Helps identify models that are unnecessarily verbose
Infinite ratio for models with zero accuracy

Example Comparison

Model	Accuracy	Tokens	Efficiency Score	Overthinking Ratio
Model A	98%	100	0.892	50.0
Model B	99%	120	0.881	70.7

Result: Model A is more efficient despite slightly lower accuracy!

📊 Supported Tasks

Task Type	Task	Description
Basic Operations	Sorting	Evaluates ability to correctly sort lists of numbers
	Comparison	Tests number comparison abilities (greater than, less than, equal to)
	Sum	Assesses ability to calculate the sum of multiple numbers
	Subtraction	Measures accuracy in subtracting two numbers
	Multiplication	Tests multiplication of numbers
	Division	Evaluates division operations
List Processing	Find Maximum	Finds the largest value in a list
	Find Minimum	Identifies the smallest value in a list
	Odd Count	Counts odd numbers in a list
	Even Count	Counts even numbers in a list
Statistical	Mean	Calculates the arithmetic mean of a list
	Median	Finds the median value of a list
	Mode	Identifies the most frequent value(s) in a list
Advanced	Absolute Difference	Calculates the absolute difference between numbers

🚀 Installation

# Install from PyPI
pip install llmthinkbench

# Install from source
git clone https://github.com/ctrl-gaurav/llmthinkbench.git
cd llmthinkbench
pip install -e .

📈 Quick Start

Command Line Interface

# Basic usage
llmthinkbench --model_id "Qwen/Qwen2.5-1.5B-Instruct" --tasks sorting comparison

# Comprehensive evaluation
llmthinkbench --model_id "meta-llama/Llama-2-7b-chat-hf" \
  --tensor_parallel_size 2 \
  --tasks sorting comparison sum multiplication \
  --datapoints 1000 \
  --list_sizes 8 16 32 \
  --folds 3 \
  --range -1000 1000 \
  --store_details \
  --output_dir "./llama2_evaluation_results"

Python API

from llmthinkbench import evaluate

# Simple evaluation
results = evaluate(
    model_id="meta-llama/Llama-2-7b-chat-hf",
    tasks=["sorting", "comparison", "sum"]
)

# Advanced configuration
results = evaluate(
    model_id="meta-llama/Llama-2-7b-chat-hf",
    tasks=["sorting", "comparison", "sum", "multiplication"],
    datapoints=500,
    list_sizes=[8, 16, 32],
    folds=3,
    range=[-1000, 1000],
    store_details=True,
    output_dir="./custom_results",
    tensor_parallel_size=2,
    gpu_memory_utilization=0.9,
    temperature=0.7,
    top_p=0.9,
    max_tokens=512
)

Detailed API Usage

from llmthinkbench.models.model_handler import ModelHandler
from llmthinkbench.tasks.sorting_task import SortingTask
from llmthinkbench.tasks.comparison_task import ComparisonTask
from llmthinkbench.utils.reporting import generate_final_report

# Initialize model
model_handler = ModelHandler(
    model_id="meta-llama/Llama-2-7b-chat-hf",
    tensor_parallel_size=2,
    gpu_memory_utilization=0.9
)

# Configure output directory
output_dir = "llama2_eval_results"

# Run sorting task
sorting = SortingTask(
    model_handler=model_handler,
    output_dir=output_dir,
    min_val=-100,
    max_val=100,
    num_folds=3,
    num_samples=500,
    store_details=True,
    temperature=0.7,
    top_p=0.9,
    max_tokens=512
)

# Evaluate multiple list sizes
list_sizes = [8, 16, 32]
sorting_metrics = sorting.run_evaluation(list_sizes)

# Run comparison task
comparison = ComparisonTask(
    model_handler=model_handler,
    output_dir=output_dir,
    min_val=-100,
    max_val=100,
    num_folds=3,
    num_samples=500,
    store_details=True,
    temperature=0.7,
    top_p=0.9,
    max_tokens=512
)

comparison_metrics = comparison.run_evaluation()

# Generate comprehensive report
all_metrics = sorting_metrics + comparison_metrics
report = generate_final_report(all_metrics, list_sizes, output_dir)

📊 Example Results

Below is an example report generated by LLMThinkBench v0.1.6:

+------------------+----------------------------+------------------+---------------------+-----------------------------------+-------------------------+------------------------------+------------------------------+
| Task             | Accuracy                   | Efficiency Score | Overthinking Ratio  | Instruction Followed              | Tokens                  | Chars                        | Words                        |
+------------------+----------------------------+------------------+---------------------+-----------------------------------+-------------------------+------------------------------+------------------------------+
| sorting_8        | 95.20% ± 3.60              | 0.892            | 50.0                | 98.80% ± 1.20                     | 186.2 ± 32.6            | 612.6 ± 98.4                 | 93.5 ± 15.6                  |
| sorting_16       | 87.40% ± 4.80              | 0.743            | 89.2                | 96.70% ± 2.30                     | 312.5 ± 48.7            | 982.3 ± 156.5                | 167.9 ± 26.9                 |
| sorting_32       | 68.60% ± 7.20              | 0.521            | 198.5               | 92.40% ± 3.50                     | 645.7 ± 92.2            | 1872.2 ± 283.6               | 348.8 ± 52.8                 |
| comparison       | 99.20% ± 1.20              | 0.951            | 44.1                | 99.60% ± 0.50                     | 93.8 ± 16.2             | 324.8 ± 52.2                 | 48.3 ± 8.1                   |
| sum_8            | 97.80% ± 2.10              | 0.923            | 35.2                | 99.30% ± 0.70                     | 134.6 ± 23.9            | 452.2 ± 78.3                 | 68.9 ± 11.7                  |
| multiplication   | 94.60% ± 3.50              | 0.885            | 47.9                | 98.40% ± 1.60                     | 114.3 ± 19.4            | 386.7 ± 64.3                 | 58.4 ± 9.7                   |
+------------------+----------------------------+------------------+---------------------+-----------------------------------+-------------------------+------------------------------+------------------------------+

🏆 Efficiency Rankings (Best Balance of Accuracy & Conciseness)
+--------+---------------+------------------+----------+--------+---------------------+
|  Rank  | Task          | Efficiency Score | Accuracy | Tokens | Overthinking Ratio  |
+--------+---------------+------------------+----------+--------+---------------------+
|   1    | comparison    | 0.951            | 99.2%    | 93.8   | 44.1                |
|   2    | sum_8         | 0.923            | 97.8%    | 134.6  | 35.2                |
|   3    | sorting_8     | 0.892            | 95.2%    | 186.2  | 50.0                |
|   4    | multiplication| 0.885            | 94.6%    | 114.3  | 47.9                |
|   5    | sorting_16    | 0.743            | 87.4%    | 312.5  | 89.2                |
|   6    | sorting_32    | 0.521            | 68.6%    | 645.7  | 198.5               |
+--------+---------------+------------------+----------+--------+---------------------+

📈 Visualization

You can visualize LLMThinkBench results including the new efficiency metrics:

import json
import matplotlib.pyplot as plt
import pandas as pd

# Load results
with open("final_report.json") as f:
    results = json.load(f)

# Extract efficiency rankings
rankings = results['efficiency_rankings']

# Create dataframe for plotting
df = pd.DataFrame(rankings)

# Plot efficiency comparison
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(15, 10))

# Efficiency Score vs Accuracy
scatter = ax1.scatter(df['accuracy']*100, df['efficiency_score'], 
                     s=100, alpha=0.7, c=df['tokens'], cmap='viridis')
ax1.set_xlabel('Accuracy (%)')
ax1.set_ylabel('Efficiency Score')
ax1.set_title('Efficiency Score vs Accuracy')
plt.colorbar(scatter, ax=ax1, label='Tokens')

# Add task labels
for i, row in df.iterrows():
    ax1.annotate(row['task'], (row['accuracy']*100, row['efficiency_score']), 
                xytext=(5, 5), textcoords='offset points', fontsize=8)

# Efficiency Score ranking
ax2.barh(range(len(df)), df['efficiency_score'])
ax2.set_yticks(range(len(df)))
ax2.set_yticklabels(df['task'])
ax2.set_xlabel('Efficiency Score')
ax2.set_title('Efficiency Score Rankings')
ax2.invert_yaxis()

# Overthinking Ratio vs Accuracy
# Filter out infinite values for plotting
finite_df = df[df['overthinking_ratio'] != float('inf')]
scatter2 = ax3.scatter(finite_df['accuracy']*100, finite_df['overthinking_ratio'], 
                      s=100, alpha=0.7, c=finite_df['tokens'], cmap='plasma')
ax3.set_xlabel('Accuracy (%)')
ax3.set_ylabel('Overthinking Ratio')
ax3.set_title('Overthinking Ratio vs Accuracy')
plt.colorbar(scatter2, ax=ax3, label='Tokens')

# Tokens vs Accuracy with Efficiency Score as color
scatter3 = ax4.scatter(df['accuracy']*100, df['tokens'], 
                      s=100, alpha=0.7, c=df['efficiency_score'], cmap='coolwarm')
ax4.set_xlabel('Accuracy (%)')
ax4.set_ylabel('Tokens')
ax4.set_title('Tokens vs Accuracy (colored by Efficiency)')
plt.colorbar(scatter3, ax=ax4, label='Efficiency Score')

plt.tight_layout()
plt.savefig("efficiency_analysis.png", dpi=300, bbox_inches='tight')
plt.show()

⚙️ Advanced Configuration

Command Line Parameters

Parameter	Description	Default
`--model_id`	Hugging Face model ID	Required
`--tasks`	Tasks to evaluate	`["sorting"]`
`--datapoints`	Number of samples per test case	`1000`
`--folds`	Number of evaluation folds	`1`
`--range`	Number range for evaluation	`[-100, 100]`
`--list_sizes`	List sizes for list-based tasks	`[8]`
`--store_details`	Store detailed per-example results	`False`
`--output_dir`	Directory to save results	Auto-generated
`--tensor_parallel_size`	Number of GPUs to use	`1`
`--gpu_memory_utilization`	GPU memory utilization threshold	`0.9`
`--temperature`	Sampling temperature	`0.7`
`--top_p`	Sampling top_p value	`0.9`
`--max_tokens`	Maximum tokens for sampling	`512`

🧩 Extending with Custom Tasks

LLMThinkBench is designed to be easily extensible. Here's how to create a custom evaluation task:

Create a new task module:

# llmthinkbench/tasks/custom_task.py
import random
from ..tasks.base_task import BaseTask

class CustomTask(BaseTask):
    """Implementation of a custom task"""
    
    @property
    def task_name(self):
        return "custom_task"
    
    def generate_data(self):
        """Generate random data for the task"""
        data = []
        for _ in range(self.num_samples):
            a = random.randint(self.min_val, self.max_val)
            b = random.randint(self.min_val, self.max_val)
            result = a * b + a  # Example operation
            data.append({"a": a, "b": b, "expected": result})
        return data
    
    def create_prompt(self, data_point):
        """Create prompt for the task"""
        return (f"Calculate a * b + a where a = {data_point['a']} and b = {data_point['b']}.\n\n"
                f"Provide your answer in the format \\boxed{{result}} at the end.")
    
    def evaluate_response(self, response, data_point):
        """Evaluate model response for the task"""
        from ..utils.custom_parsing import parse_boxed_answer
        
        boxed_answer = parse_boxed_answer(response)
        instruction_followed = boxed_answer is not None
        accuracy = 0
        
        if instruction_followed and boxed_answer:
            try:
                parsed_answer = int(boxed_answer[0])
                accuracy = 1 if parsed_answer == data_point['expected'] else 0
            except:
                pass
        
        return {
            "a": data_point['a'],
            "b": data_point['b'],
            "expected": data_point['expected'],
            "parsed_answer": boxed_answer[0] if boxed_answer and len(boxed_answer) > 0 else None,
            "accuracy": accuracy,
            "instruction_followed": instruction_followed
        }
    
    def run_evaluation(self):
        """Run evaluation for custom task"""
        all_metrics = []
        
        # Generate evaluation data
        data = self.generate_data()
        
        # Run each fold
        for fold in range(1, self.num_folds + 1):
            metrics = self.run_fold(data, "custom_task", fold)
            all_metrics.append(metrics)
        
        return all_metrics

Create a parsing utility for your task:

# llmthinkbench/utils/custom_parsing.py
import re

def parse_boxed_answer(text):
    """Parse boxed answers from text.
    
    Args:
        text (str): Model response text
        
    Returns:
        list: List of answers found in \boxed{} notation
    """
    matches = re.findall(r'\\boxed\{([^}]*)\}', text)
    
    if not matches:
        # Alternative formats
        matches = re.findall(r'\[([^]]*)\]', text)
        
    return matches if matches else None

Add your task to the available tasks in __init__.py
Use your custom task:

llmthinkbench --model_id "meta-llama/Llama-2-7b-chat-hf" --tasks custom_task

🔍 Contributing

Contributions to LLMThinkBench are welcome! To contribute:

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add some amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

For major changes, please open an issue first to discuss what you would like to change.

📜 License

LLMThinkBench is licensed under the MIT License - see the LICENSE file for details.

📚 Citation

If you use LLMThinkBench in your research, please cite:

@software{llmthinkbench2025,
  author = {Gaurav Srivastava, Aafiya Hussain, Sriram Srinivasan, Xuan Wang},
  title = {LLMThinkBench: Advanced Reasoning and Overthinking Evaluation Framework for LLMs},
  year = {2025},
  url = {https://github.com/ctrl-gaurav/LLMThinkBench/},
  version = {0.1.6}
}

📧 Contact

Issues: For questions, issues, or feedback, please open an issue on GitHub.
PyPI: pypi.org/project/llmthinkbench

Project details

These details have not been verified by PyPI

Project links

Development Status
- 3 - Alpha
Intended Audience
- Science/Research
Operating System
- OS Independent
Programming Language
- Python :: 3
Topic
- Scientific/Engineering :: Artificial Intelligence

Release history Release notifications | RSS feed

This version

0.1.6

Jun 17, 2025

0.1.5

Jun 12, 2025

0.1.4

Apr 17, 2025

0.1.3

Apr 16, 2025

0.1.2

Apr 15, 2025

0.1.1

Apr 13, 2025

0.1.0

Apr 6, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llmthinkbench-0.1.6.tar.gz (80.4 kB view details)

Uploaded Jun 17, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

llmthinkbench-0.1.6-py3-none-any.whl (107.8 kB view details)

Uploaded Jun 17, 2025 Python 3

File details

Details for the file llmthinkbench-0.1.6.tar.gz.

File metadata

Download URL: llmthinkbench-0.1.6.tar.gz
Upload date: Jun 17, 2025
Size: 80.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.7

File hashes

Hashes for llmthinkbench-0.1.6.tar.gz
Algorithm	Hash digest
SHA256	`56ae67bd9116b1314bb0d35976b2b7296c9776a68a5792d734608f1dc1a5da1c`
MD5	`935504a47c8eb36789588ed13183b00f`
BLAKE2b-256	`fb289db4393350b570894e69a7c8b749251b67db4fcdaaf0b02d58ac2c163780`

See more details on using hashes here.

File details

Details for the file llmthinkbench-0.1.6-py3-none-any.whl.

File metadata

Download URL: llmthinkbench-0.1.6-py3-none-any.whl
Upload date: Jun 17, 2025
Size: 107.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.7

File hashes

Hashes for llmthinkbench-0.1.6-py3-none-any.whl
Algorithm	Hash digest
SHA256	`714ba2f37d165282d3c9aa54e5564ac92b2daa0fea94b58b468a498fb600de7a`
MD5	`e56fd76339b71561869fe442b511854d`
BLAKE2b-256	`e20989b273714816a37e9cbd989f28e0ad1924b7f25274e185791516bd6572f0`

See more details on using hashes here.

llmthinkbench 0.1.6

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

🧠 LLMThinkBench

A Framework for Evaluating Basic Math Reasoning Capabilities and Overthinking of Language Models

📰 News & Releases

🌟 Key Features

🧮 Revolutionary Overthinking Metrics

Efficiency Score

Overthinking Ratio

Example Comparison

📊 Supported Tasks

🚀 Installation

📈 Quick Start

Command Line Interface

Python API

Detailed API Usage

📊 Example Results

📈 Visualization

⚙️ Advanced Configuration

Command Line Parameters

🧩 Extending with Custom Tasks

🔍 Contributing

📜 License

📚 Citation

📧 Contact

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes