Skip to main content

A framework for evaluating overthinking and basic reasoning capabilities of Large Language Models

Project description

🧠 LLMThinkBench: An Advanced Reasoning and Overthinking Evaluation Framework for Language Models

Python License vLLM HuggingFace

LLMThinkBench is a robust, extensible framework for rigorously evaluating the reasoning capabilities and "overthinking" tendencies of Large Language Models. Through standardized, reproducible benchmarks, it provides crucial insights into model performance on core reasoning tasks.

LLMThinkBench Overview

🌟 Key Features

  • Modular Architecture: Easily extend with custom evaluation tasks
  • Efficient Inference: Built on vLLM for high-throughput batched evaluation
  • Detailed Metrics: Comprehensive reports on accuracy, instruction following, and more
  • Multi-GPU Support: Scale evaluations across multiple GPUs
  • Reproducible Results: Consistent methodology across model comparison

📊 Supported Tasks

Task Description Metrics
Sorting Evaluates ability to correctly sort numerical lists of varying sizes Accuracy, Instruction Following
Comparison Tests number comparison abilities across different relationships Accuracy across comparison types
Custom Tasks Easily add your own evaluation tasks Customizable metrics

🚀 Installation

# From PyPI
pip install llmthinkbench

# From source
git clone https://github.com/yourusername/llmthinkbench.git
cd llmthinkbench
pip install -e .

📈 Quick Start

Command Line Interface

# Basic usage with default parameters
llmthinkbench --model_id "Qwen/Qwen2.5-1.5B-Instruct" --tasks sorting comparison

# Comprehensive evaluation
llmthinkbench --model_id "meta-llama/Llama-2-7b-chat-hf" \
  --tensor_parallel_size 2 \
  --tasks sorting comparison \
  --datapoints 1000 \
  --list_sizes 8 16 32 64 \
  --folds 3 \
  --range -1000 1000 \
  --store_details \
  --output_dir "./my_evaluation_results"

Python API

from llmthinkbench.models.model_handler import ModelHandler
from llmthinkbench.tasks.sorting_task import SortingTask
from llmthinkbench.tasks.comparison_task import ComparisonTask
from llmthinkbench.utils.reporting import generate_final_report

# Initialize model
model_handler = ModelHandler(
    model_id="meta-llama/Llama-2-7b-chat-hf",
    tensor_parallel_size=2,
    gpu_memory_utilization=0.9
)

# Configure output directory
output_dir = "llama2_eval_results"

# Run sorting task
sorting = SortingTask(
    model_handler=model_handler,
    output_dir=output_dir,
    min_val=-100,
    max_val=100,
    num_folds=3,
    num_samples=500,
    store_details=True,
    temperature=0.7,
    top_p=0.9,
    max_tokens=512
)

# Evaluate multiple list sizes
list_sizes = [8, 16, 32]
sorting_metrics = sorting.run_evaluation(list_sizes)

# Run comparison task
comparison = ComparisonTask(
    model_handler=model_handler,
    output_dir=output_dir,
    min_val=-100,
    max_val=100,
    num_folds=3,
    num_samples=500,
    store_details=True,
    temperature=0.7,
    top_p=0.9,
    max_tokens=512
)

# Run evaluation
comparison_metrics = comparison.run_evaluation()

# Generate comprehensive report
all_metrics = sorting_metrics + comparison_metrics
report = generate_final_report(all_metrics, list_sizes, output_dir)

📝 Example Results

Below is an example report generated by LLMThinkBench:

+----------------+------------------+---------------+----------------------+-----------+-----------+-------------+
| Test Case      | Accuracy (Mean)  | Accuracy (Std)| Instruction Followed | Avg Chars | Avg Words | Avg Tokens  |
+----------------+------------------+---------------+----------------------+-----------+-----------+-------------+
| sorting_8      | 95.20%           | 3.60%         | 98.80%               | 612.57    | 93.45     | 186.23      |
| sorting_16     | 87.40%           | 4.80%         | 96.70%               | 982.32    | 167.85    | 312.45      |
| sorting_32     | 68.60%           | 7.20%         | 92.40%               | 1872.15   | 348.76    | 645.65      |
| comparison     | 99.20%           | 1.20%         | 99.60%               | 324.83    | 48.27     | 93.75       |
+----------------+------------------+---------------+----------------------+-----------+-----------+-------------+

⚙️ Advanced Configuration

Command Line Parameters

Parameter Description Default
--model_id Hugging Face model ID Required
--tasks Tasks to evaluate ["sorting"]
--datapoints Number of samples per test case 1000
--folds Number of evaluation folds 1
--range Number range for evaluation [-100, 100]
--list_sizes List sizes for sorting task [8]
--store_details Store detailed per-example results False
--output_dir Directory to save results Auto-generated
--tensor_parallel_size Number of GPUs to use 1
--gpu_memory_utilization GPU memory utilization threshold 0.9
--temperature Sampling temperature 0.7
--top_p Sampling top_p value 0.9
--max_tokens Maximum tokens for sampling 512

🧩 Extending with Custom Tasks

LLMThinkBench is designed to be easily extensible. Here's how to create a custom evaluation task:

  1. Create a new task module:
# llmthinkbench/tasks/addition_task.py
import random
from ..utils.parsing import parse_boxed_answer
from .base_task import BaseTask

class AdditionTask(BaseTask):
    """Implementation of the addition task"""
    
    @property
    def task_name(self):
        return "addition"
    
    def generate_data(self):
        """Generate random number pairs for addition"""
        data = []
        for _ in range(self.num_samples):
            a = random.randint(self.min_val, self.max_val)
            b = random.randint(self.min_val, self.max_val)
            data.append({"a": a, "b": b, "sum": a + b})
        return data
    
    def create_prompt(self, data_point):
        """Create prompt for addition task"""
        return (f"Calculate the sum of these two numbers:\n\n"
                f"First number: {data_point['a']}\n"
                f"Second number: {data_point['b']}\n\n"
                f"Provide the result. Your final answer must be in the format "
                f"\\boxed{{result}} at the end.")
    
    def evaluate_response(self, response, data_point):
        """Evaluate model response for addition task"""
        boxed_answer = parse_boxed_answer(response)
        instruction_followed = boxed_answer is not None
        accuracy = 0
        
        if instruction_followed and len(boxed_answer) == 1:
            accuracy = 1 if boxed_answer[0] == data_point['sum'] else 0
        
        return {
            "num1": data_point['a'],
            "num2": data_point['b'],
            "expected_sum": data_point['sum'],
            "parsed_answer": boxed_answer[0] if boxed_answer and len(boxed_answer) > 0 else None,
            "accuracy": accuracy,
            "instruction_followed": instruction_followed
        }
    
    def run_evaluation(self):
        """Run evaluation for addition task"""
        all_metrics = []
        
        # Generate evaluation data
        data = self.generate_data()
        
        # Run each fold
        for fold in range(1, self.num_folds + 1):
            metrics = self.run_fold(data, "addition", fold)
            all_metrics.append(metrics)
        
        return all_metrics
  1. Use your custom task:
llmthinkbench --model_id "meta-llama/Llama-2-7b-chat-hf" --tasks addition

📊 Visualization

LLMThinkBench results can be visualized using any plotting library. Here's a simple example using matplotlib:

import json
import matplotlib.pyplot as plt
import pandas as pd

# Load results
with open("final_report.json") as f:
    results = json.load(f)

# Create dataframe for plotting
data = []
for task, metrics in results.items():
    data.append({
        "Task": task,
        "Accuracy": metrics["accuracy"]["mean"] * 100,
        "Instruction Following": metrics["instruction_followed"]["mean"] * 100
    })

df = pd.DataFrame(data)

# Plot results
plt.figure(figsize=(12, 6))
df.plot(x="Task", y=["Accuracy", "Instruction Following"], kind="bar")
plt.title("LLMThinkBench Results")
plt.ylabel("Percentage")
plt.ylim(0, 100)
plt.grid(axis="y")
plt.tight_layout()
plt.savefig("results_comparison.png")

🔍 Contributing

Contributions to LLMThinkBench are welcome! Please check out our contributing guidelines for more information.

📜 License

LLMThinkBench is licensed under the MIT License - see the LICENSE file for details.

📚 Citation

If you use LLMThinkBench in your research, please cite:

@software{llmthinkbench2025,
  author = {Gaurav Srivastava, Aafiya Hussain, Sriram Srinivasan, Aninditaa Chauhan},
  title = {LLMThinkBench: Advanced Reasoning and Overthinking Evaluation Framework for LLMs},
  year = {2025},
  url = {https://github.com/ctrl-gaurav/LLMThinkBench/}
}

📧 Contact

For questions, issues, or feedback, please open an issue on GitHub.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llmthinkbench-0.1.1.tar.gz (50.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llmthinkbench-0.1.1-py3-none-any.whl (70.5 kB view details)

Uploaded Python 3

File details

Details for the file llmthinkbench-0.1.1.tar.gz.

File metadata

  • Download URL: llmthinkbench-0.1.1.tar.gz
  • Upload date:
  • Size: 50.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.7

File hashes

Hashes for llmthinkbench-0.1.1.tar.gz
Algorithm Hash digest
SHA256 56ae0b9280247e1033b40893e8fb8d4042979b83791942db40b925f5522f5c65
MD5 9e2a4ce768d4192d4dff2e11aa077711
BLAKE2b-256 ecf60bb8bc8aa6ed7aeb96e7d31490ce4f57ffac3c3666e16a366ce6649cb186

See more details on using hashes here.

File details

Details for the file llmthinkbench-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: llmthinkbench-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 70.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.7

File hashes

Hashes for llmthinkbench-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 24f9abb622cbc0d696fc9cee43be7a05d66c620cf993cd87afe03e15fa4d07ed
MD5 57e62407295869e82dc25962c8e1e539
BLAKE2b-256 922283444781eaf43865bb469d3d5c50cd0d43be84dfde3f56f9a66306c82c8f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page