Skip to main content

A framework for evaluating overthinking and basic reasoning capabilities of Large Language Models

Project description

🧠 LLMThinkBench: An Advanced Reasoning and Overthinking Evaluation Framework for Language Models

Python License vLLM HuggingFace

LLMThinkBench is a robust, extensible framework for rigorously evaluating the reasoning capabilities and "overthinking" tendencies of Large Language Models. Through standardized, reproducible benchmarks, it provides crucial insights into model performance on core reasoning tasks.

LLMThinkBench Overview

🌟 Key Features

  • Modular Architecture: Easily extend with custom evaluation tasks
  • Efficient Inference: Built on vLLM for high-throughput batched evaluation
  • Detailed Metrics: Comprehensive reports on accuracy, instruction following, and more
  • Multi-GPU Support: Scale evaluations across multiple GPUs
  • Reproducible Results: Consistent methodology across model comparison

📊 Supported Tasks

Task Description Metrics
Sorting Evaluates ability to correctly sort numerical lists of varying sizes Accuracy, Instruction Following
Comparison Tests number comparison abilities across different relationships Accuracy across comparison types
Custom Tasks Easily add your own evaluation tasks Customizable metrics

🚀 Installation

# From PyPI
pip install llmthinkbench

# From source
git clone https://github.com/yourusername/llmthinkbench.git
cd llmthinkbench
pip install -e .

📈 Quick Start

Command Line Interface

# Basic usage with default parameters
llmthinkbench --model_id "Qwen/Qwen2.5-1.5B-Instruct" --tasks sorting comparison

# Comprehensive evaluation
llmthinkbench --model_id "meta-llama/Llama-2-7b-chat-hf" \
  --tensor_parallel_size 2 \
  --tasks sorting comparison \
  --datapoints 1000 \
  --list_sizes 8 16 32 64 \
  --folds 3 \
  --range -1000 1000 \
  --store_details \
  --output_dir "./my_evaluation_results"

Python API

from llmthinkbench.models.model_handler import ModelHandler
from llmthinkbench.tasks.sorting_task import SortingTask
from llmthinkbench.tasks.comparison_task import ComparisonTask
from llmthinkbench.utils.reporting import generate_final_report

# Initialize model
model_handler = ModelHandler(
    model_id="meta-llama/Llama-2-7b-chat-hf",
    tensor_parallel_size=2,
    gpu_memory_utilization=0.9
)

# Configure output directory
output_dir = "llama2_eval_results"

# Run sorting task
sorting = SortingTask(
    model_handler=model_handler,
    output_dir=output_dir,
    min_val=-100,
    max_val=100,
    num_folds=3,
    num_samples=500,
    store_details=True,
    temperature=0.7,
    top_p=0.9,
    max_tokens=512
)

# Evaluate multiple list sizes
list_sizes = [8, 16, 32]
sorting_metrics = sorting.run_evaluation(list_sizes)

# Run comparison task
comparison = ComparisonTask(
    model_handler=model_handler,
    output_dir=output_dir,
    min_val=-100,
    max_val=100,
    num_folds=3,
    num_samples=500,
    store_details=True,
    temperature=0.7,
    top_p=0.9,
    max_tokens=512
)

# Run evaluation
comparison_metrics = comparison.run_evaluation()

# Generate comprehensive report
all_metrics = sorting_metrics + comparison_metrics
report = generate_final_report(all_metrics, list_sizes, output_dir)

📝 Example Results

Below is an example report generated by LLMThinkBench:

+----------------+------------------+---------------+----------------------+-----------+-----------+-------------+
| Test Case      | Accuracy (Mean)  | Accuracy (Std)| Instruction Followed | Avg Chars | Avg Words | Avg Tokens  |
+----------------+------------------+---------------+----------------------+-----------+-----------+-------------+
| sorting_8      | 95.20%           | 3.60%         | 98.80%               | 612.57    | 93.45     | 186.23      |
| sorting_16     | 87.40%           | 4.80%         | 96.70%               | 982.32    | 167.85    | 312.45      |
| sorting_32     | 68.60%           | 7.20%         | 92.40%               | 1872.15   | 348.76    | 645.65      |
| comparison     | 99.20%           | 1.20%         | 99.60%               | 324.83    | 48.27     | 93.75       |
+----------------+------------------+---------------+----------------------+-----------+-----------+-------------+

⚙️ Advanced Configuration

Command Line Parameters

Parameter Description Default
--model_id Hugging Face model ID Required
--tasks Tasks to evaluate ["sorting"]
--datapoints Number of samples per test case 1000
--folds Number of evaluation folds 1
--range Number range for evaluation [-100, 100]
--list_sizes List sizes for sorting task [8]
--store_details Store detailed per-example results False
--output_dir Directory to save results Auto-generated
--tensor_parallel_size Number of GPUs to use 1
--gpu_memory_utilization GPU memory utilization threshold 0.9
--temperature Sampling temperature 0.7
--top_p Sampling top_p value 0.9
--max_tokens Maximum tokens for sampling 512

🧩 Extending with Custom Tasks

LLMThinkBench is designed to be easily extensible. Here's how to create a custom evaluation task:

  1. Create a new task module:
# llmthinkbench/tasks/addition_task.py
import random
from ..utils.parsing import parse_boxed_answer
from .base_task import BaseTask

class AdditionTask(BaseTask):
    """Implementation of the addition task"""
    
    @property
    def task_name(self):
        return "addition"
    
    def generate_data(self):
        """Generate random number pairs for addition"""
        data = []
        for _ in range(self.num_samples):
            a = random.randint(self.min_val, self.max_val)
            b = random.randint(self.min_val, self.max_val)
            data.append({"a": a, "b": b, "sum": a + b})
        return data
    
    def create_prompt(self, data_point):
        """Create prompt for addition task"""
        return (f"Calculate the sum of these two numbers:\n\n"
                f"First number: {data_point['a']}\n"
                f"Second number: {data_point['b']}\n\n"
                f"Provide the result. Your final answer must be in the format "
                f"\\boxed{{result}} at the end.")
    
    def evaluate_response(self, response, data_point):
        """Evaluate model response for addition task"""
        boxed_answer = parse_boxed_answer(response)
        instruction_followed = boxed_answer is not None
        accuracy = 0
        
        if instruction_followed and len(boxed_answer) == 1:
            accuracy = 1 if boxed_answer[0] == data_point['sum'] else 0
        
        return {
            "num1": data_point['a'],
            "num2": data_point['b'],
            "expected_sum": data_point['sum'],
            "parsed_answer": boxed_answer[0] if boxed_answer and len(boxed_answer) > 0 else None,
            "accuracy": accuracy,
            "instruction_followed": instruction_followed
        }
    
    def run_evaluation(self):
        """Run evaluation for addition task"""
        all_metrics = []
        
        # Generate evaluation data
        data = self.generate_data()
        
        # Run each fold
        for fold in range(1, self.num_folds + 1):
            metrics = self.run_fold(data, "addition", fold)
            all_metrics.append(metrics)
        
        return all_metrics
  1. Use your custom task:
llmthinkbench --model_id "meta-llama/Llama-2-7b-chat-hf" --tasks addition

📊 Visualization

LLMThinkBench results can be visualized using any plotting library. Here's a simple example using matplotlib:

import json
import matplotlib.pyplot as plt
import pandas as pd

# Load results
with open("final_report.json") as f:
    results = json.load(f)

# Create dataframe for plotting
data = []
for task, metrics in results.items():
    data.append({
        "Task": task,
        "Accuracy": metrics["accuracy"]["mean"] * 100,
        "Instruction Following": metrics["instruction_followed"]["mean"] * 100
    })

df = pd.DataFrame(data)

# Plot results
plt.figure(figsize=(12, 6))
df.plot(x="Task", y=["Accuracy", "Instruction Following"], kind="bar")
plt.title("LLMThinkBench Results")
plt.ylabel("Percentage")
plt.ylim(0, 100)
plt.grid(axis="y")
plt.tight_layout()
plt.savefig("results_comparison.png")

🔍 Contributing

Contributions to LLMThinkBench are welcome! Please check out our contributing guidelines for more information.

📜 License

LLMThinkBench is licensed under the MIT License - see the LICENSE file for details.

📚 Citation

If you use LLMThinkBench in your research, please cite:

@software{llmthinkbench2025,
  author = {Gaurav Srivastava, Aafiya Hussain, Sriram Srinivasan, Aninditaa Chauhan},
  title = {LLMThinkBench: Advanced Reasoning and Overthinking Evaluation Framework for LLMs},
  year = {2025},
  url = {https://github.com/ctrl-gaurav/LLMThinkBench/}
}

📧 Contact

For questions, issues, or feedback, please open an issue on GitHub.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llmthinkbench-0.1.2.tar.gz (61.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llmthinkbench-0.1.2-py3-none-any.whl (89.7 kB view details)

Uploaded Python 3

File details

Details for the file llmthinkbench-0.1.2.tar.gz.

File metadata

  • Download URL: llmthinkbench-0.1.2.tar.gz
  • Upload date:
  • Size: 61.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.7

File hashes

Hashes for llmthinkbench-0.1.2.tar.gz
Algorithm Hash digest
SHA256 a542e93e5ffca0a9c2c889a58ad38b19c92673651a26e4b40c62f0f5feaa776e
MD5 63617077f7fe3b7e15b88099f83c801e
BLAKE2b-256 de146e0fbcacff8d664c719fac11ba5ad74dce49c2a52a2139a025bc7a86386e

See more details on using hashes here.

File details

Details for the file llmthinkbench-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: llmthinkbench-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 89.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.7

File hashes

Hashes for llmthinkbench-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 2a2b6fc755b7a51494277f873e94287729cf05eb6e754d684b39cb20d8234e87
MD5 82e76f9a5cbf49899233b498bac56a5f
BLAKE2b-256 2cf6d92df9e4d186ffd8d95de176e9cc4f8ea141e98c01bab207c66fa7b01de3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page