Skip to main content

A framework for evaluating overthinking and basic reasoning capabilities of Large Language Models

Project description

🧠 LLMThinkBench: An Advanced Reasoning and Overthinking Evaluation Framework for Language Models

Python License vLLM HuggingFace

LLMThinkBench is a robust, extensible framework for rigorously evaluating the reasoning capabilities and "overthinking" tendencies of Large Language Models. Through standardized, reproducible benchmarks, it provides crucial insights into model performance on core reasoning tasks.

LLMThinkBench Overview

🌟 Key Features

  • Modular Architecture: Easily extend with custom evaluation tasks
  • Efficient Inference: Built on vLLM for high-throughput batched evaluation
  • Detailed Metrics: Comprehensive reports on accuracy, instruction following, and more
  • Multi-GPU Support: Scale evaluations across multiple GPUs
  • Reproducible Results: Consistent methodology across model comparison

📊 Supported Tasks

Task Description Metrics
Sorting Evaluates ability to correctly sort numerical lists of varying sizes Accuracy, Instruction Following
Comparison Tests number comparison abilities across different relationships Accuracy across comparison types
Custom Tasks Easily add your own evaluation tasks Customizable metrics

🚀 Installation

# From PyPI
pip install llmthinkbench

# From source
git clone https://github.com/yourusername/llmthinkbench.git
cd llmthinkbench
pip install -e .

📈 Quick Start

Command Line Interface

# Basic usage with default parameters
llmthinkbench --model_id "Qwen/Qwen2.5-1.5B-Instruct" --tasks sorting comparison

# Comprehensive evaluation
llmthinkbench --model_id "meta-llama/Llama-2-7b-chat-hf" \
  --tensor_parallel_size 2 \
  --tasks sorting comparison \
  --datapoints 1000 \
  --list_sizes 8 16 32 64 \
  --folds 3 \
  --range -1000 1000 \
  --store_details \
  --output_dir "./my_evaluation_results"

Python API

from llmthinkbench.models.model_handler import ModelHandler
from llmthinkbench.tasks.sorting_task import SortingTask
from llmthinkbench.tasks.comparison_task import ComparisonTask
from llmthinkbench.utils.reporting import generate_final_report

# Initialize model
model_handler = ModelHandler(
    model_id="meta-llama/Llama-2-7b-chat-hf",
    tensor_parallel_size=2,
    gpu_memory_utilization=0.9
)

# Configure output directory
output_dir = "llama2_eval_results"

# Run sorting task
sorting = SortingTask(
    model_handler=model_handler,
    output_dir=output_dir,
    min_val=-100,
    max_val=100,
    num_folds=3,
    num_samples=500,
    store_details=True,
    temperature=0.7,
    top_p=0.9,
    max_tokens=512
)

# Evaluate multiple list sizes
list_sizes = [8, 16, 32]
sorting_metrics = sorting.run_evaluation(list_sizes)

# Run comparison task
comparison = ComparisonTask(
    model_handler=model_handler,
    output_dir=output_dir,
    min_val=-100,
    max_val=100,
    num_folds=3,
    num_samples=500,
    store_details=True,
    temperature=0.7,
    top_p=0.9,
    max_tokens=512
)

# Run evaluation
comparison_metrics = comparison.run_evaluation()

# Generate comprehensive report
all_metrics = sorting_metrics + comparison_metrics
report = generate_final_report(all_metrics, list_sizes, output_dir)

📝 Example Results

Below is an example report generated by LLMThinkBench:

+----------------+------------------+---------------+----------------------+-----------+-----------+-------------+
| Test Case      | Accuracy (Mean)  | Accuracy (Std)| Instruction Followed | Avg Chars | Avg Words | Avg Tokens  |
+----------------+------------------+---------------+----------------------+-----------+-----------+-------------+
| sorting_8      | 95.20%           | 3.60%         | 98.80%               | 612.57    | 93.45     | 186.23      |
| sorting_16     | 87.40%           | 4.80%         | 96.70%               | 982.32    | 167.85    | 312.45      |
| sorting_32     | 68.60%           | 7.20%         | 92.40%               | 1872.15   | 348.76    | 645.65      |
| comparison     | 99.20%           | 1.20%         | 99.60%               | 324.83    | 48.27     | 93.75       |
+----------------+------------------+---------------+----------------------+-----------+-----------+-------------+

⚙️ Advanced Configuration

Command Line Parameters

Parameter Description Default
--model_id Hugging Face model ID Required
--tasks Tasks to evaluate ["sorting"]
--datapoints Number of samples per test case 1000
--folds Number of evaluation folds 1
--range Number range for evaluation [-100, 100]
--list_sizes List sizes for sorting task [8]
--store_details Store detailed per-example results False
--output_dir Directory to save results Auto-generated
--tensor_parallel_size Number of GPUs to use 1
--gpu_memory_utilization GPU memory utilization threshold 0.9
--temperature Sampling temperature 0.7
--top_p Sampling top_p value 0.9
--max_tokens Maximum tokens for sampling 512

🧩 Extending with Custom Tasks

LLMThinkBench is designed to be easily extensible. Here's how to create a custom evaluation task:

  1. Create a new task module:
# llmthinkbench/tasks/addition_task.py
import random
from ..utils.parsing import parse_boxed_answer
from .base_task import BaseTask

class AdditionTask(BaseTask):
    """Implementation of the addition task"""
    
    @property
    def task_name(self):
        return "addition"
    
    def generate_data(self):
        """Generate random number pairs for addition"""
        data = []
        for _ in range(self.num_samples):
            a = random.randint(self.min_val, self.max_val)
            b = random.randint(self.min_val, self.max_val)
            data.append({"a": a, "b": b, "sum": a + b})
        return data
    
    def create_prompt(self, data_point):
        """Create prompt for addition task"""
        return (f"Calculate the sum of these two numbers:\n\n"
                f"First number: {data_point['a']}\n"
                f"Second number: {data_point['b']}\n\n"
                f"Provide the result. Your final answer must be in the format "
                f"\\boxed{{result}} at the end.")
    
    def evaluate_response(self, response, data_point):
        """Evaluate model response for addition task"""
        boxed_answer = parse_boxed_answer(response)
        instruction_followed = boxed_answer is not None
        accuracy = 0
        
        if instruction_followed and len(boxed_answer) == 1:
            accuracy = 1 if boxed_answer[0] == data_point['sum'] else 0
        
        return {
            "num1": data_point['a'],
            "num2": data_point['b'],
            "expected_sum": data_point['sum'],
            "parsed_answer": boxed_answer[0] if boxed_answer and len(boxed_answer) > 0 else None,
            "accuracy": accuracy,
            "instruction_followed": instruction_followed
        }
    
    def run_evaluation(self):
        """Run evaluation for addition task"""
        all_metrics = []
        
        # Generate evaluation data
        data = self.generate_data()
        
        # Run each fold
        for fold in range(1, self.num_folds + 1):
            metrics = self.run_fold(data, "addition", fold)
            all_metrics.append(metrics)
        
        return all_metrics
  1. Use your custom task:
llmthinkbench --model_id "meta-llama/Llama-2-7b-chat-hf" --tasks addition

📊 Visualization

LLMThinkBench results can be visualized using any plotting library. Here's a simple example using matplotlib:

import json
import matplotlib.pyplot as plt
import pandas as pd

# Load results
with open("final_report.json") as f:
    results = json.load(f)

# Create dataframe for plotting
data = []
for task, metrics in results.items():
    data.append({
        "Task": task,
        "Accuracy": metrics["accuracy"]["mean"] * 100,
        "Instruction Following": metrics["instruction_followed"]["mean"] * 100
    })

df = pd.DataFrame(data)

# Plot results
plt.figure(figsize=(12, 6))
df.plot(x="Task", y=["Accuracy", "Instruction Following"], kind="bar")
plt.title("LLMThinkBench Results")
plt.ylabel("Percentage")
plt.ylim(0, 100)
plt.grid(axis="y")
plt.tight_layout()
plt.savefig("results_comparison.png")

🔍 Contributing

Contributions to LLMThinkBench are welcome! Please check out our contributing guidelines for more information.

📜 License

LLMThinkBench is licensed under the MIT License - see the LICENSE file for details.

📚 Citation

If you use LLMThinkBench in your research, please cite:

@software{llmthinkbench2025,
  author = {Gaurav Srivastava, Aafiya Hussain, Sriram Srinivasan, Aninditaa Chauhan},
  title = {LLMThinkBench: Advanced Reasoning and Overthinking Evaluation Framework for LLMs},
  year = {2025},
  url = {https://github.com/ctrl-gaurav/LLMThinkBench/}
}

📧 Contact

For questions, issues, or feedback, please open an issue on GitHub.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llmthinkbench-0.1.3.tar.gz (61.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llmthinkbench-0.1.3-py3-none-any.whl (89.6 kB view details)

Uploaded Python 3

File details

Details for the file llmthinkbench-0.1.3.tar.gz.

File metadata

  • Download URL: llmthinkbench-0.1.3.tar.gz
  • Upload date:
  • Size: 61.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.7

File hashes

Hashes for llmthinkbench-0.1.3.tar.gz
Algorithm Hash digest
SHA256 dc042780ebcd6c56a8146ca8c0ce3faaeec85b4c19c1b8214b33934102bec167
MD5 127eff9f312fe2e2c4d67bba0fff5578
BLAKE2b-256 5622af2d7aa60f0a138db6dccd7e96dc0c156b6e8bb2e249dec2e3e28f73fa95

See more details on using hashes here.

File details

Details for the file llmthinkbench-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: llmthinkbench-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 89.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.7

File hashes

Hashes for llmthinkbench-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 bbeb603a74e161c11fdcf6205bc504fb131f8925048d2fbb76ae3446bf2c134f
MD5 cdd8d6db1aec5aed03778c17493263f7
BLAKE2b-256 ffa17d1d6b6305c6f1ced26205523dc2e14ee1f7f3aba7ea1e9d0d8e8638d16e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page