A framework for evaluating overthinking and basic reasoning capabilities of Large Language Models

These details have not been verified by PyPI

Project links

Development Status
- 3 - Alpha
Intended Audience
- Science/Research
Operating System
- OS Independent
Programming Language
- Python :: 3
Topic
- Scientific/Engineering :: Artificial Intelligence

Project description

🧠 LLMThinkBench: An Advanced Reasoning and Overthinking Evaluation Framework for Language Models

Python License vLLM HuggingFace

LLMThinkBench is a robust, extensible framework for rigorously evaluating the reasoning capabilities and "overthinking" tendencies of Large Language Models. Through standardized, reproducible benchmarks, it provides crucial insights into model performance on core reasoning tasks.

🌟 Key Features

Modular Architecture: Easily extend with custom evaluation tasks
Efficient Inference: Built on vLLM for high-throughput batched evaluation
Detailed Metrics: Comprehensive reports on accuracy, instruction following, and more
Multi-GPU Support: Scale evaluations across multiple GPUs
Reproducible Results: Consistent methodology across model comparison

📊 Supported Tasks

Task	Description	Metrics
Sorting	Evaluates ability to correctly sort numerical lists of varying sizes	Accuracy, Instruction Following
Comparison	Tests number comparison abilities across different relationships	Accuracy across comparison types
Custom Tasks	Easily add your own evaluation tasks	Customizable metrics

🚀 Installation

# From PyPI
pip install llmthinkbench

# From source
git clone https://github.com/yourusername/llmthinkbench.git
cd llmthinkbench
pip install -e .

📈 Quick Start

Command Line Interface

# Basic usage with default parameters
llmthinkbench --model_id "Qwen/Qwen2.5-1.5B-Instruct" --tasks sorting comparison

# Comprehensive evaluation
llmthinkbench --model_id "meta-llama/Llama-2-7b-chat-hf" \
  --tensor_parallel_size 2 \
  --tasks sorting comparison \
  --datapoints 1000 \
  --list_sizes 8 16 32 64 \
  --folds 3 \
  --range -1000 1000 \
  --store_details \
  --output_dir "./my_evaluation_results"

Python API

from llmthinkbench.models.model_handler import ModelHandler
from llmthinkbench.tasks.sorting_task import SortingTask
from llmthinkbench.tasks.comparison_task import ComparisonTask
from llmthinkbench.utils.reporting import generate_final_report

# Initialize model
model_handler = ModelHandler(
    model_id="meta-llama/Llama-2-7b-chat-hf",
    tensor_parallel_size=2,
    gpu_memory_utilization=0.9
)

# Configure output directory
output_dir = "llama2_eval_results"

# Run sorting task
sorting = SortingTask(
    model_handler=model_handler,
    output_dir=output_dir,
    min_val=-100,
    max_val=100,
    num_folds=3,
    num_samples=500,
    store_details=True,
    temperature=0.7,
    top_p=0.9,
    max_tokens=512
)

# Evaluate multiple list sizes
list_sizes = [8, 16, 32]
sorting_metrics = sorting.run_evaluation(list_sizes)

# Run comparison task
comparison = ComparisonTask(
    model_handler=model_handler,
    output_dir=output_dir,
    min_val=-100,
    max_val=100,
    num_folds=3,
    num_samples=500,
    store_details=True,
    temperature=0.7,
    top_p=0.9,
    max_tokens=512
)

# Run evaluation
comparison_metrics = comparison.run_evaluation()

# Generate comprehensive report
all_metrics = sorting_metrics + comparison_metrics
report = generate_final_report(all_metrics, list_sizes, output_dir)

📝 Example Results

Below is an example report generated by LLMThinkBench:

+----------------+------------------+---------------+----------------------+-----------+-----------+-------------+
| Test Case      | Accuracy (Mean)  | Accuracy (Std)| Instruction Followed | Avg Chars | Avg Words | Avg Tokens  |
+----------------+------------------+---------------+----------------------+-----------+-----------+-------------+
| sorting_8      | 95.20%           | 3.60%         | 98.80%               | 612.57    | 93.45     | 186.23      |
| sorting_16     | 87.40%           | 4.80%         | 96.70%               | 982.32    | 167.85    | 312.45      |
| sorting_32     | 68.60%           | 7.20%         | 92.40%               | 1872.15   | 348.76    | 645.65      |
| comparison     | 99.20%           | 1.20%         | 99.60%               | 324.83    | 48.27     | 93.75       |
+----------------+------------------+---------------+----------------------+-----------+-----------+-------------+

⚙️ Advanced Configuration

Command Line Parameters

Parameter	Description	Default
`--model_id`	Hugging Face model ID	Required
`--tasks`	Tasks to evaluate	`["sorting"]`
`--datapoints`	Number of samples per test case	`1000`
`--folds`	Number of evaluation folds	`1`
`--range`	Number range for evaluation	`[-100, 100]`
`--list_sizes`	List sizes for sorting task	`[8]`
`--store_details`	Store detailed per-example results	`False`
`--output_dir`	Directory to save results	Auto-generated
`--tensor_parallel_size`	Number of GPUs to use	`1`
`--gpu_memory_utilization`	GPU memory utilization threshold	`0.9`
`--temperature`	Sampling temperature	`0.7`
`--top_p`	Sampling top_p value	`0.9`
`--max_tokens`	Maximum tokens for sampling	`512`

🧩 Extending with Custom Tasks

LLMThinkBench is designed to be easily extensible. Here's how to create a custom evaluation task:

Create a new task module:

# llmthinkbench/tasks/addition_task.py
import random
from ..utils.parsing import parse_boxed_answer
from .base_task import BaseTask

class AdditionTask(BaseTask):
    """Implementation of the addition task"""
    
    @property
    def task_name(self):
        return "addition"
    
    def generate_data(self):
        """Generate random number pairs for addition"""
        data = []
        for _ in range(self.num_samples):
            a = random.randint(self.min_val, self.max_val)
            b = random.randint(self.min_val, self.max_val)
            data.append({"a": a, "b": b, "sum": a + b})
        return data
    
    def create_prompt(self, data_point):
        """Create prompt for addition task"""
        return (f"Calculate the sum of these two numbers:\n\n"
                f"First number: {data_point['a']}\n"
                f"Second number: {data_point['b']}\n\n"
                f"Provide the result. Your final answer must be in the format "
                f"\\boxed{{result}} at the end.")
    
    def evaluate_response(self, response, data_point):
        """Evaluate model response for addition task"""
        boxed_answer = parse_boxed_answer(response)
        instruction_followed = boxed_answer is not None
        accuracy = 0
        
        if instruction_followed and len(boxed_answer) == 1:
            accuracy = 1 if boxed_answer[0] == data_point['sum'] else 0
        
        return {
            "num1": data_point['a'],
            "num2": data_point['b'],
            "expected_sum": data_point['sum'],
            "parsed_answer": boxed_answer[0] if boxed_answer and len(boxed_answer) > 0 else None,
            "accuracy": accuracy,
            "instruction_followed": instruction_followed
        }
    
    def run_evaluation(self):
        """Run evaluation for addition task"""
        all_metrics = []
        
        # Generate evaluation data
        data = self.generate_data()
        
        # Run each fold
        for fold in range(1, self.num_folds + 1):
            metrics = self.run_fold(data, "addition", fold)
            all_metrics.append(metrics)
        
        return all_metrics

Use your custom task:

llmthinkbench --model_id "meta-llama/Llama-2-7b-chat-hf" --tasks addition

📊 Visualization

LLMThinkBench results can be visualized using any plotting library. Here's a simple example using matplotlib:

import json
import matplotlib.pyplot as plt
import pandas as pd

# Load results
with open("final_report.json") as f:
    results = json.load(f)

# Create dataframe for plotting
data = []
for task, metrics in results.items():
    data.append({
        "Task": task,
        "Accuracy": metrics["accuracy"]["mean"] * 100,
        "Instruction Following": metrics["instruction_followed"]["mean"] * 100
    })

df = pd.DataFrame(data)

# Plot results
plt.figure(figsize=(12, 6))
df.plot(x="Task", y=["Accuracy", "Instruction Following"], kind="bar")
plt.title("LLMThinkBench Results")
plt.ylabel("Percentage")
plt.ylim(0, 100)
plt.grid(axis="y")
plt.tight_layout()
plt.savefig("results_comparison.png")

🔍 Contributing

Contributions to LLMThinkBench are welcome! Please check out our contributing guidelines for more information.

📜 License

LLMThinkBench is licensed under the MIT License - see the LICENSE file for details.

📚 Citation

If you use LLMThinkBench in your research, please cite:

@software{llmthinkbench2025,
  author = {Gaurav Srivastava, Aafiya Hussain, Sriram Srinivasan, Aninditaa Chauhan},
  title = {LLMThinkBench: Advanced Reasoning and Overthinking Evaluation Framework for LLMs},
  year = {2025},
  url = {https://github.com/ctrl-gaurav/LLMThinkBench/}
}

📧 Contact

For questions, issues, or feedback, please open an issue on GitHub.

Project details

These details have not been verified by PyPI

Project links

Development Status
- 3 - Alpha
Intended Audience
- Science/Research
Operating System
- OS Independent
Programming Language
- Python :: 3
Topic
- Scientific/Engineering :: Artificial Intelligence

Release history Release notifications | RSS feed

0.1.6

Jun 17, 2025

0.1.5

Jun 12, 2025

0.1.4

Apr 17, 2025

This version

0.1.3

Apr 16, 2025

0.1.2

Apr 15, 2025

0.1.1

Apr 13, 2025

0.1.0

Apr 6, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llmthinkbench-0.1.3.tar.gz (61.2 kB view details)

Uploaded Apr 16, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

llmthinkbench-0.1.3-py3-none-any.whl (89.6 kB view details)

Uploaded Apr 16, 2025 Python 3

File details

Details for the file llmthinkbench-0.1.3.tar.gz.

File metadata

Download URL: llmthinkbench-0.1.3.tar.gz
Upload date: Apr 16, 2025
Size: 61.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.7

File hashes

Hashes for llmthinkbench-0.1.3.tar.gz
Algorithm	Hash digest
SHA256	`dc042780ebcd6c56a8146ca8c0ce3faaeec85b4c19c1b8214b33934102bec167`
MD5	`127eff9f312fe2e2c4d67bba0fff5578`
BLAKE2b-256	`5622af2d7aa60f0a138db6dccd7e96dc0c156b6e8bb2e249dec2e3e28f73fa95`

See more details on using hashes here.

File details

Details for the file llmthinkbench-0.1.3-py3-none-any.whl.

File metadata

Download URL: llmthinkbench-0.1.3-py3-none-any.whl
Upload date: Apr 16, 2025
Size: 89.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.7

File hashes

Hashes for llmthinkbench-0.1.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`bbeb603a74e161c11fdcf6205bc504fb131f8925048d2fbb76ae3446bf2c134f`
MD5	`cdd8d6db1aec5aed03778c17493263f7`
BLAKE2b-256	`ffa17d1d6b6305c6f1ced26205523dc2e14ee1f7f3aba7ea1e9d0d8e8638d16e`

See more details on using hashes here.

llmthinkbench 0.1.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

🧠 LLMThinkBench: An Advanced Reasoning and Overthinking Evaluation Framework for Language Models

🌟 Key Features

📊 Supported Tasks

🚀 Installation

📈 Quick Start

Command Line Interface

Python API

📝 Example Results

⚙️ Advanced Configuration

Command Line Parameters

🧩 Extending with Custom Tasks

📊 Visualization

🔍 Contributing

📜 License

📚 Citation

📧 Contact

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes