A framework for evaluating overthinking and basic reasoning capabilities of Large Language Models
Project description
🧠 LLMThinkBench: An Advanced Reasoning and Overthinking Evaluation Framework for Language Models
LLMThinkBench is a robust, extensible framework for rigorously evaluating the reasoning capabilities and "overthinking" tendencies of Large Language Models. Through standardized, reproducible benchmarks, it provides crucial insights into model performance on core reasoning tasks.
🌟 Key Features
- Modular Architecture: Easily extend with custom evaluation tasks
- Efficient Inference: Built on vLLM for high-throughput batched evaluation
- Detailed Metrics: Comprehensive reports on accuracy, instruction following, and more
- Multi-GPU Support: Scale evaluations across multiple GPUs
- Reproducible Results: Consistent methodology across model comparison
📊 Supported Tasks
| Task | Description | Metrics |
|---|---|---|
| Sorting | Evaluates ability to correctly sort numerical lists of varying sizes | Accuracy, Instruction Following |
| Comparison | Tests number comparison abilities across different relationships | Accuracy across comparison types |
| Custom Tasks | Easily add your own evaluation tasks | Customizable metrics |
🚀 Installation
# From PyPI
pip install llmthinkbench
# From source
git clone https://github.com/yourusername/llmthinkbench.git
cd llmthinkbench
pip install -e .
📈 Quick Start
Command Line Interface
# Basic usage with default parameters
llmthinkbench --model_id "Qwen/Qwen2.5-1.5B-Instruct" --tasks sorting comparison
# Comprehensive evaluation
llmthinkbench --model_id "meta-llama/Llama-2-7b-chat-hf" \
--tensor_parallel_size 2 \
--tasks sorting comparison \
--datapoints 1000 \
--list_sizes 8 16 32 64 \
--folds 3 \
--range -1000 1000 \
--store_details \
--output_dir "./my_evaluation_results"
Python API
from llmthinkbench.models.model_handler import ModelHandler
from llmthinkbench.tasks.sorting_task import SortingTask
from llmthinkbench.tasks.comparison_task import ComparisonTask
from llmthinkbench.utils.reporting import generate_final_report
# Initialize model
model_handler = ModelHandler(
model_id="meta-llama/Llama-2-7b-chat-hf",
tensor_parallel_size=2,
gpu_memory_utilization=0.9
)
# Configure output directory
output_dir = "llama2_eval_results"
# Run sorting task
sorting = SortingTask(
model_handler=model_handler,
output_dir=output_dir,
min_val=-100,
max_val=100,
num_folds=3,
num_samples=500,
store_details=True,
temperature=0.7,
top_p=0.9,
max_tokens=512
)
# Evaluate multiple list sizes
list_sizes = [8, 16, 32]
sorting_metrics = sorting.run_evaluation(list_sizes)
# Run comparison task
comparison = ComparisonTask(
model_handler=model_handler,
output_dir=output_dir,
min_val=-100,
max_val=100,
num_folds=3,
num_samples=500,
store_details=True,
temperature=0.7,
top_p=0.9,
max_tokens=512
)
# Run evaluation
comparison_metrics = comparison.run_evaluation()
# Generate comprehensive report
all_metrics = sorting_metrics + comparison_metrics
report = generate_final_report(all_metrics, list_sizes, output_dir)
📝 Example Results
Below is an example report generated by LLMThinkBench:
+----------------+------------------+---------------+----------------------+-----------+-----------+-------------+
| Test Case | Accuracy (Mean) | Accuracy (Std)| Instruction Followed | Avg Chars | Avg Words | Avg Tokens |
+----------------+------------------+---------------+----------------------+-----------+-----------+-------------+
| sorting_8 | 95.20% | 3.60% | 98.80% | 612.57 | 93.45 | 186.23 |
| sorting_16 | 87.40% | 4.80% | 96.70% | 982.32 | 167.85 | 312.45 |
| sorting_32 | 68.60% | 7.20% | 92.40% | 1872.15 | 348.76 | 645.65 |
| comparison | 99.20% | 1.20% | 99.60% | 324.83 | 48.27 | 93.75 |
+----------------+------------------+---------------+----------------------+-----------+-----------+-------------+
⚙️ Advanced Configuration
Command Line Parameters
| Parameter | Description | Default |
|---|---|---|
--model_id |
Hugging Face model ID | Required |
--tasks |
Tasks to evaluate | ["sorting"] |
--datapoints |
Number of samples per test case | 1000 |
--folds |
Number of evaluation folds | 1 |
--range |
Number range for evaluation | [-100, 100] |
--list_sizes |
List sizes for sorting task | [8] |
--store_details |
Store detailed per-example results | False |
--output_dir |
Directory to save results | Auto-generated |
--tensor_parallel_size |
Number of GPUs to use | 1 |
--gpu_memory_utilization |
GPU memory utilization threshold | 0.9 |
--temperature |
Sampling temperature | 0.7 |
--top_p |
Sampling top_p value | 0.9 |
--max_tokens |
Maximum tokens for sampling | 512 |
🧩 Extending with Custom Tasks
LLMThinkBench is designed to be easily extensible. Here's how to create a custom evaluation task:
- Create a new task module:
# llmthinkbench/tasks/addition_task.py
import random
from ..utils.parsing import parse_boxed_answer
from .base_task import BaseTask
class AdditionTask(BaseTask):
"""Implementation of the addition task"""
@property
def task_name(self):
return "addition"
def generate_data(self):
"""Generate random number pairs for addition"""
data = []
for _ in range(self.num_samples):
a = random.randint(self.min_val, self.max_val)
b = random.randint(self.min_val, self.max_val)
data.append({"a": a, "b": b, "sum": a + b})
return data
def create_prompt(self, data_point):
"""Create prompt for addition task"""
return (f"Calculate the sum of these two numbers:\n\n"
f"First number: {data_point['a']}\n"
f"Second number: {data_point['b']}\n\n"
f"Provide the result. Your final answer must be in the format "
f"\\boxed{{result}} at the end.")
def evaluate_response(self, response, data_point):
"""Evaluate model response for addition task"""
boxed_answer = parse_boxed_answer(response)
instruction_followed = boxed_answer is not None
accuracy = 0
if instruction_followed and len(boxed_answer) == 1:
accuracy = 1 if boxed_answer[0] == data_point['sum'] else 0
return {
"num1": data_point['a'],
"num2": data_point['b'],
"expected_sum": data_point['sum'],
"parsed_answer": boxed_answer[0] if boxed_answer and len(boxed_answer) > 0 else None,
"accuracy": accuracy,
"instruction_followed": instruction_followed
}
def run_evaluation(self):
"""Run evaluation for addition task"""
all_metrics = []
# Generate evaluation data
data = self.generate_data()
# Run each fold
for fold in range(1, self.num_folds + 1):
metrics = self.run_fold(data, "addition", fold)
all_metrics.append(metrics)
return all_metrics
- Use your custom task:
llmthinkbench --model_id "meta-llama/Llama-2-7b-chat-hf" --tasks addition
📊 Visualization
LLMThinkBench results can be visualized using any plotting library. Here's a simple example using matplotlib:
import json
import matplotlib.pyplot as plt
import pandas as pd
# Load results
with open("final_report.json") as f:
results = json.load(f)
# Create dataframe for plotting
data = []
for task, metrics in results.items():
data.append({
"Task": task,
"Accuracy": metrics["accuracy"]["mean"] * 100,
"Instruction Following": metrics["instruction_followed"]["mean"] * 100
})
df = pd.DataFrame(data)
# Plot results
plt.figure(figsize=(12, 6))
df.plot(x="Task", y=["Accuracy", "Instruction Following"], kind="bar")
plt.title("LLMThinkBench Results")
plt.ylabel("Percentage")
plt.ylim(0, 100)
plt.grid(axis="y")
plt.tight_layout()
plt.savefig("results_comparison.png")
🔍 Contributing
Contributions to LLMThinkBench are welcome! Please check out our contributing guidelines for more information.
📜 License
LLMThinkBench is licensed under the MIT License - see the LICENSE file for details.
📚 Citation
If you use LLMThinkBench in your research, please cite:
@software{llmthinkbench2025,
author = {Gaurav Srivastava, Aafiya Hussain, Sriram Srinivasan, Aninditaa Chauhan},
title = {LLMThinkBench: Advanced Reasoning and Overthinking Evaluation Framework for LLMs},
year = {2025},
url = {https://github.com/ctrl-gaurav/LLMThinkBench/}
}
📧 Contact
For questions, issues, or feedback, please open an issue on GitHub.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file llmthinkbench-0.1.3.tar.gz.
File metadata
- Download URL: llmthinkbench-0.1.3.tar.gz
- Upload date:
- Size: 61.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dc042780ebcd6c56a8146ca8c0ce3faaeec85b4c19c1b8214b33934102bec167
|
|
| MD5 |
127eff9f312fe2e2c4d67bba0fff5578
|
|
| BLAKE2b-256 |
5622af2d7aa60f0a138db6dccd7e96dc0c156b6e8bb2e249dec2e3e28f73fa95
|
File details
Details for the file llmthinkbench-0.1.3-py3-none-any.whl.
File metadata
- Download URL: llmthinkbench-0.1.3-py3-none-any.whl
- Upload date:
- Size: 89.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bbeb603a74e161c11fdcf6205bc504fb131f8925048d2fbb76ae3446bf2c134f
|
|
| MD5 |
cdd8d6db1aec5aed03778c17493263f7
|
|
| BLAKE2b-256 |
ffa17d1d6b6305c6f1ced26205523dc2e14ee1f7f3aba7ea1e9d0d8e8638d16e
|