BeyondBench: Contamination-Resistant Evaluation of Reasoning in Language Models
Project description
BeyondBench: Contamination-Resistant Evaluation of Reasoning in Language Models
101+ Models Evaluated | 44 Reasoning Tasks | 117 Variations | >10^15 Unique Instances
Explore Leaderboard | Read Paper | GitHub | Documentation
What is BeyondBench?
BeyondBench introduces a revolutionary approach to evaluating reasoning capabilities in language models without relying on traditional static benchmarks. Our system dynamically generates novel problems across 44 distinct reasoning tasks with 117 variations, ensuring that models cannot memorize solutions and must demonstrate true reasoning abilities.
Key Features
- Dynamic Problem Generation — Problem space >10^15 unique instances, zero risk of data contamination
- Three Difficulty Levels — Easy (29 tasks), Medium (5 tasks, 49 variations), Hard (10 tasks, 68 variations)
- Multi-Backend Support — OpenAI, Gemini, Anthropic APIs + vLLM and HuggingFace Transformers
- Contamination-Resistant — No static benchmark memorization, novel problems every run
- Comprehensive Metrics — Accuracy, instruction-following compliance, token efficiency
- 101+ Models Evaluated — Open-source and proprietary, regularly updated
Installation
From PyPI
pip install beyondbench
With Optional Dependencies
# All API clients (OpenAI, Gemini, Anthropic)
pip install beyondbench[all-apis]
# vLLM support (requires CUDA)
pip install beyondbench[vllm]
# Everything (all APIs + vLLM + dev tools + visualization)
pip install beyondbench[full]
From Source
git clone https://github.com/ctrl-gaurav/BeyondBench.git
cd BeyondBench
pip install -e .
Quick Start
Interactive Wizard
beyondbench
Command Line
# Evaluate GPT-4o on the easy suite
beyondbench evaluate --model-id gpt-4o --api-provider openai --suite easy
# Evaluate a local model with vLLM
beyondbench evaluate --model-id meta-llama/Llama-3.2-3B-Instruct --backend vllm --suite all
# Evaluate Claude on hard tasks
beyondbench evaluate --model-id claude-sonnet-4-20250514 --api-provider anthropic --suite hard
# List all available tasks
beyondbench list-tasks
Python API
from beyondbench import EvaluationEngine, ModelHandler, TaskRegistry
# Initialize model handler
model = ModelHandler(
model_id="gpt-4o",
api_provider="openai",
api_key="your-api-key"
)
# Run evaluation
engine = EvaluationEngine(model_handler=model, output_dir="./results")
results = engine.run_evaluation(suite="easy", datapoints=100)
# Print results
print(f"Average Accuracy: {results['summary']['avg_accuracy']:.2%}")
Supported Backends
| Backend | Models | Features |
|---|---|---|
| OpenAI | GPT-4o, GPT-4o-mini, GPT-5, GPT-5-mini | Reasoning effort control |
| Gemini | Gemini 2.5 Pro, Gemini 2.5 Flash | Thinking budget configuration |
| Anthropic | Claude Sonnet 4, Claude Opus 4 | Latest Claude models |
| vLLM | Any HuggingFace model | Batch processing, tensor parallelism |
| Transformers | Any HuggingFace model | CPU/GPU inference |
Task Suites
Easy Suite (29 Tasks)
Arithmetic (sum, multiplication, subtraction, division, absolute_difference), Statistics (mean, median, mode), Counting (odd_count, even_count, count_negative, count_unique, and more), Extrema (find_maximum, find_minimum, second_maximum, range, and more), Sequences (sorting, longest_increasing_subsequence, alternating_sum, sum_of_digits), Comparison
Medium Suite (5 Tasks, 49 Variations)
Fibonacci Sequence (6 variations), Algebraic Sequence (10), Geometric Sequence (10), Prime Sequence (11), Complex Pattern (12)
Hard Suite (10 Tasks, 68 Variations)
Tower of Hanoi, N-Queens, Graph Coloring, Boolean SAT, Sudoku, Cryptarithmetic, Matrix Chain, Modular Systems, Constraint Optimization, Logic Grid Puzzles
Leaderboard (Top 5)
| Rank | Model | Overall | Instruction Following |
|---|---|---|---|
| 1 | GPT-5* | 83.56% | 96.15% |
| 2 | GPT-5-Nano* | 82.04% | 93.58% |
| 3 | GPT-5-Mini* | 81.67% | 94.23% |
| 4 | o3* | 80.36% | 94.96% |
| 5 | o4-Mini* | 79.04% | 95.30% |
*Models use reasoning/thinking tokens. Full results for 101+ models on the leaderboard.
Environment Variables
export OPENAI_API_KEY="sk-..."
export GEMINI_API_KEY="..."
export ANTHROPIC_API_KEY="sk-ant-..."
Citation
If you use BeyondBench in your research, please cite our paper (accepted at ICLR 2026):
@misc{srivastava2025beyondbenchbenchmarkfreeevaluationreasoning,
title={BeyondBench: Contamination-Resistant Evaluation of Reasoning in Language Models},
author={Gaurav Srivastava and Aafiya Hussain and Zhenyu Bi and Swastik Roy and Priya Pitre and Meng Lu and Morteza Ziyadi and Xuan Wang},
year={2025},
eprint={2509.24210},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2509.24210},
}
Links
- Leaderboard: ctrl-gaurav.github.io/BeyondBench
- GitHub: github.com/ctrl-gaurav/BeyondBench
- Paper: arXiv:2509.24210
- Documentation: Full Docs | Usage Guide
- Issues: GitHub Issues
- Email: gks@vt.edu, xuanw@vt.edu
Made with care by the BeyondBench Team | Virginia Tech, Department of Computer Science | Amazon AGI
Advancing the frontier of AI reasoning evaluation, one benchmark at a time.
License: MIT
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file beyondbench-0.0.2.tar.gz.
File metadata
- Download URL: beyondbench-0.0.2.tar.gz
- Upload date:
- Size: 278.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c8e771707123fa78689ba866fec976c45a4e138b0b523f5f662e879d71c3ae20
|
|
| MD5 |
2515cbce383aa367cf63846dd5d55ea6
|
|
| BLAKE2b-256 |
128cf9afb8f719be7733e42e157b8577ed5818a4d2236404558c674d154fc2cd
|
File details
Details for the file beyondbench-0.0.2-py3-none-any.whl.
File metadata
- Download URL: beyondbench-0.0.2-py3-none-any.whl
- Upload date:
- Size: 338.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
270a11ec235646ef56a91eb848e674bac48f36444faecf05c714e74b0c5991b0
|
|
| MD5 |
f20bc5b8b1c537368313a24288affd06
|
|
| BLAKE2b-256 |
c0a4aa5da8f1e4d2e6267e33699643b193475b70c1b759650adaddcea0f53c70
|