BeyondBench: Contamination-Resistant Evaluation of Reasoning in Language Models
Project description
Contamination-Resistant Evaluation of Reasoning in Language Models
๐ 101+ Models Evaluated โข ๐ง 44 Reasoning Tasks โข ๐ฏ 117 Variations โข ๐ >1015 Unique Instances
๐ Explore Leaderboard | ๐ Read Paper | ๐ฆ PyPI | ๐ Documentation
๐ข Latest News
| Date | Update |
|---|---|
| Feb 2026 | v0.0.1 released โ 44 tasks, 117 variations, 101+ models |
| Jan 2026 | Paper accepted at ICLR 2026 |
| Jan 2026 | Interactive leaderboard website launched |
| Sep 2025 | Paper submitted: arXiv:2509.24210 |
๐ก What is BeyondBench?
BeyondBench introduces a revolutionary approach to evaluating reasoning capabilities in language models without relying on traditional static benchmarks. Our system dynamically generates novel problems across 44 distinct reasoning tasks with 117 variations, ensuring that models cannot memorize solutions and must demonstrate true reasoning abilities.
๐ Key Highlights
๐ Dynamic Problem Generation
|
๐ฏ Three Difficulty Levels
|
๐ค Multi-Backend Support
|
๐ Comprehensive Metrics
|
๐ก๏ธ Contamination-Resistant
|
โก Extensive Coverage
|
๐ Installation
From PyPI
pip install beyondbench
From Source
git clone https://github.com/ctrl-gaurav/BeyondBench.git
cd BeyondBench
pip install -e .
With Optional Dependencies
# All API clients (OpenAI, Gemini, Anthropic)
pip install beyondbench[all-apis]
# vLLM support (requires CUDA)
pip install beyondbench[vllm]
# Everything
pip install beyondbench[full]
โก Quick Start
Interactive Wizard
beyondbench
Command Line
# Evaluate GPT-4o on the easy suite
beyondbench evaluate --model-id gpt-4o --api-provider openai --suite easy
# Evaluate a local model with vLLM
beyondbench evaluate --model-id meta-llama/Llama-3.2-3B-Instruct --backend vllm --suite all
# Evaluate Claude on hard tasks
beyondbench evaluate --model-id claude-sonnet-4-20250514 --api-provider anthropic --suite hard
# List available tasks
beyondbench list-tasks
Python API
from beyondbench import EvaluationEngine, ModelHandler, TaskRegistry
# Initialize model handler
model = ModelHandler(
model_id="gpt-4o",
api_provider="openai",
api_key="your-api-key"
)
# Run evaluation
engine = EvaluationEngine(model_handler=model, output_dir="./results")
results = engine.run_evaluation(suite="easy", datapoints=100)
# Print results
print(f"Average Accuracy: {results['summary']['avg_accuracy']:.2%}")
๐ Supported Backends
| Backend | Models | Features |
|---|---|---|
| OpenAI | GPT-4o, GPT-4o-mini, GPT-5, GPT-5-mini | Reasoning effort control |
| Gemini | Gemini 2.5 Pro, Gemini 2.5 Flash | Thinking budget configuration |
| Anthropic | Claude Sonnet 4, Claude Opus 4 | Latest Claude models |
| vLLM | Any HuggingFace model | Batch processing, tensor parallelism |
| Transformers | Any HuggingFace model | CPU/GPU inference |
๐ Results
๐ Leaderboard (Top Models)
| ๐ Rank | ๐ค Model | ๐ Overall | ๐ฏ Instruction Following |
|---|---|---|---|
| ๐ฅ | GPT-5* | 83.56% | 96.15% |
| ๐ฅ | GPT-5-Nano* | 82.04% | 93.58% |
| ๐ฅ | GPT-5-Mini* | 81.67% | 94.23% |
| 4 | o3* | 80.36% | 94.96% |
| 5 | o4-Mini* | 79.04% | 95.30% |
*Models marked with * use reasoning/thinking tokens. Full results for 101+ models available in the paper and on the leaderboard.
๐ Key Findings
- Reasoning Gap: Even top models show 20-30% performance drops on hard reasoning tasks
- Scaling Effects: Larger models generally perform better, but the relationship is not always linear
- Instruction vs. Accuracy: High accuracy does not guarantee perfect instruction-following
๐งฉ Task Suites
Easy Suite (29 Tasks)
| Category | Tasks |
|---|---|
| Arithmetic | sum, multiplication, subtraction, division, absolute_difference |
| Statistics | mean, median, mode |
| Counting | odd_count, even_count, count_negative, count_unique, count_greater_than_previous, count_palindromic, count_perfect_squares, count_multiples, local_maxima_count |
| Extrema | find_maximum, find_minimum, second_maximum, range, index_of_maximum, max_adjacent_difference, sum_of_max_indices |
| Sequences | sorting, longest_increasing_subsequence, alternating_sum, sum_of_digits |
| Comparison | comparison |
Medium Suite (5 Tasks, 49 Variations)
| Task | Variations |
|---|---|
| Fibonacci Sequence | 6 (Tribonacci, Lucas numbers, modified recursive) |
| Algebraic Sequence | 10 (Polynomial, arithmetic, quadratic) |
| Geometric Sequence | 10 (Exponential, compound growth, factorial) |
| Prime Sequence | 11 (Prime gaps, twin primes, Sophie Germain) |
| Complex Pattern | 12 (Interleaved, conditional, multi-rule) |
Hard Suite (10 Tasks, 68 Variations)
| Task | Variations | Complexity |
|---|---|---|
| Tower of Hanoi | 6 | O(2^n) moves |
| N-Queens | 4 | NP-complete |
| Graph Coloring | 10 | NP-complete |
| Boolean SAT | 5 | NP-complete |
| Sudoku | 8 | Constraint satisfaction |
| Cryptarithmetic | 12 | Constraint satisfaction |
| Matrix Chain | 5 | Dynamic programming |
| Modular Systems | 5 | Number theory |
| Constraint Optimization | 5 | Operations research |
| Logic Grid Puzzles | 8 | Deductive reasoning |
๐ Documentation
- Full Documentation โ Complete API reference and configuration guide
- Usage Guide โ Detailed usage examples for all backends
Environment Variables
export OPENAI_API_KEY="sk-..."
export GEMINI_API_KEY="..."
export ANTHROPIC_API_KEY="sk-ant-..."
๐ค Contributing
We welcome contributions! See the Contributing Guide for details.
git clone https://github.com/ctrl-gaurav/BeyondBench.git
cd BeyondBench
pip install -e ".[dev]"
pre-commit install
pytest tests/ -v
๐ ๏ธ Ways to Contribute
- ๐ Bug Reports: Found an issue? Report it here
- โจ Feature Requests: Have ideas? Share them here
- ๐ง Code Contributions: Submit PRs for improvements
- ๐ Documentation: Help improve our docs
- ๐ค Model Submissions: Suggest models for evaluation
๐ Citation
If you use BeyondBench in your research, please cite our paper (accepted at ICLR 2026):
@misc{srivastava2025beyondbenchbenchmarkfreeevaluationreasoning,
title={BeyondBench: Contamination-Resistant Evaluation of Reasoning in Language Models},
author={Gaurav Srivastava and Aafiya Hussain and Zhenyu Bi and Swastik Roy and Priya Pitre and Meng Lu and Morteza Ziyadi and Xuan Wang},
year={2025},
eprint={2509.24210},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2509.24210},
}
๐ Contact & Support
- ๐ง Email: gks@vt.edu, xuanw@vt.edu
- ๐ Issues: GitHub Issues
- ๐ฌ Discussions: GitHub Discussions
๐ License
This project is licensed under the MIT License - see the LICENSE file for details.
๐ Ready to Explore the Future of AI Evaluation?
Made with โค๏ธ by the BeyondBench Team
Advancing the frontier of AI reasoning evaluation, one benchmark at a time ๐
| ๐ Home | ๐ Leaderboard | ๐ Paper | ๐ป Code |
|---|---|---|---|
| Main website | Interactive rankings | Research paper | Source code |
๐ฏ Transform your understanding of AI capabilities. BeyondBench reveals what language models can truly reason about, beyond memorization. Start exploring now โ
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file beyondbench-0.0.1.tar.gz.
File metadata
- Download URL: beyondbench-0.0.1.tar.gz
- Upload date:
- Size: 273.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.19
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ff801a48712c6d6f967f40aeaf2eef0dbe95c6e8d8805c383d884a53f3e96f57
|
|
| MD5 |
d97d098b38dadb0e70c185e7508fb11c
|
|
| BLAKE2b-256 |
b6dd51cd56f33cbc30c159d9c7ead6d3d10735ca67d359ef7f7202d5c0182560
|
File details
Details for the file beyondbench-0.0.1-py3-none-any.whl.
File metadata
- Download URL: beyondbench-0.0.1-py3-none-any.whl
- Upload date:
- Size: 334.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.19
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9e1c623c8610994353c2fcb5e096905888e8280c9520698e91eb5b1606823592
|
|
| MD5 |
23c0c42d04c2d7c647e9e78dfa482720
|
|
| BLAKE2b-256 |
58785d3203b7e281eea6ae92ff31142c7c9203495e85dec42add81336fe1b5f6
|