beyondbench

BeyondBench: Contamination-Resistant Evaluation of Reasoning in Language Models

These details have not been verified by PyPI

Project links

Project description

BeyondBench Banner

Contamination-Resistant Evaluation of Reasoning in Language Models

🏆 101+ Models Evaluated • 🧠 44 Reasoning Tasks • 🎯 117 Variations • 📊 >10¹⁵ Unique Instances

🌟 Explore Leaderboard | 📖 Read Paper | 📦 PyPI | 📚 Documentation

📢 Latest News

Date	Update
Feb 2026	v0.0.1 released — 44 tasks, 117 variations, 101+ models
Jan 2026	Paper accepted at ICLR 2026
Jan 2026	Interactive leaderboard website launched
Sep 2025	Paper submitted: arXiv:2509.24210

💡 What is BeyondBench?

BeyondBench introduces a revolutionary approach to evaluating reasoning capabilities in language models without relying on traditional static benchmarks. Our system dynamically generates novel problems across 44 distinct reasoning tasks with 117 variations, ensuring that models cannot memorize solutions and must demonstrate true reasoning abilities.

🌟 Key Highlights

🔄 Dynamic Problem Generation

Problem space >10^15 unique instances
Zero risk of data contamination
Fresh problems on every evaluation

🎯 Three Difficulty Levels

Easy: 29 fundamental reasoning tasks
Medium: 5 tasks with 49 variations
Hard: 10 tasks with 68 variations

🤖 Multi-Backend Support

OpenAI, Gemini, Anthropic APIs
vLLM for high-throughput local inference
HuggingFace Transformers

📊 Comprehensive Metrics

Accuracy across difficulty levels
Instruction-following compliance
Token efficiency analysis

🛡️ Contamination-Resistant

No static benchmark memorization
Novel problem generation
Fair model comparison

⚡ Extensive Coverage

101+ models evaluated
Open-source and proprietary
Regular updates with new models

🚀 Installation

From PyPI

pip install beyondbench

From Source

git clone https://github.com/ctrl-gaurav/BeyondBench.git
cd BeyondBench
pip install -e .

With Optional Dependencies

# All API clients (OpenAI, Gemini, Anthropic)
pip install beyondbench[all-apis]

# vLLM support (requires CUDA)
pip install beyondbench[vllm]

# Everything
pip install beyondbench[full]

⚡ Quick Start

Interactive Wizard

beyondbench

Command Line

# Evaluate GPT-4o on the easy suite
beyondbench evaluate --model-id gpt-4o --api-provider openai --suite easy

# Evaluate a local model with vLLM
beyondbench evaluate --model-id meta-llama/Llama-3.2-3B-Instruct --backend vllm --suite all

# Evaluate Claude on hard tasks
beyondbench evaluate --model-id claude-sonnet-4-20250514 --api-provider anthropic --suite hard

# List available tasks
beyondbench list-tasks

Python API

from beyondbench import EvaluationEngine, ModelHandler, TaskRegistry

# Initialize model handler
model = ModelHandler(
    model_id="gpt-4o",
    api_provider="openai",
    api_key="your-api-key"
)

# Run evaluation
engine = EvaluationEngine(model_handler=model, output_dir="./results")
results = engine.run_evaluation(suite="easy", datapoints=100)

# Print results
print(f"Average Accuracy: {results['summary']['avg_accuracy']:.2%}")

🔌 Supported Backends

Backend	Models	Features
OpenAI	GPT-4o, GPT-4o-mini, GPT-5, GPT-5-mini	Reasoning effort control
Gemini	Gemini 2.5 Pro, Gemini 2.5 Flash	Thinking budget configuration
Anthropic	Claude Sonnet 4, Claude Opus 4	Latest Claude models
vLLM	Any HuggingFace model	Batch processing, tensor parallelism
Transformers	Any HuggingFace model	CPU/GPU inference

📊 Results

🏆 Leaderboard (Top Models)

🏅 Rank	🤖 Model	📊 Overall	🎯 Instruction Following
🥇	GPT-5*	83.56%	96.15%
🥈	GPT-5-Nano*	82.04%	93.58%
🥉	GPT-5-Mini*	81.67%	94.23%
4	o3*	80.36%	94.96%
5	o4-Mini*	79.04%	95.30%

_{*Models marked with * use reasoning/thinking tokens. Full results for 101+ models available in the paper and on the leaderboard.}

🔍 Key Findings

Reasoning Gap: Even top models show 20-30% performance drops on hard reasoning tasks
Scaling Effects: Larger models generally perform better, but the relationship is not always linear
Instruction vs. Accuracy: High accuracy does not guarantee perfect instruction-following

🧩 Task Suites

Easy Suite (29 Tasks)

Category	Tasks
Arithmetic	sum, multiplication, subtraction, division, absolute_difference
Statistics	mean, median, mode
Counting	odd_count, even_count, count_negative, count_unique, count_greater_than_previous, count_palindromic, count_perfect_squares, count_multiples, local_maxima_count
Extrema	find_maximum, find_minimum, second_maximum, range, index_of_maximum, max_adjacent_difference, sum_of_max_indices
Sequences	sorting, longest_increasing_subsequence, alternating_sum, sum_of_digits
Comparison	comparison

Medium Suite (5 Tasks, 49 Variations)

Task	Variations
Fibonacci Sequence	6 (Tribonacci, Lucas numbers, modified recursive)
Algebraic Sequence	10 (Polynomial, arithmetic, quadratic)
Geometric Sequence	10 (Exponential, compound growth, factorial)
Prime Sequence	11 (Prime gaps, twin primes, Sophie Germain)
Complex Pattern	12 (Interleaved, conditional, multi-rule)

Hard Suite (10 Tasks, 68 Variations)

Task	Variations	Complexity
Tower of Hanoi	6	O(2^n) moves
N-Queens	4	NP-complete
Graph Coloring	10	NP-complete
Boolean SAT	5	NP-complete
Sudoku	8	Constraint satisfaction
Cryptarithmetic	12	Constraint satisfaction
Matrix Chain	5	Dynamic programming
Modular Systems	5	Number theory
Constraint Optimization	5	Operations research
Logic Grid Puzzles	8	Deductive reasoning

📚 Documentation

Full Documentation — Complete API reference and configuration guide
Usage Guide — Detailed usage examples for all backends

Environment Variables

export OPENAI_API_KEY="sk-..."
export GEMINI_API_KEY="..."
export ANTHROPIC_API_KEY="sk-ant-..."

🤝 Contributing

We welcome contributions! See the Contributing Guide for details.

git clone https://github.com/ctrl-gaurav/BeyondBench.git
cd BeyondBench
pip install -e ".[dev]"
pre-commit install
pytest tests/ -v

🛠️ Ways to Contribute

🐛 Bug Reports: Found an issue? Report it here
✨ Feature Requests: Have ideas? Share them here
🔧 Code Contributions: Submit PRs for improvements
📚 Documentation: Help improve our docs
🤖 Model Submissions: Suggest models for evaluation

📝 Citation

If you use BeyondBench in your research, please cite our paper (accepted at ICLR 2026):

@misc{srivastava2025beyondbenchbenchmarkfreeevaluationreasoning,
      title={BeyondBench: Contamination-Resistant Evaluation of Reasoning in Language Models},
      author={Gaurav Srivastava and Aafiya Hussain and Zhenyu Bi and Swastik Roy and Priya Pitre and Meng Lu and Morteza Ziyadi and Xuan Wang},
      year={2025},
      eprint={2509.24210},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2509.24210},
}

📞 Contact & Support

📧 Email: gks@vt.edu, xuanw@vt.edu
🐛 Issues: GitHub Issues
💬 Discussions: GitHub Discussions

📜 License

This project is licensed under the MIT License - see the LICENSE file for details.

🚀 Ready to Explore the Future of AI Evaluation?

Made with ❤️ by the BeyondBench Team

Advancing the frontier of AI reasoning evaluation, one benchmark at a time 🌟

🏠 Home	📊 Leaderboard	📖 Paper	💻 Code
Main website	Interactive rankings	Research paper	Source code

🎯 Transform your understanding of AI capabilities. BeyondBench reveals what language models can truly reason about, beyond memorization. Start exploring now →

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.2.1

Apr 17, 2026

0.2.0

Apr 16, 2026

0.1.0

Mar 6, 2026

0.0.2

Feb 26, 2026

This version

0.0.1

Feb 25, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

beyondbench-0.0.1.tar.gz (273.6 kB view details)

Uploaded Feb 25, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

beyondbench-0.0.1-py3-none-any.whl (334.9 kB view details)

Uploaded Feb 25, 2026 Python 3

File details

Details for the file beyondbench-0.0.1.tar.gz.

File metadata

Download URL: beyondbench-0.0.1.tar.gz
Upload date: Feb 25, 2026
Size: 273.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.19

File hashes

Hashes for beyondbench-0.0.1.tar.gz
Algorithm	Hash digest
SHA256	`ff801a48712c6d6f967f40aeaf2eef0dbe95c6e8d8805c383d884a53f3e96f57`
MD5	`d97d098b38dadb0e70c185e7508fb11c`
BLAKE2b-256	`b6dd51cd56f33cbc30c159d9c7ead6d3d10735ca67d359ef7f7202d5c0182560`

See more details on using hashes here.

File details

Details for the file beyondbench-0.0.1-py3-none-any.whl.

File metadata

Download URL: beyondbench-0.0.1-py3-none-any.whl
Upload date: Feb 25, 2026
Size: 334.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.19

File hashes

Hashes for beyondbench-0.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`9e1c623c8610994353c2fcb5e096905888e8280c9520698e91eb5b1606823592`
MD5	`23c0c42d04c2d7c647e9e78dfa482720`
BLAKE2b-256	`58785d3203b7e281eea6ae92ff31142c7c9203495e85dec42add81336fe1b5f6`

See more details on using hashes here.

beyondbench 0.0.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

📢 Latest News

💡 What is BeyondBench?

🌟 Key Highlights

🔄 Dynamic Problem Generation

🎯 Three Difficulty Levels

🤖 Multi-Backend Support

📊 Comprehensive Metrics

🛡️ Contamination-Resistant

⚡ Extensive Coverage

🚀 Installation

From PyPI

From Source

With Optional Dependencies

⚡ Quick Start

Interactive Wizard

Command Line

Python API

🔌 Supported Backends

📊 Results

🏆 Leaderboard (Top Models)

🔍 Key Findings

🧩 Task Suites

📚 Documentation

Environment Variables

🤝 Contributing

🛠️ Ways to Contribute

📝 Citation

📞 Contact & Support

📜 License

🚀 Ready to Explore the Future of AI Evaluation?

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes