Skip to main content

BeyondBench: Contamination-Resistant Evaluation of Reasoning in Language Models

Project description

BeyondBench Banner

Paper Conference PyPI Python License Stars

Contamination-Resistant Evaluation of Reasoning in Language Models

๐Ÿ† 101+ Models Evaluated โ€ข ๐Ÿง  44 Reasoning Tasks โ€ข ๐ŸŽฏ 117 Variations โ€ข ๐Ÿ“Š >1015 Unique Instances

๐ŸŒŸ Explore Leaderboard | ๐Ÿ“– Read Paper | ๐Ÿ“ฆ PyPI | ๐Ÿ“š Documentation


๐Ÿ“ข Latest News

Date Update
Feb 2026 v0.0.1 released โ€” 44 tasks, 117 variations, 101+ models
Jan 2026 Paper accepted at ICLR 2026
Jan 2026 Interactive leaderboard website launched
Sep 2025 Paper submitted: arXiv:2509.24210

๐Ÿ’ก What is BeyondBench?

BeyondBench introduces a revolutionary approach to evaluating reasoning capabilities in language models without relying on traditional static benchmarks. Our system dynamically generates novel problems across 44 distinct reasoning tasks with 117 variations, ensuring that models cannot memorize solutions and must demonstrate true reasoning abilities.

๐ŸŒŸ Key Highlights

๐Ÿ”„ Dynamic Problem Generation

  • Problem space >10^15 unique instances
  • Zero risk of data contamination
  • Fresh problems on every evaluation

๐ŸŽฏ Three Difficulty Levels

  • Easy: 29 fundamental reasoning tasks
  • Medium: 5 tasks with 49 variations
  • Hard: 10 tasks with 68 variations

๐Ÿค– Multi-Backend Support

  • OpenAI, Gemini, Anthropic APIs
  • vLLM for high-throughput local inference
  • HuggingFace Transformers

๐Ÿ“Š Comprehensive Metrics

  • Accuracy across difficulty levels
  • Instruction-following compliance
  • Token efficiency analysis

๐Ÿ›ก๏ธ Contamination-Resistant

  • No static benchmark memorization
  • Novel problem generation
  • Fair model comparison

โšก Extensive Coverage

  • 101+ models evaluated
  • Open-source and proprietary
  • Regular updates with new models

๐Ÿš€ Installation

From PyPI

pip install beyondbench

From Source

git clone https://github.com/ctrl-gaurav/BeyondBench.git
cd BeyondBench
pip install -e .

With Optional Dependencies

# All API clients (OpenAI, Gemini, Anthropic)
pip install beyondbench[all-apis]

# vLLM support (requires CUDA)
pip install beyondbench[vllm]

# Everything
pip install beyondbench[full]

โšก Quick Start

Interactive Wizard

beyondbench

Command Line

# Evaluate GPT-4o on the easy suite
beyondbench evaluate --model-id gpt-4o --api-provider openai --suite easy

# Evaluate a local model with vLLM
beyondbench evaluate --model-id meta-llama/Llama-3.2-3B-Instruct --backend vllm --suite all

# Evaluate Claude on hard tasks
beyondbench evaluate --model-id claude-sonnet-4-20250514 --api-provider anthropic --suite hard

# List available tasks
beyondbench list-tasks

Python API

from beyondbench import EvaluationEngine, ModelHandler, TaskRegistry

# Initialize model handler
model = ModelHandler(
    model_id="gpt-4o",
    api_provider="openai",
    api_key="your-api-key"
)

# Run evaluation
engine = EvaluationEngine(model_handler=model, output_dir="./results")
results = engine.run_evaluation(suite="easy", datapoints=100)

# Print results
print(f"Average Accuracy: {results['summary']['avg_accuracy']:.2%}")

๐Ÿ”Œ Supported Backends

Backend Models Features
OpenAI GPT-4o, GPT-4o-mini, GPT-5, GPT-5-mini Reasoning effort control
Gemini Gemini 2.5 Pro, Gemini 2.5 Flash Thinking budget configuration
Anthropic Claude Sonnet 4, Claude Opus 4 Latest Claude models
vLLM Any HuggingFace model Batch processing, tensor parallelism
Transformers Any HuggingFace model CPU/GPU inference

๐Ÿ“Š Results

๐Ÿ† Leaderboard (Top Models)

๐Ÿ… Rank ๐Ÿค– Model ๐Ÿ“Š Overall ๐ŸŽฏ Instruction Following
๐Ÿฅ‡GPT-5*83.56%96.15%
๐ŸฅˆGPT-5-Nano*82.04%93.58%
๐Ÿฅ‰GPT-5-Mini*81.67%94.23%
4o3*80.36%94.96%
5o4-Mini*79.04%95.30%

*Models marked with * use reasoning/thinking tokens. Full results for 101+ models available in the paper and on the leaderboard.

๐Ÿ” Key Findings

  • Reasoning Gap: Even top models show 20-30% performance drops on hard reasoning tasks
  • Scaling Effects: Larger models generally perform better, but the relationship is not always linear
  • Instruction vs. Accuracy: High accuracy does not guarantee perfect instruction-following

๐Ÿงฉ Task Suites

Easy Suite (29 Tasks)
Category Tasks
Arithmetic sum, multiplication, subtraction, division, absolute_difference
Statistics mean, median, mode
Counting odd_count, even_count, count_negative, count_unique, count_greater_than_previous, count_palindromic, count_perfect_squares, count_multiples, local_maxima_count
Extrema find_maximum, find_minimum, second_maximum, range, index_of_maximum, max_adjacent_difference, sum_of_max_indices
Sequences sorting, longest_increasing_subsequence, alternating_sum, sum_of_digits
Comparison comparison
Medium Suite (5 Tasks, 49 Variations)
Task Variations
Fibonacci Sequence 6 (Tribonacci, Lucas numbers, modified recursive)
Algebraic Sequence 10 (Polynomial, arithmetic, quadratic)
Geometric Sequence 10 (Exponential, compound growth, factorial)
Prime Sequence 11 (Prime gaps, twin primes, Sophie Germain)
Complex Pattern 12 (Interleaved, conditional, multi-rule)
Hard Suite (10 Tasks, 68 Variations)
Task Variations Complexity
Tower of Hanoi 6 O(2^n) moves
N-Queens 4 NP-complete
Graph Coloring 10 NP-complete
Boolean SAT 5 NP-complete
Sudoku 8 Constraint satisfaction
Cryptarithmetic 12 Constraint satisfaction
Matrix Chain 5 Dynamic programming
Modular Systems 5 Number theory
Constraint Optimization 5 Operations research
Logic Grid Puzzles 8 Deductive reasoning

๐Ÿ“š Documentation

Environment Variables

export OPENAI_API_KEY="sk-..."
export GEMINI_API_KEY="..."
export ANTHROPIC_API_KEY="sk-ant-..."

๐Ÿค Contributing

We welcome contributions! See the Contributing Guide for details.

git clone https://github.com/ctrl-gaurav/BeyondBench.git
cd BeyondBench
pip install -e ".[dev]"
pre-commit install
pytest tests/ -v

๐Ÿ› ๏ธ Ways to Contribute

  • ๐Ÿ› Bug Reports: Found an issue? Report it here
  • โœจ Feature Requests: Have ideas? Share them here
  • ๐Ÿ”ง Code Contributions: Submit PRs for improvements
  • ๐Ÿ“š Documentation: Help improve our docs
  • ๐Ÿค– Model Submissions: Suggest models for evaluation

๐Ÿ“ Citation

If you use BeyondBench in your research, please cite our paper (accepted at ICLR 2026):

@misc{srivastava2025beyondbenchbenchmarkfreeevaluationreasoning,
      title={BeyondBench: Contamination-Resistant Evaluation of Reasoning in Language Models},
      author={Gaurav Srivastava and Aafiya Hussain and Zhenyu Bi and Swastik Roy and Priya Pitre and Meng Lu and Morteza Ziyadi and Xuan Wang},
      year={2025},
      eprint={2509.24210},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2509.24210},
}

๐Ÿ“ž Contact & Support


๐Ÿ“œ License

This project is licensed under the MIT License - see the LICENSE file for details.


๐Ÿš€ Ready to Explore the Future of AI Evaluation?

Explore Leaderboard

Made with โค๏ธ by the BeyondBench Team

Virginia Tech Amazon AGI

Advancing the frontier of AI reasoning evaluation, one benchmark at a time ๐ŸŒŸ


๐Ÿ  Home ๐Ÿ“Š Leaderboard ๐Ÿ“– Paper ๐Ÿ’ป Code
Main website Interactive rankings Research paper Source code

๐ŸŽฏ Transform your understanding of AI capabilities. BeyondBench reveals what language models can truly reason about, beyond memorization. Start exploring now โ†’


BeyondBench Logo

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

beyondbench-0.0.1.tar.gz (273.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

beyondbench-0.0.1-py3-none-any.whl (334.9 kB view details)

Uploaded Python 3

File details

Details for the file beyondbench-0.0.1.tar.gz.

File metadata

  • Download URL: beyondbench-0.0.1.tar.gz
  • Upload date:
  • Size: 273.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.19

File hashes

Hashes for beyondbench-0.0.1.tar.gz
Algorithm Hash digest
SHA256 ff801a48712c6d6f967f40aeaf2eef0dbe95c6e8d8805c383d884a53f3e96f57
MD5 d97d098b38dadb0e70c185e7508fb11c
BLAKE2b-256 b6dd51cd56f33cbc30c159d9c7ead6d3d10735ca67d359ef7f7202d5c0182560

See more details on using hashes here.

File details

Details for the file beyondbench-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: beyondbench-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 334.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.19

File hashes

Hashes for beyondbench-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 9e1c623c8610994353c2fcb5e096905888e8280c9520698e91eb5b1606823592
MD5 23c0c42d04c2d7c647e9e78dfa482720
BLAKE2b-256 58785d3203b7e281eea6ae92ff31142c7c9203495e85dec42add81336fe1b5f6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page