Skip to main content

BeyondBench: Contamination-Resistant Evaluation of Reasoning in Language Models

Project description

BeyondBench: Contamination-Resistant Evaluation of Reasoning in Language Models

Paper ICLR 2026 PyPI Python License Stars

101+ Models Evaluated | 79 Reasoning Tasks | 138 Variations | >10^15 Unique Instances

Explore Leaderboard | Read Paper | GitHub | Documentation


What is BeyondBench?

BeyondBench introduces a revolutionary approach to evaluating reasoning capabilities in language models without relying on traditional static benchmarks. Our system dynamically generates novel problems across 79 distinct reasoning tasks with 138 variations, ensuring that models cannot memorize solutions and must demonstrate true reasoning abilities.

Key Features

  • Dynamic Problem Generation — Problem space >10^15 unique instances, zero risk of data contamination
  • Three Difficulty Levels — Easy (44 tasks), Medium (15 tasks, 49 variations), Hard (20 tasks, 68 variations)
  • Multi-Backend Support — OpenAI, Gemini, Anthropic APIs + vLLM and HuggingFace Transformers
  • Contamination-Resistant — No static benchmark memorization, novel problems every run
  • Comprehensive Metrics — Accuracy, instruction-following compliance, token efficiency
  • 101+ Models Evaluated — Open-source and proprietary, regularly updated

Installation

From PyPI

pip install beyondbench

With Optional Dependencies

# All API clients (OpenAI, Gemini, Anthropic)
pip install beyondbench[all-apis]

# vLLM support (requires CUDA)
pip install beyondbench[vllm]

# Everything (all APIs + vLLM + dev tools + visualization)
pip install beyondbench[full]

From Source

git clone https://github.com/ctrl-gaurav/BeyondBench.git
cd BeyondBench
pip install -e .

Quick Start

Interactive Wizard

beyondbench

Command Line

# Evaluate GPT-4o on the easy suite
beyondbench evaluate --model-id gpt-4o --api-provider openai --suite easy

# Evaluate a local model with vLLM
beyondbench evaluate --model-id meta-llama/Llama-3.2-3B-Instruct --backend vllm --suite all

# Evaluate Claude on hard tasks
beyondbench evaluate --model-id claude-sonnet-4-20250514 --api-provider anthropic --suite hard

# List all available tasks
beyondbench list-tasks

Python API

from beyondbench import EvaluationEngine, ModelHandler, TaskRegistry

# Initialize model handler
model = ModelHandler(
    model_id="gpt-4o",
    api_provider="openai",
    api_key="your-api-key"
)

# Run evaluation
engine = EvaluationEngine(model_handler=model, output_dir="./results")
results = engine.run_evaluation(suite="easy", datapoints=100)

# Print results
print(f"Average Accuracy: {results['summary']['avg_accuracy']:.2%}")

New in v0.2.1 (Apr 17, 2026)

Critical packaging fix — four subpackages (parsers.strategies, configs, eval, prompts) were missing from the v0.2.0 PyPI wheel, causing ImportError on fresh installs. No API changes.

New in v0.2.0 (Apr 16, 2026)

  • 79 Reasoning Tasks: 44 easy + 15 medium (49 variations) + 20 hard (68 variations)
  • Multi-GPU Parallel Evaluation: Automatic batch auto-tuning and tensor parallelism
  • Plugin SDK: Create and share custom tasks with beyondbench plugin scaffold
  • Gradio Dashboard: Real-time evaluation monitoring with --dashboard flag
  • Response Caching: Skip redundant API calls across runs
  • Universal Parser: Unified parsing engine with confidence scoring
  • 1000+ Tests: Comprehensive unit, integration, and end-to-end test coverage
  • API Server: beyondbench serve - FastAPI REST API with WebSocket support

Supported Backends

Backend Models Features
OpenAI GPT-4o, GPT-4o-mini, GPT-5, GPT-5-mini Reasoning effort control
Gemini Gemini 2.5 Pro, Gemini 2.5 Flash Thinking budget configuration
Anthropic Claude Sonnet 4, Claude Opus 4 Latest Claude models
vLLM Any HuggingFace model Batch processing, tensor parallelism
Transformers Any HuggingFace model CPU/GPU inference

Task Suites

Easy Suite (44 Tasks)

Arithmetic (sum, multiplication, subtraction, division, absolute_difference), Statistics (mean, median, mode), Counting (odd_count, even_count, count_negative, count_unique, and more), Extrema (find_maximum, find_minimum, second_maximum, range, and more), Sequences (sorting, longest_increasing_subsequence, alternating_sum, sum_of_digits), Comparison

Medium Suite (15 Tasks, 49 Variations)

Fibonacci Sequence (6 variations), Algebraic Sequence (10), Geometric Sequence (10), Prime Sequence (11), Complex Pattern (12)

Hard Suite (20 Tasks, 68 Variations)

Tower of Hanoi, N-Queens, Graph Coloring, Boolean SAT, Sudoku, Cryptarithmetic, Matrix Chain, Modular Systems, Constraint Optimization, Logic Grid Puzzles


Leaderboard (Top 5)

Rank Model Overall Instruction Following
1 GPT-5* 83.56% 96.15%
2 GPT-5-Nano* 82.04% 93.58%
3 GPT-5-Mini* 81.67% 94.23%
4 o3* 80.36% 94.96%
5 o4-Mini* 79.04% 95.30%

*Models use reasoning/thinking tokens. Full results for 101+ models on the leaderboard.


Environment Variables

export OPENAI_API_KEY="sk-..."
export GEMINI_API_KEY="..."
export ANTHROPIC_API_KEY="sk-ant-..."

Citation

If you use BeyondBench in your research, please cite our paper (accepted at ICLR 2026):

@misc{srivastava2025beyondbenchbenchmarkfreeevaluationreasoning,
      title={BeyondBench: Contamination-Resistant Evaluation of Reasoning in Language Models},
      author={Gaurav Srivastava and Aafiya Hussain and Zhenyu Bi and Swastik Roy and Priya Pitre and Meng Lu and Morteza Ziyadi and Xuan Wang},
      year={2025},
      eprint={2509.24210},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2509.24210},
}

Links


Made with care by the BeyondBench Team | Virginia Tech, Department of Computer Science | Amazon AGI

Advancing the frontier of AI reasoning evaluation, one benchmark at a time.

License: Apache-2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

beyondbench-0.2.1.tar.gz (523.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

beyondbench-0.2.1-py3-none-any.whl (595.5 kB view details)

Uploaded Python 3

File details

Details for the file beyondbench-0.2.1.tar.gz.

File metadata

  • Download URL: beyondbench-0.2.1.tar.gz
  • Upload date:
  • Size: 523.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.12

File hashes

Hashes for beyondbench-0.2.1.tar.gz
Algorithm Hash digest
SHA256 f196f4197f6ce3430fa1d1c7ae8c3a35700cb437048df29fd4345f46827d5c3d
MD5 fb0116a97c3ae45d783b3b932eec9fde
BLAKE2b-256 0d425560475fc590c4976dc10455b64ceaaf8c8b16b752c76f22a3305ff405fd

See more details on using hashes here.

File details

Details for the file beyondbench-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: beyondbench-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 595.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.12

File hashes

Hashes for beyondbench-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 798f741df27b510c17ea3682d1c9f09b79518bd661fc4869e0fad359f55014a0
MD5 c92ad282d3617b52e07962082e33fbb1
BLAKE2b-256 f12d2498a2386add7bc7b233521b1ded894c1462c0825784de15abfc0c8cd3db

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page