BeyondBench: Contamination-Resistant Evaluation of Reasoning in Language Models

These details have not been verified by PyPI

Project links

Project description

BeyondBench: Contamination-Resistant Evaluation of Reasoning in Language Models

101+ Models Evaluated | 79 Reasoning Tasks | 138 Variations | >10^15 Unique Instances

Explore Leaderboard | Read Paper | GitHub | Documentation

What is BeyondBench?

BeyondBench introduces a revolutionary approach to evaluating reasoning capabilities in language models without relying on traditional static benchmarks. Our system dynamically generates novel problems across 79 distinct reasoning tasks with 138 variations, ensuring that models cannot memorize solutions and must demonstrate true reasoning abilities.

Key Features

Dynamic Problem Generation — Problem space >10^15 unique instances, zero risk of data contamination
Three Difficulty Levels — Easy (44 tasks), Medium (15 tasks, 49 variations), Hard (20 tasks, 68 variations)
Multi-Backend Support — OpenAI, Gemini, Anthropic APIs + vLLM and HuggingFace Transformers
Contamination-Resistant — No static benchmark memorization, novel problems every run
Comprehensive Metrics — Accuracy, instruction-following compliance, token efficiency
101+ Models Evaluated — Open-source and proprietary, regularly updated

Installation

From PyPI

pip install beyondbench

With Optional Dependencies

# All API clients (OpenAI, Gemini, Anthropic)
pip install beyondbench[all-apis]

# vLLM support (requires CUDA)
pip install beyondbench[vllm]

# Everything (all APIs + vLLM + dev tools + visualization)
pip install beyondbench[full]

From Source

git clone https://github.com/ctrl-gaurav/BeyondBench.git
cd BeyondBench
pip install -e .

Quick Start

Interactive Wizard

beyondbench

Command Line

# Evaluate GPT-4o on the easy suite
beyondbench evaluate --model-id gpt-4o --api-provider openai --suite easy

# Evaluate a local model with vLLM
beyondbench evaluate --model-id meta-llama/Llama-3.2-3B-Instruct --backend vllm --suite all

# Evaluate Claude on hard tasks
beyondbench evaluate --model-id claude-sonnet-4-20250514 --api-provider anthropic --suite hard

# List all available tasks
beyondbench list-tasks

Python API

from beyondbench import EvaluationEngine, ModelHandler, TaskRegistry

# Initialize model handler
model = ModelHandler(
    model_id="gpt-4o",
    api_provider="openai",
    api_key="your-api-key"
)

# Run evaluation
engine = EvaluationEngine(model_handler=model, output_dir="./results")
results = engine.run_evaluation(suite="easy", datapoints=100)

# Print results
print(f"Average Accuracy: {results['summary']['avg_accuracy']:.2%}")

New in v0.2.0

79 Reasoning Tasks: 44 easy + 15 medium (49 variations) + 20 hard (68 variations)
Multi-GPU Parallel Evaluation: Automatic batch auto-tuning and tensor parallelism
Plugin SDK: Create and share custom tasks with beyondbench plugin scaffold
Gradio Dashboard: Real-time evaluation monitoring with --dashboard flag
Response Caching: Skip redundant API calls across runs
Universal Parser: Unified parsing engine with confidence scoring
1000+ Tests: Comprehensive unit, integration, and end-to-end test coverage
API Server: beyondbench serve - FastAPI REST API with WebSocket support

Supported Backends

Backend	Models	Features
OpenAI	GPT-4o, GPT-4o-mini, GPT-5, GPT-5-mini	Reasoning effort control
Gemini	Gemini 2.5 Pro, Gemini 2.5 Flash	Thinking budget configuration
Anthropic	Claude Sonnet 4, Claude Opus 4	Latest Claude models
vLLM	Any HuggingFace model	Batch processing, tensor parallelism
Transformers	Any HuggingFace model	CPU/GPU inference

Task Suites

Easy Suite (44 Tasks)

Arithmetic (sum, multiplication, subtraction, division, absolute_difference), Statistics (mean, median, mode), Counting (odd_count, even_count, count_negative, count_unique, and more), Extrema (find_maximum, find_minimum, second_maximum, range, and more), Sequences (sorting, longest_increasing_subsequence, alternating_sum, sum_of_digits), Comparison

Medium Suite (15 Tasks, 49 Variations)

Fibonacci Sequence (6 variations), Algebraic Sequence (10), Geometric Sequence (10), Prime Sequence (11), Complex Pattern (12)

Hard Suite (20 Tasks, 68 Variations)

Tower of Hanoi, N-Queens, Graph Coloring, Boolean SAT, Sudoku, Cryptarithmetic, Matrix Chain, Modular Systems, Constraint Optimization, Logic Grid Puzzles

Leaderboard (Top 5)

Rank	Model	Overall	Instruction Following
1	GPT-5*	83.56%	96.15%
2	GPT-5-Nano*	82.04%	93.58%
3	GPT-5-Mini*	81.67%	94.23%
4	o3*	80.36%	94.96%
5	o4-Mini*	79.04%	95.30%

*Models use reasoning/thinking tokens. Full results for 101+ models on the leaderboard.

Environment Variables

export OPENAI_API_KEY="sk-..."
export GEMINI_API_KEY="..."
export ANTHROPIC_API_KEY="sk-ant-..."

Citation

If you use BeyondBench in your research, please cite our paper (accepted at ICLR 2026):

@misc{srivastava2025beyondbenchbenchmarkfreeevaluationreasoning,
      title={BeyondBench: Contamination-Resistant Evaluation of Reasoning in Language Models},
      author={Gaurav Srivastava and Aafiya Hussain and Zhenyu Bi and Swastik Roy and Priya Pitre and Meng Lu and Morteza Ziyadi and Xuan Wang},
      year={2025},
      eprint={2509.24210},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2509.24210},
}

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.2.1

Apr 17, 2026

This version

0.2.0

Apr 16, 2026

0.1.0

Mar 6, 2026

0.0.2

Feb 26, 2026

0.0.1

Feb 25, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

beyondbench-0.2.0.tar.gz (473.4 kB view details)

Uploaded Apr 16, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

beyondbench-0.2.0-py3-none-any.whl (535.7 kB view details)

Uploaded Apr 16, 2026 Python 3

File details

Details for the file beyondbench-0.2.0.tar.gz.

File metadata

Download URL: beyondbench-0.2.0.tar.gz
Upload date: Apr 16, 2026
Size: 473.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.12

File hashes

Hashes for beyondbench-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`ebd340fb5b27857f27202e13a00a4d81bfcb9b9669b9f546af214d2f32135aad`
MD5	`b2f944c31e8fe071ed5ea2a621ab0b12`
BLAKE2b-256	`9b48d3abddde75fc18e0f91898374d15592c2bb91b402a4c2779b9123bb8bd1e`

See more details on using hashes here.

File details

Details for the file beyondbench-0.2.0-py3-none-any.whl.

File metadata

Download URL: beyondbench-0.2.0-py3-none-any.whl
Upload date: Apr 16, 2026
Size: 535.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.12

File hashes

Hashes for beyondbench-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`71f7fa91a0817c43cfedd536ffbbd98869d80821d2ff1e77cce87bc83296b68d`
MD5	`c686177ee7cc05d62fced5797abcabae`
BLAKE2b-256	`5de0691e603c31387e3bf4ba4b580939797c1961e4f5550f6d7193d02e6eb9b7`

See more details on using hashes here.

beyondbench 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

BeyondBench: Contamination-Resistant Evaluation of Reasoning in Language Models

What is BeyondBench?

Key Features

Installation

From PyPI

With Optional Dependencies

From Source

Quick Start

Interactive Wizard

Command Line

Python API

New in v0.2.0

Supported Backends

Task Suites

Easy Suite (44 Tasks)

Medium Suite (15 Tasks, 49 Variations)

Hard Suite (20 Tasks, 68 Variations)

Leaderboard (Top 5)

Environment Variables

Citation

Links

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes