GSM-Infinite Benchmark for LLMs
Project description
GSM-Infinite
Infinitely Scalable Long-context Reasoning Benchmark for Large Language Models
*Equal Contributions | 1Carnegie Mellon University | 2Meta AI
Limitation of Existing Long-context Benchmark
- Lack of reasoning complexity: Most tasks rely on text retrieval, text summarization, QA.
- Lack of context length: Some tasks are inherently short-context tasks but are bloated to long-context through injecting semantically irrelevant noise.
- Lack of scalability: Admittedly, tasks with high reasoning complexity and high information density exists, but these tasks requires huge human-effort to gather, dedup, and verify. The result is lack of scalability in quantity, making it hard to prevail in the community.
Problem Statement: How can we develop a benchmark that contains sufficient problems at every fine-grained level of reasoning difficulty, from easy retrieval tasks to infinitely hard challenges, while providing infinitely customizable context length with high information density?
Overview
GSM-Infinite is a completely synthetic reasoning benchmark that generates problems with infinitely scalable context length and reasoning complexity. Unlike existing benchmarks that rely on text retrieval or summarization, GSM-Infinite creates high information density tasks that can only be solved by long-context LLMs, not by RAG systems.
Key Features
- ๐ Infinitely Scalable: Generate problems of any context length and reasoning complexity
- ๐งฎ High Information Density: Every token matters - RAG systems cannot solve these problems
- ๐ฏ Three Difficulty Levels: Symbolic, Medium, and Hard subsets
- ๐ Comprehensive Evaluation: Built-in evaluation scripts and leaderboards
- ๐ฌ Synthetic Generation: No LLMs in the loop, ensuring unbiased benchmarks
Why GSM-Infinite?
RAG systems fail on GSM-Infinite due to high information density
Traditional long-context benchmarks can often be solved by RAG systems, making them insufficient for evaluating true long-context reasoning. GSM-Infinite addresses this by:
- High Information Density: Every part of the context is essential
- Reasoning Complexity: Requires multi-step mathematical reasoning
- Infinite Scalability: Generate unlimited test cases at any difficulty
Quick Start
Installation
# Clone the repository
git clone https://github.com/Infini-AI-Lab/gsm_infinite.git
cd gsm_infinite
# Install dependencies
pip install -r requirements.txt
# or
pip install -e .
Basic Usage
-
Configure your setup by editing
gsm-infinite/config.sh:# Set your API configuration backend_type='openai' # or 'gemini', 'anthropic' SAMPLER_OPENAI_BASE_URL='your_api_url' SAMPLER_OPENAI_API_KEY='your_api_key' # Configure model and dataset model_name='your_model_name' save_name='your_save_name'
-
Run evaluation:
cd gsm-infinite bash run.sh
Results are stored in gsm-infinite/results
- View results with the interactive dashboard:
streamlit run app.py
Project Structure
gsm_infinite/
โโโ gsm-infinite/ # Main package
โ โโโ app.py # Streamlit results viewer
โ โโโ config.sh # Configuration file
โ โโโ run.sh # Main execution script
โ โโโ preprocess.py # Data preprocessing
โ โโโ data/ # Data generation modules
โ โ โโโ symbolic/ # Symbolic dataset generation
โ โ โโโ realistic/ # Medium/Hard dataset generation
โ โโโ pred/ # Prediction and evaluation scripts
โโโ docs/ # Detailed documentation
โโโ static/ # Web assets and images
โโโ requirements.txt # Python dependencies
โโโ pyproject.toml # Package configuration
Dataset Information
GSM-Infinite provides three types of datasets:
| Dataset | Description | Context Length |
|---|---|---|
| Symbolic | Abstract mathematical operations | 0-32K+ tokens |
| Medium | Realistic problems with at most 2-entity implicit relationship | 0-32K+ tokens |
| Hard | Realistic problems with at most 3-entity implicit relationship | 0-32K+ tokens |
Documentation
For detailed information, please refer to our comprehensive documentation:
- ๐ Installation Guide - Detailed setup instructions
- ๐ Usage Guide - Complete usage examples Evaluate your models -->
- ๐ Leaderboards - Current model rankings
Results
Our benchmark reveals significant differences in long-context reasoning capabilities across models. See our leaderboards for the latest results.
For complete results and analysis, visit our paper and leaderboard.
Citation
If you use GSM-Infinite in your research, please cite our paper:
@misc{zhou2025gsminfinitellmsbehaveinfinitely,
title={GSM-Infinite: How Do Your LLMs Behave over Infinitely Increasing Context Length and Reasoning Complexity?},
author={Yang Zhou and Hongyi Liu and Zhuoming Chen and Yuandong Tian and Beidi Chen},
year={2025},
eprint={2502.05252},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2502.05252},
}
Support
- ๐ Issues: GitHub Issues
- ๐ฌ Discussions: GitHub Discussions
- ๐ง Contact: yangzho6@andrew.cmu.edu
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file gsm_infinite-0.2.0.tar.gz.
File metadata
- Download URL: gsm_infinite-0.2.0.tar.gz
- Upload date:
- Size: 13.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fc921166272e846e05ec77238f599d657db67518d8eff649aa639e2ef7565d80
|
|
| MD5 |
4bb622c905f87fd544d93fa57dc68140
|
|
| BLAKE2b-256 |
fd343bae631abc16be3cec8f2c7df79d0b7b03e7fa8216d86b86c2bfbd1a0000
|
File details
Details for the file gsm_infinite-0.2.0-py3-none-any.whl.
File metadata
- Download URL: gsm_infinite-0.2.0-py3-none-any.whl
- Upload date:
- Size: 10.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bef83f9c820b02481cf0cf19b33ce3875380c8c9aa83bdcda812070c4bb7a67b
|
|
| MD5 |
1c02f80d78590dc11f96c33c8b46532c
|
|
| BLAKE2b-256 |
731f414a39461422346a792ebc5519b68e8d5095a734fe1a2082f4291645fe4b
|