Skip to main content

GSM-Infinite Benchmark for LLMs

Project description

GSM-Infinite

Infinitely Scalable Long-context Reasoning Benchmark for Large Language Models

Paper Blog Leaderboard Datasets

Yang Zhou*1, Hongyi Liu*1, Zhuoming Chen1, Yuandong Tian2, Beidi Chen1
*Equal Contributions | 1Carnegie Mellon University | 2Meta AI

Limitation of Existing Long-context Benchmark

RAG can robustly solve most of today popular long-context benchmarks
In this paper, we first point out the insufficiencies in long-context LLMs evaluation, highlighting:
  1. Lack of reasoning complexity: Most tasks rely on text retrieval, text summarization, QA.
  2. Lack of context length: Some tasks are inherently short-context tasks but are bloated to long-context through injecting semantically irrelevant noise.
  3. Lack of scalability: Admittedly, tasks with high reasoning complexity and high information density exists, but these tasks requires huge human-effort to gather, dedup, and verify. The result is lack of scalability in quantity, making it hard to prevail in the community.
First two is further studied in the above figure. These tasks are not tasks that only long-context LLMs can do. We show that RAG are robust and have performance on par with long-context LLMs. However, given the high efficiency to build and run inference on RAG systems, RAG is more favorable in practice on these tasks. Therefore, we have the following problem to solve.

Problem Statement: How can we develop a benchmark that contains sufficient problems at every fine-grained level of reasoning difficulty, from easy retrieval tasks to infinitely hard challenges, while providing infinitely customizable context length with high information density?

Overview

GSM-Infinite is a completely synthetic reasoning benchmark that generates problems with infinitely scalable context length and reasoning complexity. Unlike existing benchmarks that rely on text retrieval or summarization, GSM-Infinite creates high information density tasks that can only be solved by long-context LLMs, not by RAG systems.

Key Features

  • ๐Ÿ”„ Infinitely Scalable: Generate problems of any context length and reasoning complexity
  • ๐Ÿงฎ High Information Density: Every token matters - RAG systems cannot solve these problems
  • ๐ŸŽฏ Three Difficulty Levels: Symbolic, Medium, and Hard subsets
  • ๐Ÿ“Š Comprehensive Evaluation: Built-in evaluation scripts and leaderboards
  • ๐Ÿ”ฌ Synthetic Generation: No LLMs in the loop, ensuring unbiased benchmarks

Why GSM-Infinite?

RAG systems fail on GSM-Infinite due to high information density

Traditional long-context benchmarks can often be solved by RAG systems, making them insufficient for evaluating true long-context reasoning. GSM-Infinite addresses this by:

  1. High Information Density: Every part of the context is essential
  2. Reasoning Complexity: Requires multi-step mathematical reasoning
  3. Infinite Scalability: Generate unlimited test cases at any difficulty

Quick Start

Installation

# Clone the repository
git clone https://github.com/Infini-AI-Lab/gsm_infinite.git
cd gsm_infinite

# Install dependencies
pip install -r requirements.txt

# or
pip install -e .

Basic Usage

  1. Configure your setup by editing gsm-infinite/config.sh:

    # Set your API configuration
    backend_type='openai'  # or 'gemini', 'anthropic'
    SAMPLER_OPENAI_BASE_URL='your_api_url'
    SAMPLER_OPENAI_API_KEY='your_api_key'
    
    # Configure model and dataset
    model_name='your_model_name'
    save_name='your_save_name'
    
  2. Run evaluation:

    cd gsm-infinite
    bash run.sh
    

Results are stored in gsm-infinite/results

  1. View results with the interactive dashboard:
    streamlit run app.py
    

Project Structure

gsm_infinite/
โ”œโ”€โ”€ gsm-infinite/           # Main package
โ”‚   โ”œโ”€โ”€ app.py             # Streamlit results viewer
โ”‚   โ”œโ”€โ”€ config.sh          # Configuration file
โ”‚   โ”œโ”€โ”€ run.sh             # Main execution script
โ”‚   โ”œโ”€โ”€ preprocess.py      # Data preprocessing
โ”‚   โ”œโ”€โ”€ data/              # Data generation modules
โ”‚   โ”‚   โ”œโ”€โ”€ symbolic/      # Symbolic dataset generation
โ”‚   โ”‚   โ””โ”€โ”€ realistic/     # Medium/Hard dataset generation
โ”‚   โ””โ”€โ”€ pred/              # Prediction and evaluation scripts
โ”œโ”€โ”€ docs/                  # Detailed documentation
โ”œโ”€โ”€ static/                # Web assets and images
โ”œโ”€โ”€ requirements.txt       # Python dependencies
โ””โ”€โ”€ pyproject.toml        # Package configuration

Dataset Information

GSM-Infinite provides three types of datasets:

Dataset Description Context Length
Symbolic Abstract mathematical operations 0-32K+ tokens
Medium Realistic problems with at most 2-entity implicit relationship 0-32K+ tokens
Hard Realistic problems with at most 3-entity implicit relationship 0-32K+ tokens

Documentation

For detailed information, please refer to our comprehensive documentation:

Results

Our benchmark reveals significant differences in long-context reasoning capabilities across models. See our leaderboards for the latest results.

For complete results and analysis, visit our paper and leaderboard.

Citation

If you use GSM-Infinite in your research, please cite our paper:

@misc{zhou2025gsminfinitellmsbehaveinfinitely,
    title={GSM-Infinite: How Do Your LLMs Behave over Infinitely Increasing Context Length and Reasoning Complexity?}, 
    author={Yang Zhou and Hongyi Liu and Zhuoming Chen and Yuandong Tian and Beidi Chen},
    year={2025},
    eprint={2502.05252},
    archivePrefix={arXiv},
    primaryClass={cs.CL},
    url={https://arxiv.org/abs/2502.05252}, 
}

Support


Made with โค๏ธ by the Infini-AI Lab team

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gsm_infinite-0.2.0.tar.gz (13.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

gsm_infinite-0.2.0-py3-none-any.whl (10.3 kB view details)

Uploaded Python 3

File details

Details for the file gsm_infinite-0.2.0.tar.gz.

File metadata

  • Download URL: gsm_infinite-0.2.0.tar.gz
  • Upload date:
  • Size: 13.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for gsm_infinite-0.2.0.tar.gz
Algorithm Hash digest
SHA256 fc921166272e846e05ec77238f599d657db67518d8eff649aa639e2ef7565d80
MD5 4bb622c905f87fd544d93fa57dc68140
BLAKE2b-256 fd343bae631abc16be3cec8f2c7df79d0b7b03e7fa8216d86b86c2bfbd1a0000

See more details on using hashes here.

File details

Details for the file gsm_infinite-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: gsm_infinite-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 10.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for gsm_infinite-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 bef83f9c820b02481cf0cf19b33ce3875380c8c9aa83bdcda812070c4bb7a67b
MD5 1c02f80d78590dc11f96c33c8b46532c
BLAKE2b-256 731f414a39461422346a792ebc5519b68e8d5095a734fe1a2082f4291645fe4b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page