Skip to main content

Add your description here

Project description


GSO: Challenging Software Optimization Tasks for Evaluating SWE-Agents

GSO (Global Software Optimization) is a benchmark for evaluating language models' capabilities in developing high-performance software. We present 100+ challenging optimization tasks across 10 codebases spanning diverse domains and programming languages. Each task provides a codebase and performance test as a precise specification, with agents required to optmize the codebase and measured against expert developer commits.

📰 News

  • [Feb 2, 2026]: Released integrations with frameworks like Harbor and other scaffolds: gso-bench/scaffolds.
  • [Dec 23, 2025]: Released evaluation logs and transcripts w/ Docent support: gso-bench/gso-experiments.
  • [Nov 3, 2025]: Released GSO's HackDetector that catches models reward hacking: GSO Blog.
  • [May 30, 2025]: 🤗 GSO dataset is now available on HuggingFace! Access it at gso-bench/gso.
  • [May 30, 2025]: Prebuilt docker images for GSO tasks are now available on Docker Hub.
  • [May 30, 2025]: Initial release of the GSO benchmark: gso-bench.github.io

👋 Overview

GSO evaluates language models on software performance optimization. Each task provides:

  • A codebase with a specific performance bottleneck
  • A performance test as a precise specification
  • An agent must generate a patch that improves runtime efficiency
  • Success is measured against expert developer optimizations

To access GSO, copy and run the following code:

from datasets import load_dataset
gso = load_dataset('gso-bench/gso', split='test')

🚀 Setup

curl -LsSf https://astral.sh/uv/install.sh | sh
source $HOME/.local/bin/env

git clone --recursive https://github.com/gso-bench/gso.git
cd gso && uv venv && source .venv/bin/activate
uv sync

(Additional) Setup HuggingFace token:

export HF_TOKEN="huggingface_token"

💽 Usage

Evaluation Harness

  1. Building Dockers for GSO tasks:
docker login

uv run src/gso/harness/prepare_images.py \
    --push_to_registry True \
    --dockerhub_username <dockerhub_username> \
    --dockerhub_repo <dockerhub_repo>
  1. Running Evaluations:
uv run src/gso/harness/opt_at_k.py \
    --prediction_paths <prediction_path> \
    --timeout 3600 \
    --run_id <run_id> \
    --k 1 \
    --model <modelname>

For detailed instructions and options, see the Harness documentation.

GSO Collection Framework

The collection framework enables you to create your own GSO tasks through a four-step pipeline:

  1. Commit Extraction & Filtering: Extract performance-related commits using LLMs
  2. API Identification: Identify affected high-level APIs for each commit
  3. Performance Test Generation: Generate tests for API-Commit pairs
  4. Test Execution: Execute tests to identify performance improvements

For detailed instructions and usage, see the Collection Framework documentation.

⬇️ Artifacts

Datasets Tools Dockers
💿 GSO 🔧 Evaluation Harness 🐳 Docker Hub
🔧 Collection Framework

💫 Contributions

We welcome contributions from the broader NLP, Machine Learning, and Software Engineering research communities! Please file a new pull request or issue and fill in the corresponding templates accordingly.

✍️ Citation & license

MIT license. Check LICENSE file.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gsobench-0.1.7.tar.gz (237.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

gsobench-0.1.7-py3-none-any.whl (140.1 kB view details)

Uploaded Python 3

File details

Details for the file gsobench-0.1.7.tar.gz.

File metadata

  • Download URL: gsobench-0.1.7.tar.gz
  • Upload date:
  • Size: 237.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for gsobench-0.1.7.tar.gz
Algorithm Hash digest
SHA256 4ed94ec9bd51925b3922390b98ce4d423e70b56ec5334fb1b4296fa62883980f
MD5 d96128f6d3f4d47a254c9485ef344619
BLAKE2b-256 aa3d9ab8bd3ddc8860ce243a21449d8ddfecfb9348610ca03c86a9198d2cf4de

See more details on using hashes here.

Provenance

The following attestation bundles were made for gsobench-0.1.7.tar.gz:

Publisher: publish.yml on gso-bench/gso

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file gsobench-0.1.7-py3-none-any.whl.

File metadata

  • Download URL: gsobench-0.1.7-py3-none-any.whl
  • Upload date:
  • Size: 140.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for gsobench-0.1.7-py3-none-any.whl
Algorithm Hash digest
SHA256 b1dc047782b030929f878cb94629e63f8800125e8e262ceaa5ef51c37bb77e9d
MD5 49aa26be51fcdb5512ed96978679f15d
BLAKE2b-256 b3ed8da614e40e81acabbe5b37b84ffc5b7e152bfad2087be633444c64c19893

See more details on using hashes here.

Provenance

The following attestation bundles were made for gsobench-0.1.7-py3-none-any.whl:

Publisher: publish.yml on gso-bench/gso

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page