Gymnasium environments, workloads, and scheduling baselines for realistic GPU cluster research.

These details have not been verified by PyPI

Project links

Project description

GPU ChooChoo

Keep your GPUs chugging along!

Overview

GPU ChooChoo is a library providing various Gymnasium environments for training Deep RL agents on GPU scheduling with statistically realistic workloads that generalize across different cluster configurations.

Key Features

1. Realistic Workload Generation

Based on actual ML/AI cluster characteristics:

Poisson arrivals with time-varying rate λ(t)
- Business hours effect (higher load 9am-5pm)
- Non-homogeneous Poisson process
- Burst arrivals (researchers submitting job batches)
Power-law job sizes (P(k GPUs) ∝ k^(-2.5))
- Most jobs small (1-2 GPUs): ~75%
- Few large jobs (8+ GPUs): ~5%
- Realistic for ML workloads
Log-normal durations with size correlation
- Heavy-tailed distribution
- Larger jobs run longer (correlation)
- Range: 1 minute to 48 hours
Correlated characteristics
- Larger jobs prefer newer GPUs (H100 > A100)
- VRAM scales with job size
- GPU type preferences realistic
GPU tier awareness
- Built-in catalog describing V100/A100/H100-style tiers
- Jobs expose preferred/acceptable GPU type lists and per-GPU VRAM minima
- Scheduler enforces compatibility automatically

2. Multi-Scenario Training

Train on diverse scenarios to learn general policies:

Curriculum learning: Easy → Medium → Hard progression
Difficulty levels: Based on load factor and cluster size
Held-out test scenarios: Evaluate generalization
Automatic infeasible job filtering: No hanging on impossible jobs

3. Safety Features

Adaptive step limits: Automatically scales with scenario size (num_jobs × 3)
- Easy scenarios (17 jobs): ~50-100 steps
- Medium scenarios (95 jobs): ~300 steps
- Hard scenarios (760 jobs): ~2300 steps
- Ensures 24-hour traces complete successfully
Infeasible job removal: Jobs requiring more GPUs than any node are filtered
Proper termination: All episodes complete without hanging

Quick Start

from gym_env.multi_scenario_wrapper import create_diverse_training_env

# Create environment with 30 diverse scenarios
env = create_diverse_training_env(
    num_scenarios=30,
    difficulty_distribution='curriculum',  # or 'balanced', 'easy-heavy', 'hard-heavy'
    seed=42,
    max_steps=200  # Truncate after 200 steps
)

# Standard Gymnasium loop
obs, info = env.reset()
done = False

while not done:
    action = your_policy(obs)  # Your RL agent
    obs, reward, terminated, truncated, info = env.step(action)
    done = terminated or truncated

# Get scenario info
scenario_info = env.get_scenario_info()
print(f"Difficulty: {scenario_info['difficulty']}")
print(f"Load Factor: {scenario_info['load_factor']:.2f}")

Running Tests

Test Workload Generation

python -m pytest gpu_choochoo/tests/test_gpu_tier_preferences.py

Validates:

Poisson arrival process (homogeneous & time-varying)
Power-law job size distribution
Log-normal duration with size correlation
GPU type selection logic, tier preferences, and per-GPU VRAM constraints enforced by the scheduler
Load factor computation

Test Multi-Scenario Wrapper

python test_multi_scenario_quick.py

Tests:

Multi-scenario environment creation
Episode execution without hanging
Performance across difficulty levels
Proper truncation with max_steps

Files

Core Implementation

gym_env/gpu_scheduler_env.py - Base Gymnasium environment
gym_env/realistic_workload_generator.py - Statistical workload generation
gym_env/multi_scenario_wrapper.py - Multi-scenario training wrapper

Tests

test_realistic_workloads.py - Validate workload statistics
test_multi_scenario_quick.py - Quick integration test
test_simple_scenario.py - Debug single scenario

Original Files

test_baseline.py - Test EASY Backfilling baseline
test_all_baselines.py - Compare FCFS, SJF, EASY Backfilling
test_gym_env.py - Basic environment tests
example_rl_training.py - Training loop template

Workload Statistics

Job Size Distribution (Power-Law, α=2.5)

1 GPU:    74.4% ####################################
2 GPUs:   14.0% #######
3 GPUs:    5.2% ##
4 GPUs:    2.1% #
5+ GPUs:   4.3% ##

Duration by Job Size (Log-Normal)

Size  | Mean Duration | Median Duration
------|---------------|----------------
1 GPU |   0.77 hours  |   0.28 hours
2 GPU |   1.21 hours  |   0.41 hours
4 GPU |   1.26 hours  |   0.41 hours
8 GPU |   1.55 hours  |   0.58 hours
16 GPU|   1.91 hours  |   0.73 hours

Difficulty Levels

Easy

2 nodes, 4-8 GPUs per node
~17 jobs over 4 hours
Load factor: 0.5-2.0
No burst arrivals

Medium

4 nodes, 4-8 GPUs per node
~95 jobs over 8 hours
Load factor: 0.7-1.0
2 burst events

Hard

8 nodes, 4-16 GPUs per node
~760 jobs over 24 hours
Load factor: 0.8-1.2
5 burst events

Baseline Performance

Running benchmark_baselines.py gives the following example output:

======================================================================
BASELINE SCHEDULER BENCHMARK ON REALISTIC SCENARIOS
======================================================================

Generating test scenarios...
Created 30 scenarios:
  Easy: 10
  Medium: 10
  Hard: 10

======================================================================
Benchmarking: EASY Backfilling
======================================================================
  Scenario 5: easy   - Util= 18.3%, Jobs=15/15
  Scenario 10: easy   - Util= 15.2%, Jobs=19/19
  Scenario 15: medium - Util= 52.7%, Jobs=99/99
  Scenario 20: medium - Util= 26.9%, Jobs=85/85
  Scenario 25: hard   - Util= 31.9%, Jobs=731/731
  Scenario 30: hard   - Util= 27.7%, Jobs=736/736

EASY Backfilling Results:
======================================================================

EASY:
  Scenarios: 10
  Utilization: 16.08% ± 6.28%
  Range: [8.3%, 28.2%]
  Median: 13.58%
  Avg Wait Time: 0.4 minutes

MEDIUM:
  Scenarios: 10
  Utilization: 29.55% ± 10.73%
  Range: [16.6%, 52.7%]
  Median: 25.65%
  Avg Wait Time: 10.7 minutes

HARD:
  Scenarios: 10
  Utilization: 32.21% ± 6.81%
  Range: [21.8%, 43.4%]
  Median: 31.28%
  Avg Wait Time: 29.9 minutes

ALL:
  Scenarios: 30
  Utilization: 25.95% ± 10.81%
  Range: [8.3%, 52.7%]
  Median: 25.63%
  Avg Wait Time: 13.7 minutes

======================================================================
Benchmarking: Shortest Job First
======================================================================
  Scenario 5: easy   - Util= 18.3%, Jobs=15/15
  Scenario 10: easy   - Util= 15.2%, Jobs=19/19
  Scenario 15: medium - Util= 40.0%, Jobs=99/99
  Scenario 20: medium - Util= 23.6%, Jobs=85/85
  Scenario 25: hard   - Util= 31.9%, Jobs=731/731
  Scenario 30: hard   - Util= 27.7%, Jobs=736/736

Shortest Job First Results:
======================================================================

EASY:
  Scenarios: 10
  Utilization: 16.08% ± 6.28%
  Range: [8.3%, 28.2%]
  Median: 13.58%
  Avg Wait Time: 0.4 minutes

MEDIUM:
  Scenarios: 10
  Utilization: 27.66% ± 8.25%
  Range: [16.6%, 40.0%]
  Median: 23.99%
  Avg Wait Time: 10.5 minutes

HARD:
  Scenarios: 10
  Utilization: 28.32% ± 5.52%
  Range: [21.0%, 38.1%]
  Median: 26.60%
  Avg Wait Time: 18.4 minutes

ALL:
  Scenarios: 30
  Utilization: 24.02% ± 8.81%
  Range: [8.3%, 40.0%]
  Median: 23.72%
  Avg Wait Time: 9.8 minutes

======================================================================
Benchmarking: Pure FCFS
======================================================================
  Scenario 5: easy   - Util= 18.3%, Jobs=15/15
  Scenario 10: easy   - Util= 15.2%, Jobs=19/19
  Scenario 15: medium - Util= 22.4%, Jobs=99/99
  Scenario 20: medium - Util= 25.3%, Jobs=85/85
  Scenario 25: hard   - Util= 27.3%, Jobs=731/731
  Scenario 30: hard   - Util=  3.1%, Jobs=736/736

Pure FCFS Results:
======================================================================

EASY:
  Scenarios: 10
  Utilization: 16.08% ± 6.28%
  Range: [8.3%, 28.2%]
  Median: 13.58%
  Avg Wait Time: 0.4 minutes

MEDIUM:
  Scenarios: 10
  Utilization: 25.06% ± 6.37%
  Range: [16.6%, 35.3%]
  Median: 22.91%
  Avg Wait Time: 79.8 minutes

HARD:
  Scenarios: 10
  Utilization: 23.55% ± 9.23%
  Range: [3.1%, 34.2%]
  Median: 24.95%
  Avg Wait Time: 1954.4 minutes

ALL:
  Scenarios: 30
  Utilization: 21.56% ± 8.39%
  Range: [3.1%, 35.3%]
  Median: 20.84%
  Avg Wait Time: 678.2 minutes

======================================================================
BASELINE COMPARISON SUMMARY
======================================================================

Scheduler                 | Overall      | Easy         | Medium       | Hard        
------------------------------------------------------------------------------------------
EASY Backfilling          | 25.95% ± 10.81 | 16.08% ± 6.28 | 29.55% ± 10.73 | 32.21% ± 6.81
Shortest Job First        | 24.02% ± 8.81 | 16.08% ± 6.28 | 27.66% ± 8.25 | 28.32% ± 5.52
Pure FCFS                 | 21.56% ± 8.39 | 16.08% ± 6.28 | 25.06% ± 6.37 | 23.55% ± 9.23

======================================================================
TARGET FOR RL AGENT: Beat 25.95% average utilization
======================================================================

Packaging and PyPI Release

The repository already contains a standard pyproject.toml that points setuptools at the gpu_choochoo package living one directory below the repo root. To ship a new release:

Update metadata
- Bump the version string in both pyproject.toml ([project].version) and gpu_choochoo/__init__.py (__version__).
- Fill in/adjust the author, license, and classifiers so they match the release you want to publish.
Build the distribution artifacts
```
python -m pip install --upgrade build twine
rm -rf dist/ build/
python -m build
```
The dist/ folder will contain both a source tarball and a wheel that package the gpu_choochoo module tree.

Smoke-test the build locally

python -m venv .venv-test
source .venv-test/bin/activate
python -m pip install dist/gpu_choochoo-*.whl


4. **Upload to TestPyPI (recommended) and then PyPI**  
 ```bash
 # TestPyPI
 python -m twine upload --repository testpypi dist/*

 # Production PyPI
 python -m twine upload dist/*

Twine will prompt for your PyPI (or TestPyPI) credentials or can read them from ~/.pypirc.

Consume the published package
```
python -m pip install gpu-choochoo
```
Users can then import everything from gpu_choochoo, including GPUSchedulerEnv, MultiScenarioWrapper, and RealisticWorkloadGenerator. Package discovery works even though the code lives in the nested gpu_choochoo/ directory because setuptools is configured to include gpu_choochoo*.

Troubleshooting

Episodes not terminating?

Increase max_steps parameter (default: 1000)
Check for infeasible jobs in logs (verbose=True)

Low utilization?

Check load factor of scenarios (should be 0.7-1.2)
Verify policy is scheduling jobs (not all no-ops)
Compare against baselines (see test_all_baselines.py)

Jobs filtered as infeasible?

Workload generator ensures max_gpus ≤ largest node
Check cluster config has sufficient capacity
Set verbose=True to see which jobs are filtered

Next Steps

Train your RL agent on diverse scenarios
Compare to baselines (target: >45% utilization)
Test generalization on held-out scenarios
Analyze learned policies - what strategies emerge?
Scale up - add more scenarios, larger clusters

References

Workload characteristics based on Google cluster traces, Azure ML traces
Power-law distributions: Reiss et al., "Google cluster-usage traces" (2011)
EASY Backfilling: Lifka, "The ANL/IBM SP scheduling system" (1995)

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Dec 24, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gpu_choochoo-0.1.0.tar.gz (37.6 kB view details)

Uploaded Dec 24, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

gpu_choochoo-0.1.0-py3-none-any.whl (39.0 kB view details)

Uploaded Dec 24, 2025 Python 3

File details

Details for the file gpu_choochoo-0.1.0.tar.gz.

File metadata

Download URL: gpu_choochoo-0.1.0.tar.gz
Upload date: Dec 24, 2025
Size: 37.6 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for gpu_choochoo-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`ff7fb760422f0024513f916e671c4b6a6454a3521a839f7c7dcf42a0d4934738`
MD5	`0f17df93542c3caead9aa1e4def764ca`
BLAKE2b-256	`0087018d1e51f71a81e382997608a4e4ca6f90f17cdfc1b3e12a597beed4c886`

See more details on using hashes here.

Provenance

The following attestation bundles were made for gpu_choochoo-0.1.0.tar.gz:

Publisher: python-publish.yml on sorenjmadsen/gpu-choochoo

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: gpu_choochoo-0.1.0.tar.gz
- Subject digest: ff7fb760422f0024513f916e671c4b6a6454a3521a839f7c7dcf42a0d4934738
- Sigstore transparency entry: 778794963
- Sigstore integration time: Dec 24, 2025
Source repository:
- Permalink: sorenjmadsen/gpu-choochoo@f782000b83d3ef66ecc5d2bd8a05fbc138a4ed84
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/sorenjmadsen
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: python-publish.yml@f782000b83d3ef66ecc5d2bd8a05fbc138a4ed84
- Trigger Event: release

File details

Details for the file gpu_choochoo-0.1.0-py3-none-any.whl.

File metadata

Download URL: gpu_choochoo-0.1.0-py3-none-any.whl
Upload date: Dec 24, 2025
Size: 39.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for gpu_choochoo-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d323d1bf5e303d3d3df6f11b7a9bbb358853f0bac3a37e926f941e3913cf04c3`
MD5	`11b8ba6863522c2b43aef8b4b5b818b2`
BLAKE2b-256	`2b8eada469c3f807f65dc8f157c9c9c66757d89451af099d8297217751711f91`

See more details on using hashes here.

Provenance

The following attestation bundles were made for gpu_choochoo-0.1.0-py3-none-any.whl:

Publisher: python-publish.yml on sorenjmadsen/gpu-choochoo

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: gpu_choochoo-0.1.0-py3-none-any.whl
- Subject digest: d323d1bf5e303d3d3df6f11b7a9bbb358853f0bac3a37e926f941e3913cf04c3
- Sigstore transparency entry: 778794977
- Sigstore integration time: Dec 24, 2025
Source repository:
- Permalink: sorenjmadsen/gpu-choochoo@f782000b83d3ef66ecc5d2bd8a05fbc138a4ed84
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/sorenjmadsen
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: python-publish.yml@f782000b83d3ef66ecc5d2bd8a05fbc138a4ed84
- Trigger Event: release

gpu-choochoo 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

GPU ChooChoo

Overview

Key Features

1. Realistic Workload Generation

2. Multi-Scenario Training

3. Safety Features

Quick Start

Running Tests

Test Workload Generation

Test Multi-Scenario Wrapper

Files

Core Implementation

Tests

Original Files

Workload Statistics

Job Size Distribution (Power-Law, α=2.5)

Duration by Job Size (Log-Normal)

Difficulty Levels

Easy

Medium

Hard

Baseline Performance

Packaging and PyPI Release

Troubleshooting

Episodes not terminating?

Low utilization?

Jobs filtered as infeasible?

Next Steps

References

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance