Gymnasium environments, workloads, and scheduling baselines for realistic GPU cluster research.
Project description
GPU ChooChoo
Keep your GPUs chugging along!
Overview
GPU ChooChoo is a library providing various Gymnasium environments for training Deep RL agents on GPU scheduling with statistically realistic workloads that generalize across different cluster configurations.
Key Features
1. Realistic Workload Generation
Based on actual ML/AI cluster characteristics:
-
Poisson arrivals with time-varying rate λ(t)
- Business hours effect (higher load 9am-5pm)
- Non-homogeneous Poisson process
- Burst arrivals (researchers submitting job batches)
-
Power-law job sizes (P(k GPUs) ∝ k^(-2.5))
- Most jobs small (1-2 GPUs): ~75%
- Few large jobs (8+ GPUs): ~5%
- Realistic for ML workloads
-
Log-normal durations with size correlation
- Heavy-tailed distribution
- Larger jobs run longer (correlation)
- Range: 1 minute to 48 hours
-
Correlated characteristics
- Larger jobs prefer newer GPUs (H100 > A100)
- VRAM scales with job size
- GPU type preferences realistic
-
GPU tier awareness
- Built-in catalog describing V100/A100/H100-style tiers
- Jobs expose preferred/acceptable GPU type lists and per-GPU VRAM minima
- Scheduler enforces compatibility automatically
2. Multi-Scenario Training
Train on diverse scenarios to learn general policies:
- Curriculum learning: Easy → Medium → Hard progression
- Difficulty levels: Based on load factor and cluster size
- Held-out test scenarios: Evaluate generalization
- Automatic infeasible job filtering: No hanging on impossible jobs
3. Safety Features
- Adaptive step limits: Automatically scales with scenario size (num_jobs × 3)
- Easy scenarios (17 jobs): ~50-100 steps
- Medium scenarios (95 jobs): ~300 steps
- Hard scenarios (760 jobs): ~2300 steps
- Ensures 24-hour traces complete successfully
- Infeasible job removal: Jobs requiring more GPUs than any node are filtered
- Proper termination: All episodes complete without hanging
Quick Start
from gym_env.multi_scenario_wrapper import create_diverse_training_env
# Create environment with 30 diverse scenarios
env = create_diverse_training_env(
num_scenarios=30,
difficulty_distribution='curriculum', # or 'balanced', 'easy-heavy', 'hard-heavy'
seed=42,
max_steps=200 # Truncate after 200 steps
)
# Standard Gymnasium loop
obs, info = env.reset()
done = False
while not done:
action = your_policy(obs) # Your RL agent
obs, reward, terminated, truncated, info = env.step(action)
done = terminated or truncated
# Get scenario info
scenario_info = env.get_scenario_info()
print(f"Difficulty: {scenario_info['difficulty']}")
print(f"Load Factor: {scenario_info['load_factor']:.2f}")
Running Tests
Test Workload Generation
python -m pytest gpu_choochoo/tests/test_gpu_tier_preferences.py
Validates:
- Poisson arrival process (homogeneous & time-varying)
- Power-law job size distribution
- Log-normal duration with size correlation
- GPU type selection logic, tier preferences, and per-GPU VRAM constraints enforced by the scheduler
- Load factor computation
Test Multi-Scenario Wrapper
python test_multi_scenario_quick.py
Tests:
- Multi-scenario environment creation
- Episode execution without hanging
- Performance across difficulty levels
- Proper truncation with max_steps
Files
Core Implementation
gym_env/gpu_scheduler_env.py- Base Gymnasium environmentgym_env/realistic_workload_generator.py- Statistical workload generationgym_env/multi_scenario_wrapper.py- Multi-scenario training wrapper
Tests
test_realistic_workloads.py- Validate workload statisticstest_multi_scenario_quick.py- Quick integration testtest_simple_scenario.py- Debug single scenario
Original Files
test_baseline.py- Test EASY Backfilling baselinetest_all_baselines.py- Compare FCFS, SJF, EASY Backfillingtest_gym_env.py- Basic environment testsexample_rl_training.py- Training loop template
Workload Statistics
Job Size Distribution (Power-Law, α=2.5)
1 GPU: 74.4% ####################################
2 GPUs: 14.0% #######
3 GPUs: 5.2% ##
4 GPUs: 2.1% #
5+ GPUs: 4.3% ##
Duration by Job Size (Log-Normal)
Size | Mean Duration | Median Duration
------|---------------|----------------
1 GPU | 0.77 hours | 0.28 hours
2 GPU | 1.21 hours | 0.41 hours
4 GPU | 1.26 hours | 0.41 hours
8 GPU | 1.55 hours | 0.58 hours
16 GPU| 1.91 hours | 0.73 hours
Difficulty Levels
Easy
- 2 nodes, 4-8 GPUs per node
- ~17 jobs over 4 hours
- Load factor: 0.5-2.0
- No burst arrivals
Medium
- 4 nodes, 4-8 GPUs per node
- ~95 jobs over 8 hours
- Load factor: 0.7-1.0
- 2 burst events
Hard
- 8 nodes, 4-16 GPUs per node
- ~760 jobs over 24 hours
- Load factor: 0.8-1.2
- 5 burst events
Baseline Performance
Running benchmark_baselines.py gives the following example output:
======================================================================
BASELINE SCHEDULER BENCHMARK ON REALISTIC SCENARIOS
======================================================================
Generating test scenarios...
Created 30 scenarios:
Easy: 10
Medium: 10
Hard: 10
======================================================================
Benchmarking: EASY Backfilling
======================================================================
Scenario 5: easy - Util= 18.3%, Jobs=15/15
Scenario 10: easy - Util= 15.2%, Jobs=19/19
Scenario 15: medium - Util= 52.7%, Jobs=99/99
Scenario 20: medium - Util= 26.9%, Jobs=85/85
Scenario 25: hard - Util= 31.9%, Jobs=731/731
Scenario 30: hard - Util= 27.7%, Jobs=736/736
EASY Backfilling Results:
======================================================================
EASY:
Scenarios: 10
Utilization: 16.08% ± 6.28%
Range: [8.3%, 28.2%]
Median: 13.58%
Avg Wait Time: 0.4 minutes
MEDIUM:
Scenarios: 10
Utilization: 29.55% ± 10.73%
Range: [16.6%, 52.7%]
Median: 25.65%
Avg Wait Time: 10.7 minutes
HARD:
Scenarios: 10
Utilization: 32.21% ± 6.81%
Range: [21.8%, 43.4%]
Median: 31.28%
Avg Wait Time: 29.9 minutes
ALL:
Scenarios: 30
Utilization: 25.95% ± 10.81%
Range: [8.3%, 52.7%]
Median: 25.63%
Avg Wait Time: 13.7 minutes
======================================================================
Benchmarking: Shortest Job First
======================================================================
Scenario 5: easy - Util= 18.3%, Jobs=15/15
Scenario 10: easy - Util= 15.2%, Jobs=19/19
Scenario 15: medium - Util= 40.0%, Jobs=99/99
Scenario 20: medium - Util= 23.6%, Jobs=85/85
Scenario 25: hard - Util= 31.9%, Jobs=731/731
Scenario 30: hard - Util= 27.7%, Jobs=736/736
Shortest Job First Results:
======================================================================
EASY:
Scenarios: 10
Utilization: 16.08% ± 6.28%
Range: [8.3%, 28.2%]
Median: 13.58%
Avg Wait Time: 0.4 minutes
MEDIUM:
Scenarios: 10
Utilization: 27.66% ± 8.25%
Range: [16.6%, 40.0%]
Median: 23.99%
Avg Wait Time: 10.5 minutes
HARD:
Scenarios: 10
Utilization: 28.32% ± 5.52%
Range: [21.0%, 38.1%]
Median: 26.60%
Avg Wait Time: 18.4 minutes
ALL:
Scenarios: 30
Utilization: 24.02% ± 8.81%
Range: [8.3%, 40.0%]
Median: 23.72%
Avg Wait Time: 9.8 minutes
======================================================================
Benchmarking: Pure FCFS
======================================================================
Scenario 5: easy - Util= 18.3%, Jobs=15/15
Scenario 10: easy - Util= 15.2%, Jobs=19/19
Scenario 15: medium - Util= 22.4%, Jobs=99/99
Scenario 20: medium - Util= 25.3%, Jobs=85/85
Scenario 25: hard - Util= 27.3%, Jobs=731/731
Scenario 30: hard - Util= 3.1%, Jobs=736/736
Pure FCFS Results:
======================================================================
EASY:
Scenarios: 10
Utilization: 16.08% ± 6.28%
Range: [8.3%, 28.2%]
Median: 13.58%
Avg Wait Time: 0.4 minutes
MEDIUM:
Scenarios: 10
Utilization: 25.06% ± 6.37%
Range: [16.6%, 35.3%]
Median: 22.91%
Avg Wait Time: 79.8 minutes
HARD:
Scenarios: 10
Utilization: 23.55% ± 9.23%
Range: [3.1%, 34.2%]
Median: 24.95%
Avg Wait Time: 1954.4 minutes
ALL:
Scenarios: 30
Utilization: 21.56% ± 8.39%
Range: [3.1%, 35.3%]
Median: 20.84%
Avg Wait Time: 678.2 minutes
======================================================================
BASELINE COMPARISON SUMMARY
======================================================================
Scheduler | Overall | Easy | Medium | Hard
------------------------------------------------------------------------------------------
EASY Backfilling | 25.95% ± 10.81 | 16.08% ± 6.28 | 29.55% ± 10.73 | 32.21% ± 6.81
Shortest Job First | 24.02% ± 8.81 | 16.08% ± 6.28 | 27.66% ± 8.25 | 28.32% ± 5.52
Pure FCFS | 21.56% ± 8.39 | 16.08% ± 6.28 | 25.06% ± 6.37 | 23.55% ± 9.23
======================================================================
TARGET FOR RL AGENT: Beat 25.95% average utilization
======================================================================
Packaging and PyPI Release
The repository already contains a standard pyproject.toml that points setuptools at the gpu_choochoo package living one directory below the repo root. To ship a new release:
-
Update metadata
- Bump the version string in both
pyproject.toml([project].version) andgpu_choochoo/__init__.py(__version__). - Fill in/adjust the author, license, and classifiers so they match the release you want to publish.
- Bump the version string in both
-
Build the distribution artifacts
python -m pip install --upgrade build twine rm -rf dist/ build/ python -m build
The
dist/folder will contain both a source tarball and a wheel that package thegpu_choochoomodule tree. -
Smoke-test the build locally
python -m venv .venv-test source .venv-test/bin/activate python -m pip install dist/gpu_choochoo-*.whl
4. **Upload to TestPyPI (recommended) and then PyPI**
```bash
# TestPyPI
python -m twine upload --repository testpypi dist/*
# Production PyPI
python -m twine upload dist/*
Twine will prompt for your PyPI (or TestPyPI) credentials or can read them from ~/.pypirc.
- Consume the published package
python -m pip install gpu-choochoo
Users can then import everything fromgpu_choochoo, includingGPUSchedulerEnv,MultiScenarioWrapper, andRealisticWorkloadGenerator. Package discovery works even though the code lives in the nestedgpu_choochoo/directory becausesetuptoolsis configured to includegpu_choochoo*.
Troubleshooting
Episodes not terminating?
- Increase
max_stepsparameter (default: 1000) - Check for infeasible jobs in logs (verbose=True)
Low utilization?
- Check load factor of scenarios (should be 0.7-1.2)
- Verify policy is scheduling jobs (not all no-ops)
- Compare against baselines (see test_all_baselines.py)
Jobs filtered as infeasible?
- Workload generator ensures max_gpus ≤ largest node
- Check cluster config has sufficient capacity
- Set verbose=True to see which jobs are filtered
Next Steps
- Train your RL agent on diverse scenarios
- Compare to baselines (target: >45% utilization)
- Test generalization on held-out scenarios
- Analyze learned policies - what strategies emerge?
- Scale up - add more scenarios, larger clusters
References
- Workload characteristics based on Google cluster traces, Azure ML traces
- Power-law distributions: Reiss et al., "Google cluster-usage traces" (2011)
- EASY Backfilling: Lifka, "The ANL/IBM SP scheduling system" (1995)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file gpu_choochoo-0.1.0.tar.gz.
File metadata
- Download URL: gpu_choochoo-0.1.0.tar.gz
- Upload date:
- Size: 37.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ff7fb760422f0024513f916e671c4b6a6454a3521a839f7c7dcf42a0d4934738
|
|
| MD5 |
0f17df93542c3caead9aa1e4def764ca
|
|
| BLAKE2b-256 |
0087018d1e51f71a81e382997608a4e4ca6f90f17cdfc1b3e12a597beed4c886
|
Provenance
The following attestation bundles were made for gpu_choochoo-0.1.0.tar.gz:
Publisher:
python-publish.yml on sorenjmadsen/gpu-choochoo
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
gpu_choochoo-0.1.0.tar.gz -
Subject digest:
ff7fb760422f0024513f916e671c4b6a6454a3521a839f7c7dcf42a0d4934738 - Sigstore transparency entry: 778794963
- Sigstore integration time:
-
Permalink:
sorenjmadsen/gpu-choochoo@f782000b83d3ef66ecc5d2bd8a05fbc138a4ed84 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/sorenjmadsen
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@f782000b83d3ef66ecc5d2bd8a05fbc138a4ed84 -
Trigger Event:
release
-
Statement type:
File details
Details for the file gpu_choochoo-0.1.0-py3-none-any.whl.
File metadata
- Download URL: gpu_choochoo-0.1.0-py3-none-any.whl
- Upload date:
- Size: 39.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d323d1bf5e303d3d3df6f11b7a9bbb358853f0bac3a37e926f941e3913cf04c3
|
|
| MD5 |
11b8ba6863522c2b43aef8b4b5b818b2
|
|
| BLAKE2b-256 |
2b8eada469c3f807f65dc8f157c9c9c66757d89451af099d8297217751711f91
|
Provenance
The following attestation bundles were made for gpu_choochoo-0.1.0-py3-none-any.whl:
Publisher:
python-publish.yml on sorenjmadsen/gpu-choochoo
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
gpu_choochoo-0.1.0-py3-none-any.whl -
Subject digest:
d323d1bf5e303d3d3df6f11b7a9bbb358853f0bac3a37e926f941e3913cf04c3 - Sigstore transparency entry: 778794977
- Sigstore integration time:
-
Permalink:
sorenjmadsen/gpu-choochoo@f782000b83d3ef66ecc5d2bd8a05fbc138a4ed84 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/sorenjmadsen
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@f782000b83d3ef66ecc5d2bd8a05fbc138a4ed84 -
Trigger Event:
release
-
Statement type: