Skip to main content

Gymnasium environments, workloads, and scheduling baselines for realistic GPU cluster research.

Project description

GPU ChooChoo

Keep your GPUs chugging along!

Overview

GPU ChooChoo is a library providing various Gymnasium environments for training Deep RL agents on GPU scheduling with statistically realistic workloads that generalize across different cluster configurations.

Key Features

1. Realistic Workload Generation

Based on actual ML/AI cluster characteristics:

  • Poisson arrivals with time-varying rate λ(t)

    • Business hours effect (higher load 9am-5pm)
    • Non-homogeneous Poisson process
    • Burst arrivals (researchers submitting job batches)
  • Power-law job sizes (P(k GPUs) ∝ k^(-2.5))

    • Most jobs small (1-2 GPUs): ~75%
    • Few large jobs (8+ GPUs): ~5%
    • Realistic for ML workloads
  • Log-normal durations with size correlation

    • Heavy-tailed distribution
    • Larger jobs run longer (correlation)
    • Range: 1 minute to 48 hours
  • Correlated characteristics

    • Larger jobs prefer newer GPUs (H100 > A100)
    • VRAM scales with job size
    • GPU type preferences realistic
  • GPU tier awareness

    • Built-in catalog describing V100/A100/H100-style tiers
    • Jobs expose preferred/acceptable GPU type lists and per-GPU VRAM minima
    • Scheduler enforces compatibility automatically

2. Multi-Scenario Training

Train on diverse scenarios to learn general policies:

  • Curriculum learning: Easy → Medium → Hard progression
  • Difficulty levels: Based on load factor and cluster size
  • Held-out test scenarios: Evaluate generalization
  • Automatic infeasible job filtering: No hanging on impossible jobs

3. Safety Features

  • Adaptive step limits: Automatically scales with scenario size (num_jobs × 3)
    • Easy scenarios (17 jobs): ~50-100 steps
    • Medium scenarios (95 jobs): ~300 steps
    • Hard scenarios (760 jobs): ~2300 steps
    • Ensures 24-hour traces complete successfully
  • Infeasible job removal: Jobs requiring more GPUs than any node are filtered
  • Proper termination: All episodes complete without hanging

Quick Start

from gym_env.multi_scenario_wrapper import create_diverse_training_env

# Create environment with 30 diverse scenarios
env = create_diverse_training_env(
    num_scenarios=30,
    difficulty_distribution='curriculum',  # or 'balanced', 'easy-heavy', 'hard-heavy'
    seed=42,
    max_steps=200  # Truncate after 200 steps
)

# Standard Gymnasium loop
obs, info = env.reset()
done = False

while not done:
    action = your_policy(obs)  # Your RL agent
    obs, reward, terminated, truncated, info = env.step(action)
    done = terminated or truncated

# Get scenario info
scenario_info = env.get_scenario_info()
print(f"Difficulty: {scenario_info['difficulty']}")
print(f"Load Factor: {scenario_info['load_factor']:.2f}")

Running Tests

Test Workload Generation

python -m pytest gpu_choochoo/tests/test_gpu_tier_preferences.py

Validates:

  • Poisson arrival process (homogeneous & time-varying)
  • Power-law job size distribution
  • Log-normal duration with size correlation
  • GPU type selection logic, tier preferences, and per-GPU VRAM constraints enforced by the scheduler
  • Load factor computation

Test Multi-Scenario Wrapper

python test_multi_scenario_quick.py

Tests:

  • Multi-scenario environment creation
  • Episode execution without hanging
  • Performance across difficulty levels
  • Proper truncation with max_steps

Files

Core Implementation

  • gym_env/gpu_scheduler_env.py - Base Gymnasium environment
  • gym_env/realistic_workload_generator.py - Statistical workload generation
  • gym_env/multi_scenario_wrapper.py - Multi-scenario training wrapper

Tests

  • test_realistic_workloads.py - Validate workload statistics
  • test_multi_scenario_quick.py - Quick integration test
  • test_simple_scenario.py - Debug single scenario

Original Files

  • test_baseline.py - Test EASY Backfilling baseline
  • test_all_baselines.py - Compare FCFS, SJF, EASY Backfilling
  • test_gym_env.py - Basic environment tests
  • example_rl_training.py - Training loop template

Workload Statistics

Job Size Distribution (Power-Law, α=2.5)

1 GPU:    74.4% ####################################
2 GPUs:   14.0% #######
3 GPUs:    5.2% ##
4 GPUs:    2.1% #
5+ GPUs:   4.3% ##

Duration by Job Size (Log-Normal)

Size  | Mean Duration | Median Duration
------|---------------|----------------
1 GPU |   0.77 hours  |   0.28 hours
2 GPU |   1.21 hours  |   0.41 hours
4 GPU |   1.26 hours  |   0.41 hours
8 GPU |   1.55 hours  |   0.58 hours
16 GPU|   1.91 hours  |   0.73 hours

Difficulty Levels

Easy

  • 2 nodes, 4-8 GPUs per node
  • ~17 jobs over 4 hours
  • Load factor: 0.5-2.0
  • No burst arrivals

Medium

  • 4 nodes, 4-8 GPUs per node
  • ~95 jobs over 8 hours
  • Load factor: 0.7-1.0
  • 2 burst events

Hard

  • 8 nodes, 4-16 GPUs per node
  • ~760 jobs over 24 hours
  • Load factor: 0.8-1.2
  • 5 burst events

Baseline Performance

Running benchmark_baselines.py gives the following example output:

======================================================================
BASELINE SCHEDULER BENCHMARK ON REALISTIC SCENARIOS
======================================================================

Generating test scenarios...
Created 30 scenarios:
  Easy: 10
  Medium: 10
  Hard: 10

======================================================================
Benchmarking: EASY Backfilling
======================================================================
  Scenario 5: easy   - Util= 18.3%, Jobs=15/15
  Scenario 10: easy   - Util= 15.2%, Jobs=19/19
  Scenario 15: medium - Util= 52.7%, Jobs=99/99
  Scenario 20: medium - Util= 26.9%, Jobs=85/85
  Scenario 25: hard   - Util= 31.9%, Jobs=731/731
  Scenario 30: hard   - Util= 27.7%, Jobs=736/736

EASY Backfilling Results:
======================================================================

EASY:
  Scenarios: 10
  Utilization: 16.08% ± 6.28%
  Range: [8.3%, 28.2%]
  Median: 13.58%
  Avg Wait Time: 0.4 minutes

MEDIUM:
  Scenarios: 10
  Utilization: 29.55% ± 10.73%
  Range: [16.6%, 52.7%]
  Median: 25.65%
  Avg Wait Time: 10.7 minutes

HARD:
  Scenarios: 10
  Utilization: 32.21% ± 6.81%
  Range: [21.8%, 43.4%]
  Median: 31.28%
  Avg Wait Time: 29.9 minutes

ALL:
  Scenarios: 30
  Utilization: 25.95% ± 10.81%
  Range: [8.3%, 52.7%]
  Median: 25.63%
  Avg Wait Time: 13.7 minutes

======================================================================
Benchmarking: Shortest Job First
======================================================================
  Scenario 5: easy   - Util= 18.3%, Jobs=15/15
  Scenario 10: easy   - Util= 15.2%, Jobs=19/19
  Scenario 15: medium - Util= 40.0%, Jobs=99/99
  Scenario 20: medium - Util= 23.6%, Jobs=85/85
  Scenario 25: hard   - Util= 31.9%, Jobs=731/731
  Scenario 30: hard   - Util= 27.7%, Jobs=736/736

Shortest Job First Results:
======================================================================

EASY:
  Scenarios: 10
  Utilization: 16.08% ± 6.28%
  Range: [8.3%, 28.2%]
  Median: 13.58%
  Avg Wait Time: 0.4 minutes

MEDIUM:
  Scenarios: 10
  Utilization: 27.66% ± 8.25%
  Range: [16.6%, 40.0%]
  Median: 23.99%
  Avg Wait Time: 10.5 minutes

HARD:
  Scenarios: 10
  Utilization: 28.32% ± 5.52%
  Range: [21.0%, 38.1%]
  Median: 26.60%
  Avg Wait Time: 18.4 minutes

ALL:
  Scenarios: 30
  Utilization: 24.02% ± 8.81%
  Range: [8.3%, 40.0%]
  Median: 23.72%
  Avg Wait Time: 9.8 minutes

======================================================================
Benchmarking: Pure FCFS
======================================================================
  Scenario 5: easy   - Util= 18.3%, Jobs=15/15
  Scenario 10: easy   - Util= 15.2%, Jobs=19/19
  Scenario 15: medium - Util= 22.4%, Jobs=99/99
  Scenario 20: medium - Util= 25.3%, Jobs=85/85
  Scenario 25: hard   - Util= 27.3%, Jobs=731/731
  Scenario 30: hard   - Util=  3.1%, Jobs=736/736

Pure FCFS Results:
======================================================================

EASY:
  Scenarios: 10
  Utilization: 16.08% ± 6.28%
  Range: [8.3%, 28.2%]
  Median: 13.58%
  Avg Wait Time: 0.4 minutes

MEDIUM:
  Scenarios: 10
  Utilization: 25.06% ± 6.37%
  Range: [16.6%, 35.3%]
  Median: 22.91%
  Avg Wait Time: 79.8 minutes

HARD:
  Scenarios: 10
  Utilization: 23.55% ± 9.23%
  Range: [3.1%, 34.2%]
  Median: 24.95%
  Avg Wait Time: 1954.4 minutes

ALL:
  Scenarios: 30
  Utilization: 21.56% ± 8.39%
  Range: [3.1%, 35.3%]
  Median: 20.84%
  Avg Wait Time: 678.2 minutes

======================================================================
BASELINE COMPARISON SUMMARY
======================================================================

Scheduler                 | Overall      | Easy         | Medium       | Hard        
------------------------------------------------------------------------------------------
EASY Backfilling          | 25.95% ± 10.81 | 16.08% ± 6.28 | 29.55% ± 10.73 | 32.21% ± 6.81
Shortest Job First        | 24.02% ± 8.81 | 16.08% ± 6.28 | 27.66% ± 8.25 | 28.32% ± 5.52
Pure FCFS                 | 21.56% ± 8.39 | 16.08% ± 6.28 | 25.06% ± 6.37 | 23.55% ± 9.23

======================================================================
TARGET FOR RL AGENT: Beat 25.95% average utilization
======================================================================

Packaging and PyPI Release

The repository already contains a standard pyproject.toml that points setuptools at the gpu_choochoo package living one directory below the repo root. To ship a new release:

  1. Update metadata

    • Bump the version string in both pyproject.toml ([project].version) and gpu_choochoo/__init__.py (__version__).
    • Fill in/adjust the author, license, and classifiers so they match the release you want to publish.
  2. Build the distribution artifacts

    python -m pip install --upgrade build twine
    rm -rf dist/ build/
    python -m build
    

    The dist/ folder will contain both a source tarball and a wheel that package the gpu_choochoo module tree.

  3. Smoke-test the build locally

    python -m venv .venv-test
    source .venv-test/bin/activate
    python -m pip install dist/gpu_choochoo-*.whl
    

4. **Upload to TestPyPI (recommended) and then PyPI**  
 ```bash
 # TestPyPI
 python -m twine upload --repository testpypi dist/*

 # Production PyPI
 python -m twine upload dist/*

Twine will prompt for your PyPI (or TestPyPI) credentials or can read them from ~/.pypirc.

  1. Consume the published package
    python -m pip install gpu-choochoo
    
    Users can then import everything from gpu_choochoo, including GPUSchedulerEnv, MultiScenarioWrapper, and RealisticWorkloadGenerator. Package discovery works even though the code lives in the nested gpu_choochoo/ directory because setuptools is configured to include gpu_choochoo*.

Troubleshooting

Episodes not terminating?

  • Increase max_steps parameter (default: 1000)
  • Check for infeasible jobs in logs (verbose=True)

Low utilization?

  • Check load factor of scenarios (should be 0.7-1.2)
  • Verify policy is scheduling jobs (not all no-ops)
  • Compare against baselines (see test_all_baselines.py)

Jobs filtered as infeasible?

  • Workload generator ensures max_gpus ≤ largest node
  • Check cluster config has sufficient capacity
  • Set verbose=True to see which jobs are filtered

Next Steps

  1. Train your RL agent on diverse scenarios
  2. Compare to baselines (target: >45% utilization)
  3. Test generalization on held-out scenarios
  4. Analyze learned policies - what strategies emerge?
  5. Scale up - add more scenarios, larger clusters

References

  • Workload characteristics based on Google cluster traces, Azure ML traces
  • Power-law distributions: Reiss et al., "Google cluster-usage traces" (2011)
  • EASY Backfilling: Lifka, "The ANL/IBM SP scheduling system" (1995)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gpu_choochoo-0.1.0.tar.gz (37.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

gpu_choochoo-0.1.0-py3-none-any.whl (39.0 kB view details)

Uploaded Python 3

File details

Details for the file gpu_choochoo-0.1.0.tar.gz.

File metadata

  • Download URL: gpu_choochoo-0.1.0.tar.gz
  • Upload date:
  • Size: 37.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for gpu_choochoo-0.1.0.tar.gz
Algorithm Hash digest
SHA256 ff7fb760422f0024513f916e671c4b6a6454a3521a839f7c7dcf42a0d4934738
MD5 0f17df93542c3caead9aa1e4def764ca
BLAKE2b-256 0087018d1e51f71a81e382997608a4e4ca6f90f17cdfc1b3e12a597beed4c886

See more details on using hashes here.

Provenance

The following attestation bundles were made for gpu_choochoo-0.1.0.tar.gz:

Publisher: python-publish.yml on sorenjmadsen/gpu-choochoo

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file gpu_choochoo-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: gpu_choochoo-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 39.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for gpu_choochoo-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d323d1bf5e303d3d3df6f11b7a9bbb358853f0bac3a37e926f941e3913cf04c3
MD5 11b8ba6863522c2b43aef8b4b5b818b2
BLAKE2b-256 2b8eada469c3f807f65dc8f157c9c9c66757d89451af099d8297217751711f91

See more details on using hashes here.

Provenance

The following attestation bundles were made for gpu_choochoo-0.1.0-py3-none-any.whl:

Publisher: python-publish.yml on sorenjmadsen/gpu-choochoo

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page