Skip to main content

AI High Performance Infrastructure - Distributed job submission for SLURM clusters

Project description

aihpi - AI High Performance Infrastructure

A Python package for simplified distributed job submission on SLURM clusters with container support. Built on top of submitit with additional features specifically designed for AI/ML workloads.

Features

  • Simple API: Configure and submit jobs with minimal code
  • Command Line Interface: aihpi CLI for easy job submission and management
  • Distributed Training: Automatic setup for multi-node distributed training
  • Container Support: First-class support for Pyxis/Enroot containers
  • Remote Submission: Submit jobs via SSH from remote machines
  • LlamaFactory Integration: Built-in support for LlamaFactory training
  • Job Monitoring: Real-time job status tracking and log streaming
  • Experiment Tracking: Integration with Weights & Biases, MLflow, and local tracking
  • Flexible Configuration: Dataclass-based configuration system

Setup and Installation

Prerequisites

  • Python ≥ 3.8
  • submitit ≥ 1.4.0
  • Access to SLURM cluster with Pyxis/Enroot (for container jobs)
Installation with pip
  1. Clone the repository:

    git clone https://github.com/aihpi/aihpi-cluster.git
    cd aihpi-cluster
    
  2. Install the package:

    # Basic installation
    pip install -e .
    
    # With experiment tracking support
    pip install -e ".[tracking]"
    
    # With all optional dependencies
    pip install -e ".[all]"
    
Installation with UV (Recommended)

UV is a fast Python package manager that provides better dependency resolution and faster installs:

  1. Install UV (if not already installed):

    curl -LsSf https://astral.sh/uv/install.sh | sh
    # or
    pip install uv
    
  2. Clone and setup:

    git clone https://github.com/aihpi/aihpi-cluster.git
    cd aihpi-cluster
    
  3. Install with UV:

    # Basic installation
    uv pip install -e .
    
    # With experiment tracking support
    uv pip install -e ".[tracking]"
    
    # With all optional dependencies (recommended)
    uv pip install -e ".[all]"
    

Quick Start

After installation, start using aihpi:

from aihpi import SlurmJobExecutor, JobConfig

config = JobConfig(
    job_name="my-training",
    num_nodes=1,
    gpus_per_node=2,
    walltime="01:00:00",
    partition="aisc",
    login_node="10.130.0.6"  # Your SLURM login node IP
)

executor = SlurmJobExecutor(config)
job = executor.submit_function(my_training_function)

User Guide

Using the Tool

  1. Configure your job using JobConfig with resource requirements and SLURM parameters
    • Important: Set login_node to your SLURM login node IP for remote job submission
  2. Create an executor with SlurmJobExecutor(config)
  3. Submit your function with executor.submit_function(func) or executor.submit_distributed_training(func)
  4. Monitor progress using JobMonitor for real-time status updates
  5. Track experiments with Weights & Biases, MLflow, or local tracking

Basic Example

from aihpi import SlurmJobExecutor, JobConfig, ContainerConfig

# Configure multi-node distributed training
config = JobConfig(
    job_name="distributed-training",
    num_nodes=4,
    gpus_per_node=2,
    walltime="04:00:00",
    partition="aisc",
    login_node="10.130.0.6",  # Your SLURM login node IP
)

# Configure container
config.container = ContainerConfig(
    name="torch2412",
    mounts=["/data:/workspace/data"]
)

executor = SlurmJobExecutor(config)

def distributed_training():
    import os
    print(f"Node rank: {os.getenv('NODE_RANK')}")
    print(f"World size: {os.getenv('WORLD_SIZE')}")
    # Your distributed training code here

job = executor.submit_distributed_training(distributed_training)

Command Line Interface

The aihpi CLI provides a convenient command-line interface for job submission and management:

# Submit a single-node Python job
aihpi run train.py --config slurm_config.py

# Submit with monitoring
aihpi run train.py --config slurm_config.py --monitor

# Submit distributed job (automatically detected from config)
aihpi run train.py --config distributed_config.py

# Submit LlamaFactory job with app config
aihpi run llamafactory-cli train --config job_config.py --app-config train.yaml

# Monitor a running job
aihpi monitor 12345 --follow

# Check job status
aihpi status

# Cancel a job
aihpi cancel 12345

CLI Configuration Files

The CLI uses Python configuration files containing a JobConfig object:

# config.py
from aihpi import JobConfig
from pathlib import Path

config = JobConfig(
    job_name="my_job",
    num_nodes=1,
    gpus_per_node=2,
    walltime="02:00:00",
    partition="gpu",
    log_dir=Path("./logs"),
    login_node="10.130.0.6"
)

The CLI automatically determines the submission mode:

  • Function mode: Single-node Python scripts
  • Distributed mode: Multi-node Python scripts (when num_nodes > 1)
  • CLI mode: Non-Python executables

Advanced Features

  • Job Monitoring: Real-time status tracking and log streaming
  • Experiment Tracking: Automatic logging of metrics, parameters, and artifacts
  • Remote Submission: Submit jobs via SSH from any machine
  • LlamaFactory Integration: Built-in support for LLM fine-tuning

See aihpi/examples/ for comprehensive usage examples.

Recommendations

  • Use containerized jobs for reproducible environments
  • Enable experiment tracking for better ML workflow management
  • Monitor long-running jobs with the built-in monitoring utilities
  • Configure SSH keys for seamless remote job submission

Package Structure

aihpi/
├── cli.py              # Command-line interface
├── core/               # Core job submission functionality
│   ├── config.py      # Configuration classes
│   └── executor.py    # Job executors
├── monitoring/        # Job monitoring utilities
│   └── monitoring.py  # Real-time job status and log streaming
├── tracking/          # Experiment tracking integrations
│   └── tracking.py    # W&B, MLflow, and local tracking
└── examples/          # Usage examples
    ├── basic.py       # Basic job submission examples
    └── monitoring.py  # Monitoring and tracking examples

Limitations

  • SLURM Dependency: Requires access to a SLURM cluster environment
  • Container Runtime: Container features require Pyxis/Enroot setup
  • Network Access: Remote submission requires SSH connectivity to login nodes

References

Author

License

MIT License - see LICENSE file for details.


Acknowledgements

drawing

The AI Service Centre Berlin Brandenburg is funded by the Federal Ministry of Research, Technology and Space under the funding code 01IS22092.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

aihpi-0.1.2.tar.gz (44.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

aihpi-0.1.2-py3-none-any.whl (30.1 kB view details)

Uploaded Python 3

File details

Details for the file aihpi-0.1.2.tar.gz.

File metadata

  • Download URL: aihpi-0.1.2.tar.gz
  • Upload date:
  • Size: 44.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for aihpi-0.1.2.tar.gz
Algorithm Hash digest
SHA256 40f27d786588cd1baec70b154c83cfbe9eb4327526edd95e236a7073632efa1a
MD5 4ef9f1b3cc92fa44c07b4791c3291904
BLAKE2b-256 0c54f388006efb799878ae6e8dc9e68ffc3837627924ecf282616bb5bfa98098

See more details on using hashes here.

File details

Details for the file aihpi-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: aihpi-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 30.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for aihpi-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 745de398d003db0efb2963e327fa83fcd18dccfae81f4abd8256c907d151220d
MD5 fd14d72500c848b3db1c25a3384cffc3
BLAKE2b-256 0e4d6a71df8a73577264c111143b43c66c84ed0fad40041d654121c7dc9b882d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page