Skip to main content

AI High Performance Infrastructure - Distributed job submission for SLURM clusters

Project description

AI Service Centre Logo

aihpi - AI High Performance Infrastructure

A Python package for simplified distributed job submission on SLURM clusters with container support. Built on top of submitit with additional features specifically designed for AI/ML workloads.

Features

  • Simple API: Configure and submit jobs with minimal code
  • Command Line Interface: aihpi CLI for easy job submission and management
  • Distributed Training: Automatic setup for multi-node distributed training
  • Container Support: First-class support for Pyxis/Enroot containers
  • Remote Submission: Submit jobs via SSH from remote machines
  • LlamaFactory Integration: Built-in support for LlamaFactory training
  • Job Monitoring: Real-time job status tracking and log streaming
  • Experiment Tracking: Integration with Weights & Biases, MLflow, and local tracking
  • Flexible Configuration: Dataclass-based configuration system

Setup and Installation

Prerequisites

  • Python ≥ 3.8
  • submitit ≥ 1.4.0
  • Access to SLURM cluster with Pyxis/Enroot (for container jobs)
Installation with pip
  1. Clone the repository:

    git clone https://github.com/aihpi/aihpi-cluster.git
    cd aihpi-cluster
    
  2. Install the package:

    # Basic installation
    pip install -e .
    
    # With experiment tracking support
    pip install -e ".[tracking]"
    
    # With all optional dependencies
    pip install -e ".[all]"
    
Installation with UV (Recommended)

UV is a fast Python package manager that provides better dependency resolution and faster installs:

  1. Install UV (if not already installed):

    curl -LsSf https://astral.sh/uv/install.sh | sh
    # or
    pip install uv
    
  2. Clone and setup:

    git clone https://github.com/aihpi/aihpi-cluster.git
    cd aihpi-cluster
    
  3. Install with UV:

    # Basic installation
    uv pip install -e .
    
    # With experiment tracking support
    uv pip install -e ".[tracking]"
    
    # With all optional dependencies (recommended)
    uv pip install -e ".[all]"
    

Quick Start

After installation, start using aihpi:

from aihpi import SlurmJobExecutor, JobConfig

config = JobConfig(
    job_name="my-training",
    num_nodes=1,
    gpus_per_node=2,
    walltime="01:00:00",
    partition="aisc",
    login_node="10.130.0.6"  # Your SLURM login node IP
)

executor = SlurmJobExecutor(config)
job = executor.submit_function(my_training_function)

User Guide

Using the Tool

  1. Configure your job using JobConfig with resource requirements and SLURM parameters
    • Important: Set login_node to your SLURM login node IP for remote job submission
  2. Create an executor with SlurmJobExecutor(config)
  3. Submit your function with executor.submit_function(func) or executor.submit_distributed_training(func)
  4. Monitor progress using JobMonitor for real-time status updates
  5. Track experiments with Weights & Biases, MLflow, or local tracking

Basic Example

from aihpi import SlurmJobExecutor, JobConfig, ContainerConfig

# Configure multi-node distributed training
config = JobConfig(
    job_name="distributed-training",
    num_nodes=4,
    gpus_per_node=2,
    walltime="04:00:00",
    partition="aisc",
    login_node="10.130.0.6",  # Your SLURM login node IP
)

# Configure container
config.container = ContainerConfig(
    name="torch2412",
    mounts=["/data:/workspace/data"]
)

executor = SlurmJobExecutor(config)

def distributed_training():
    import os
    print(f"Node rank: {os.getenv('NODE_RANK')}")
    print(f"World size: {os.getenv('WORLD_SIZE')}")
    # Your distributed training code here

job = executor.submit_distributed_training(distributed_training)

Command Line Interface

The aihpi CLI provides a convenient command-line interface for job submission and management:

# Submit a single-node Python job
aihpi run train.py --config slurm_config.py

# Submit with monitoring
aihpi run train.py --config slurm_config.py --monitor

# Submit distributed job (automatically detected from config)
aihpi run train.py --config distributed_config.py

# Submit LlamaFactory job with app config
aihpi run llamafactory-cli train --config job_config.py --app-config train.yaml

# Monitor a running job
aihpi monitor 12345 --follow

# Check job status
aihpi status

# Cancel a job
aihpi cancel 12345

CLI Configuration Files

The CLI uses Python configuration files containing a JobConfig object:

# config.py
from aihpi import JobConfig
from pathlib import Path

config = JobConfig(
    job_name="my_job",
    num_nodes=1,
    gpus_per_node=2,
    walltime="02:00:00",
    partition="gpu",
    log_dir=Path("./logs"),
    login_node="10.130.0.6"
)

The CLI automatically determines the submission mode:

  • Function mode: Single-node Python scripts
  • Distributed mode: Multi-node Python scripts (when num_nodes > 1)
  • CLI mode: Non-Python executables

Advanced Features

  • Job Monitoring: Real-time status tracking and log streaming
  • Experiment Tracking: Automatic logging of metrics, parameters, and artifacts
  • Remote Submission: Submit jobs via SSH from any machine
  • LlamaFactory Integration: Built-in support for LLM fine-tuning

See aihpi/examples/ for comprehensive usage examples.

Recommendations

  • Use containerized jobs for reproducible environments
  • Enable experiment tracking for better ML workflow management
  • Monitor long-running jobs with the built-in monitoring utilities
  • Configure SSH keys for seamless remote job submission

Package Structure

aihpi/
├── cli.py              # Command-line interface
├── core/               # Core job submission functionality
│   ├── config.py      # Configuration classes
│   └── executor.py    # Job executors
├── monitoring/        # Job monitoring utilities
│   └── monitoring.py  # Real-time job status and log streaming
├── tracking/          # Experiment tracking integrations
│   └── tracking.py    # W&B, MLflow, and local tracking
└── examples/          # Usage examples
    ├── basic.py       # Basic job submission examples
    └── monitoring.py  # Monitoring and tracking examples

Limitations

  • SLURM Dependency: Requires access to a SLURM cluster environment
  • Container Runtime: Container features require Pyxis/Enroot setup
  • Network Access: Remote submission requires SSH connectivity to login nodes

References

Author

License

MIT License - see LICENSE file for details.


Acknowledgements

BMBF Logo

The AI Service Centre Berlin Brandenburg is funded by the Federal Ministry of Research, Technology and Space under the funding code 01IS22092.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

aihpi-0.1.3.tar.gz (44.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

aihpi-0.1.3-py3-none-any.whl (30.1 kB view details)

Uploaded Python 3

File details

Details for the file aihpi-0.1.3.tar.gz.

File metadata

  • Download URL: aihpi-0.1.3.tar.gz
  • Upload date:
  • Size: 44.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for aihpi-0.1.3.tar.gz
Algorithm Hash digest
SHA256 f2793242347cbcea41d1f7e21d5bf8387a781dbd2962c0444f8d117885db7493
MD5 3f2a08d0a9a8e7d8475288fbe86a56da
BLAKE2b-256 d9c0b7004a21f8cbe20ca7e1ffe4f24137cf72e49dd40dd26b1e1562d9db4b5e

See more details on using hashes here.

File details

Details for the file aihpi-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: aihpi-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 30.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for aihpi-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 f7f63e47c5a31fa0aeb342bbf31542b20a5d60e9c9291ae5675c7cde5993db2a
MD5 e5bdc53c6af655f72c382eeda4df6b96
BLAKE2b-256 6be1c17e0e34a8e9d3bda468fa34ed088d0257da3b03bccf23c6c5b45d01a946

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page