AI High Performance Infrastructure - Distributed job submission for SLURM clusters
Project description
aihpi - AI High Performance Infrastructure
A Python package for simplified distributed job submission on SLURM clusters with container support. Built on top of submitit with additional features specifically designed for AI/ML workloads.
Features
- Simple API: Configure and submit jobs with minimal code
- Command Line Interface:
aihpiCLI for easy job submission and management - Distributed Training: Automatic setup for multi-node distributed training
- Container Support: First-class support for Pyxis/Enroot containers
- Remote Submission: Submit jobs via SSH from remote machines
- LlamaFactory Integration: Built-in support for LlamaFactory training
- Job Monitoring: Real-time job status tracking and log streaming
- Experiment Tracking: Integration with Weights & Biases, MLflow, and local tracking
- Flexible Configuration: Dataclass-based configuration system
Setup and Installation
Prerequisites
- Python ≥ 3.8
- submitit ≥ 1.4.0
- Access to SLURM cluster with Pyxis/Enroot (for container jobs)
Installation with pip
-
Clone the repository:
git clone https://github.com/aihpi/aihpi-cluster.git cd aihpi-cluster
-
Install the package:
# Basic installation pip install -e . # With experiment tracking support pip install -e ".[tracking]" # With all optional dependencies pip install -e ".[all]"
Installation with UV (Recommended)
UV is a fast Python package manager that provides better dependency resolution and faster installs:
-
Install UV (if not already installed):
curl -LsSf https://astral.sh/uv/install.sh | sh # or pip install uv
-
Clone and setup:
git clone https://github.com/aihpi/aihpi-cluster.git cd aihpi-cluster
-
Install with UV:
# Basic installation uv pip install -e . # With experiment tracking support uv pip install -e ".[tracking]" # With all optional dependencies (recommended) uv pip install -e ".[all]"
Quick Start
After installation, start using aihpi:
from aihpi import SlurmJobExecutor, JobConfig
config = JobConfig(
job_name="my-training",
num_nodes=1,
gpus_per_node=2,
walltime="01:00:00",
partition="aisc",
login_node="10.130.0.6" # Your SLURM login node IP
)
executor = SlurmJobExecutor(config)
job = executor.submit_function(my_training_function)
User Guide
Using the Tool
- Configure your job using
JobConfigwith resource requirements and SLURM parameters- Important: Set
login_nodeto your SLURM login node IP for remote job submission
- Important: Set
- Create an executor with
SlurmJobExecutor(config) - Submit your function with
executor.submit_function(func)orexecutor.submit_distributed_training(func) - Monitor progress using
JobMonitorfor real-time status updates - Track experiments with Weights & Biases, MLflow, or local tracking
Basic Example
from aihpi import SlurmJobExecutor, JobConfig, ContainerConfig
# Configure multi-node distributed training
config = JobConfig(
job_name="distributed-training",
num_nodes=4,
gpus_per_node=2,
walltime="04:00:00",
partition="aisc",
login_node="10.130.0.6", # Your SLURM login node IP
)
# Configure container
config.container = ContainerConfig(
name="torch2412",
mounts=["/data:/workspace/data"]
)
executor = SlurmJobExecutor(config)
def distributed_training():
import os
print(f"Node rank: {os.getenv('NODE_RANK')}")
print(f"World size: {os.getenv('WORLD_SIZE')}")
# Your distributed training code here
job = executor.submit_distributed_training(distributed_training)
Command Line Interface
The aihpi CLI provides a convenient command-line interface for job submission and management:
# Submit a single-node Python job
aihpi run train.py --config slurm_config.py
# Submit with monitoring
aihpi run train.py --config slurm_config.py --monitor
# Submit distributed job (automatically detected from config)
aihpi run train.py --config distributed_config.py
# Submit LlamaFactory job with app config
aihpi run llamafactory-cli train --config job_config.py --app-config train.yaml
# Monitor a running job
aihpi monitor 12345 --follow
# Check job status
aihpi status
# Cancel a job
aihpi cancel 12345
CLI Configuration Files
The CLI uses Python configuration files containing a JobConfig object:
# config.py
from aihpi import JobConfig
from pathlib import Path
config = JobConfig(
job_name="my_job",
num_nodes=1,
gpus_per_node=2,
walltime="02:00:00",
partition="gpu",
log_dir=Path("./logs"),
login_node="10.130.0.6"
)
The CLI automatically determines the submission mode:
- Function mode: Single-node Python scripts
- Distributed mode: Multi-node Python scripts (when
num_nodes > 1) - CLI mode: Non-Python executables
Advanced Features
- Job Monitoring: Real-time status tracking and log streaming
- Experiment Tracking: Automatic logging of metrics, parameters, and artifacts
- Remote Submission: Submit jobs via SSH from any machine
- LlamaFactory Integration: Built-in support for LLM fine-tuning
See aihpi/examples/ for comprehensive usage examples.
Recommendations
- Use containerized jobs for reproducible environments
- Enable experiment tracking for better ML workflow management
- Monitor long-running jobs with the built-in monitoring utilities
- Configure SSH keys for seamless remote job submission
Package Structure
aihpi/
├── cli.py # Command-line interface
├── core/ # Core job submission functionality
│ ├── config.py # Configuration classes
│ └── executor.py # Job executors
├── monitoring/ # Job monitoring utilities
│ └── monitoring.py # Real-time job status and log streaming
├── tracking/ # Experiment tracking integrations
│ └── tracking.py # W&B, MLflow, and local tracking
└── examples/ # Usage examples
├── basic.py # Basic job submission examples
└── monitoring.py # Monitoring and tracking examples
Limitations
- SLURM Dependency: Requires access to a SLURM cluster environment
- Container Runtime: Container features require Pyxis/Enroot setup
- Network Access: Remote submission requires SSH connectivity to login nodes
References
Author
License
MIT License - see LICENSE file for details.
Acknowledgements
The AI Service Centre Berlin Brandenburg is funded by the Federal Ministry of Research, Technology and Space under the funding code 01IS22092.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file aihpi-0.1.2.tar.gz.
File metadata
- Download URL: aihpi-0.1.2.tar.gz
- Upload date:
- Size: 44.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
40f27d786588cd1baec70b154c83cfbe9eb4327526edd95e236a7073632efa1a
|
|
| MD5 |
4ef9f1b3cc92fa44c07b4791c3291904
|
|
| BLAKE2b-256 |
0c54f388006efb799878ae6e8dc9e68ffc3837627924ecf282616bb5bfa98098
|
File details
Details for the file aihpi-0.1.2-py3-none-any.whl.
File metadata
- Download URL: aihpi-0.1.2-py3-none-any.whl
- Upload date:
- Size: 30.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
745de398d003db0efb2963e327fa83fcd18dccfae81f4abd8256c907d151220d
|
|
| MD5 |
fd14d72500c848b3db1c25a3384cffc3
|
|
| BLAKE2b-256 |
0e4d6a71df8a73577264c111143b43c66c84ed0fad40041d654121c7dc9b882d
|