AI High Performance Infrastructure - Distributed job submission for SLURM clusters
Project description
aihpi - AI High Performance Infrastructure
A Python package for simplified distributed job submission on SLURM clusters with container support. Built on top of submitit with additional features specifically designed for AI/ML workloads.
Installation
# Basic installation
pip install aihpi
# With experiment tracking support
pip install aihpi[tracking]
# With all optional dependencies
pip install aihpi[all]
Quick Start
from aihpi import SlurmJobExecutor, JobConfig
config = JobConfig(
job_name="my-training",
num_nodes=1,
gpus_per_node=2,
walltime="01:00:00",
partition="gpu",
login_node="10.130.0.6" # Your SLURM login node IP
)
executor = SlurmJobExecutor(config)
job = executor.submit_function(my_training_function)
Features
- Simple API: Configure and submit jobs with minimal code
- Command Line Interface:
aihpiCLI for easy job submission and management - Distributed Training: Automatic setup for multi-node distributed training
- Container Support: First-class support for Pyxis/Enroot containers
- Container Submission: Submit jobs from within containers via SSH to login nodes
- LlamaFactory Integration: Built-in support for LlamaFactory training
- Job Monitoring: Real-time job status tracking and log streaming
- Experiment Tracking: Integration with Weights & Biases, MLflow, and local tracking
Command Line Usage
# Submit a Python job
aihpi run train.py --config config.py
# Submit with monitoring
aihpi run train.py --config config.py --monitor
# Submit distributed job
aihpi run train.py --config distributed_config.py
# Monitor a running job
aihpi monitor 12345 --follow
Documentation & Examples
For detailed documentation, examples, and setup instructions, visit:
- GitHub Repository: aihpi/aihpi-cluster
- Full Documentation: README.md
Requirements
- Python ≥ 3.8
- Access to SLURM cluster
- submitit ≥ 1.4.0
License
MIT License
Acknowledgements
The AI Service Centre Berlin Brandenburg is funded by the Federal Ministry of Research, Technology and Space under the funding code 01IS22092.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file aihpi-0.1.5.tar.gz.
File metadata
- Download URL: aihpi-0.1.5.tar.gz
- Upload date:
- Size: 43.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
26ae7f177ddac2ff887292e54409bec93b5d5b548209e0ce6f26d40d03b15a65
|
|
| MD5 |
aa3d30440fd4e0cb9e7973ad372233eb
|
|
| BLAKE2b-256 |
bd4ff77cbd80ed7a51fdbe951c700dca11a95a1f944d69eb3284db8924cd620f
|
File details
Details for the file aihpi-0.1.5-py3-none-any.whl.
File metadata
- Download URL: aihpi-0.1.5-py3-none-any.whl
- Upload date:
- Size: 28.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ec86651377ce24af3b4a326e1352e2595a2132840682bccdc1e558c5c38fe760
|
|
| MD5 |
a9366cc3e639fc391edf476641b56ce7
|
|
| BLAKE2b-256 |
6448414e390ea6b09c1f4b7c54f22a30f0bae8441bc8b17a25efe333fafad40b
|