Skip to main content

AI High Performance Infrastructure - Distributed job submission for SLURM clusters

Project description

AI Service Centre Logo

aihpi - AI High Performance Infrastructure

A Python package for simplified distributed job submission on SLURM clusters with container support. Built on top of submitit with additional features specifically designed for AI/ML workloads.

Installation

# Basic installation
pip install aihpi

# With experiment tracking support
pip install aihpi[tracking]

# With all optional dependencies
pip install aihpi[all]

Quick Start

from aihpi import SlurmJobExecutor, JobConfig

config = JobConfig(
    job_name="my-training",
    num_nodes=1,
    gpus_per_node=2,
    walltime="01:00:00",
    partition="gpu",
    login_node="10.130.0.6"  # Your SLURM login node IP
)

executor = SlurmJobExecutor(config)
job = executor.submit_function(my_training_function)

Features

  • Simple API: Configure and submit jobs with minimal code
  • Command Line Interface: aihpi CLI for easy job submission and management
  • Distributed Training: Automatic setup for multi-node distributed training
  • Container Support: First-class support for Pyxis/Enroot containers
  • Container Submission: Submit jobs from within containers via SSH to login nodes
  • LlamaFactory Integration: Built-in support for LlamaFactory training
  • Job Monitoring: Real-time job status tracking and log streaming
  • Experiment Tracking: Integration with Weights & Biases, MLflow, and local tracking

Command Line Usage

# Submit a Python job
aihpi run train.py --config config.py

# Submit with monitoring
aihpi run train.py --config config.py --monitor

# Submit distributed job
aihpi run train.py --config distributed_config.py

# Monitor a running job
aihpi monitor 12345 --follow

Documentation & Examples

For detailed documentation, examples, and setup instructions, visit:

Requirements

  • Python ≥ 3.8
  • Access to SLURM cluster
  • submitit ≥ 1.4.0

License

MIT License


Acknowledgements

BMBF Logo

The AI Service Centre Berlin Brandenburg is funded by the Federal Ministry of Research, Technology and Space under the funding code 01IS22092.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

aihpi-0.1.5.tar.gz (43.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

aihpi-0.1.5-py3-none-any.whl (28.6 kB view details)

Uploaded Python 3

File details

Details for the file aihpi-0.1.5.tar.gz.

File metadata

  • Download URL: aihpi-0.1.5.tar.gz
  • Upload date:
  • Size: 43.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for aihpi-0.1.5.tar.gz
Algorithm Hash digest
SHA256 26ae7f177ddac2ff887292e54409bec93b5d5b548209e0ce6f26d40d03b15a65
MD5 aa3d30440fd4e0cb9e7973ad372233eb
BLAKE2b-256 bd4ff77cbd80ed7a51fdbe951c700dca11a95a1f944d69eb3284db8924cd620f

See more details on using hashes here.

File details

Details for the file aihpi-0.1.5-py3-none-any.whl.

File metadata

  • Download URL: aihpi-0.1.5-py3-none-any.whl
  • Upload date:
  • Size: 28.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for aihpi-0.1.5-py3-none-any.whl
Algorithm Hash digest
SHA256 ec86651377ce24af3b4a326e1352e2595a2132840682bccdc1e558c5c38fe760
MD5 a9366cc3e639fc391edf476641b56ce7
BLAKE2b-256 6448414e390ea6b09c1f4b7c54f22a30f0bae8441bc8b17a25efe333fafad40b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page