Skip to main content

AI High Performance Infrastructure - Distributed job submission for SLURM clusters

Project description

AI Service Centre Logo

aihpi - AI High Performance Infrastructure

A Python package for simplified distributed job submission on SLURM clusters with container support. Built on top of submitit with additional features specifically designed for AI/ML workloads.

Installation

# Basic installation
pip install aihpi

# With experiment tracking support
pip install aihpi[tracking]

# With all optional dependencies
pip install aihpi[all]

Quick Start

from aihpi import SlurmJobExecutor, JobConfig

config = JobConfig(
    job_name="my-training",
    num_nodes=1,
    gpus_per_node=2,
    walltime="01:00:00",
    partition="gpu",
    login_node="10.130.0.6"  # Your SLURM login node IP
)

executor = SlurmJobExecutor(config)
job = executor.submit_function(my_training_function)

Features

  • Simple API: Configure and submit jobs with minimal code
  • Command Line Interface: aihpi CLI for easy job submission and management
  • Distributed Training: Automatic setup for multi-node distributed training
  • Container Support: First-class support for Pyxis/Enroot containers
  • Remote Submission: Submit jobs via SSH from remote machines
  • LlamaFactory Integration: Built-in support for LlamaFactory training
  • Job Monitoring: Real-time job status tracking and log streaming
  • Experiment Tracking: Integration with Weights & Biases, MLflow, and local tracking

Command Line Usage

# Submit a Python job
aihpi run train.py --config config.py

# Submit with monitoring
aihpi run train.py --config config.py --monitor

# Submit distributed job
aihpi run train.py --config distributed_config.py

# Monitor a running job
aihpi monitor 12345 --follow

Documentation & Examples

For detailed documentation, examples, and setup instructions, visit:

Requirements

  • Python ≥ 3.8
  • Access to SLURM cluster
  • submitit ≥ 1.4.0

License

MIT License


Acknowledgements

BMBF Logo

The AI Service Centre Berlin Brandenburg is funded by the Federal Ministry of Research, Technology and Space under the funding code 01IS22092.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

aihpi-0.1.4.tar.gz (43.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

aihpi-0.1.4-py3-none-any.whl (28.6 kB view details)

Uploaded Python 3

File details

Details for the file aihpi-0.1.4.tar.gz.

File metadata

  • Download URL: aihpi-0.1.4.tar.gz
  • Upload date:
  • Size: 43.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for aihpi-0.1.4.tar.gz
Algorithm Hash digest
SHA256 ed59a9c16feb7fed7fbd6f528cf15341bbd515d0da57a50d8807b314ea754e9d
MD5 3568151aba3fa542a0ae881a0c6c87c1
BLAKE2b-256 03bbdf4f1694f1b7fea030d855b7f41edd36167ee27317ebf0f0e71d6a7b0816

See more details on using hashes here.

File details

Details for the file aihpi-0.1.4-py3-none-any.whl.

File metadata

  • Download URL: aihpi-0.1.4-py3-none-any.whl
  • Upload date:
  • Size: 28.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for aihpi-0.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 08fee92af6112370411a8a6e0c4e592111b48e1ea0d8f9ba110f71844c3902a3
MD5 301b76aa47ca9ab50ff6d75952307fe1
BLAKE2b-256 c8dc68b8c840239ac9624a128da7ba0fb37d5f7cfcc0d6569e9ffd49ff022307

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page