Skip to main content

Painless distributed training with torch

Project description

Dmlcloud Logo

PyPI Status Documentation Status Test Status

A torch library for easy distributed deep learning on HPC clusters. Supports both slurm and MPI. No unnecessary abstractions and overhead. Simple, yet powerful, API.

Highlights

  • Simple, yet powerful, API
  • Easy initialization of torch.distributed
  • Distributed metrics
  • Extensive logging and diagnostics
  • Wandb support
  • Tensorboard support
  • A wealth of useful utility functions

Installation

dmlcloud can be installed directly from PyPI:

pip install dmlcloud

Alternatively, you can install the latest development version directly from Github:

pip install git+https://github.com/sehoffmann/dmlcloud.git

Documentation

You can find the official documentation at Read the Docs

Minimal Example

See examples/mnist.py for a minimal example on how to train MNIST with multiple GPUS. To run it with 4 GPUs, use

dmlrun -n 4 python examples/mnist.py

dmlrun is a thin wrapper around torchrun that makes it easier to prototype on a single node.

Slurm Support

dmlcloud automatically looks for slurm environment variables to initialize torch.distributed. On a slurm cluster, you can hence simply use srun from within an sbatch script to train on multiple nodes:

#!/bin/bash
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=4
#SBATCH --gpus-per-node=4
#SBATCH --cpus-per-task=8
#SBATCH --gpu-bind=none

srun python examples/mnist.py

FAQ

How is dmlcloud different from similar libraries like pytorch lightning or fastai?

dmlcloud was designed foremost with one underlying principle:

No unnecessary abstractions, just help with distributed training

As a consequence, dmlcloud code is almost identical to a regular pytorch training loop and only requires a few adjustments here and there. In contrast, other libraries often introduce extensive API's that can quickly feel overwhelming due to their sheer amount of options.

For instance, the constructor of ligthning.Trainer has 51 arguments! dml.Pipeline only has 2.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dmlcloud-0.5.tar.gz (193.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dmlcloud-0.5-py3-none-any.whl (46.3 kB view details)

Uploaded Python 3

File details

Details for the file dmlcloud-0.5.tar.gz.

File metadata

  • Download URL: dmlcloud-0.5.tar.gz
  • Upload date:
  • Size: 193.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.6.5

File hashes

Hashes for dmlcloud-0.5.tar.gz
Algorithm Hash digest
SHA256 ed04cdf691a8400ec55da369dc77edcc3f9b73f484fbf9b6c70c8241d4a31007
MD5 b4c756259abd5cef6698ee879235a40a
BLAKE2b-256 ca8c114fff988050693ea6eba15be7e1e5e34368affc41399454eb60a533770e

See more details on using hashes here.

File details

Details for the file dmlcloud-0.5-py3-none-any.whl.

File metadata

  • Download URL: dmlcloud-0.5-py3-none-any.whl
  • Upload date:
  • Size: 46.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.6.5

File hashes

Hashes for dmlcloud-0.5-py3-none-any.whl
Algorithm Hash digest
SHA256 24b5b5275c2292a0e8f43aa3fe50e93891dcd45a994b668682b867e3b31fed0d
MD5 139150701cbb860724d2997b15ef67b5
BLAKE2b-256 7b9915ea7bb47a05a74ca6d2b85d716f3064d366e3d3c311914c43e027fd736c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page