Skip to main content

Painless distributed training with torch

Project description

Dmlcloud Logo

PyPI Status Documentation Status Test Status

A torch library for easy distributed deep learning on HPC clusters. Supports both slurm and MPI. No unnecessary abstractions and overhead. Simple, yet powerful, API.

Highlights

  • Simple, yet powerful, API
  • Easy initialization of torch.distributed
  • Distributed metrics
  • Extensive logging and diagnostics
  • Wandb support
  • Tensorboard support
  • A wealth of useful utility functions

Installation

dmlcloud can be installed directly from PyPI:

pip install dmlcloud

Alternatively, you can install the latest development version directly from Github:

pip install git+https://github.com/sehoffmann/dmlcloud.git

Documentation

You can find the official documentation at Read the Docs

Minimal Example

See examples/mnist.py for a minimal example on how to train MNIST with multiple GPUS. To run it with 4 GPUs, use

dmlrun -n 4 python examples/mnist.py

dmlrun is a thin wrapper around torchrun that makes it easier to prototype on a single node.

Slurm Support

dmlcloud automatically looks for slurm environment variables to initialize torch.distributed. On a slurm cluster, you can hence simply use srun from within an sbatch script to train on multiple nodes:

#!/bin/bash
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=4
#SBATCH --gpus-per-node=4
#SBATCH --cpus-per-task=8
#SBATCH --gpu-bind=none

srun python examples/mnist.py

FAQ

How is dmlcloud different from similar libraries like pytorch lightning or fastai?

dmlcloud was designed foremost with one underlying principle:

No unnecessary abstractions, just help with distributed training

As a consequence, dmlcloud code is almost identical to a regular pytorch training loop and only requires a few adjustments here and there. In contrast, other libraries often introduce extensive API's that can quickly feel overwhelming due to their sheer amount of options.

For instance, the constructor of ligthning.Trainer has 51 arguments! dml.Pipeline only has 2.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dmlcloud-0.5.1.tar.gz (193.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dmlcloud-0.5.1-py3-none-any.whl (46.3 kB view details)

Uploaded Python 3

File details

Details for the file dmlcloud-0.5.1.tar.gz.

File metadata

  • Download URL: dmlcloud-0.5.1.tar.gz
  • Upload date:
  • Size: 193.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.6.5

File hashes

Hashes for dmlcloud-0.5.1.tar.gz
Algorithm Hash digest
SHA256 26ad61e2ce636c68188099780866cc042f188666f2fa1b6eae3c12f5bb69814b
MD5 518b1265dbec5e2fbe1f530627aa40d4
BLAKE2b-256 f05b03d6b9ab370c7c432cdb3f07d43ea7d4f4f1d97974d727643f924bc99e4f

See more details on using hashes here.

File details

Details for the file dmlcloud-0.5.1-py3-none-any.whl.

File metadata

  • Download URL: dmlcloud-0.5.1-py3-none-any.whl
  • Upload date:
  • Size: 46.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.6.5

File hashes

Hashes for dmlcloud-0.5.1-py3-none-any.whl
Algorithm Hash digest
SHA256 9be5a8026bd8f55ade019daa9178352306270e7eddfa1673e70529f2144f0b3e
MD5 ec313a68ecbb1ec267031769e6fab4e6
BLAKE2b-256 426695cebd90852aa73ce137912676fe043d1b7730f7c14d654c4ce5f87124b3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page