An EXPERIMENTAL torchrun decorator for Metaflow
Project description
Metaflow torchrun decorator
Introduction
This repository implements a plugin to run parallel Metaflow tasks as nodes in a torchrun job which can be submitted to AWS Batch or a Kubernetes cluster.
Features
- Automatic torchrun integration: This extension provides a simple and intuitive way to incorporate PyTorch distributed programs in your Metaflow workflows using the
@torchrun
decorator - No changes to model code: The
@torchrun
decorator exposes a new method on the Metaflow current object, so you can run your existing torch distributed programs inside Metaflow tasks with no changes in the research code. - Run one command: You don't need to log into many nodes and run commands on each. Instead, the
@torchrun
decorator will select arguments for the torchrun command based on the requests in Metaflow compute decorators like number of GPUs. Network addresses are automatically discoverable. - No user-facing subprocess calls: At the end of the day,
@torchrun
is calling a subprocess inside a Metaflow task. Although many Metaflow users do this, it can make code difficult to read for beginners. One major goal of this plugin is to motivate hardening and automating a pattern for submitting subprocess calls inside Metaflow tasks.
Installation
You can install it with:
pip install metaflow-torchrun
Getting Started
And then you can import it and use in parallel steps:
from metaflow import FlowSpec, step, torchrun
...
class MyGPT(FlowSpec):
@step
def start(self):
self.next(self.torch_multinode, num_parallel=N_NODES)
@kubernetes(cpu=N_CPU, gpu=N_GPU, memory=MEMORY)
@torchrun
@step
def torch_multinode(self):
...
current.torch.run(
entrypoint="main.py", # No changes made to original script.
entrypoint_args = {"main-arg-1": "123", "main-arg-2": "777"},
nproc_per_node=1, # edge case of a torchrun arg user-facing.
)
...
...
Examples
Directory | torch script description |
---|---|
Hello | Each process prints their rank and the world size. |
Tensor pass | Main process passes a tensor to the workers. |
Torch DDP | A flow that uses a script from the torchrun tutorials on multi-node DDP. |
MinGPT | A flow that runs a torchrun GPT demo that simplifies Karpathy's minGPT in a set of parallel Metaflow tasks each contributing their @resources . |
License
metaflow-torchrun
is distributed under the Apache License.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
metaflow_torchrun-0.0.8.tar.gz
(11.3 kB
view hashes)
Built Distribution
Close
Hashes for metaflow_torchrun-0.0.8-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 28dac15c9135b050f1c7348dbb98e24bccc2ceb847116cb14b498d31d7947226 |
|
MD5 | f110f148e2ed6f2a6630b93e4e213f00 |
|
BLAKE2b-256 | a1855cadb0443c38bdb9d639beaca520a2284f4b0fbb4aa11f4debbfe9438825 |