An EXPERIMENTAL torchrun decorator for Metaflow
Project description
Metaflow torchrun decorator
Introduction
This repository implements a plugin to run parallel Metaflow tasks as nodes in a torchrun job which can be submitted to AWS Batch or a Kubernetes cluster.
Features
- Automatic torchrun integration: This extension provides a simple and intuitive way to incorporate PyTorch distributed programs in your Metaflow workflows using the
@torchrun
decorator - No changes to model code: The
@torchrun
decorator exposes a new method on the Metaflow current object, so you can run your existing torch distributed programs inside Metaflow tasks with no changes in the research code. - Run one command: You don't need to log into many nodes and run commands on each. Instead, the
@torchrun
decorator will select arguments for the torchrun command based on the requests in Metaflow compute decorators like number of GPUs. Network addresses are automatically discoverable. - No user-facing subprocess calls: At the end of the day,
@torchrun
is calling a subprocess inside a Metaflow task. Although many Metaflow users do this, it can make code difficult to read for beginners. One major goal of this plugin is to motivate hardening and automating a pattern for submitting subprocess calls inside Metaflow tasks.
Installation
You can install it with:
pip install metaflow-torchrun
Getting Started
And then you can import it and use in parallel steps:
from metaflow import FlowSpec, step, torchrun
...
class MyGPT(FlowSpec):
@step
def start(self):
self.next(self.torch_multinode, num_parallel=N_NODES)
@kubernetes(cpu=N_CPU, gpu=N_GPU, memory=MEMORY)
@torchrun
@step
def torch_multinode(self):
...
current.torch.run(
entrypoint="main.py", # No changes made to original script.
entrypoint_args = {"main-arg-1": "123", "main-arg-2": "777"},
nproc_per_node=1, # edge case of a torchrun arg user-facing.
)
...
...
Examples
Directory | torch script description |
---|---|
Hello | Each process prints their rank and the world size. |
Tensor pass | Main process passes a tensor to the workers. |
Torch DDP | A flow that uses a script from the torchrun tutorials on multi-node DDP. |
MinGPT | A flow that runs a torchrun GPT demo that simplifies Karpathy's minGPT in a set of parallel Metaflow tasks each contributing their @resources . |
License
metaflow-torchrun
is distributed under the Apache License.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
metaflow_torchrun-0.0.9.tar.gz
(11.3 kB
view details)
Built Distribution
File details
Details for the file metaflow_torchrun-0.0.9.tar.gz
.
File metadata
- Download URL: metaflow_torchrun-0.0.9.tar.gz
- Upload date:
- Size: 11.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.12.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | b2aacde08491efe919497edeaad5da0f4f1c860aeb6cbbf1924df9c144da3ac8 |
|
MD5 | 3a5f69db33e6538cf88e8cbe4127585a |
|
BLAKE2b-256 | 647a13f4ca77aaf9053fe971b894500dbcccc1b7e4b1ccb2aeca5bba33e1d045 |
File details
Details for the file metaflow_torchrun-0.0.9-py2.py3-none-any.whl
.
File metadata
- Download URL: metaflow_torchrun-0.0.9-py2.py3-none-any.whl
- Upload date:
- Size: 13.2 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.12.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 033f434000111296e0bf3bf273288f3b0641977e0c35f3dcb1e198490c9416e8 |
|
MD5 | 0838b780e8fa91c5d5f3760353019647 |
|
BLAKE2b-256 | 719eddd5df5e2ce8483df6724e8030b029f8cc03042b8965ad1e62a7050c85bf |