Skip to main content

An EXPERIMENTAL torchrun decorator for Metaflow

Project description

Metaflow torchrun decorator

Introduction

This repository implements a plugin to run parallel Metaflow tasks as nodes in a torchrun job which can be submitted to AWS Batch or a Kubernetes cluster.

Features

  • Automatic torchrun integration: This extension provides a simple and intuitive way to incorporate PyTorch distributed programs in your Metaflow workflows using the @torchrun decorator
  • No changes to model code: The @torchrun decorator exposes a new method on the Metaflow current object, so you can run your existing torch distributed programs inside Metaflow tasks with no changes in the research code.
  • Run one command: You don't need to log into many nodes and run commands on each. Instead, the @torchrun decorator will select arguments for the torchrun command based on the requests in Metaflow compute decorators like number of GPUs. Network addresses are automatically discoverable.
  • No user-facing subprocess calls: At the end of the day, @torchrun is calling a subprocess inside a Metaflow task. Although many Metaflow users do this, it can make code difficult to read for beginners. One major goal of this plugin is to motivate hardening and automating a pattern for submitting subprocess calls inside Metaflow tasks.

Installation

You can install it with:

pip install metaflow-torchrun

Getting Started

And then you can import it and use in parallel steps:

from metaflow import FlowSpec, step, torchrun

...
class MyGPT(FlowSpec):

    @step
    def start(self):
        self.next(self.torch_multinode, num_parallel=N_NODES)

    @kubernetes(cpu=N_CPU, gpu=N_GPU, memory=MEMORY)
    @torchrun
    @step
    def torch_multinode(self):
        ...
        current.torch.run(
            entrypoint="main.py", # No changes made to original script.
            entrypoint_args = {"main-arg-1": "123", "main-arg-2": "777"},
            nproc_per_node=1,     # edge case of a torchrun arg user-facing.
        )
        ...
    ...

Examples

Directory torch script description
Hello Each process prints their rank and the world size.
Tensor pass Main process passes a tensor to the workers.
Torch DDP A flow that uses a script from the torchrun tutorials on multi-node DDP.
MinGPT A flow that runs a torchrun GPT demo that simplifies Karpathy's minGPT in a set of parallel Metaflow tasks each contributing their @resources.

License

metaflow-torchrun is distributed under the Apache License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

metaflow_torchrun-0.0.9.tar.gz (11.3 kB view details)

Uploaded Source

Built Distribution

metaflow_torchrun-0.0.9-py2.py3-none-any.whl (13.2 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file metaflow_torchrun-0.0.9.tar.gz.

File metadata

  • Download URL: metaflow_torchrun-0.0.9.tar.gz
  • Upload date:
  • Size: 11.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.12.6

File hashes

Hashes for metaflow_torchrun-0.0.9.tar.gz
Algorithm Hash digest
SHA256 b2aacde08491efe919497edeaad5da0f4f1c860aeb6cbbf1924df9c144da3ac8
MD5 3a5f69db33e6538cf88e8cbe4127585a
BLAKE2b-256 647a13f4ca77aaf9053fe971b894500dbcccc1b7e4b1ccb2aeca5bba33e1d045

See more details on using hashes here.

File details

Details for the file metaflow_torchrun-0.0.9-py2.py3-none-any.whl.

File metadata

File hashes

Hashes for metaflow_torchrun-0.0.9-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 033f434000111296e0bf3bf273288f3b0641977e0c35f3dcb1e198490c9416e8
MD5 0838b780e8fa91c5d5f3760353019647
BLAKE2b-256 719eddd5df5e2ce8483df6724e8030b029f8cc03042b8965ad1e62a7050c85bf

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page