A torchrun decorator for Metaflow

Project description

Metaflow torchrun decorator

Introduction

This repository implements a plugin to run parallel Metaflow tasks as nodes in a torchrun job which can be submitted to AWS Batch or a Kubernetes cluster.

Features

Automatic torchrun integration: This extension provides a simple and intuitive way to incorporate PyTorch distributed programs in your Metaflow workflows using the @torchrun decorator
No changes to model code: The @torchrun decorator exposes a new method on the Metaflow current object, so you can run your existing torch distributed programs inside Metaflow tasks with no changes in the research code.
Run one command: You don't need to log into many nodes and run commands on each. Instead, the @torchrun decorator will select arguments for the torchrun command based on the requests in Metaflow compute decorators like number of GPUs. Network addresses are automatically discoverable.
No user-facing subprocess calls: At the end of the day, @torchrun is calling a subprocess inside a Metaflow task. Although many Metaflow users do this, it can make code difficult to read for beginners. One major goal of this plugin is to motivate hardening and automating a pattern for submitting subprocess calls inside Metaflow tasks.

Installation

You can install it with:

pip install metaflow-torchrun

Getting Started

And then you can import it and use in parallel steps:

from metaflow import FlowSpec, step, torchrun

...
class MyGPT(FlowSpec):

    @step
    def start(self):
        self.next(self.torch_multinode, num_parallel=N_NODES)

    @kubernetes(cpu=N_CPU, gpu=N_GPU, memory=MEMORY)
    @torchrun
    @step
    def torch_multinode(self):
        ...
        current.torch.run(
            entrypoint="main.py", # No changes made to original script.
            entrypoint_args = {"main-arg-1": "123", "main-arg-2": "777"},
            nproc_per_node=1,     # edge case of a torchrun arg user-facing.
        )
        ...
    ...

Examples

Directory	torch script description
Hello	Each process prints their rank and the world size.
Tensor pass	Main process passes a tensor to the workers.
Torch DDP	A flow that uses a script from the torchrun tutorials on multi-node DDP.
MinGPT	A flow that runs a torchrun GPT demo that simplifies Karpathy's minGPT in a set of parallel Metaflow tasks each contributing their `@resources`.

License

metaflow-torchrun is distributed under the Apache License.

Project details

Release history Release notifications | RSS feed

0.2.1

Jan 28, 2026

This version

0.2.0

Jan 27, 2026

0.1.2

Apr 11, 2025

0.1.1

Feb 27, 2025

0.1.0

Feb 27, 2025

0.0.9

Sep 11, 2024

0.0.8

May 20, 2024

0.0.7

May 3, 2024

0.0.6

Apr 4, 2024

0.0.5

Oct 31, 2023

0.0.4

Oct 31, 2023

0.0.3

Oct 29, 2023

0.0.2

Oct 9, 2023

0.0.1

Sep 25, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

metaflow_torchrun-0.2.0-py2.py3-none-any.whl (14.9 kB view details)

Uploaded Jan 27, 2026 Python 2Python 3

File details

Details for the file metaflow_torchrun-0.2.0-py2.py3-none-any.whl.

File metadata

Download URL: metaflow_torchrun-0.2.0-py2.py3-none-any.whl
Upload date: Jan 27, 2026
Size: 14.9 kB
Tags: Python 2, Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for metaflow_torchrun-0.2.0-py2.py3-none-any.whl
Algorithm	Hash digest
SHA256	`fc2030501f02302c8e989aba5118645a5ea665a01e17ccd5c4a639b2c7afa03a`
MD5	`f0884731fdbd969157da7028053293f4`
BLAKE2b-256	`2307efc714a77e247e9c60d8e48562b8cd96f7667c7b3a4f538678fa8aa92697`

See more details on using hashes here.

metaflow-torchrun 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Meta