Skip to main content

An EXPERIMENTAL Deepspeed decorator for Metaflow

Project description

Introduction

Deepspeed is a highly scalable framework from Microsoft for distributed training and model serving. The Metaflow @deepspeed decorator helps you run these workflows inside of Metaflow tasks.

Features

  • Automatic SSH configuration: Multi-node Deepspeed jobs are built around OpenMPI or Horovod. Like Metaflow's @mpi decorator, the @deepspeed decorator automatically configures SSH requirements between nodes, so you can focus on research code.
  • Seamless Python interface: Metaflow's @deepspeed exposes a method current.deepspeed.run to make it easy to run Deepspeed commands on your transient MPI cluster, in the same way you'd launch Deepspeed from the terminal independent of Metaflow. A major design goal is to get the orchestration and other benefits of Metaflow, without requiring modification to research code.

Installation

Install this experimental module:

pip install metaflow-deepspeed

Getting Started

After installing the module, you can import the deepspeed decorator and use it in your Metaflow steps. This exposes the current.deepspeed.run method, which you can map your terminal commands for running Deepspeed.

from metaflow import FlowSpec, step, deepspeed, current, batch, environment

class HelloDeepspeed(FlowSpec):

    @step
    def start(self):
        self.next(self.train, num_parallel=2)

    @environment(vars={"NCCL_SOCKET_IFNAME": "eth0"})
    @batch(gpu=8, cpu=64, memory=256000)
    @deepspeed
    @step
    def train(self):
        current.deepspeed.run(
            entrypoint="my_torch_dist_script.py"
        )
        self.next(self.join)

    @step
    def join(self, inputs):
        self.next(self.end)

    @step
    def end(self):
        pass
        
if __name__ == "__main__":
    HelloDeepspeed()

Examples

Directory MPI program description
CPU Check The easiest way to check your Deepspeed infrastructure on CPUs.
Hello Deepspeed The easiest way to check your Deepspeed infrastructure on GPUs.
BERT Train your BERT model using Deepspeed!
Dolly A multi-node implementation of Databricks' Dolly.

Cloud-specific use cases

Directory MPI program description
Automatically upload a directory on AWS Push a checkpoint of any directory to S3 after the Deepspeed process completes.
Automatically upload a directory on Azure Push a checkpoint of any directory to Azure Blob storage after the Deepspeed process completes.
Use Metaflow S3 client from the Deepspeed process Upload arbitrary bytes to S3 storage from the Deepspeed process.
Use Metaflow Azure Blob client from the Deepspeed process Upload arbitrary bytes to Azure Blob storage from the Deepspeed process.
Use a Metaflow Huggingface checkpoint on S3 Push a checkpoint to S3 at the end of each epoch using a customizable Huggingface callback. See the implementation here to build your own.
Use a Metaflow Huggingface checkpoint on Azure Push a checkpoint to Azure Blob storage at the end of each epoch using a customizable Huggingface callback. See the implementation here to build your own.

License

metaflow-deepspeed is distributed under the Apache License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

metaflow_deepspeed-0.0.8.tar.gz (27.8 kB view details)

Uploaded Source

Built Distribution

metaflow_deepspeed-0.0.8-py2.py3-none-any.whl (35.5 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file metaflow_deepspeed-0.0.8.tar.gz.

File metadata

  • Download URL: metaflow_deepspeed-0.0.8.tar.gz
  • Upload date:
  • Size: 27.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.12.7

File hashes

Hashes for metaflow_deepspeed-0.0.8.tar.gz
Algorithm Hash digest
SHA256 b2e8d0535d60b860d9c88072f0f051922e20b3792f95a396e69a1d6a094137fa
MD5 6ce78cd8f7517294c169b44f0104c1ec
BLAKE2b-256 53ef65de4f720e2e1b4a43197cac62b69d9a3d959f56cdbad009ac6c3ed98e71

See more details on using hashes here.

File details

Details for the file metaflow_deepspeed-0.0.8-py2.py3-none-any.whl.

File metadata

File hashes

Hashes for metaflow_deepspeed-0.0.8-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 6b704cc2ac5cad1c5e34add23270f2377f6e0411880d560c1e317f9d1e82d31e
MD5 7fea046c5ed3ac3597c9e9159f1b101a
BLAKE2b-256 67902308b29117ebea897525eeee3d8a5d9571f314742afd3af9e3cada3413be

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page