An EXPERIMENTAL Deepspeed decorator for Metaflow
Project description
Introduction
Deepspeed is a highly scalable framework from Microsoft for distributed training and model serving. The Metaflow @deepspeed
decorator helps you run these workflows inside of Metaflow tasks.
Features
- Automatic SSH configuration: Multi-node Deepspeed jobs are built around OpenMPI or Horovod. Like Metaflow's
@mpi
decorator, the@deepspeed
decorator automatically configures SSH requirements between nodes, so you can focus on research code. - Seamless Python interface: Metaflow's
@deepspeed
exposes a methodcurrent.deepspeed.run
to make it easy to run Deepspeed commands on your transient MPI cluster, in the same way you'd launch Deepspeed from the terminal independent of Metaflow. A major design goal is to get the orchestration and other benefits of Metaflow, without requiring modification to research code.
Installation
Install this experimental module:
pip install metaflow-deepspeed
Getting Started
After installing the module, you can import the deepspeed
decorator and use it in your Metaflow steps.
This exposes the current.deepspeed.run
method, which you can map your terminal commands for running Deepspeed.
from metaflow import FlowSpec, step, deepspeed, current, batch, environment
class HelloDeepspeed(FlowSpec):
@step
def start(self):
self.next(self.train, num_parallel=2)
@environment(vars={"NCCL_SOCKET_IFNAME": "eth0"})
@batch(gpu=8, cpu=64, memory=256000)
@deepspeed
@step
def train(self):
current.deepspeed.run(
entrypoint="my_torch_dist_script.py"
)
self.next(self.join)
@step
def join(self, inputs):
self.next(self.end)
@step
def end(self):
pass
if __name__ == "__main__":
HelloDeepspeed()
Examples
Directory | MPI program description |
---|---|
CPU Check | The easiest way to check your Deepspeed infrastructure on CPUs. |
Hello Deepspeed | The easiest way to check your Deepspeed infrastructure on GPUs. |
BERT | Train your BERT model using Deepspeed! |
Dolly | A multi-node implementation of Databricks' Dolly. |
Cloud-specific use cases
Directory | MPI program description |
---|---|
Automatically upload a directory on AWS | Push a checkpoint of any directory to S3 after the Deepspeed process completes. |
Automatically upload a directory on Azure | Push a checkpoint of any directory to Azure Blob storage after the Deepspeed process completes. |
Use Metaflow S3 client from the Deepspeed process | Upload arbitrary bytes to S3 storage from the Deepspeed process. |
Use Metaflow Azure Blob client from the Deepspeed process | Upload arbitrary bytes to Azure Blob storage from the Deepspeed process. |
Use a Metaflow Huggingface checkpoint on S3 | Push a checkpoint to S3 at the end of each epoch using a customizable Huggingface callback. See the implementation here to build your own. |
Use a Metaflow Huggingface checkpoint on Azure | Push a checkpoint to Azure Blob storage at the end of each epoch using a customizable Huggingface callback. See the implementation here to build your own. |
License
metaflow-deepspeed
is distributed under the Apache License.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file metaflow_deepspeed-0.0.9.tar.gz
.
File metadata
- Download URL: metaflow_deepspeed-0.0.9.tar.gz
- Upload date:
- Size: 27.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.12.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | b31ece64b7183da5796fb02fa21f4e3e00a02ca139654848e6d16211c5a1d035 |
|
MD5 | 7089ba3aea0b0ca0c42323dda429c2e4 |
|
BLAKE2b-256 | a077facd58d7a094f819b4bcb68769251f8e8faf52bc2342909c88e309dbded6 |
File details
Details for the file metaflow_deepspeed-0.0.9-py2.py3-none-any.whl
.
File metadata
- Download URL: metaflow_deepspeed-0.0.9-py2.py3-none-any.whl
- Upload date:
- Size: 35.5 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.12.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | df7f603428da66ff2053b8e6d12fe77027fc92cf8bd76d28fe52b2cae8b625c0 |
|
MD5 | 1f346cc275aea949d4cba0979ea3ccf3 |
|
BLAKE2b-256 | 091352e212adcde83b16566423d7d35aef7c5356cca25fa157cb33988322db64 |