A framework for efficient fault tolerance in large scale distributed training with pipeline template.
Project description
Oobleck
Resilient Distributed Training Framework
Oobleck is a large-model training framework with fast fault recovery support utilizing the concept of pipeline templates.
It is the first training framework that realizes:
- Dynamic reconfiguration: Oobleck can reconfigure distributed training configurtation without restart after failures.
- Pipeline template instantiation: Oobleck pre-generates a set of pipeline templates, and then combine their instantiated pipelines to form a distributed execution plan. The same set of pipeline templates is reused and different pipelines are instantiated after failures.
Getting Started
Install
Use pip
to install Oobleck:
pip install oobleck
Oobleck relies on cornstarch
for pipeline template and Colossal-AI
for training backend.
Optionally, install apex
, xformers
and flash-attn
to boost throughput (follow instructions in each README).
Run
Please refer to this README.
Cluster Management
Oobleck provides a command line interface (CLI) that manages the cluster. Use oobleck
to access the master agent:
$ oobleck --ip <master_ip> --port <master_port> <command> <command_options>
where master port can be found in stdout
of running:
| INFO | __main__:serve:430 - Running master service on port 45145
Currently you can see the list of agents and send a request to gracefully terminate an agent:
$ oobleck --ip <master_ip> --port <master_port> get_agent_list
=== Agents ===
[0] IP: node1:10000 Status: up (device indices: 0,1)
[1] IP: node1:10000 Status: up (device indices: 2,3)
[2] IP: node2:10000 Status: up (device indices: 0,1)
[3] IP: node2:10000 Status: up (device indices: 2,3)
==============
$ oobleck --ip <master_ip> --port <master_port> kill_agent --agent_index 2
| INFO | __main__:KillAgent:340 - Terminating agent 2 on node1:10000
Citation
@inproceedings{oobleck-sosp23,
title = {Oobleck: Resilient Distributed Training of Large Models Using Pipeline Templates},
author = {Jang, Insu and Yang, Zhenning and Zhang, Zhen and Jin, Xin and Chowdhury, Mosharaf},
booktitle = {ACM SIGOPS 29th Symposium of Operating Systems and Principles (SOSP '23)},
year = {2023},
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file oobleck-0.1.1.tar.gz
.
File metadata
- Download URL: oobleck-0.1.1.tar.gz
- Upload date:
- Size: 34.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.10.13
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7c061c7012a52477f3ede642c616f4f425e7c081acec4dc40719ca55687a849f |
|
MD5 | dc6476cb83c839361b9d59d8fdadae2b |
|
BLAKE2b-256 | f2fdad26205344f09c5361542e3a48ed2ea7b4c5682890fd071f3463faa05e76 |
File details
Details for the file oobleck-0.1.1-cp310-cp310-manylinux_2_28_x86_64.whl
.
File metadata
- Download URL: oobleck-0.1.1-cp310-cp310-manylinux_2_28_x86_64.whl
- Upload date:
- Size: 946.6 kB
- Tags: CPython 3.10, manylinux: glibc 2.28+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.10.13
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | b9765947dc7c620104f1437e6d6bef014beb9fc4d0c06c56b96304456d678abe |
|
MD5 | defb60e1669f5fb6ddeafb084a85b805 |
|
BLAKE2b-256 | 77bef97212c07bce6ff9b73037cbf20d19875fc7c59845b70046edb2d66838c6 |