A framework for efficient fault tolerance in large scale distributed training with pipeline template.
Project description
Oobleck
Resilient Distributed Training Framework
Oobleck is a large-model training framework with fast fault recovery support utilizing the concept of pipeline templates.
It is the first training framework that realizes:
- Dynamic reconfiguration: Oobleck can reconfigure distributed training configurtation without restart after failures.
- Pipeline template instantiation: Oobleck pre-generates a set of pipeline templates, and then combine their instantiated pipelines to form a distributed execution plan. The same set of pipeline templates is reused and different pipelines are instantiated after failures.
Getting Started
Install
Use pip to install Oobleck:
pip install oobleck
Oobleck relies on cornstarch for pipeline template and Colossal-AI for training backend.
Optionally, install apex, xformers and flash-attn to boost throughput (follow instructions in each README).
Run
Please refer to this README.
Cluster Management
Oobleck provides a command line interface (CLI) that manages the cluster. Use oobleck to access the master agent:
$ oobleck --ip <master_ip> --port <master_port> <command> <command_options>
where master port can be found in stdout of running:
| INFO | __main__:serve:430 - Running master service on port 45145
Currently you can see the list of agents and send a request to gracefully terminate an agent:
$ oobleck --ip <master_ip> --port <master_port> get_agent_list
=== Agents ===
[0] IP: node1:10000 Status: up (device indices: 0,1)
[1] IP: node1:10000 Status: up (device indices: 2,3)
[2] IP: node2:10000 Status: up (device indices: 0,1)
[3] IP: node2:10000 Status: up (device indices: 2,3)
==============
$ oobleck --ip <master_ip> --port <master_port> kill_agent --agent_index 2
| INFO | __main__:KillAgent:340 - Terminating agent 2 on node1:10000
Citation
@inproceedings{oobleck-sosp23,
title = {Oobleck: Resilient Distributed Training of Large Models Using Pipeline Templates},
author = {Jang, Insu and Yang, Zhenning and Zhang, Zhen and Jin, Xin and Chowdhury, Mosharaf},
booktitle = {ACM SIGOPS 29th Symposium of Operating Systems and Principles (SOSP '23)},
year = {2023},
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file oobleck-0.1.1.tar.gz.
File metadata
- Download URL: oobleck-0.1.1.tar.gz
- Upload date:
- Size: 34.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.10.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7c061c7012a52477f3ede642c616f4f425e7c081acec4dc40719ca55687a849f
|
|
| MD5 |
dc6476cb83c839361b9d59d8fdadae2b
|
|
| BLAKE2b-256 |
f2fdad26205344f09c5361542e3a48ed2ea7b4c5682890fd071f3463faa05e76
|
File details
Details for the file oobleck-0.1.1-cp310-cp310-manylinux_2_28_x86_64.whl.
File metadata
- Download URL: oobleck-0.1.1-cp310-cp310-manylinux_2_28_x86_64.whl
- Upload date:
- Size: 946.6 kB
- Tags: CPython 3.10, manylinux: glibc 2.28+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.10.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b9765947dc7c620104f1437e6d6bef014beb9fc4d0c06c56b96304456d678abe
|
|
| MD5 |
defb60e1669f5fb6ddeafb084a85b805
|
|
| BLAKE2b-256 |
77bef97212c07bce6ff9b73037cbf20d19875fc7c59845b70046edb2d66838c6
|