This package is a python library to support multiple communication groups for pytorch's distribted package
Project description
MultiWorld Framework for PyTorch
About
This repository implements MultiWorld
framework for PyTorch. It enables fault management functionality for collective communication libraries (CCL) such as NCCL on top of the PyTorch distributed package. The fault management functionality includes (i) detection, (ii) tolerance (or resilience) and (iii) recovery. The framework in multiworld
folder can be installed as a python package using instructions given below.
Project Summary
Background and Motivation
In the world of machine learning (ML) and artificial intelligence (AI), it's crucial for models to be reliable. But as ML models are used more and more in real life, they face all sorts of problems such as hardware and network failures. Since ML inference is a long-running service, it is crucial that ML inference workloads handle these problems fast and gracefully. Especially, as models become larger, it becomes unavoidable to deploy them across GPUs and hosts, which renders fault management challenging.
MultiWorld
is an innovative framework aimed at supporting fault management in ML inference workloads. Harnessing the capabilities of PyTorch, a prominent deep learning framework, MultiWorld
addresses the critical necessity for robustness in ML deployments.
Key Contributions
The framework is built on top of PyTorch, a widely-used deep learning framework, and will support various backends such as NCCL and Gloo for distributed computing.
MultiWorld
framework allows each worker to be a part of multiple worlds as displayed in the above figure. Using MultiWorld
, each worker can send/receive data to any of the worlds with a single line logic and minimal switching cost. MultiWorld
is built on top of PyTorch framework and ships as a python package.
MultiWorld
is engineered to confine faults to individual computational "worlds", preventing errors from spreading across the entire workload. This means that if something goes wrong in one worker, the worlds where the worker belongs will be only affected, but it won't affect the others. Despite adding fault management mechanisms, MultiWorld
maintains the integrity of each computational context, preserving the underlying structure and minimizing overhead. This approach allows developers to enhance fault management without requiring significant changes to their existing codebase or workflow. In many cases, the developers only need to replace PyTorch's send/recv with the counter part of MultiWorld
(send/recv under WorldCommunicator's module).
Folder Information
docs
contains additional documentsexamples
contain examples to demonstrate the usage of themultiworld
framework.multiworld
contains the source code for themultiworld
package.patch
contains patch files to install themultiworld
source code into the installed PyTorch package.scripts
contains scripts for generating the patch file, primarily for developers contributing to themultiworld
source code.
Key Source Files Information
multiworld/world_manager.py
containsWorldManager
class to create and manage multiple worlds.multiworld/world_communicator.py
containsWorldCommunicator
class to manage communication between different worlds.multiworld/watchdog.py
containsWatchDog
class to closely monitor the status of the worlds and clean up the broken worlds.
Dependencies and Version
- PyTorch version:
2.4.0
Installation
To use the latest official package,
pip install multiworld
To install the package from source,
pip install .
Running Examples
The list of all examples that are available can be found in the examples
folder.
We recommend to start with send_recv
example
Contributors
How to Contribute
If you wish to contribute or suggest any additional funtionalities, please check out Contributing Guidelines
Citation
@misc{m8d2024,
title={Enabling Elastic Model Serving with MultiWorld},
author={Myungjin Lee and Akshay Jajoo and Ramana Rao Kompella},
year={2024},
eprint={2407.08980},
archivePrefix={arXiv},
primaryClass={cs.DC},
url={https://arxiv.org/abs/2407.08980},
}
License
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file multiworld-0.2.1.tar.gz
.
File metadata
- Download URL: multiworld-0.2.1.tar.gz
- Upload date:
- Size: 57.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/5.1.1 CPython/3.12.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | c97ff38c4e8f44a8e753ce9f5e56967e14d101fd7fbde28a0890d26b5f998496 |
|
MD5 | 65273d34324748bbae6b5cfcd52accee |
|
BLAKE2b-256 | 5ba32357a84c8559307bdd87b535e0e48905a7a1a6c128b5b8555878045aff00 |
File details
Details for the file multiworld-0.2.1-py3-none-any.whl
.
File metadata
- Download URL: multiworld-0.2.1-py3-none-any.whl
- Upload date:
- Size: 61.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/5.1.1 CPython/3.12.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1b3bef1e71f230d43ac9c840acc218443763a4c58bd59edcac5cb3cc99460d09 |
|
MD5 | 65fd4b0996ed3dd5cc401c084321369a |
|
BLAKE2b-256 | 31093073f94d0bc0659e5c8fadcd57abb0965a01931eb097b25c94a22e2758de |