Skip to main content

This package is a python library to support multiple communication groups for pytorch's distribted package

Project description

MultiWorld Framework for PyTorch

About

This repository implements MultiWorld framework for PyTorch. It enables fault management functionality for collective communication libraries (CCL) such as NCCL on top of the PyTorch distributed package. The fault management functionality includes (i) detection, (ii) tolerance (or resilience) and (iii) recovery. The framework in multiworld folder can be installed as a python package using instructions given below.

Project Summary

Single World vs. Multi World

Background and Motivation

In the world of machine learning (ML) and artificial intelligence (AI), it's crucial for models to be reliable. But as ML models are used more and more in real life, they face all sorts of problems such as hardware and network failures. Since ML inference is a long-running service, it is crucial that ML inference workloads handle these problems fast and gracefully. Especially, as models become larger, it becomes unavoidable to deploy them across GPUs and hosts, which renders fault management challenging.

MultiWorld is an innovative framework aimed at supporting fault management in ML inference workloads. Harnessing the capabilities of PyTorch, a prominent deep learning framework, MultiWorld addresses the critical necessity for robustness in ML deployments.

Key Contributions

The framework is built on top of PyTorch, a widely-used deep learning framework, and will support various backends such as NCCL and Gloo for distributed computing.

MultiWorld framework allows each worker to be a part of multiple worlds as displayed in the above figure. Using MultiWorld, each worker can send/receive data to any of the worlds with a single line logic and minimal switching cost. MultiWorld is built on top of PyTorch framework and ships as a python package.

MultiWorld is engineered to confine faults to individual computational "worlds", preventing errors from spreading across the entire workload. This means that if something goes wrong in one worker, the worlds where the worker belongs will be only affected, but it won't affect the others. Despite adding fault management mechanisms, MultiWorld maintains the integrity of each computational context, preserving the underlying structure and minimizing overhead. This approach allows developers to enhance fault management without requiring significant changes to their existing codebase or workflow. In many cases, the developers only need to replace PyTorch's send/recv with the counter part of MultiWorld (send/recv under WorldCommunicator's module).

Folder Information

  • docs contains additional documents
  • examples contain examples to demonstrate the usage of the multiworld framework.
  • multiworld contains the source code for the multiworld package.
  • patch contains patch files to install the multiworld source code into the installed PyTorch package.
  • scripts contains scripts for generating the patch file, primarily for developers contributing to the multiworld source code.

Key Source Files Information

  • multiworld/world_manager.py contains WorldManager class to create and manage multiple worlds.
  • multiworld/world_communicator.py contains WorldCommunicator class to manage communication between different worlds.
  • multiworld/watchdog.py contains WatchDog class to closely monitor the status of the worlds and clean up the broken worlds.

Dependencies and Version

Installation

To use the latest official package,

pip install multiworld

To install the package from source,

pip install .

Running Examples

The list of all examples that are available can be found in the examples folder. We recommend to start with send_recv example

Contributors

contributors

How to Contribute

If you wish to contribute or suggest any additional funtionalities, please check out Contributing Guidelines

Citation

@misc{m8d2024,
      title={Enabling Elastic Model Serving with MultiWorld}, 
      author={Myungjin Lee and Akshay Jajoo and Ramana Rao Kompella},
      year={2024},
      eprint={2407.08980},
      archivePrefix={arXiv},
      primaryClass={cs.DC},
      url={https://arxiv.org/abs/2407.08980}, 
}

License

Apache License 2.0.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

multiworld-0.2.3.tar.gz (57.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

multiworld-0.2.3-py3-none-any.whl (61.4 kB view details)

Uploaded Python 3

File details

Details for the file multiworld-0.2.3.tar.gz.

File metadata

  • Download URL: multiworld-0.2.3.tar.gz
  • Upload date:
  • Size: 57.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for multiworld-0.2.3.tar.gz
Algorithm Hash digest
SHA256 2f609b023d801cbaea0c282d6a3899af88f64e7a272aa64240a90bd955583f85
MD5 e9f99b49603722f807bb8cb10e51e499
BLAKE2b-256 26186ca8886613cc64b590cf463de00e1ed0edd1865080e1eb6b41dc285b1f03

See more details on using hashes here.

Provenance

The following attestation bundles were made for multiworld-0.2.3.tar.gz:

Publisher: pypi_release.yml on cisco-open/pymultiworld

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file multiworld-0.2.3-py3-none-any.whl.

File metadata

  • Download URL: multiworld-0.2.3-py3-none-any.whl
  • Upload date:
  • Size: 61.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for multiworld-0.2.3-py3-none-any.whl
Algorithm Hash digest
SHA256 0e89b5172cbe038884854f868c07841a37eded3ae32a7408ebf0ab28d0f8ed05
MD5 4655732101ac8adb6ddea89abe9900ef
BLAKE2b-256 2e0879b1d9f73eab9ee9b1dc9448ab9b27dd84a4a8d14ad4379b2506c16a8ba8

See more details on using hashes here.

Provenance

The following attestation bundles were made for multiworld-0.2.3-py3-none-any.whl:

Publisher: pypi_release.yml on cisco-open/pymultiworld

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page