A lightweight library for adding fault tolerance to large-scale PyTorch distributed training workloads.
Project description
torchsnapshot
This library is currently in Alpha and currently does not have a stable release. The API may change and may not be backward compatible. If you have suggestions for improvements, please open a GitHub issue. We'd love to hear your feedback.
A light-weight library for adding fault tolerance to large-scale PyTorch distributed training workloads.
Install
Requires Python >= 3.7 and PyTorch >= 1.11
From pip:
pip install --pre torchsnapshot-nightly
From source:
git clone https://github.com/facebookresearch/torchsnapshot
cd torchsnapshot
pip install -r requirements.txt
python setup.py install
Concepts
- Stateful object - an object that whose state can be obtained via
.state_dict()
and restored via.load_state_dict()
. Most PyTorch components (e.g.Module
,Optimizer
,LRScheduler
) already implement this protocol. - App state - the application state described using multiple stateful objects.
- Snapshot - the persisted app state.
Basic Usage
Describing the application state with multiple stateful objects:
app_state = {"model": model, "optimizer": optimizer}
Taking a snapshot of the application state:
from torchsnapshot import Snapshot
# File System
snapshot = Snapshot.take(path="/foo/bar/baz", app_state=app_state)
# S3
snapshot = Snapshot.take(path="s3://foo/bar", app_state=app_state)
# Google Cloud Storage
snapshot = Snapshot.take(path="gcs://foo/bar", app_state=app_state)
Referencing an existing snapshot:
snapshot = Snapshot(path="foo/bar/baz")
Restoring the application state from a snapshot:
snapshot.restore(app_state=app_state)
See the example directory for more examples.
License
torchsnapshot is BSD licensed, as found in the LICENSE file.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file torchsnapshot-nightly-2022.10.9.tar.gz
.
File metadata
- Download URL: torchsnapshot-nightly-2022.10.9.tar.gz
- Upload date:
- Size: 48.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.7.13
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 822f8a3ca6b3fba07046bd948bba9069aaecfa1ec440fa79c1c5e6e246eb14d0 |
|
MD5 | c43637c5ac9fbfdff81a7687dec299f2 |
|
BLAKE2b-256 | 74550a66435198e230274eb61aaf31d652c3cdbb23847a903f61c68c62faa52a |
File details
Details for the file torchsnapshot_nightly-2022.10.9-py3-none-any.whl
.
File metadata
- Download URL: torchsnapshot_nightly-2022.10.9-py3-none-any.whl
- Upload date:
- Size: 59.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.7.13
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 98a45143b5c25195eebb91907c46afa8be5ed65dfc5fcfa72dc1cbacc3bbfbd3 |
|
MD5 | f97af94bed1003c881631e274eb527ec |
|
BLAKE2b-256 | 65ea43b11c563d99e3183fae90bb622edc81702bdbd2455c0c8c0916b169e2b8 |