A lightweight library for adding fault tolerance to large-scale PyTorch distributed training workloads.
Project description
torchsnapshot
This library is currently in Alpha and currently does not have a stable release. The API may change and may not be backward compatible. If you have suggestions for improvements, please open a GitHub issue. We'd love to hear your feedback.
A light-weight library for adding fault tolerance to large-scale PyTorch distributed training workloads.
Install
Requires Python >= 3.7 and PyTorch >= 1.11
From pip:
pip install --pre torchsnapshot-nightly
From source:
git clone https://github.com/facebookresearch/torchsnapshot
cd torchsnapshot
pip install -r requirements.txt
python setup.py install
Concepts
- Stateful object - an object that whose state can be obtained via
.state_dict()and restored via.load_state_dict(). Most PyTorch components (e.g.Module,Optimizer,LRScheduler) already implement this protocol. - App state - the application state described using multiple stateful objects.
- Snapshot - the persisted app state.
Basic Usage
Describing the application state with multiple stateful objects:
app_state = {"model": model, "optimizer": optimizer}
Taking a snapshot of the application state:
from torchsnapshot import Snapshot
# File System
snapshot = Snapshot.take(path="/foo/bar/baz", app_state=app_state)
# S3
snapshot = Snapshot.take(path="s3://foo/bar", app_state=app_state)
# Google Cloud Storage
snapshot = Snapshot.take(path="gcs://foo/bar", app_state=app_state)
Referencing an existing snapshot:
snapshot = Snapshot(path="foo/bar/baz")
Restoring the application state from a snapshot:
snapshot.restore(app_state=app_state)
See the example directory for more examples.
License
torchsnapshot is BSD licensed, as found in the LICENSE file.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file torchsnapshot-nightly-2022.8.13.tar.gz.
File metadata
- Download URL: torchsnapshot-nightly-2022.8.13.tar.gz
- Upload date:
- Size: 38.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.7.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f0ea3a9868b880e3be3e8f93084754e439aa64661ae62fd60190874a13e99064
|
|
| MD5 |
d9ee633bfafdb048147e4d88be5e1924
|
|
| BLAKE2b-256 |
b4bc0108af676f7d6135a5d81748f2fc29045400c5bbc042e38b329bf47795e3
|
File details
Details for the file torchsnapshot_nightly-2022.8.13-py3-none-any.whl.
File metadata
- Download URL: torchsnapshot_nightly-2022.8.13-py3-none-any.whl
- Upload date:
- Size: 48.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.7.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5611a0b95fe823ec6c57055861345e46b19738a6c512c1f8263f033f77d598e7
|
|
| MD5 |
73942ea8e30e3107e9a4fecf3046454e
|
|
| BLAKE2b-256 |
217430e5650689be9572ecfd77aa4e8501f09bb325617953b90c66a3d1eb40a6
|