Skip to main content

A lightweight library for adding fault tolerance to large-scale PyTorch distributed training workloads.

Project description

torchsnapshot

build status pypi version pypi nightly version codecov bsd license

This library is currently in Alpha and currently does not have a stable release. The API may change and may not be backward compatible. If you have suggestions for improvements, please open a GitHub issue. We'd love to hear your feedback.

A light-weight library for adding fault tolerance to large-scale PyTorch distributed training workloads.

Install

Requires Python >= 3.7 and PyTorch >= 1.11

From pip:

pip install --pre torchsnapshot-nightly

From source:

git clone https://github.com/facebookresearch/torchsnapshot
cd torchsnapshot
pip install -r requirements.txt
python setup.py install

Concepts

  • Stateful object - an object that whose state can be obtained via .state_dict() and restored via .load_state_dict(). Most PyTorch components (e.g. Module, Optimizer, LRScheduler) already implement this protocol.
  • App state - the application state described using multiple stateful objects.
  • Snapshot - the persisted app state.

Basic Usage

Describing the application state with multiple stateful objects:

app_state = {"model": model, "optimizer": optimizer}

Taking a snapshot of the application state:

from torchsnapshot import Snapshot

# File System
snapshot = Snapshot.take(path="/foo/bar/baz", app_state=app_state)

# S3
snapshot = Snapshot.take(path="s3://foo/bar", app_state=app_state)

# Google Cloud Storage
snapshot = Snapshot.take(path="gcs://foo/bar", app_state=app_state)

Referencing an existing snapshot:

snapshot = Snapshot(path="foo/bar/baz")

Restoring the application state from a snapshot:

snapshot.restore(app_state=app_state)

See the example directory for more examples.

License

torchsnapshot is BSD licensed, as found in the LICENSE file.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

torchsnapshot-nightly-2022.9.4.tar.gz (38.9 kB view details)

Uploaded Source

Built Distribution

torchsnapshot_nightly-2022.9.4-py3-none-any.whl (49.6 kB view details)

Uploaded Python 3

File details

Details for the file torchsnapshot-nightly-2022.9.4.tar.gz.

File metadata

File hashes

Hashes for torchsnapshot-nightly-2022.9.4.tar.gz
Algorithm Hash digest
SHA256 c1c358ce993f49f2f5fada6c3bd0944e4972171492520d1ac985eec995a4dcdb
MD5 f66e89fa57ed42c6d1b06feced6d8ebc
BLAKE2b-256 339073980c309b6e63ee1661c62c8e4c37cf70e47ca529ba4d09637f2ac8f4fa

See more details on using hashes here.

File details

Details for the file torchsnapshot_nightly-2022.9.4-py3-none-any.whl.

File metadata

File hashes

Hashes for torchsnapshot_nightly-2022.9.4-py3-none-any.whl
Algorithm Hash digest
SHA256 233df5623586ab0d191a5e124c8f6f6a5b63db166b327863723d145953f23e2f
MD5 74229ca9bf10854db8273d4430074350
BLAKE2b-256 ec1bdb043668f60edd105a52d69c986fc2fa55ebeef09f9273adfea42f2863b5

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page