Fast Low-Overhead Recovery

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: Apache Software License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

FLOR: Experiment Management for ML Engineers

You can use FLOR to take checkpoints during model training. These checkpoints allow you to restore arbitrary training data post-hoc and efficiently, thanks to memoization and parallelism speedups on replay.

FLOR is a suite of machine learning tools for hindsight logging. Hindsight logging is an optimistic logging practice favored by agile model developers. Model developers log training metrics such as the loss and accuracy by default, and selectively restore additional training data --- like tensor histograms, images, and overlays --- post-hoc, if and when there is evidence of a problem.

FLOR is software developed at UC Berkeley's RISE Lab, and is being released as part of an accompanying VLDB publication.

Installation

pip install pyflor

FLOR expects a recent version of Python (3.7+) and PyTorch (1.0+).

git checkout -b flor.shadow
python3 examples/linear.py --flor linear

Run the linear.py script to test your installation. This script will train a small linear model on MNIST. Think of it as a ''hello world'' of deep learning. We will cover FLOR shadow branches later.

ls ~/.flor/linear

Confirm that FLOR saved checkpoints of the linear.py execution on your home directory. FLOR will access and interpret contents of ~/.flor automatically. Do watch out for storage footprint though. If you see disk space running out, check ~/.flor. FLOR includes utilities for spooling its checkpoints to S3.

Preparing your Training Script

from flor import MTK as Flor
for epoch in Flor.loop(range(...)):
    ...

First, wrap the iterator of the main loop with FLOR's generator: Flor.loop. The generator enables FLOR to parallelize replay of the main loop, and to jump to an arbitrary epoch for data recovery.

from flor import MTK as Flor

import torch

trainloader: torch.utils.data.DataLoader
testloader:  torch.utils.data.DataLoader
optimizer:   torch.optim.Optimizer
net:         torch.nn.Module
criterion:   torch.nn._Loss

Flor.checkpoints(net, optimizer)
for epoch in Flor.loop(range(...)):
    for data in Flor.loop(trainloader):
        inputs, labels = data
        optimizer.zero_grad()
        outputs = net(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        print(f"loss: {loss.item()}")
    eval(net, testloader)

That's it, your training code is now ready for record-replay.

Training your model

python3 training_script.py --flor NAME [your_script_flags]

Before we train your model, make sure that your model training code is part of a git repository. Model training is exploratory and it's common to iterate dozens of times before finding the right fit. We'd hate for you to be manually responsible for managing all those versions. Instead, we ask you to create a FLOR shadow branch that we can automatically commit changes to. Think of it as a sandbox: you get the benefits of autosaving, without worrying about us poluting your main branch with frequent & automatic commits. Later, you can merge the changes you like.

In FLOR, all experiments need a name. As your training scripts and configurations evolve, keep the same experiment name so FLOR associates the checkpoints as versions of the same experiment. If you want to re-use the name from the previous run, you may leave the field blank.

Hindsight Logging

from flor import MTK as Flor
import torch

trainloader: torch.utils.data.DataLoader
testloader:  torch.utils.data.DataLoader
optimizer:   torch.optim.Optimizer
net:         torch.nn.Module
criterion:   torch.nn._Loss

for epoch in Flor.loop(range(...)):
    for batch in Flor.loop(trainloader):
        ...
    eval(net, testloader)
    log_confusion_matrix(net, testloader)

Suppose you want to view a confusion matrix as it changes throughout training. Add the code to generate the confusion matrix, as sugared above.

python3 training_script.py --replay_flor

You first switch to the FLOR shadow branch, and select the version you wish to replay from the git log list. In our example, we won't checkout version, because we want to replay the latest version, which is selected by default.

You will tell FLOR to replay by setting the flag --replay_flor. FLOR is performing fast replay, so you may generalize this example to recover ad-hoc training data. In our example, FLOR will compute your confusion matrix and automatically skip the nested training loop by loading its checkpoints.

from flor import MTK as Flor
import torch

trainloader: torch.utils.data.DataLoader
testloader:  torch.utils.data.DataLoader
optimizer:   torch.optim.Optimizer
net:         torch.nn.Module
criterion:   torch.nn._Loss

for epoch in Flor.loop(range(...)):
    for batch in Flor.loop(trainloader, probed=True):
        ...
    eval(net, testloader)
    log_confusion_matrix(net, testloader)

Now, suppose you also want TensorBoard to plot the tensor histograms. In this case, it is not possible to skip the nested training loop because we are probing intermediate data. We tell FLOR to step into the nested training loop by setting probed=True.

Although we can't skip the nested training loop, we can parallelize replay or re-execute just a fraction of the epochs (e.g. near the epoch where we see a loss anomaly).

python3 training_script.py --replay_flor PID/NGPUS [your_flags]

As before, you tell FLOR to run in replay mode by setting --replay_flor. You'll also tell FLOR how many GPUs from the pool to use for parallelism, and you'll dispatch this script simultaneously, varying the pid:<int> to span all the GPUs. To run segment 3 out of 5 segments, you would write: --replay_flor 3/5.

If instead of replaying all of training you wish to re-execute only a fraction of the epochs you can do this by setting the value of ngpus and pid respectively. Suppose you want to run the tenth epoch of a training job that ran for 200 epochs. You would set pid:9and ngpus:200.

We provide additional examples in the examples directory. A good starting point is linear.py.

Publications

To cite this work, please refer to the Hindsight Logging paper (VLDB '21).

FLOR is open source software developed at UC Berkeley. Joe Hellerstein (databases), Joey Gonzalez (machine learning), and Koushik Sen (programming languages) are the primary faculty members leading this work.

This work is released as part of Rolando Garcia's doctoral dissertation at UC Berkeley, and has been the subject of study by Eric Liu and Anusha Dandamudi, both of whom completed their master's theses on FLOR. Our list of publications are reproduced below. Finally, we thank Vikram Sreekanti, Dan Crankshaw, and Neeraja Yadwadkar for guidance, comments, and advice. Bobby Yan was instrumental in the development of FLOR and its corresponding experimental evaluation.

Hindsight Logging for Model Training. R Garcia, E Liu, V Sreekanti, B Yan, A Dandamudi, JE Gonzalez, JM Hellerstein, K Sen. The VLDB Journal, 2021.
Fast Low-Overhead Logging Extending Time. A Dandamudi. EECS Department, UC Berkeley Technical Report, 2021.
Low Overhead Materialization with FLOR. E Liu. EECS Department, UC Berkeley Technical Report, 2020.

License

FLOR is licensed under the Apache v2 License.

Project details

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: Apache Software License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

This version

2.5.6

Feb 28, 2023

2.5.5

Feb 23, 2023

2.5.4

Feb 23, 2023

2.5.4b0 pre-release

Feb 23, 2023

2.5.4a0 pre-release

Feb 23, 2023

2.4.2

Oct 9, 2022

2.4.1

Aug 5, 2022

2.4.0

Jul 27, 2022

2.3.2

Oct 26, 2021

2.3.1

Oct 26, 2021

2.3.0

Oct 25, 2021

2.2.0

May 11, 2021

2.1.0

Apr 13, 2021

2.0.4

Feb 20, 2021

2.0.3

Feb 20, 2021

2.0.2

Feb 19, 2021

2.0.1

Feb 17, 2021

2.0.0

Feb 17, 2021

1.0.0

Nov 29, 2020

0.0.8a7 pre-release

Nov 22, 2019

0.0.8a6 pre-release

Nov 21, 2019

0.0.8a5 pre-release

Nov 21, 2019

0.0.8a4 pre-release

Nov 20, 2019

0.0.8a3 pre-release

Nov 20, 2019

0.0.8a2 pre-release

Nov 20, 2019

0.0.8a0 pre-release

Nov 20, 2019

0.0.7a0 pre-release

Nov 18, 2019

0.0.3a50 pre-release

Aug 27, 2019

0.0.3a49 pre-release

Aug 27, 2019

0.0.3a48 pre-release

Aug 27, 2019

0.0.3a47 pre-release

Aug 23, 2019

0.0.3a46 pre-release

Aug 16, 2019

0.0.3a45 pre-release

Aug 16, 2019

0.0.3a44 pre-release

Aug 16, 2019

0.0.3a43 pre-release

Aug 15, 2019

0.0.3a36 pre-release

Aug 14, 2019

0.0.3a31 pre-release

Aug 12, 2019

0.0.3a30 pre-release

Aug 11, 2019

0.0.3a29 pre-release

Aug 11, 2019

0.0.3a28 pre-release

Aug 11, 2019

0.0.3a27 pre-release

Aug 11, 2019

0.0.3a26 pre-release

Aug 11, 2019

0.0.3a25 pre-release

Aug 11, 2019

0.0.3a24 pre-release

Aug 11, 2019

0.0.3a23 pre-release

Aug 11, 2019

0.0.3a22 pre-release

Aug 11, 2019

0.0.3a21 pre-release

Aug 11, 2019

0.0.3a20 pre-release

Aug 11, 2019

0.0.3a19 pre-release

Aug 10, 2019

0.0.3a18 pre-release

Aug 10, 2019

0.0.3a16 pre-release

Aug 9, 2019

0.0.3a15 pre-release

Aug 9, 2019

0.0.3a14 pre-release

Aug 8, 2019

0.0.3a13 pre-release

Aug 8, 2019

0.0.3a12 pre-release

Aug 8, 2019

0.0.3a11 pre-release

Aug 8, 2019

0.0.3a10 pre-release

Aug 8, 2019

0.0.3a9 pre-release

Aug 8, 2019

0.0.3a8 pre-release

Aug 7, 2019

0.0.3a7 pre-release

Aug 6, 2019

0.0.3a6 pre-release

Aug 6, 2019

0.0.3a5 pre-release

Aug 6, 2019

0.0.3a4 pre-release

Aug 6, 2019

0.0.3a3 pre-release

Aug 6, 2019

0.0.3a2 pre-release

Aug 6, 2019

0.0.3a1 pre-release

Aug 6, 2019

0.0.3a0 pre-release

Aug 6, 2019

0.0.2a15 pre-release

Jun 27, 2019

0.0.2a14 pre-release

Jun 27, 2019

0.0.2a13 pre-release

Jun 19, 2019

0.0.2a12 pre-release

Jun 19, 2019

0.0.2a11 pre-release

Jun 18, 2019

0.0.2a10 pre-release

Jun 18, 2019

0.0.2a9 pre-release

Jun 18, 2019

0.0.2a8 pre-release

Jun 12, 2019

0.0.2a7 pre-release

Jun 12, 2019

0.0.2a6 pre-release

Jun 12, 2019

0.0.2a5 pre-release

Jun 12, 2019

0.0.2a4 pre-release

Jun 11, 2019

0.0.2a1 pre-release

Jun 11, 2019

0.0.2a0 pre-release

Jun 11, 2019

0.0.1a0 pre-release

Dec 2, 2018

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyflor-2.5.6.tar.gz (39.4 kB view details)

Uploaded Feb 28, 2023 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pyflor-2.5.6-py3-none-any.whl (50.6 kB view details)

Uploaded Feb 28, 2023 Python 3

File details

Details for the file pyflor-2.5.6.tar.gz.

File metadata

Download URL: pyflor-2.5.6.tar.gz
Upload date: Feb 28, 2023
Size: 39.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.9.13

File hashes

Hashes for pyflor-2.5.6.tar.gz
Algorithm	Hash digest
SHA256	`14729eaa18b4ca22ee2a65d8eabbaf1f0a3d41b0789f29d25822802aded4250c`
MD5	`cfb79a0ed02d75a4a693fe54082f587a`
BLAKE2b-256	`0ca9bf7e9ea664dfb23603fc67ac5b60129b24ae7b120966e67c893369ac5fac`

See more details on using hashes here.

File details

Details for the file pyflor-2.5.6-py3-none-any.whl.

File metadata

Download URL: pyflor-2.5.6-py3-none-any.whl
Upload date: Feb 28, 2023
Size: 50.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.9.13

File hashes

Hashes for pyflor-2.5.6-py3-none-any.whl
Algorithm	Hash digest
SHA256	`3c3d1208cadf6ab355b45f20216de5d9eb9307f80d52c15e7320d9cca16c2b16`
MD5	`44953c13a93cc6ae2ed784d0c844b03c`
BLAKE2b-256	`af81fa9fc38715e83659f401e89446c0843b60b05bbf2ed7e8c76b8283a6fe65`

See more details on using hashes here.

pyflor 2.5.6

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

FLOR: Experiment Management for ML Engineers

Installation

Preparing your Training Script

Training your model

Hindsight Logging

Publications

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes