Skip to main content

Fast Low-Overhead Recovery

Project description

FLOR

PyPI

Flor (for "fast low-overhead recovery") is a record-replay system for deep learning, and other forms of machine learning that train models on GPUs. Flor was developed to speed-up hindsight logging: a cyclic-debugging practice that involves adding logging statements after encountering a surprise, and efficiently re-training with more logging. Flor takes low-overhead checkpoints during training, or the record phase, and uses those checkpoints for replay speedups based on memoization and parallelism.

FlorDB integrates Flor, git and sqlite3 to manage model developer's logs, execution data, versions of code, and training checkpoints. In addition to serving as an experiment management solution for ML Engineers, FlorDB extends hindsight logging across model trainging versions for the retroactive evaluation of iterative ML.

FlorFlow will extend FlorDB to support Dataflow operations.

Flor, FlorDB, and FlorFlow are software developed at UC Berkeley's RISE Lab.

FlorDB Demo

You can follow along yourself by starting a Jupyter server from this directory and opening tutorial.ipynb.

Installation

pip install florflow

Getting Started

We start by selecting (or creating) a git repository to save our model training code as we iterate and experiment. Flor automatically commits your changes on every run, so no change is lost. Below we provide a sample repository you can use to follow along:

$ git clone git@github.com:ucbepic/ml_tutorial
$ cd ml_tutorial/

Run the train.py script to train a small linear model, and test your florflow installation.

$ python train.py

Flor will manage checkpoints, logs, command-line arguments, code changes, and other experiment metadata on each run (More details below). All of this data is then expesed to the user via SQL or Pandas queries.

View your experiment history

From the same directory you ran the examples above, open an iPython terminal, then load and pivot the log records.

$ python -m flor dataframe

        projid               tstamp  filename device seed hidden epochs batch_size     lr print_every accuracy correct
0  ml_tutorial  2023-08-28T15:04:07  train.py    cpu   78    500      5         32  0.001         500    97.71    9771
1  ml_tutorial  2023-08-28T15:04:35  train.py    cpu    8    500      5         32  0.001         500    98.01    9801

Run some more experiments

The train.py script has been prepared in advance to define and manage four different hyper-parameters:

$ cat train.py | grep flor.arg
hidden_size = flor.arg("hidden", default=500)
num_epochs = flor.arg("epochs", 5)
batch_size = flor.arg("batch_size", 32)
learning_rate = flor.arg("lr", 1e-3)

You can control any of the hyper-parameters (e.g. hidden) using Flor's command-line interface:

$ python train.py --kwargs hidden=75

Application Programming Interface (API)

Flor is shipped with utilities for serializing and checkpointing PyTorch state, and utilities for resuming, auto-parallelizing, and memoizing executions from checkpoint.

The model developer passes objects for checkpointing to flor.checkpointing(**kwargs), and gives it control over loop iterators by calling flor.loop(name, iterator) as follows:

import flor
import torch

hidden_size = flor.arg("hidden", default=500)
num_epochs = flor.arg("epochs", 5)
batch_size = flor.arg("batch_size", 32)
learning_rate = flor.arg("lr", 1e-3)

trainloader: torch.utils.data.DataLoader
testloader:  torch.utils.data.DataLoader
optimizer:   torch.optim.Optimizer
net:         torch.nn.Module
criterion:   torch.nn._Loss

with flor.checkpointing(model=net, optimizer=optimizer):
    for epoch in flor.loop("epoch", range(num_epochs)):
        for data in flor.loop("step", trainloader):
            inputs, labels = data
            optimizer.zero_grad()
            outputs = net(inputs)
            loss = criterion(outputs, labels)
            loss.backward()
            flor.log("loss", loss.item())
            optimizer.step()
        eval(net, testloader)

As shown, we wrap both the nested training loop and main loop with flor.loop so Flor can manage their state. Flor will use loop iteration boundaries to store selected checkpoints adaptively, and on replay time use those same checkpoints to resume training from the appropriate epoch.

Logging API

You call flor.log(name, value) and flor.arg(name, default=None) to log metrics and register tune-able hyper-parameters, respectively.

$ cat train.py | grep flor.arg
hidden_size = flor.arg("hidden", default=500)
num_epochs = flor.arg("epochs", 5)
batch_size = flor.arg("batch_size", 32)
learning_rate = flor.arg("lr", 1e-3)

$ cat train.py | grep flor.log
        flor.log("loss", loss.item()),

The name(s) you use for the variables you intercept with flor.log and flor.arg will become a column (measure) in the full pivoted view (see Viewing your exp history).

Publications

To cite this work, please refer to the Multiversion Hindsight Logging paper (pre-print '23).

FLOR is open source software developed at UC Berkeley. Joe Hellerstein (databases), Joey Gonzalez (machine learning), and Koushik Sen (programming languages) are the primary faculty members leading this work.

This work is released as part of Rolando Garcia's doctoral dissertation at UC Berkeley, and has been the subject of study by Eric Liu and Anusha Dandamudi, both of whom completed their master's theses on FLOR. Our list of publications are reproduced below. Finally, we thank Vikram Sreekanti, Dan Crankshaw, and Neeraja Yadwadkar for guidance, comments, and advice. Bobby Yan was instrumental in the development of FLOR and its corresponding experimental evaluation.

License

FLOR is licensed under the Apache v2 License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

florflow-3.3.3.tar.gz (30.6 kB view details)

Uploaded Source

Built Distribution

florflow-3.3.3-py3-none-any.whl (32.9 kB view details)

Uploaded Python 3

File details

Details for the file florflow-3.3.3.tar.gz.

File metadata

  • Download URL: florflow-3.3.3.tar.gz
  • Upload date:
  • Size: 30.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.5

File hashes

Hashes for florflow-3.3.3.tar.gz
Algorithm Hash digest
SHA256 4e621c50278e80cb593d415321cd3a0dccbb29ec8f7ad2924b784b12cbb6318c
MD5 d493ae64fe89bd9b6c8120b06991fc91
BLAKE2b-256 0d219eeb7fdcc94adb2e0a6b67943f32f6d91ec9336f9618db0de686eadaeb45

See more details on using hashes here.

File details

Details for the file florflow-3.3.3-py3-none-any.whl.

File metadata

  • Download URL: florflow-3.3.3-py3-none-any.whl
  • Upload date:
  • Size: 32.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.5

File hashes

Hashes for florflow-3.3.3-py3-none-any.whl
Algorithm Hash digest
SHA256 dd46caf47a8151b9ab666daf998c7643f1809de5f8bf6e7fa19d30b28c6fb7a1
MD5 466e0c5762390e05b4a4519daa2c24d0
BLAKE2b-256 8cfe196daaf3903e49f8e4073a8e56e8a1e6a362d92dda0cc5030e0e3fa6d414

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page