Skip to main content

manage fast data loading with ffcv and pytorch lightning

Project description

FFCV Dataloader with Pytorch Lightning

FFCV is a fast dataloader for neural networks training: https://github.com/libffcv/ffcv

In this repository, all the steps to install and configure it with pytorch-lightning are presented.
Moreover, some useful methods to quickly create, preprocess and load Datasets with FFCV and pytorch-lightning are proposed.

Installation

Dependencies

There are actually some known issues about the installation of the FFCV package.
Check for instance issues of FFCV (#133 #54).

The first suggestion to install dependencies is to use the provided environment.yml file:

conda env create --file environment.yml

This should correctly create a conda environment named ffcv-pl.

If the above does not work, then you can try installing packages manually:

  1. create conda environment

    conda create --name ffcv-pl
    conda activate ffcv-pl
    
  2. install pytorch according to official website

    # in my environment the command is the following 
    conda install pytorch torchvision torchaudio pytorch-cuda=11.6 -c pytorch -c nvidia
    
  3. install ffcv dependencies

    # can take a very long time, but should not create conflicts
    conda install cupy pkg-config compilers libjpeg-turbo opencv numba -c pytorch -c conda-forge
    
  4. install ffcv and pytorch-lighting

    pip install ffcv
    pip install pytorch-lightning
    

Package

Once dependencies are installed, it is safe to install package:

pip install ffcv_pl

Dataset Creation

You need to save your dataset in ffcv format (.beton).
Official FFCV docs.

This package allows different types of Datasets, listed in the dataset subpackage. A quick example on how to create a dataset is provided in the dataset_creation.py script:

from ffcv_pl.ffcv_utils.generate_dataset import create_image_dataset

if __name__ == '__main__':

    # write dataset in ".beton" format
    test_folder = '/media/dserez/datasets/imagenet/test/'
    create_image_dataset(test_folder=test_folder) 

For example, this code will create the file /media/dserez/datasets/imagenet/test.beton, loading images from folder /media/dserez/datasets/imagenet/test/.

Note that you can pass also train/validation folders, all in one call.

Dataloader and Datamodule

Merge the PL Datamodule with the FFCV Loader object.
It should be compatible with ddp/multiprocessing.
See main.py for a complete example.
Official FFCV docs.

import pytorch_lightning as pl
import torch
from pytorch_lightning.strategies.ddp import DDPStrategy

from torch import nn
from torch.optim import Adam

from ffcv_pl.datasets.image import ImageDataModule


# define the LightningModule
class LitAutoEncoder(pl.LightningModule):

    def __init__(self):

        super().__init__()
        self.encoder = nn.Sequential(nn.Linear(256 * 256 * 3, 64), nn.ReLU(), nn.Linear(64, 3))
        self.decoder = nn.Sequential(nn.Linear(3, 64), nn.ReLU(), nn.Linear(64, 256 * 256 * 3))

    def training_step(self, batch, batch_idx):

        x, y = batch

        b, c, h, w = x.shape
        x = x.reshape(b, -1)
        z = self.encoder(x)
        x_hat = self.decoder(z)
        loss = nn.functional.mse_loss(x_hat, x)
        # Logging to TensorBoard by default
        self.log("train_loss", loss)
        return loss

    def configure_optimizers(self):
        optimizer = Adam(self.parameters(), lr=1e-3)
        return optimizer


if __name__ == '__main__':

    SEED = 1234

    pl.seed_everything(SEED, workers=True)

    dataset = 'cub2002011'
    image_size = 256
    batch_size = 16
    train_folder = f'/media/dserez/datasets/{dataset}/train.beton'
    val_folder = f'/media/dserez/datasets/{dataset}/test.beton'

    gpus = 2
    workers = 8

    # define model
    model = LitAutoEncoder()

    # trainer
    trainer = pl.Trainer(strategy=DDPStrategy(find_unused_parameters=False), deterministic=True,
                         accelerator='gpu', devices=gpus, num_nodes=1, max_epochs=5)

    # Note: set is_dist True if you are using DDP and more than one GPU
    data_module = ImageDataModule(train_folder, val_folder, val_folder, image_size, torch.float32, batch_size,
                                  num_workers=1, is_dist=gpus > 1, seed=SEED)

    trainer.fit(model, data_module)

Each ffcv_pl.datasets.* contains a couple of classes (Dataset, Dataloader).

Citations

  1. Pytorch-Lightning:
    Falcon, W., & The PyTorch Lightning team. (2019). PyTorch Lightning (Version 1.4) [Computer software]. https://doi.org/10.5281/zenodo.3828935

  2. FFCV:

    @misc{leclerc2022ffcv,
        author = {Guillaume Leclerc and Andrew Ilyas and Logan Engstrom and Sung Min Park and Hadi Salman and Aleksander Madry},
        title = {{FFCV}: Accelerating Training by Removing Data Bottlenecks},
        year = {2022},
        howpublished = {\url{https://github.com/libffcv/ffcv/}},
        note = {commit xxxxxxx}
    }
    

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ffcv_pl-0.1.3.tar.gz (7.9 kB view hashes)

Uploaded Source

Built Distribution

ffcv_pl-0.1.3-py3-none-any.whl (7.8 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page