manage fast data loading with ffcv and pytorch lightning

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

Project description

FFCV Dataloader with Pytorch Lightning

FFCV is a fast dataloader for neural networks training: https://github.com/libffcv/ffcv

In this repository, all the steps to install and configure it with pytorch-lightning are presented.
Moreover, some useful methods to quickly create, preprocess and load Datasets with FFCV and pytorch-lightning are proposed.

Installation

Dependencies

There are actually some known issues about the installation of the FFCV package.
Check for instance issues of FFCV (#133 #54).

The first suggestion to install dependencies is to use the provided environment.yml file:

conda env create --file environment.yml

This should correctly create a conda environment named ffcv-pl.

If the above does not work, then you can try installing packages manually:

create conda environment

conda create --name ffcv-pl
conda activate ffcv-pl

install pytorch according to official website

# in my environment the command is the following 
conda install pytorch torchvision torchaudio pytorch-cuda=11.6 -c pytorch -c nvidia

install ffcv dependencies

# can take a very long time, but should not create conflicts
conda install cupy pkg-config compilers libjpeg-turbo opencv numba -c pytorch -c conda-forge

install ffcv and pytorch-lighting

pip install ffcv
pip install pytorch-lightning

Package

Once dependencies are installed, it is safe to install package:

pip install ffcv_pl

Dataset Creation

You need to save your dataset in ffcv format (.beton).
Official FFCV docs.

This package allows different types of Datasets, listed in the dataset subpackage. A quick example on how to create a dataset is provided in the dataset_creation.py script:

from ffcv_pl.ffcv_utils.generate_dataset import create_image_dataset

if __name__ == '__main__':

    # write dataset in ".beton" format
    test_folder = '/media/dserez/datasets/imagenet/test/'
    create_image_dataset(test_folder=test_folder)

For example, this code will create the file /media/dserez/datasets/imagenet/test.beton, loading images from folder /media/dserez/datasets/imagenet/test/.

Note that you can pass also train/validation folders, all in one call.

Dataloader and Datamodule

Merge the PL Datamodule with the FFCV Loader object.
It should be compatible with ddp/multiprocessing.
See main.py for a complete example.
Official FFCV docs.

import pytorch_lightning as pl
import torch
from pytorch_lightning.strategies.ddp import DDPStrategy

from torch import nn
from torch.optim import Adam

from ffcv_pl.datasets.image import ImageDataModule


# define the LightningModule
class LitAutoEncoder(pl.LightningModule):

    def __init__(self):

        super().__init__()
        self.encoder = nn.Sequential(nn.Linear(256 * 256 * 3, 64), nn.ReLU(), nn.Linear(64, 3))
        self.decoder = nn.Sequential(nn.Linear(3, 64), nn.ReLU(), nn.Linear(64, 256 * 256 * 3))

    def training_step(self, batch, batch_idx):

        x, y = batch

        b, c, h, w = x.shape
        x = x.reshape(b, -1)
        z = self.encoder(x)
        x_hat = self.decoder(z)
        loss = nn.functional.mse_loss(x_hat, x)
        # Logging to TensorBoard by default
        self.log("train_loss", loss)
        return loss

    def configure_optimizers(self):
        optimizer = Adam(self.parameters(), lr=1e-3)
        return optimizer


if __name__ == '__main__':

    SEED = 1234

    pl.seed_everything(SEED, workers=True)

    dataset = 'cub2002011'
    image_size = 256
    batch_size = 16
    train_folder = f'/media/dserez/datasets/{dataset}/train.beton'
    val_folder = f'/media/dserez/datasets/{dataset}/test.beton'

    gpus = 2
    workers = 8

    # define model
    model = LitAutoEncoder()

    # trainer
    trainer = pl.Trainer(strategy=DDPStrategy(find_unused_parameters=False), deterministic=True,
                         accelerator='gpu', devices=gpus, num_nodes=1, max_epochs=5)

    # Note: set is_dist True if you are using DDP and more than one GPU
    data_module = ImageDataModule(train_folder, val_folder, val_folder, image_size, torch.float32, batch_size,
                                  num_workers=1, is_dist=gpus > 1, seed=SEED)

    trainer.fit(model, data_module)

Each ffcv_pl.datasets.* contains a couple of classes (Dataset, Dataloader).

Citations

Pytorch-Lightning:
Falcon, W., & The PyTorch Lightning team. (2019). PyTorch Lightning (Version 1.4) [Computer software]. https://doi.org/10.5281/zenodo.3828935

FFCV:

@misc{leclerc2022ffcv,
    author = {Guillaume Leclerc and Andrew Ilyas and Logan Engstrom and Sung Min Park and Hadi Salman and Aleksander Madry},
    title = {{FFCV}: Accelerating Training by Removing Data Bottlenecks},
    year = {2022},
    howpublished = {\url{https://github.com/libffcv/ffcv/}},
    note = {commit xxxxxxx}
}

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

Release history Release notifications | RSS feed

0.3.2

Jul 17, 2023

0.3.1

Jul 13, 2023

0.2.3

Jun 1, 2023

0.2.2

Jun 1, 2023

0.2.1

May 18, 2023

0.2.0

May 18, 2023

0.1.5

Mar 21, 2023

0.1.4

Mar 10, 2023

This version

0.1.3

Feb 7, 2023

0.1.2

Feb 6, 2023

0.1.1

Jan 24, 2023

0.1.0

Jan 24, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ffcv_pl-0.1.3.tar.gz (7.9 kB view hashes)

Uploaded Feb 7, 2023 Source

Built Distribution

ffcv_pl-0.1.3-py3-none-any.whl (7.8 kB view hashes)

Uploaded Feb 7, 2023 Python 3

Hashes for ffcv_pl-0.1.3.tar.gz

Hashes for ffcv_pl-0.1.3.tar.gz
Algorithm	Hash digest
SHA256	`51ea6625051491defd153a1dcc65fbce609b895e88ada74110d5483525a9275a`
MD5	`3beac29869da4a01e8de719729cc958e`
BLAKE2b-256	`c2c278a95a4eb826deba52cf9d81f1ff8b79b0cbadf680f89d4fbf1f809cea97`

Hashes for ffcv_pl-0.1.3-py3-none-any.whl

Hashes for ffcv_pl-0.1.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`3f99ba3206e25d5c34b24c7a080ac4b2738b78256e4d3972e6e4205e810eb479`
MD5	`e724be43631744bcdbaa96f08ca6abbb`
BLAKE2b-256	`7e59865a5b63df6db5f33a7c7cd120b1210b7d2e6f6cfcf3d47c20b89c538b83`