manage fast data loading with ffcv and pytorch lightning
Project description
FFCV Dataloader with Pytorch Lightning
FFCV is a fast dataloader for neural networks training: https://github.com/libffcv/ffcv
In this repository, all the steps to install and configure it with pytorch-lightning are presented.
Moreover, some useful methods to quickly create, preprocess and load Datasets with FFCV and pytorch-lightning
are proposed.
Installation
Dependencies
There are actually some known issues about the installation of the FFCV package.
In particular, even a successful installation may rise the following error when
trying to import ffcv
(this seems to happen also in version 1.0.x
of FFCV):
ImportError: libopencv_imgproc.so.405: cannot open shared object file: No such file or directory
There is a Closed issue about this #136.
In order to correctly install everything, I suggest to use Conda (I tried also pip but encountered the error above).
First, try to install dependencies with environment.yml
file:
conda env create --file environment.yml
This should correctly create a conda environment named ffcv-pl
.
If the above does not work, then you can try installing packages manually:
-
create conda environment
conda create --name ffcv-pl conda activate ffcv-pl
-
install pytorch according to official website
# in my environment the command is the following conda install pytorch torchvision torchaudio pytorch-cuda=11.6 -c pytorch -c nvidia
-
install ffcv dependencies and pytorch-lightning
# can take a very long time, but should not create conflicts conda install cupy pkg-config compilers libjpeg-turbo opencv numba pytorch-lightning -c pytorch -c conda-forge
-
install ffcv
pip install ffcv
Package
Once dependencies are installed, it is safe to install package:
pip install ffcv_pl
Dataset Creation
You need to save your dataset in ffcv format (.beton
).
Official FFCV docs.
This package allows different types of Datasets, listed in the dataset
subpackage.
A quick example on how to create a dataset is provided in the dataset_creation.py script
:
from ffcv_pl.ffcv_utils.generate_dataset import create_image_label_dataset
if __name__ == '__main__':
# write dataset in ".beton" format
train_folder = '/media/dserez/datasets/cub/train/'
test_folder = '/media/dserez/datasets/cub/test/'
create_image_label_dataset(train_folder=train_folder, test_folder=test_folder)
For example, this code will create the files /media/dserez/datasets/cub/test.beton
and
/media/dserez/datasets/cub/train.beton
,
loading images from folders /media/dserez/datasets/cub/test/
and
/media/dserez/datasets/cub/train/
, respectively.
Note that you can pass also more folders, all in one call.
Dataloader and Datamodule
Merge the PL Datamodule with the FFCV Loader object.
It should be compatible with ddp/multiprocessing.
See main.py
for a complete example.
Official FFCV docs.
import pytorch_lightning as pl
import torch
from pytorch_lightning.strategies.ddp import DDPStrategy
from torch import nn
from torch.optim import Adam
from ffcv_pl.datasets.image import ImageDataModule
# define the LightningModule
class LitAutoEncoder(pl.LightningModule):
def __init__(self):
super().__init__()
self.encoder = nn.Sequential(nn.Linear(256 * 256 * 3, 64), nn.ReLU(), nn.Linear(64, 3))
self.decoder = nn.Sequential(nn.Linear(3, 64), nn.ReLU(), nn.Linear(64, 256 * 256 * 3))
def training_step(self, batch, batch_idx):
x, y = batch
b, c, h, w = x.shape
x = x.reshape(b, -1)
z = self.encoder(x)
x_hat = self.decoder(z)
loss = nn.functional.mse_loss(x_hat, x)
# Logging to TensorBoard by default
self.log("train_loss", loss)
return loss
def configure_optimizers(self):
optimizer = Adam(self.parameters(), lr=1e-3)
return optimizer
if __name__ == '__main__':
SEED = 1234
pl.seed_everything(SEED, workers=True)
dataset = 'cub'
image_size = 256
batch_size = 16
train_folder = f'/media/dserez/datasets/{dataset}/train.beton'
val_folder = f'/media/dserez/datasets/{dataset}/test.beton'
gpus = 2
workers = 8
# define model
model = LitAutoEncoder()
# trainer
trainer = pl.Trainer(strategy=DDPStrategy(find_unused_parameters=False), deterministic=True,
accelerator='gpu', devices=gpus, num_nodes=1, max_epochs=5)
# Note: set is_dist True if you are using DDP and more than one GPU
data_module = ImageDataModule(train_folder, val_folder, val_folder, image_size, torch.float32, batch_size,
num_workers=1, is_dist=gpus > 1, seed=SEED)
trainer.fit(model, data_module)
Each ffcv_pl.datasets.*
contains a couple of classes (Dataset, Dataloader).
Citations
-
Pytorch-Lightning:
Falcon, W., & The PyTorch Lightning team. (2019). PyTorch Lightning (Version 1.4) [Computer software]. https://doi.org/10.5281/zenodo.3828935 -
FFCV:
@misc{leclerc2022ffcv, author = {Guillaume Leclerc and Andrew Ilyas and Logan Engstrom and Sung Min Park and Hadi Salman and Aleksander Madry}, title = {{FFCV}: Accelerating Training by Removing Data Bottlenecks}, year = {2022}, howpublished = {\url{https://github.com/libffcv/ffcv/}}, note = {commit xxxxxxx} }
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.