No project description provided
Project description
ml-dataloader
ml-dataloader is an efficient and flexible data loading pipeline for deep learning, written in pure Python.
Install
pip install ml-dataloader
Examples (similar to Pytorch-dataloader)
- suppose data store in python list
.. code:: python
from dataloader.dataset import Dataset from dataloader.dataloader import DataLoader from dataloader.util.data_kind import DataKind
data = list(range(10)) dataset = Dataset(data, kind=DataKind.MEM_SEQ)
dl = DataLoader(dataset, batch_size=2, shuffle=False) for batch in dl: print(batch)
tf.Tensor([0 1], shape=(2,), dtype=int32)
tf.Tensor([2 3], shape=(2,), dtype=int32)
tf.Tensor([4 5], shape=(2,), dtype=int32)
tf.Tensor([6 7], shape=(2,), dtype=int32)
tf.Tensor([8 9], shape=(2,), dtype=int32)
- data with index
.. code:: python
from dataloader.dataset import DatasetWithIndex from dataloader.dataloader import DataLoader from dataloader.util.data_kind import DataKind
data = list(range(10)) dataset = DatasetWithIndex(data, kind=DataKind.MEM_SEQ)
dl = DataLoader(dataset, batch_size=2, shuffle=False) for batch in dl: print(batch)
(tf.Tensor([0, 1], shape=(2,), dtype=int32), tf.Tensor([0, 1], shape=(2,), dtype=int32))
(tf.Tensor([2, 3], shape=(2,), dtype=int32), tf.Tensor([2, 3], shape=(2,), dtype=int32))
(tf.Tensor([4, 5], shape=(2,), dtype=int32), tf.Tensor([4, 5], shape=(2,), dtype=int32))
(tf.Tensor([6, 7], shape=(2,), dtype=int32), tf.Tensor([6, 7], shape=(2,), dtype=int32))
(tf.Tensor([8, 9], shape=(2,), dtype=int32), tf.Tensor([8, 9], shape=(2,), dtype=int32))
- suppose
train.tsv
storing the data
.. code:: python
from dataloader.dataset import Dataset from dataloader.dataloader import DataLoader from dataloader.util.data_kind import DataKind
filename = 'train.tsv' dataset = Dataset(filename, kind=DataKind.FILE)
dl = DataLoader(dataset, batch_size=2, shuffle=True) for batch in dl: print(batch)
- suppose
train.tsv
storing the data and usingmmap
.. code:: python
from dataloader.dataset import Dataset from dataloader.dataloader import DataLoader from dataloader.util.data_kind import DataKind
filename = 'train.tsv'
dataset = Dataset(filename, kind=DataKind.MMAP_FILE)
dl = DataLoader(dataset, batch_size=2, shuffle=True) for batch in dl: print(batch)
NOTES:
- if transform is slow, the dataloader will be stuck while num_workers
0
Examples with Pipeline (similar to Tensorpack-dataflow)
- suppose data store in python list
.. code:: python
from dataloader.pipeline.dataset import Dataset from dataloader.pipeline.dataloader import DataLoader from dataloader.pipeline.processor import MapDataProcessKind from dataloader.util.data_kind import DataKind
data = list(range(10)) dataset = Dataset(data, kind=DataKind.MEM_SEQ)
dl = DataLoader(dataset, batch_size=2, shuffle=False, processor_kind=MapDataProcessKind.NORMAL) for batch in dl: print(batch)
tf.Tensor([0 1], shape=(2,), dtype=int32)
tf.Tensor([2 3], shape=(2,), dtype=int32)
tf.Tensor([4 5], shape=(2,), dtype=int32)
tf.Tensor([6 7], shape=(2,), dtype=int32)
tf.Tensor([8 9], shape=(2,), dtype=int32)
- suppose
train.tsv
storing the data
.. code:: python
from dataloader.pipeline.dataset import Dataset from dataloader.pipeline.dataloader import DataLoader from dataloader.pipeline.processor import MapDataProcessKind from dataloader.util.data_kind import DataKind
filename = 'train.tsv' dataset = Dataset(filename, kind=DataKind.FILE)
dl = DataLoader(dataset, batch_size=2, shuffle=True, processor_kind=MapDataProcessKind.MULTI_PROCESS, num_procs=20) for batch in dl: print(batch)
- suppose
train.tsv
storing the data and usingmmap
.. code:: python
from dataloader.pipeline.dataset import Dataset from dataloader.pipeline.dataloader import DataLoader from dataloader.pipeline.processor import MapDataProcessKind from dataloader.util.data_kind import DataKind
filename = 'train.tsv'
dataset = Dataset(filename, kind=DataKind.MMAP_FILE)
dl = DataLoader(dataset, batch_size=2, shuffle=True, processor_kind=MapDataProcessKind.MULTI_PROCESS, num_procs=20) for batch in dl: print(batch)
NOTES:
- the fully supported parameters, pls ref to
DataLoader <https://github.com/ericxsun/ml-dataloader/blob/main/dataloader/dataloader.py>
__ definition - with
MultiThreadMapData/MultiProcessMapDataZMQ <https://github.com/ericxsun/ml-dataloader/blob/main/dataloader/pipeline/processor.py>
__, the order won’t be kept as defined in dataset - in order to keep order as defined in
Dataset
,MapData <https://github.com/ericxsun/ml-dataloader/blob/main/dataloader/pipeline/processor.py>
__ can be used, but it could be slow compare with MultiThreadMapData and MultiProcessMapDataZMQ. Another way, process the data withpool_transform <https://github.com/ericxsun/ml-dataloader/blob/main/dataloader/transform/misc.py>
__, then pass the processed data asDataKind.MEM_SEQ
kind intoDataset
, i.e.,dataset = Dataset(processed, DataKind.MEM_SEQ)
, and avoid usingMultiThreadMapData/MultiProcessMapDataZMQ
Refs:
pytorch-data <https://github.com/pytorch/pytorch/tree/master/torch/utils/data>
__MONAI <https://github.com/Project-MONAI/MONAI>
__tensorpack-dataflow <https://github.com/tensorpack/dataflow>
__performance-tuning <https://github.com/tensorpack/tensorpack/blob/master/docs/tutorial/performance-tuning.md>
__tensorpack-benchmark <https://github.com/tensorpack/benchmarks/blob/master/ResNet-Horovod/imagenet-resnet-horovod.py>
__
FAQ
-
[__NSPlaceholderDate initialize] may have been in progress in another thread when fork()
if one meets such error, please set
OBJC_DISABLE_INITIALIZE_FORK_SAFETY
as yes:export OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES
-
publish to pypi (on mac)
::
pip install pypandoc twine brew install pandoc
python setup.py sdist twine upload --verbose --repository-url "https://upload.pypi.org/legacy/" -u "pypi-username" -p "pypi-password" dist/*
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.