Skip to main content

No project description provided

Project description

ml-dataloader

ml-dataloader is an efficient and flexible data loading pipeline for deep learning, written in pure Python.

Install

pip install ml-dataloader

Examples (similar to Pytorch-dataloader)

  • suppose data store in python list

.. code:: python

from dataloader.dataset import Dataset from dataloader.dataloader import DataLoader from dataloader.util.data_kind import DataKind

data = list(range(10)) dataset = Dataset(data, kind=DataKind.MEM_SEQ)

dl = DataLoader(dataset, batch_size=2, shuffle=False) for batch in dl: print(batch)

tf.Tensor([0 1], shape=(2,), dtype=int32)

tf.Tensor([2 3], shape=(2,), dtype=int32)

tf.Tensor([4 5], shape=(2,), dtype=int32)

tf.Tensor([6 7], shape=(2,), dtype=int32)

tf.Tensor([8 9], shape=(2,), dtype=int32)

  • data with index

.. code:: python

from dataloader.dataset import DatasetWithIndex from dataloader.dataloader import DataLoader from dataloader.util.data_kind import DataKind

data = list(range(10)) dataset = DatasetWithIndex(data, kind=DataKind.MEM_SEQ)

dl = DataLoader(dataset, batch_size=2, shuffle=False) for batch in dl: print(batch)

(tf.Tensor([0, 1], shape=(2,), dtype=int32), tf.Tensor([0, 1], shape=(2,), dtype=int32))

(tf.Tensor([2, 3], shape=(2,), dtype=int32), tf.Tensor([2, 3], shape=(2,), dtype=int32))

(tf.Tensor([4, 5], shape=(2,), dtype=int32), tf.Tensor([4, 5], shape=(2,), dtype=int32))

(tf.Tensor([6, 7], shape=(2,), dtype=int32), tf.Tensor([6, 7], shape=(2,), dtype=int32))

(tf.Tensor([8, 9], shape=(2,), dtype=int32), tf.Tensor([8, 9], shape=(2,), dtype=int32))

  • suppose train.tsv storing the data

.. code:: python

from dataloader.dataset import Dataset from dataloader.dataloader import DataLoader from dataloader.util.data_kind import DataKind

filename = 'train.tsv' dataset = Dataset(filename, kind=DataKind.FILE)

dl = DataLoader(dataset, batch_size=2, shuffle=True) for batch in dl: print(batch)

  • suppose train.tsv storing the data and using mmap

.. code:: python

from dataloader.dataset import Dataset from dataloader.dataloader import DataLoader from dataloader.util.data_kind import DataKind

filename = 'train.tsv'

dataset = Dataset(filename, kind=DataKind.MMAP_FILE)

dl = DataLoader(dataset, batch_size=2, shuffle=True) for batch in dl: print(batch)

NOTES:

  • if transform is slow, the dataloader will be stuck while num_workers

    0

Examples with Pipeline (similar to Tensorpack-dataflow)

  • suppose data store in python list

.. code:: python

from dataloader.pipeline.dataset import Dataset from dataloader.pipeline.dataloader import DataLoader from dataloader.pipeline.processor import MapDataProcessKind from dataloader.util.data_kind import DataKind

data = list(range(10)) dataset = Dataset(data, kind=DataKind.MEM_SEQ)

dl = DataLoader(dataset, batch_size=2, shuffle=False, processor_kind=MapDataProcessKind.NORMAL) for batch in dl: print(batch)

tf.Tensor([0 1], shape=(2,), dtype=int32)

tf.Tensor([2 3], shape=(2,), dtype=int32)

tf.Tensor([4 5], shape=(2,), dtype=int32)

tf.Tensor([6 7], shape=(2,), dtype=int32)

tf.Tensor([8 9], shape=(2,), dtype=int32)

  • suppose train.tsv storing the data

.. code:: python

from dataloader.pipeline.dataset import Dataset from dataloader.pipeline.dataloader import DataLoader from dataloader.pipeline.processor import MapDataProcessKind from dataloader.util.data_kind import DataKind

filename = 'train.tsv' dataset = Dataset(filename, kind=DataKind.FILE)

dl = DataLoader(dataset, batch_size=2, shuffle=True, processor_kind=MapDataProcessKind.MULTI_PROCESS, num_procs=20) for batch in dl: print(batch)

  • suppose train.tsv storing the data and using mmap

.. code:: python

from dataloader.pipeline.dataset import Dataset from dataloader.pipeline.dataloader import DataLoader from dataloader.pipeline.processor import MapDataProcessKind from dataloader.util.data_kind import DataKind

filename = 'train.tsv'

dataset = Dataset(filename, kind=DataKind.MMAP_FILE)

dl = DataLoader(dataset, batch_size=2, shuffle=True, processor_kind=MapDataProcessKind.MULTI_PROCESS, num_procs=20) for batch in dl: print(batch)

NOTES:

  1. the fully supported parameters, pls ref to DataLoader <https://github.com/ericxsun/ml-dataloader/blob/main/dataloader/dataloader.py>__ definition
  2. with MultiThreadMapData/MultiProcessMapDataZMQ <https://github.com/ericxsun/ml-dataloader/blob/main/dataloader/pipeline/processor.py>__, the order won’t be kept as defined in dataset
  3. in order to keep order as defined in Dataset, MapData <https://github.com/ericxsun/ml-dataloader/blob/main/dataloader/pipeline/processor.py>__ can be used, but it could be slow compare with MultiThreadMapData and MultiProcessMapDataZMQ. Another way, process the data with pool_transform <https://github.com/ericxsun/ml-dataloader/blob/main/dataloader/transform/misc.py>__, then pass the processed data as DataKind.MEM_SEQ kind into Dataset, i.e., dataset = Dataset(processed, DataKind.MEM_SEQ), and avoid using MultiThreadMapData/MultiProcessMapDataZMQ

Refs:

  • pytorch-data <https://github.com/pytorch/pytorch/tree/master/torch/utils/data>__
  • MONAI <https://github.com/Project-MONAI/MONAI>__
  • tensorpack-dataflow <https://github.com/tensorpack/dataflow>__
  • performance-tuning <https://github.com/tensorpack/tensorpack/blob/master/docs/tutorial/performance-tuning.md>__
  • tensorpack-benchmark <https://github.com/tensorpack/benchmarks/blob/master/ResNet-Horovod/imagenet-resnet-horovod.py>__

FAQ

  • [__NSPlaceholderDate initialize] may have been in progress in another thread when fork()

    if one meets such error, please set OBJC_DISABLE_INITIALIZE_FORK_SAFETY as yes: export OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ml-dataloader-0.5.4.tar.gz (29.1 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page