Skip to main content

No project description provided

Project description

ml-dataloader is an efficient and flexible data loading pipeline for deep learning, written in pure Python.

Install

pip install ml-dataloader

Examples (similar to Pytorch-dataloader)

  • suppose data store in python list

from dataloader.dataset import Dataset
from dataloader.dataloader import DataLoader
from dataloader.util.data_kind import DataKind

data = list(range(10))
kind = DataKind.MEM_SEQ
dataset = Dataset(data, kind)

dl = DataLoader(dataset, batch_size=2, shuffle=False)
for batch in dl:
    print(batch)

# tf.Tensor([0 1], shape=(2,), dtype=int32)
# tf.Tensor([2 3], shape=(2,), dtype=int32)
# tf.Tensor([4 5], shape=(2,), dtype=int32)
# tf.Tensor([6 7], shape=(2,), dtype=int32)
# tf.Tensor([8 9], shape=(2,), dtype=int32)
  • suppose train.tsv storing the data

from dataloader.dataset import Dataset
from dataloader.dataloader import DataLoader
from dataloader.util.data_kind import DataKind

filename = 'train.tsv'
kind = DataKind.FILE
dataset = Dataset(filename, kind)

dl = DataLoader(dataset, batch_size=2, shuffle=True)
for batch in dl:
    print(batch)
  • suppose train.tsv storing the data and using mmap

import os
import mmap

from dataloader.dataset import Dataset
from dataloader.dataloader import DataLoader
from dataloader.util.data_kind import DataKind

filename = 'train.tsv'

fp = open(filename, 'rb', os.O_RDONLY)
mm = mmap(fp.fileno(), 0,  access=mmap.ACCESS_READ)
fp.close()

kind = DataKind.MMAP_FILE
dataset = Dataset(mm, kind, filename=filename)

dl = DataLoader(dataset, batch_size=2, shuffle=True)
for batch in dl:
    print(batch)

NOTES:

  • if transform is slow, the dataloader will be stuck while num_workers > 0

Examples with Pipeline (similar to Tensorpack-dataflow)

  • suppose data store in python list

from dataloader.pipeline.dataset import Dataset
from dataloader.pipeline.dataloader import DataLoader
from dataloader.pipeline.processor import MapDataProcessKind
from dataloader.util.data_kind import DataKind

data = list(range(10))
kind = DataKind.MEM_SEQ
dataset = Dataset(data, kind)

dl = DataLoader(dataset, batch_size=2, shuffle=False, processor_kind=MapDataProcessKind.NORMAL)
for batch in dl:
    print(batch)

# tf.Tensor([0 1], shape=(2,), dtype=int32)
# tf.Tensor([2 3], shape=(2,), dtype=int32)
# tf.Tensor([4 5], shape=(2,), dtype=int32)
# tf.Tensor([6 7], shape=(2,), dtype=int32)
# tf.Tensor([8 9], shape=(2,), dtype=int32)
  • suppose train.tsv storing the data

from dataloader.pipeline.dataset import Dataset
from dataloader.pipeline.dataloader import DataLoader
from dataloader.pipeline.processor import MapDataProcessKind
from dataloader.util.data_kind import DataKind

filename = 'train.tsv'
kind = DataKind.FILE
dataset = Dataset(filename, kind)

dl = DataLoader(dataset, batch_size=2, shuffle=True, processor_kind=MapDataProcessKind.MULTI_PROCESS, num_procs=20)
for batch in dl:
    print(batch)
  • suppose train.tsv storing the data and using mmap

import os
import mmap

from dataloader.pipeline.dataset import Dataset
from dataloader.pipeline.dataloader import DataLoader
from dataloader.pipeline.processor import MapDataProcessKind
from dataloader.util.data_kind import DataKind

filename = 'train.tsv'

fp = open(filename, 'rb', os.O_RDONLY)
mm = mmap(fp.fileno(), 0,  access=mmap.ACCESS_READ)
fp.close()

kind = DataKind.MMAP_FILE
dataset = Dataset(mm, kind, filename=filename)

dl = DataLoader(dataset, batch_size=2, shuffle=True, processor_kind=MapDataProcessKind.MULTI_PROCESS, num_procs=20)
for batch in dl:
    print(batch)

NOTES:

  1. the fully supported parameters, pls ref to DataLoader definition

  2. with MultiThreadMapData/MultiProcessMapDataZMQ, the order won’t be kept as defined in dataset

  3. in order to keep order as defined in Dataset, MapData can be used, but it could be slow compare with MultiThreadMapData and MultiProcessMapDataZMQ. Another way, process the data with pool_transform, then pass the processed data as DataKind.MEM_SEQ kind into Dataset, i.e., dataset = Dataset(processed, DataKind.MEM_SEQ), and avoid using MultiThreadMapData/MultiProcessMapDataZMQ

Refs:

FAQ

1 出现 [__NSPlaceholderDate initialize] may have been in progress in another thread when fork() 如何解决?

通常只在 Mac 上出现, export OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES 即可

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ml-dataloader-0.3.1.tar.gz (27.9 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page