No project description provided
Project description
ml-dataloader is an efficient and flexible data loading pipeline for deep learning, written in pure Python.
Install
pip install ml-dataloader
Examples (similar to Pytorch-dataloader)
suppose data store in python list
from dataloader.dataset import Dataset
from dataloader.dataloader import DataLoader
from dataloader.util.data_kind import DataKind
data = list(range(10))
kind = DataKind.MEM_SEQ
dataset = Dataset(data, kind)
dl = DataLoader(dataset, batch_size=2, shuffle=False)
for batch in dl:
print(batch)
# tf.Tensor([0 1], shape=(2,), dtype=int32)
# tf.Tensor([2 3], shape=(2,), dtype=int32)
# tf.Tensor([4 5], shape=(2,), dtype=int32)
# tf.Tensor([6 7], shape=(2,), dtype=int32)
# tf.Tensor([8 9], shape=(2,), dtype=int32)
suppose train.tsv storing the data
from dataloader.dataset import Dataset
from dataloader.dataloader import DataLoader
from dataloader.util.data_kind import DataKind
filename = 'train.tsv'
kind = DataKind.FILE
dataset = Dataset(filename, kind)
dl = DataLoader(dataset, batch_size=2, shuffle=True)
for batch in dl:
print(batch)
suppose train.tsv storing the data and using mmap
import os
import mmap
from dataloader.dataset import Dataset
from dataloader.dataloader import DataLoader
from dataloader.util.data_kind import DataKind
filename = 'train.tsv'
fp = open(filename, 'rb', os.O_RDONLY)
mm = mmap(fp.fileno(), 0, access=mmap.ACCESS_READ)
fp.close()
kind = DataKind.MMAP_FILE
dataset = Dataset(mm, kind, filename=filename)
dl = DataLoader(dataset, batch_size=2, shuffle=True)
for batch in dl:
print(batch)
NOTES:
if transform is slow, the dataloader will be stuck while num_workers > 0
Examples with Pipeline (similar to Tensorpack-dataflow)
suppose data store in python list
from dataloader.pipeline.dataset import Dataset
from dataloader.pipeline.dataloader import DataLoader
from dataloader.pipeline.processor import MapDataProcessKind
from dataloader.util.data_kind import DataKind
data = list(range(10))
kind = DataKind.MEM_SEQ
dataset = Dataset(data, kind)
dl = DataLoader(dataset, batch_size=2, shuffle=False, processor_kind=MapDataProcessKind.NORMAL)
for batch in dl:
print(batch)
# tf.Tensor([0 1], shape=(2,), dtype=int32)
# tf.Tensor([2 3], shape=(2,), dtype=int32)
# tf.Tensor([4 5], shape=(2,), dtype=int32)
# tf.Tensor([6 7], shape=(2,), dtype=int32)
# tf.Tensor([8 9], shape=(2,), dtype=int32)
suppose train.tsv storing the data
from dataloader.pipeline.dataset import Dataset
from dataloader.pipeline.dataloader import DataLoader
from dataloader.pipeline.processor import MapDataProcessKind
from dataloader.util.data_kind import DataKind
filename = 'train.tsv'
kind = DataKind.FILE
dataset = Dataset(filename, kind)
dl = DataLoader(dataset, batch_size=2, shuffle=True, processor_kind=MapDataProcessKind.MULTI_PROCESS, num_procs=20)
for batch in dl:
print(batch)
suppose train.tsv storing the data and using mmap
import os
import mmap
from dataloader.pipeline.dataset import Dataset
from dataloader.pipeline.dataloader import DataLoader
from dataloader.pipeline.processor import MapDataProcessKind
from dataloader.util.data_kind import DataKind
filename = 'train.tsv'
fp = open(filename, 'rb', os.O_RDONLY)
mm = mmap(fp.fileno(), 0, access=mmap.ACCESS_READ)
fp.close()
kind = DataKind.MMAP_FILE
dataset = Dataset(mm, kind, filename=filename)
dl = DataLoader(dataset, batch_size=2, shuffle=True, processor_kind=MapDataProcessKind.MULTI_PROCESS, num_procs=20)
for batch in dl:
print(batch)
NOTES:
the fully supported parameters, pls ref to DataLoader definition
with MultiThreadMapData/MultiProcessMapDataZMQ, the order won’t be kept as defined in dataset
in order to keep order as defined in Dataset, MapData can be used, but it could be slow compare with MultiThreadMapData and MultiProcessMapDataZMQ. Another way, process the data with pool_transform, then pass the processed data as DataKind.MEM_SEQ kind into Dataset, i.e., dataset = Dataset(processed, DataKind.MEM_SEQ), and avoid using MultiThreadMapData/MultiProcessMapDataZMQ
Refs:
FAQ
1 出现 [__NSPlaceholderDate initialize] may have been in progress in another thread when fork() 如何解决?
通常只在 Mac 上出现, export OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES 即可
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.