Skip to main content

Framework-agnostic library for loading data

Project description

dpshdl

A framework-agnostic library for loading data.

Installation

Install the package using:

pip install dpshdl

Or, to install the latest branch:

pip install 'dpshdl @ git+https://github.com/kscalelabs/dpshdl.git@master'

Usage

Datasets should override a single method, next, which returns a single sample.

from dpshdl.dataset import Dataset
from dpshdl.dataloader import Dataloader
import numpy as np

class MyDataset(Dataset[int, np.ndarray]):
    def next(self) -> int:
        return 1

# Loops forever.
with Dataloader(MyDataset(), batch_size=2) as loader:
    for sample in loader:
        assert sample.shape == (2,)

Error Handling

You can wrap any dataset in an ErrorHandlingDataset to catch and log errors:

from dpshdl.dataset import ErrorHandlingDataset

with Dataloader(ErrorHandlingDataset(MyDataset()), batch_size=2) as loader:
    ...

This wrapper will detect errors in the next function and log error summaries, to avoid crashing the entire program.

Ad-hoc Testing

While developing datasets, you usually want to loop through a few samples to make sure everything is working. You can do this easily as follows:

MyDataset().test(
    max_samples=100,
    handle_errors=True,  # To automatically wrap the dataset in an ErrorHandlingDataset.
    print_fn=lambda i, sample: print(f"Sample {i}: {sample}")
)

Collating

This package provides a default implementation of dataset collating, which can be used as follows:

from dpshdl.collate import collate

class MyDataset(Dataset[int, np.ndarray]):
    def collate(self, items: list[int]) -> np.ndarray:
        return collate(items)

Alternatively, you can implement your own custom collating strategy:

from dpshdl.collate import collate

class MyDataset(Dataset[int, list[int]]):
    def collate(self, items: list[int]) -> list[int]:
        return items

There are additional arguments that can be passed to the collate function to automatically handle padding and batching:

from dpshdl.collate import pad_all, pad_sequence
import functools
import random
import numpy as np

items = [np.random.random(random.randint(5, 10)) for _ in range(5)]  # Randomly sized arrays.
collate(items)  # Will fail because the arrays are of different sizes.
collate(items, pad=True)  # Use the default padding strategy.
collate(items, pad=functools.partial(pad_all, left_pad=True))  # Left-padding.
collate(items, pad=functools.partial(pad_sequence, dim=0, left_pad=True))  # Pads a specific dimension.

Prefetching

Sometimes it is a good idea to trigger a host-to-device transfer before a batch of samples is needed, so that it can take place asynchronously while other computation is happening. This is called prefetching. This package provides a simple utility class to do this:

from dpshdl.dataset import Dataset
from dpshdl.dataloader import Dataloader
from dpshdl.prefetcher import Prefetcher
import numpy as np
import torch
from torch import Tensor


class MyDataset(Dataset[int, np.ndarray]):
    def next(self) -> int:
        return 1


def to_device_func(sample: np.ndarray) -> Tensor:
    # Because this is non-blocking, the H2D transfer can take place in the
    # background while other computation is happening.
    return torch.from_numpy(sample).to("cuda", non_blocking=True)


with Prefetcher(to_device_func, Dataloader(MyDataset(), batch_size=2)) as loader:
    for sample in loader:
        assert sample.device.type == "cuda"

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dpshdl-0.0.21.tar.gz (26.6 kB view hashes)

Uploaded Source

Built Distribution

dpshdl-0.0.21-py3-none-any.whl (27.2 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page