Skip to main content

Framework-agnostic library for loading data

Project description

dpshdl

A framework-agnostic library for loading data.

Installation

Install the package using:

pip install dpshdl

Or, to install the latest branch:

pip install 'dpshdl @ git+https://github.com/kscalelabs/dpshdl.git@master'

Usage

Datasets should override a single method, next, which returns a single sample.

from dpshdl.dataset import Dataset
from dpshdl.dataloader import Dataloader
import numpy as np

class MyDataset(Dataset[int, np.ndarray]):
    def next(self) -> int:
        return 1

# Loops forever.
with Dataloader(MyDataset(), batch_size=2) as loader:
    for sample in loader:
        assert sample.shape == (2,)

Error Handling

You can wrap any dataset in an ErrorHandlingDataset to catch and log errors:

from dpshdl.dataset import ErrorHandlingDataset

with Dataloader(ErrorHandlingDataset(MyDataset()), batch_size=2) as loader:
    ...

This wrapper will detect errors in the next function and log error summaries, to avoid crashing the entire program.

Ad-hoc Testing

While developing datasets, you usually want to loop through a few samples to make sure everything is working. You can do this easily as follows:

MyDataset().test(
    max_samples=100,
    handle_errors=True,  # To automatically wrap the dataset in an ErrorHandlingDataset.
    print_fn=lambda i, sample: print(f"Sample {i}: {sample}")
)

Collating

This package provides a default implementation of dataset collating, which can be used as follows:

from dpshdl.collate import collate

class MyDataset(Dataset[int, np.ndarray]):
    def collate(self, items: list[int]) -> np.ndarray:
        return collate(items)

Alternatively, you can implement your own custom collating strategy:

from dpshdl.collate import collate

class MyDataset(Dataset[int, list[int]]):
    def collate(self, items: list[int]) -> list[int]:
        return items

There are additional arguments that can be passed to the collate function to automatically handle padding and batching:

from dpshdl.collate import pad_all, pad_sequence
import functools
import random
import numpy as np

items = [np.random.random(random.randint(5, 10)) for _ in range(5)]  # Randomly sized arrays.
collate(items)  # Will fail because the arrays are of different sizes.
collate(items, pad=True)  # Use the default padding strategy.
collate(items, pad=functools.partial(pad_all, left_pad=True))  # Left-padding.
collate(items, pad=functools.partial(pad_sequence, dim=0, left_pad=True))  # Pads a specific dimension.

Prefetching

Sometimes it is a good idea to trigger a host-to-device transfer before a batch of samples is needed, so that it can take place asynchronously while other computation is happening. This is called prefetching. This package provides a simple utility class to do this:

from dpshdl.dataset import Dataset
from dpshdl.dataloader import Dataloader
from dpshdl.prefetcher import Prefetcher
import numpy as np
import torch
from torch import Tensor


class MyDataset(Dataset[int, np.ndarray]):
    def next(self) -> int:
        return 1


def to_device_func(sample: np.ndarray) -> Tensor:
    # Because this is non-blocking, the H2D transfer can take place in the
    # background while other computation is happening.
    return torch.from_numpy(sample).to("cuda", non_blocking=True)


with Prefetcher(to_device_func, Dataloader(MyDataset(), batch_size=2)) as loader:
    for sample in loader:
        assert sample.device.type == "cuda"

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dpshdl-0.0.21.tar.gz (26.6 kB view details)

Uploaded Source

Built Distribution

dpshdl-0.0.21-py3-none-any.whl (27.2 kB view details)

Uploaded Python 3

File details

Details for the file dpshdl-0.0.21.tar.gz.

File metadata

  • Download URL: dpshdl-0.0.21.tar.gz
  • Upload date:
  • Size: 26.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.0.0 CPython/3.12.2

File hashes

Hashes for dpshdl-0.0.21.tar.gz
Algorithm Hash digest
SHA256 72e4da98750f10ff53437226c9aca2347c8054ba032563b2b361ae7a8c60910e
MD5 6027373d315ff686634b762068c814ca
BLAKE2b-256 7a91a7907b61eb4e6573f80b886dd849d3fba56cf772c233a86df47c16631d94

See more details on using hashes here.

File details

Details for the file dpshdl-0.0.21-py3-none-any.whl.

File metadata

  • Download URL: dpshdl-0.0.21-py3-none-any.whl
  • Upload date:
  • Size: 27.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.0.0 CPython/3.12.2

File hashes

Hashes for dpshdl-0.0.21-py3-none-any.whl
Algorithm Hash digest
SHA256 15d2fc76b782193d01e8c7f8d67f10e882015e7dd35e7805f46d148fe1f8c79c
MD5 61d283922ef79d3b3a5903906c67ce0e
BLAKE2b-256 1e0892e2593e0a2fbb66c8231b63d3fa66d2f1e8744458e3e7daa1bfd991d10a

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page