Framework-agnostic library for loading data
Project description
dpshdl
A framework-agnostic library for loading data.
Installation
Install the package using:
pip install dpshdl
Or, to install the latest branch:
pip install 'dpshdl @ git+https://github.com/kscalelabs/dpshdl.git@master'
Usage
Datasets should override a single method, next
, which returns a single sample.
from dpshdl.dataset import Dataset
from dpshdl.dataloader import Dataloader
import numpy as np
class MyDataset(Dataset[int, np.ndarray]):
def next(self) -> int:
return 1
# Loops forever.
with Dataloader(MyDataset(), batch_size=2) as loader:
for sample in loader:
assert sample.shape == (2,)
Error Handling
You can wrap any dataset in an ErrorHandlingDataset
to catch and log errors:
from dpshdl.dataset import ErrorHandlingDataset
with Dataloader(ErrorHandlingDataset(MyDataset()), batch_size=2) as loader:
...
This wrapper will detect errors in the next
function and log error summaries, to avoid crashing the entire program.
Ad-hoc Testing
While developing datasets, you usually want to loop through a few samples to make sure everything is working. You can do this easily as follows:
MyDataset().test(
max_samples=100,
handle_errors=True, # To automatically wrap the dataset in an ErrorHandlingDataset.
print_fn=lambda i, sample: print(f"Sample {i}: {sample}")
)
Collating
This package provides a default implementation of dataset collating, which can be used as follows:
from dpshdl.collate import collate
class MyDataset(Dataset[int, np.ndarray]):
def collate(self, items: list[int]) -> np.ndarray:
return collate(items)
Alternatively, you can implement your own custom collating strategy:
from dpshdl.collate import collate
class MyDataset(Dataset[int, list[int]]):
def collate(self, items: list[int]) -> list[int]:
return items
There are additional arguments that can be passed to the collate
function to automatically handle padding and batching:
from dpshdl.collate import pad_all, pad_sequence
import functools
import random
import numpy as np
items = [np.random.random(random.randint(5, 10)) for _ in range(5)] # Randomly sized arrays.
collate(items) # Will fail because the arrays are of different sizes.
collate(items, pad=True) # Use the default padding strategy.
collate(items, pad=functools.partial(pad_all, left_pad=True)) # Left-padding.
collate(items, pad=functools.partial(pad_sequence, dim=0, left_pad=True)) # Pads a specific dimension.
Prefetching
Sometimes it is a good idea to trigger a host-to-device transfer before a batch of samples is needed, so that it can take place asynchronously while other computation is happening. This is called prefetching. This package provides a simple utility class to do this:
from dpshdl.dataset import Dataset
from dpshdl.dataloader import Dataloader
from dpshdl.prefetcher import Prefetcher
import numpy as np
import torch
from torch import Tensor
class MyDataset(Dataset[int, np.ndarray]):
def next(self) -> int:
return 1
def to_device_func(sample: np.ndarray) -> Tensor:
# Because this is non-blocking, the H2D transfer can take place in the
# background while other computation is happening.
return torch.from_numpy(sample).to("cuda", non_blocking=True)
with Prefetcher(to_device_func, Dataloader(MyDataset(), batch_size=2)) as loader:
for sample in loader:
assert sample.device.type == "cuda"
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file dpshdl-0.0.21.tar.gz
.
File metadata
- Download URL: dpshdl-0.0.21.tar.gz
- Upload date:
- Size: 26.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/5.0.0 CPython/3.12.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 72e4da98750f10ff53437226c9aca2347c8054ba032563b2b361ae7a8c60910e |
|
MD5 | 6027373d315ff686634b762068c814ca |
|
BLAKE2b-256 | 7a91a7907b61eb4e6573f80b886dd849d3fba56cf772c233a86df47c16631d94 |
File details
Details for the file dpshdl-0.0.21-py3-none-any.whl
.
File metadata
- Download URL: dpshdl-0.0.21-py3-none-any.whl
- Upload date:
- Size: 27.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/5.0.0 CPython/3.12.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 15d2fc76b782193d01e8c7f8d67f10e882015e7dd35e7805f46d148fe1f8c79c |
|
MD5 | 61d283922ef79d3b3a5903906c67ce0e |
|
BLAKE2b-256 | 1e0892e2593e0a2fbb66c8231b63d3fa66d2f1e8744458e3e7daa1bfd991d10a |