h5torch
Project description
h5torch
HDF5 data utilities for PyTorch.
h5torch
consists of two main parts: (1) h5torch.File
: a wrapper around h5py.File
as an interface to create HDF5 files compatible with (2) h5torch.Dataset
, a wrapper around torch.utils.data.Dataset
. As a library, h5torch
establishes a "code" for how datasets should be saved, hence allowing dataloading of various machine learning data settings from a single dataset object, reducing boilerplate in your projects.
:weary: but y tho?
Loading data from HDF5 files allows for efficient data-loading from an on-disk format, drastically reducing memory overhead. Additionally, you will find your datasets to be more organized using the HDF5 format, as everything is neatly arrayed in a single file.
Install
conda create -n "gaetan_h5torch" python=3.10
# install torch manually for your specific system
git clone https://github.com/gdewael/h5torch
cd h5torch
pip install -e .
Usage
The most simple use-case is a ML setting with a 2-D X
matrix as central object with corresponding labels y
along the first axis.
import h5torch
import numpy as np
f = h5torch.File("example.h5t", "w")
X = np.random.randn(100, 15)
y = np.random.rand(100)
f.register(X, "central")
f.register(y, 0, name = "y")
f.close()
dataset = h5torch.Dataset("example.h5t")
dataset[5], len(dataset)
Note that labels y
can also play the role of central object. Both are equivalent in this simple case.
import h5torch
import numpy as np
f = h5torch.File("example.h5t", "w")
X = np.random.randn(100, 15)
y = np.random.rand(100)
f.register(y, "central")
f.register(X, 0, name = "X")
f.close()
dataset = h5torch.Dataset("example.h5t")
dataset[5], len(dataset)
An example with a 2-dimensional Y matrix (such as a score matrix), with objects aligned to both axes of the central matrix. Storing Y and sampling is performed in "coo"
mode, meaning the length of the dataset is the number of nonzero elements in the score matrix, and a sample constitutes such a nonzero element, along with the stored information of the row and col of said element.
import h5torch
import numpy as np
f = h5torch.File("example.h5t", "w")
Y = (np.random.rand(1000, 500) > 0.95).astype(int)
row_features = np.random.randn(1000, 15)
col_names = np.arange(500).astype(bytes)
f.register(Y, "central", mode = "coo")
f.register(row_features, 0, name = "row_features")
f.register(col_names, 1, name = "col_names")
f.close()
dataset = h5torch.Dataset("example.h5t", sampling = "coo")
dataset[5], len(dataset)
Note: h5torch
does not limit the number of possible dimensions along its central data object (and hence also the number of axes to align objects to).
Package roadmap
- Implement typing
- Provide data type conversion capabilities for registering datasets
- Add support for custom samplers
- Add support for making data splits
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file h5torch-0.1.0.tar.gz
.
File metadata
- Download URL: h5torch-0.1.0.tar.gz
- Upload date:
- Size: 5.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.10.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f940996c1623c160fa47c62dc98be40f549eb9c41b329f428fe3cc407b32fd00 |
|
MD5 | 34ba3c0816d7eddcac2fa8c6ee884b19 |
|
BLAKE2b-256 | 5b8bace8b17cd5e4598144d0ad889fbe9a8211f7bf38265b770fb9345c1114ab |
File details
Details for the file h5torch-0.1.0-py3-none-any.whl
.
File metadata
- Download URL: h5torch-0.1.0-py3-none-any.whl
- Upload date:
- Size: 6.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.10.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6b1485769adebdaba55fa80a9a9670f584622b46e74e94f39ca8a1361718709f |
|
MD5 | 5d161279823f6c2bcbcfdfccaa1c7ef7 |
|
BLAKE2b-256 | fb26a63044fb514aef03c4c481feb7030e09b57dba764340a88e1c14c32eaa77 |