Skip to main content

h5torch

Project description

h5torch

HDF5 data utilities for PyTorch.

h5torch consists of two main parts: (1) h5torch.File: a wrapper around h5py.File as an interface to create HDF5 files compatible with (2) h5torch.Dataset, a wrapper around torch.utils.data.Dataset. As a library, h5torch establishes a "code" for how datasets should be saved, hence allowing dataloading of various machine learning data settings from a single dataset object, reducing boilerplate in your projects.

:weary: but y tho?

Loading data from HDF5 files allows for efficient data-loading from an on-disk format, drastically reducing memory overhead. Additionally, you will find your datasets to be more organized using the HDF5 format, as everything is neatly arrayed in a single file.

Install

conda create -n "gaetan_h5torch" python=3.10
# install torch manually for your specific system
git clone https://github.com/gdewael/h5torch
cd h5torch
pip install -e .

Usage

The most simple use-case is a ML setting with a 2-D X matrix as central object with corresponding labels y along the first axis.

import h5torch
import numpy as np
f = h5torch.File("example.h5t", "w")
X = np.random.randn(100, 15)
y = np.random.rand(100)
f.register(X, "central")
f.register(y, 0, name = "y")
f.close()

dataset = h5torch.Dataset("example.h5t")
dataset[5], len(dataset)

Note that labels y can also play the role of central object. Both are equivalent in this simple case.

import h5torch
import numpy as np
f = h5torch.File("example.h5t", "w")
X = np.random.randn(100, 15)
y = np.random.rand(100)
f.register(y, "central")
f.register(X, 0, name = "X")
f.close()

dataset = h5torch.Dataset("example.h5t")
dataset[5], len(dataset)

An example with a 2-dimensional Y matrix (such as a score matrix), with objects aligned to both axes of the central matrix. Storing Y and sampling is performed in "coo" mode, meaning the length of the dataset is the number of nonzero elements in the score matrix, and a sample constitutes such a nonzero element, along with the stored information of the row and col of said element.

import h5torch
import numpy as np
f = h5torch.File("example.h5t", "w")

Y = (np.random.rand(1000, 500) > 0.95).astype(int)
row_features = np.random.randn(1000, 15)
col_names = np.arange(500).astype(bytes)


f.register(Y, "central", mode = "coo")
f.register(row_features, 0, name = "row_features")
f.register(col_names, 1, name = "col_names")
f.close()

dataset = h5torch.Dataset("example.h5t", sampling = "coo")
dataset[5], len(dataset)

Note: h5torch does not limit the number of possible dimensions along its central data object (and hence also the number of axes to align objects to).

Package roadmap

  • Implement typing
  • Provide data type conversion capabilities for registering datasets
  • Add support for custom samplers
  • Add support for making data splits

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

h5torch-0.1.0.tar.gz (5.4 kB view details)

Uploaded Source

Built Distribution

h5torch-0.1.0-py3-none-any.whl (6.1 kB view details)

Uploaded Python 3

File details

Details for the file h5torch-0.1.0.tar.gz.

File metadata

  • Download URL: h5torch-0.1.0.tar.gz
  • Upload date:
  • Size: 5.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.10.6

File hashes

Hashes for h5torch-0.1.0.tar.gz
Algorithm Hash digest
SHA256 f940996c1623c160fa47c62dc98be40f549eb9c41b329f428fe3cc407b32fd00
MD5 34ba3c0816d7eddcac2fa8c6ee884b19
BLAKE2b-256 5b8bace8b17cd5e4598144d0ad889fbe9a8211f7bf38265b770fb9345c1114ab

See more details on using hashes here.

File details

Details for the file h5torch-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: h5torch-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 6.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.10.6

File hashes

Hashes for h5torch-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 6b1485769adebdaba55fa80a9a9670f584622b46e74e94f39ca8a1361718709f
MD5 5d161279823f6c2bcbcfdfccaa1c7ef7
BLAKE2b-256 fb26a63044fb514aef03c4c481feb7030e09b57dba764340a88e1c14c32eaa77

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page