HDF5 data utilities for PyTorch

Project description

h5torch

HDF5 data utilities for PyTorch.

h5torch consists of two main parts: (1) h5torch.File: a wrapper around h5py.File as an interface to create HDF5 files compatible with (2) h5torch.Dataset, a wrapper around torch.utils.data.Dataset. As a library, h5torch establishes a "code" for linking h5py and torch. To do this, this package has to formulate a vocabulary for how datasets generally look, unifying as many ML settings to the best of its abilities. In turn, this vocabulary allows dataloading of various machine learning data settings from a single dataset class definition, reducing boilerplate in your projects.

Who is this package for?

Loading data from HDF5 files allows for efficient data-loading from an on-disk format, drastically reducing memory overhead. Additionally, you will find your datasets to be more organized using the HDF5 format, as everything is neatly arrayed in a single file.

If you want to use this package but are not sure your use-case is covered by the current formulation of the package, feel free to open an issue.

Install

Since PyTorch is a dependency of h5torch, we recommend installing PyTorch independently first, as your system may require a specific version (e.g. CUDA drivers).

After PyTorch installation, h5torch can be installed using pip

pip install h5torch

Package concepts

Storing

The main idea behind h5torch is that datasets can usually be formulated as being aligned to a central object. E.g. in a classical supervised learning setup, features/inputs are aligned to a label vector/matrix. In recommender systems, a score matrix is the central object, with features aligned to rows and columns.

h5torch allows creating and reading HDF5 datasets for use in PyTorch using this dogma. When creating a new dataset, the first data object that should be registered is the central object. The type of central object is flexible:

N-D: for regular dense data. The number of dimensions in this object will dictate how many possible aligned axes can exist.
coo: The sparse variant as N-D. The number of dimensions here can be arbitrary high.
csr: For sparse 2D arrays, this central data type can only have 2 aligned axes and can only be sampled along the first dimension
vlen: For variable length 1D arrays. This central data type can only have one aligned axis (0).
separate: For objects that are better stored in separate groups instead of as one dataset. An example is variable shape N-D objects such as variably-sized images. This central data type can only have one aligned axis (0).

Along this central object, axis objects can be aligned. The first dimension length of any axis object must correspond to the length of the central data object to that dimension. For example, a central data object of shape (50, 40) can only have 50-length and 40-length objects aligned to its first and second axis, respectively. For axis objects, these possibilities are available:

N-D: Can have arbitrary number of dimensions. E.g. equally-sized images: (N, 3, H, W).
csr: Max 2 dimensions, rows will be sampled. E.g. A sparse scRNA-seq count matrix
vlen: Variable length 1D arrays. E.g. Tokenized text as variable length arrays of integers.
separate: For objects that are better stored in separate groups instead of as one dataset. An example is variable shape N-D objects such as variably-sized images.

Note there is no support for coo data type for aligned objects, that is because aligned axis objects require efficient indexing along their first dimension.

Also note that there is no limit on the number of data objects aligned to an axis. For example, in the case of images aligned to a central label vector, extra information of every image can be added such as the URL, the date said image was taken, the geolocation of that image, ...

Besides the central and axis objects, you can also store unstructured data which can be any length or dimension and follow any of the above-mentioned data types (including coo). This could for example be a vocabularium vector or the names of classes...

Sampling

Once a dataset is created using h5torch.File, it can be used as a PyTorch Dataset using h5torch.Dataset. Sampling can occur along any of the axes in the central object, upon which the corresponding indices in the objects aligned to that axis are also sampled. Alternatively, coo sampling (available for N-D and coo-type central objects) samples one specific element of the central dataset, along with the corresponding indices of all axis-aligned objects.

Usage

Refer to the tutorial on the documentation page.

Package roadmap

Implement typing
Provide data type conversion capabilities for registering datasets
Add support for custom samplers
Add support for making data splits
Add a slice sampler
Implement a way to pre-specify dataset size and append to it
Add better docs
Add tests
Implement a collater for variable length objects
Benchmarks

Project details

Release history Release notifications | RSS feed

0.2.14

Feb 15, 2024

0.2.13

Feb 15, 2024

0.2.12

Feb 15, 2024

0.2.11

Jan 3, 2024

0.2.10

Jan 3, 2024

0.2.9

Jan 3, 2024

This version

0.2.8

Oct 13, 2023

0.2.7

Oct 11, 2023

0.2.6

Sep 20, 2023

0.2.5

Sep 20, 2023

0.2.4

Sep 19, 2023

0.2.3

Sep 15, 2023

0.2.2

Jun 16, 2023

0.2.1

Apr 20, 2023

0.2.0

Apr 6, 2023

0.1.9

Apr 6, 2023

0.1.8

Apr 4, 2023

0.1.7

Apr 4, 2023

0.1.6

Jan 18, 2023

0.1.5

Jan 17, 2023

0.1.4

Nov 16, 2022

0.1.3

Oct 18, 2022

0.1.2

Oct 18, 2022

0.1.0

Oct 17, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

h5torch-0.2.8.tar.gz (58.7 kB view details)

Uploaded Oct 13, 2023 Source

Built Distribution

h5torch-0.2.8-py3-none-any.whl (11.0 kB view details)

Uploaded Oct 13, 2023 Python 3

File details

Details for the file h5torch-0.2.8.tar.gz.

File metadata

Download URL: h5torch-0.2.8.tar.gz
Upload date: Oct 13, 2023
Size: 58.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.11.6

File hashes

Hashes for h5torch-0.2.8.tar.gz
Algorithm	Hash digest
SHA256	`466cfb37addd995123e95da527073936989e08cffbae4b0c7b9aac92fee82b6d`
MD5	`c83d3e3048d4fccf762ecd4caa679a31`
BLAKE2b-256	`fb87a23fff644c11f753e24a6baf02064737c9a5541d0849793558aa6f0b1fc4`

See more details on using hashes here.

File details

Details for the file h5torch-0.2.8-py3-none-any.whl.

File metadata

Download URL: h5torch-0.2.8-py3-none-any.whl
Upload date: Oct 13, 2023
Size: 11.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.11.6

File hashes

Hashes for h5torch-0.2.8-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ba7073e6fac5dd5d5938326c354d4d601e8d43d89cccf800b2b9dab347e479a1`
MD5	`94af438cac5b8cd9fd0b77c5aa6f908c`
BLAKE2b-256	`c2bc1a752e1f46c4b7b96900b0b1c2366999f81808c1a7873c6f4d3e550261a0`