pythonic ORM tool for reading and writing HDF5 data

These details have not been verified by PyPI

Project links

Project description

h5mapper

h5mapper is a pythonic ORM-like tool for reading and writing HDF5 data.

It is built on top of h5py and lets you define types of .h5 files as python classes which you can then easily create from raw sources (e.g. files, urls...), serve (use as Dataset for a Dataloader), or dynamically populate (logs, checkpoints of an experiment).

Installation

`pip`

h5mapper is on pypi, to install it, one only needs to

pip install h5mapper

developer install

for playing around with the internals of the package, a good solution is to first

git clone https://github.com/ktonal/h5mapper.git

and then

pip install -e h5mapper/

which installs the repo in editable mode.

Quickstart

TypedFile

h5m assumes that you want to store collections of contiguous arrays in single datasets and that you want several such concatenated datasets in a file.

Thus, TypedFile allows you to create and read files that maintain a 2-d reference system, where contiguous arrays are stored within features and indexed by their source's id.

Such a file might then look like

<Experiment "experiment.h5">
----------------------------------------------------> sources' ids axis
|                   "planes/01.jpeg"  |     "train"
|                                     |
|   data/                             |
|        images/        (32, 32)      |       None
|        labels/        (1, )         |       None
|   logs/                             |
|        loss/           None         |       (10000,)
|        ...
V
features axis

where the entries correspond to the shapes of arrays or their absence (None).

Note that this is a different approach than storing each file or image in a separate dataset. In this case, there would be an h5py.Dataset located at data/images/planes/01.jpeg although in our example, the only dataset is at data/images/ and one of its regions is indexed by the id "planes/01.jpeg"

For interacting with files that follow this particular structure, simply define a class

import h5mapper as h5m

class Experiment(h5m.TypedFile):

    data = h5m.Group(
            # your custom h5m.Feature classes:
            images=Image(),
            labels=DirLabels()
            )
    logs = h5m.Group(
            loss=h5m.Array()
            )

`create`, `add`

now, create an instance, load data from files through parallel jobs and add data on the fly :

# create instance from raw sources
exp = Experiment.create("experiment.h5",
        # those are then used as ids :
        sources=["planes/01.jpeg", "planes/02.jpeg"],
        n_workers=8)
...
# add id <-> data on the fly :
exp.logs.add("train", dict(loss=losses_array))

`get`, `refs` and `getitem`

There are 3 main options to read data from a TypedFile or one of its Proxy

1/ By their id

>> exp.logs.get("train")
Out: {"loss": np.array([...])}
# which, in this case, is equivalent to 
>> exp.logs["train"]
Out: {"loss": np.array([...])}
# because `exp.logs` is a Group and Groups only support id-based indexing

2/ By the index of their ids through their refs attribute :

>> exp.data.images[exp.data.images.refs[0]].shape
Out: (32, 32)

Which works because exp.data.images is a Dataset and only Datasets have refs

3/ with any item supported by the h5py.Dataset

>> exp.data.labels[:32]
Out: np.array([0, 0, ....])

Which only works for Datasets - not for Groups.

Note that, in this last case, you are indexing into the concatenation of all sub-arrays along their first axis.

The same interface is also implemented for set(source, data) and __setitem__

Feature

h5m exposes a class that helps you configure the behaviour of your TypedFile classes and the properties of the .h5 they create.

the Feature class helps you define :

how sources' ids are loaded into arrays (feature.load(source))
which types of files are supported
how the data is stored by h5py (compression, chunks)
which extraction parameters need to be stored with the data (e.g. sample rate of audio files)
custom-methods relevant to this kind of data

Once you defined a Feature class, attach it to the class dict of a TypedFile, that's it!

For example :

import h5mapper as h5m


class MyFeature(h5m.Feature):

    # only sources matching this pattern will be passed to load(...)
    __re__ = r".special$"

    # args for the h5py.Dataset
    __ds_kwargs__ = dict(compression='lzf', chunks=(1, 350))
    
    def __init__(self, my_extraction_param=0):
        self.my_extraction_param = my_extraction_param

    @property
    def attrs(self):
        # those are then written in the h5py.Group.attrs
        return {"p": self.my_extraction_param}

    def load(self, source):
        """your method to get an np.ndarray or a dict thereof
        from a path, an url, whatever sources you have..."""   
        return data

    def plot(self, data):
        """custom plotting method for this kind of data"""
        # ...

# attach it
class Data(h5m.TypedFile):
    feat = MyFeature(47)

# load sources...
f = Data.create(....)

# read your data through __getitem__ 
batch = f.feat[4:8]

# access your method 
f.feat.plot(batch)

# modify the file through __setitem__
f.feat[4:8] = batch ** 2

for more examples, checkout h5mapper/h5mapper/features.py.

`serve`

Primarly designed with pytorch users in mind, h5m plays very nicely with the Dataset class :

class MyDS(h5m.TypedFile, torch.utils.data.Dataset):
    
    x = MyInputFeature(42)
    
    def __getitem__(self, item):
        return self.x[item], self.labels[item]
  
    def __len__(self):
        return len(self.x)

ds = MyDS.create("train.h5", sources, keep_open=True)

dl = torch.utils.data.DataLoader(ds, batch_size=16, num_workers=8, pin_memory=True)

TypedFile even have a method that takes the Dataloader args and a batch object filled with BatchItems and returns a Dataloader that will yield such batch objects.

Example :

f = TypedFile("train.h5", keep_open=True)
loader = f.serve(
    # batch object :
    dict(
        x=h5m.Input(key='data/image', getter=h5m.GetId()),
        labels=h5m.Target(key='data/labels', getter=h5m.GetId())
    ),
    # Dataloader kwargs :
    num_workers=8, pin_memory=True, batch_size=32, shuffle=True
)

Examples

in h5mapper/examples you'll find for now

a train script with data, checkpoints and logs in dataset_and_logs.py
a script for benchmarking batch-loading times of different options

Development

h5mapper is just getting started and you're welcome to contribute!

You'll find some tests you can run from the root of the repo with a simple

pytest

If you'd like to get involved, just drop us an email : ktonalberlin@gmail.com

License

h5mapper is distributed under the terms of the MIT License.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.3.3

Jan 10, 2025

0.3.2

Apr 26, 2023

0.3.1

Apr 13, 2023

0.3.0

Apr 7, 2023

0.2.4

Nov 19, 2022

0.2.3

May 24, 2022

0.2.2

Apr 27, 2022

0.2.1

Mar 5, 2022

0.2.0

Feb 16, 2022

0.1.0

Jul 2, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

h5mapper-0.3.3.tar.gz (16.2 kB view details)

Uploaded Jan 10, 2025 Source

Built Distribution

h5mapper-0.3.3-py3-none-any.whl (19.8 kB view details)

Uploaded Jan 10, 2025 Python 3

File details

Details for the file h5mapper-0.3.3.tar.gz.

File metadata

Download URL: h5mapper-0.3.3.tar.gz
Upload date: Jan 10, 2025
Size: 16.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: python-httpx/0.24.0

File hashes

Hashes for h5mapper-0.3.3.tar.gz
Algorithm	Hash digest
SHA256	`5975d3a7a709e4e51c1763977456e5eba8ba3fa9c7f712a77194fd630bbb9dce`
MD5	`544246aebb80c7f2afa0418393cc9f75`
BLAKE2b-256	`214947f5d645e8f5cf22d4fe55500512bea60b680c4eebd1a7f75d0bd3838465`

See more details on using hashes here.

File details

Details for the file h5mapper-0.3.3-py3-none-any.whl.

File metadata

Download URL: h5mapper-0.3.3-py3-none-any.whl
Upload date: Jan 10, 2025
Size: 19.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: python-httpx/0.24.0

File hashes

Hashes for h5mapper-0.3.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`96bf3cca9d989e0986b4d7772976aacfafaaa39cc03af55b21b650913a28a8dd`
MD5	`ef4b453627c4e249f6e4f1cb9337a7e1`
BLAKE2b-256	`119b00caf19b588a0d16af1b8f707abfb8290c7bde457df1da7b8f7a7315cf5d`

See more details on using hashes here.

h5mapper 0.3.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

h5mapper

Content

Installation

`pip`

developer install

Quickstart

TypedFile

`create`, `add`

`get`, `refs` and `getitem`

Feature

`serve`

Examples

Development

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

h5mapper 0.3.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

h5mapper

Content

Installation

pip

developer install

Quickstart

TypedFile

create, add

get, refs and __getitem__

Feature

serve

Examples

Development

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`pip`

`create`, `add`

`get`, `refs` and `getitem`

`serve`