Supporting a FAIR Research Data lifecycle using Python and HDF5.

These details have not been verified by PyPI

Project links

Project description

HDF5 Research Data Management Toolbox

Tests pyvers

A Python package that helps researchers achieve sustainable FAIR (Findable, Accessible, Interoperable, Reusable) data management with HDF5 files. The toolbox provides a comprehensive approach to the research data lifecycle through five main stages: planning, collecting, analyzing, sharing, and reusing data.

Note: This project is actively under development.

✨ Quick Start

One of the core features of the toolbox and success factors for achieving FAIR data, is using semantic metadata.

Let's see how we can use RDF (semantic metadata) together with HDF5 files using the toolbox:

# Install the package
pip install h5RDMtoolbox

import h5rdmtoolbox as h5tbx
import rdflib
import numpy as np

# Create a new HDF5 file with semantic metadata
M4I = rdflib.Namespace("http://w3id.org/nfdi4ing/metadata4ing#")

# Create a new HDF5 file with FAIR metadata
with h5tbx.File("example.h5", "w") as h5:
    h5.create_dataset("temperature", data=np.array([20, 21, 19, 22]))
    h5.attrs["units", M4I.hasUnit] = "degree_Celsius"
    h5.rdf["units"].object = "http://qudt.org/vocab/unit/DEG_C"
    h5.attrs["description", "https://schema.org/description"] = "Room temperature measurements"

    ttl = h5.serialize("ttl")

The serialization in Turtle (ttl) is teh RDF serialization of the HDF5 data (without arrays):

@prefix hdf: <http://purl.allotrope.org/ontologies/hdf5/1.8#> .
@prefix m4i: <http://w3id.org/nfdi4ing/metadata4ing#> .
@prefix schema: <https://schema.org/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

hdf:H5T_INTEL_I64 a hdf:Datatype .

[] a hdf:File ;
    hdf:rootGroup [ a hdf:Group ;
            hdf:attribute [ a hdf:StringAttribute ;
                    hdf:data "Room temperature measurements" ;
                    hdf:name "description" ],
                [ a hdf:StringAttribute ;
                    hdf:data "degree_Celsius" ;
                    hdf:name "units" ] ;
            hdf:member [ a hdf:Dataset ;
                    hdf:dataspace [ a hdf:SimpleDataspace ;
                            hdf:dimension [ a hdf:DataspaceDimension ;
                                    hdf:dimensionIndex 0 ;
                                    hdf:size 4 ] ] ;
                    hdf:datatype hdf:H5T_INTEGER,
                        hdf:H5T_INTEL_I64 ;
                    hdf:layout hdf:H5D_CONTIGUOUS ;
                    hdf:maximumSize 4 ;
                    hdf:name "/temperature" ;
                    hdf:rank 1 ;
                    hdf:size 4 ] ;
            hdf:name "/" ;
            m4i:hasUnit <http://qudt.org/vocab/unit/DEG_C> ;
            schema:description "Room temperature measurements" ] .

🚀 Key Features

🔗 HDF5 + Xarray Integration: Seamless access to metadata during data analysis with native xarray support
🏷️ Persistent Identifiers: Assign globally unique identifiers using RDF triples for FAIR compliance
📋 Standardized Conventions: Define and enforce community-specific metadata standards
☁️ Repository Integration: Direct upload to Zenodo and other research repositories
🗄️ Database Support: Use HDF5 files with MongoDB or native HDF5 databases
🔒 Semantic Enrichment: Add RDF-based semantic meaning to your data
🌐 Catalog Integration: Search and discover datasets through SPARQL-based catalogs

Find an example code in or check out the 📚 full Documentation

Who is the package for?

For everybody, who is...

... looking for a management approach for his or her data.
... community has not yet established a stable convention.
... working with small and big data, that fits into HDF5 files.
... looking for an easy way to work with HDF5, especially through Jupyter Notebooks.
... trying to integrate HDF5 with repositories and databases.
... wishing to enrich data semantically with the RDF standard.
... looking for a way to do all the above whiles not needing to learn a new syntax.
... new to HDF5 and wants to learn about it, especially with respect to the FAIR principles and data management.

Who is it not for?

For everybody, who ...

... is looking for a management approach which at the same time allows high-performance and/or parallel work with HDF5
... has already well-established conventions and managements approaches in his or her community

Package Architecture/structure

The toolbox implements six modules, which are shown below. The numbers reference to their main usage in the stages in the data lifecycle shown here. The wrapper module implements the main interface between the user and the HDF5 file. It extends the features of the underlying h5py library. Some of the features are implemented in other modules, hence the wrapper module depends on the convention, database and linked data (ld) module.

Current implementation highlights in the modules:

The wrapper module adds functionality on top of the h5py package. It allows to include so-called standard names, which are defined in conventions. And it implements interfaces, such as to the package xarray, which allows to carry metadata from HDF5 to the user. Other high-level interfaces like .rdf allows assigning semantic information to the HDF5 file.
For the database module, hdfDB and mongoDB are implemented. The hdfDB module allows to use HDF5 files as a database. The mongoDB module allows to use mongoDB as a database by mapping the metadata of HDF5 files to the database.
For the repository module, a Zenodo interface is implemented. Zenodo is a repository, which allows to upload and download data with a persistent identifier.
For the convention module, the standard attributes are implemented.
The layout module allows to define expectations on the internal layout (object names, location, attributes, properties) of HDF5 files.
The catalog module allows interfacing to HDF5 and RDF data published on Zenodo. Via a catalog file, a providers describe the data in various zenodo records they want to share. Through the catalog file, users can work with the data without downloading the full HDF5 files first.

Quickstart

A quickstart notebook can be tested by clicking on the following badge:

Documentation

Please find a comprehensive documentation with many examples here or by click on the image, which shows the research data lifecycle in the center and the respective toolbox features on the outside:

A paper is published in the journal inggrid.

Installation

Use python 3.9 or higher (automatic testing is performed until 3.13). If you are a regular user, you can install the package via pip:

pip install h5RDMtoolbox

Install from source:

Developers may clone the repository and install the package from source. Clone the repository first:

git clone https://github.com/matthiasprobst/h5RDMtoolbox.git@main

Then, run

pip install h5RDMtoolbox/

Add --user if you do not have root access.

For development installation run

pip install -e h5RDMtoolbox/

Dependencies

The core functionality depends on the following packages. Some of them are for general management others are very specific to the features of the package:

General dependencies are ...

numpy: Scientific computing, handling of arrays
matplotlib: Plotting
appdirs: Managing user and application directories
packaging: Version handling
IPython: Pretty display of data in notebooks
regex: Working with regular expressions

Specific to the package are ...

h5py: HDF5 file interface
xarray: Working with scientific arrays in combination with attributes. Allows carrying metadata from HDF5 to user
pint: Allows working with units
pint_xarray: Working with units for usage with xarray
python-forge: Used to update function signatures when using the standard attributes
pydantic: Used to validate standard attributes
pyyaml: Reading and writing of yaml files, e.g. metadata definitions (conventions). Note, lower versions collide with python 3.11
requests: Used to download files from the internet or validate URLs, e.g. metadata definitions (conventions)
rdflib: Used to enable working with RDF
ontolutils: Required to work with RDF and derive semantic description of HDF5 file content

Optional dependencies

To run unit tests or to enable certain features, additional dependencies must be installed.

Install optional dependencies by specifying them in square brackets after the package name, e.g.:

pip install h5RDMtoolbox[mongodb]

[mongodb]

pymongo: Database solution for HDF5 files

[csv]

pandas: Mainly used for reading csv and pretty printing

[snt]

xmltodict: Reading of xml files
tabulate: Pretty printing of tables
python-gitlab: Access to gitlab repositories
pypandoc: Conversion of markdown files to html

Citing the package

If you intend to use the package in your work, you may cite the software itself as published on paper in the Zenodo (latest version) repository. A related paper is published in the journal inggrid. Thank you!

Alternatively or additionally, you can consult the CITATION.cff file.

Here is the BibTeX entry:

@article{probst2024h5rdmtoolbox,
	author = {Matthias Probst, Balazs Pritz},
	title = {h5RDMtoolbox - A Python Toolbox for FAIR Data Management around HDF5},
	volume = {2},
	year = {2024},
	url = {https://www.inggrid.org/article/id/4028/},
	issue = {1},
	doi = {10.48694/inggrid.4028},
	month = {8},
	keywords = {Data management,HDF5,metadata,data lifecycle,Python,database},
	issn = {2941-1300},
	publisher={Universitäts- und Landesbibliothek Darmstadt},
	journal = {ing.grid}
}

Contribution

Feel free to contribute. Make sure to write docstrings to your methods and classes and please write tests and use PEP 8 (https://peps.python.org/pep-0008/)

Please write tests for your code and put them into the test/ folder. Visit the README file in the test-folder for more information.

Pleas also add a jupyter notebook in the docs/ folder in order to document your code. Please visit the README file in the docs-folder for more information on how to compile the documentation.

Please use the numpy style for the docstrings: https://sphinxcontrib-napoleon.readthedocs.io/en/latest/example_numpy.html#example-numpy

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

2.7.4

Apr 9, 2026

2.7.3

Feb 5, 2026

2.7.2

Jan 27, 2026

2.7.1

Jan 26, 2026

2.7.0

Jan 23, 2026

This version

2.6.2

Jan 21, 2026

2.6.1

Jan 18, 2026

2.6.0

Jan 16, 2026

2.5.4

Dec 19, 2025

2.5.3

Nov 27, 2025

2.5.2

Nov 21, 2025

2.5.1

Nov 1, 2025

2.5.0

Oct 31, 2025

2.4.0

Oct 19, 2025

2.4.0rc2 pre-release

Oct 13, 2025

2.4.0rc1 pre-release

Oct 12, 2025

2.3.1

Sep 9, 2025

2.3.0

Sep 3, 2025

2.2.1

Aug 15, 2025

2.2.0

Aug 6, 2025

2.1.0

Jun 5, 2025

2.0.0

May 19, 2025

2.0.0a0 pre-release

May 17, 2025

1.7.4

Apr 10, 2025

1.7.3

Mar 16, 2025

1.7.2

Feb 27, 2025

1.7.1

Feb 22, 2025

1.6.1

Dec 28, 2024

1.6.0

Dec 14, 2024

1.5.2

Oct 30, 2024

1.5.1

Oct 29, 2024

1.5.0

Oct 28, 2024

1.4.1

Aug 18, 2024

1.4.0

Jul 5, 2024

1.4.0rc2 pre-release

Jun 28, 2024

1.3.1

May 16, 2024

1.3.0

May 8, 2024

1.3.0a1 pre-release

May 2, 2024

1.2.2

Jan 4, 2024

1.2.1

Jan 4, 2024

1.2.0

Jan 3, 2024

1.1.0

Dec 26, 2023

1.0.0

Dec 18, 2023

0.12.2

Nov 7, 2023

0.12.1

Oct 27, 2023

0.11.1

Oct 5, 2023

0.11.0

Oct 5, 2023

0.10.0

Sep 26, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

h5rdmtoolbox-2.6.2.tar.gz (221.1 kB view details)

Uploaded Jan 21, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

h5rdmtoolbox-2.6.2-py3-none-any.whl (238.3 kB view details)

Uploaded Jan 21, 2026 Python 3

File details

Details for the file h5rdmtoolbox-2.6.2.tar.gz.

File metadata

Download URL: h5rdmtoolbox-2.6.2.tar.gz
Upload date: Jan 21, 2026
Size: 221.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for h5rdmtoolbox-2.6.2.tar.gz
Algorithm	Hash digest
SHA256	`061888fe0bdae4a3b94502548d0c009b78904d16bd8150036f5073a0d6ad451b`
MD5	`1c46461954280217644310666064410c`
BLAKE2b-256	`a8fb2c54c5558ab8cf568333d1ec82a693d968e3a8807e7629a803221c10536a`

See more details on using hashes here.

File details

Details for the file h5rdmtoolbox-2.6.2-py3-none-any.whl.

File metadata

Download URL: h5rdmtoolbox-2.6.2-py3-none-any.whl
Upload date: Jan 21, 2026
Size: 238.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for h5rdmtoolbox-2.6.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`cc6b8992ae120c3961010ec79b792596f6ec5dc694e363619aa30f451321ab30`
MD5	`b9a91267eb8b2698f2160cf46a2ad0c2`
BLAKE2b-256	`69f20df3d22ddc1497d15b6311f28545d5cf42b86a577375f417e55e757d0f4b`

See more details on using hashes here.

h5rdmtoolbox 2.6.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

HDF5 Research Data Management Toolbox

✨ Quick Start

🚀 Key Features

Who is the package for?

Who is it not for?

Package Architecture/structure

Quickstart

Documentation

Installation

Install from source:

Dependencies

Optional dependencies

Citing the package

Contribution

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes