Skip to main content

Supporting a FAIR Research Data lifecycle using Python and HDF5.

Project description

HDF5 Research Data Management Toolbox

Tests DOCS Documentation Status pyvers

Note, that the project is still under development!

The "HDF5 Research Data Management Toolbox" (h5RDMtoolbox) is a Python package supporting everybody who is working with HDF5 to achieve a sustainable data lifecycle which follows the FAIR (Findable, Accessible, Interoperable, Reusable) principles. It specifically supports the five main steps of planning, collecting, analyzing, sharing and reusing data. Please visit the documentation for detailed information of try the quickstart using colab.

Highlights

Who is the package for?

For everybody, who is...

  • ... looking for a management approach for his or her data.
  • ... community has not yet established a stable convention.
  • ... working with small and big data, that fits into HDF5 files.
  • ... looking for an easy way to work with HDF5, especially through Jupyter Notebooks.
  • ... trying to integrate HDF5 with repositories and databases.
  • ... wishing to enrich data semantically with the RDF standard.
  • ... looking for a way to do all the above whiles not needing to learn a new syntax.
  • ... new to HDF5 and wants to learn about it, especially with respect to the FAIR principles and data management.

Who is it not for?

For everybody, who ...

  • ... is looking for a management approach which at the same time allows high-performance and/or parallel work with HDF5
  • ... has already well-established conventions and managements approaches in his or her community

Package Architecture/structure

The toolbox implements five modules, which are shown below. The numbers reference to their main usage in the stages in the data lifecycle above. Except the wrapper module, which uses the convention module, all other modules are independent of each other.

H5TBX modules

Current implementation highlights in the modules:

  • The wrapper module adds functionality on top of the h5py package. It allows to include so-called standard names, which are defined in conventions. And it implements interfaces, such as to the package xarray, which allows to carry metadata from HDF5 to the user. Other high-level interfaces like .rdf allows assigning semantic information to the HDF5 file.
  • For the database module, hdfDB and mongoDB are implemented. The hdfDB module allows to use HDF5 files as a database. The mongoDB module allows to use mongoDB as a database by mapping the metadata of HDF5 files to the database.
  • For the repository module, a Zenodo interface is implemented. Zenodo is a repository, which allows to upload and download data with a persistent identifier.
  • For the convention module, the standard attributes are implemented.
  • The layout module allows to define expectations on the internal layout (object names, location, attributes, properties) of HDF5 files.

Quickstart

A quickstart notebook can be tested by clicking on the following badge:

Open Quickstart Notebook

Documentation

Please find a comprehensive documentation with many examples here or by click on the image, which shows the research data lifecycle in the center and the respective toolbox features on the outside:

A paper is published in the journal inggrid.

Installation

Use python 3.8 or higher (automatic testing is performed until 3.12). If you are a regular user, you can install the package via pip:

pip install h5RDMtoolbox

Install from source:

Developers may clone the repository and install the package from source. Clone the repository first:

git clone https://github.com/matthiasprobst/h5RDMtoolbox.git@main

Then, run

pip install h5RDMtoolbox/

Add --user if you do not have root access.

For development installation run

pip install -e h5RDMtoolbox/

Dependencies

The core functionality depends on the following packages. Some of them are for general management others are very specific to the features of the package:

General dependencies are ...

  • numpy>=1.20: Scientific computing, handling of arrays
  • matplotlib>=3.5.2: Plotting
  • appdirs>=1.4.4: Managing user and application directories
  • packaging: Version handling
  • IPython>=8.4.0: Pretty display of data in notebooks
  • regex>=2020.7.9: Working with regular expressions

Specific to the package are ...

  • h5py=3.7.0: HDF5 file interface
  • xarray>=2022.3.0: Working with scientific arrays in combination with attributes. Allows carrying metadata from HDF5 to user
  • pint>=0.19.2: Allows working with units
  • pint_xarray>=0.2.1: Working with units for usage with xarray
  • python-forge==18.6.0: Used to update function signatures when using the standard attributes
  • pydantic: Used to validate standard attributes
  • pyyaml>6.0.0: Reading and writing of yaml files, e.g. metadata definitions (conventions). Note, lower versions collide with python 3.11
  • requests: Used to download files from the internet or validate URLs, e.g. metadata definitions (conventions)
  • rdflib: Used to enable working with RDF
  • ontolutils: Required to work with RDF and derive semantic description of HDF5 file content

Optional dependencies

To run unit tests or to enable certain features, additional dependencies must be installed.

Install optional dependencies by specifying them in square brackets after the package name, e.g.:

pip install h5RDMtoolbox[mongodb]

[mongodb]

  • pymongo>=4.2.0: Database solution for HDF5 files

[csv]

  • pandas>=1.4.3: Mainly used for reading csv and pretty printing

[snt]

  • xmltodict: Reading of xml files
  • tabulate>=0.8.10: Pretty printing of tables
  • python-gitlab: Access to gitlab repositories
  • pypandoc>=2.3: Conversion of markdown files to html

Citing the package

If you intend to use the package in your work, you may cite the paper in the journal inggrid

Here's the bibtext to it:

@article{probst2023h5rdmtoolbox,
  title={h5RDMtoolbox-A Python Toolbox for FAIR Data Management around HDF5},
  author={Probst, Matthias and Pritz, Balazs},
  year={2023},
  publisher={ing. grid Preprint Repository}
}

Contribution

Feel free to contribute. Make sure to write docstrings to your methods and classes and please write tests and use PEP 8 (https://peps.python.org/pep-0008/)

Please write tests for your code and put them into the test/ folder. Visit the README file in the test-folder for more information.

Pleas also add a jupyter notebook in the docs/ folder in order to document your code. Please visit the README file in the docs-folder for more information on how to compile the documentation.

Please use the numpy style for the docstrings: https://sphinxcontrib-napoleon.readthedocs.io/en/latest/example_numpy.html#example-numpy

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

h5rdmtoolbox-1.5.2.tar.gz (230.7 kB view details)

Uploaded Source

Built Distribution

h5rdmtoolbox-1.5.2-py3-none-any.whl (273.1 kB view details)

Uploaded Python 3

File details

Details for the file h5rdmtoolbox-1.5.2.tar.gz.

File metadata

  • Download URL: h5rdmtoolbox-1.5.2.tar.gz
  • Upload date:
  • Size: 230.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.8.20

File hashes

Hashes for h5rdmtoolbox-1.5.2.tar.gz
Algorithm Hash digest
SHA256 5776452d5b85593b84012a71b31b8df60fcf214c8d6016989728e40d1ffe80ea
MD5 eaad7005e697a44264c576272e3477b7
BLAKE2b-256 818a9cd59cb35e4e4134ce52bd72fa89f4ebc650d4e023905797991531e690d8

See more details on using hashes here.

File details

Details for the file h5rdmtoolbox-1.5.2-py3-none-any.whl.

File metadata

  • Download URL: h5rdmtoolbox-1.5.2-py3-none-any.whl
  • Upload date:
  • Size: 273.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.8.20

File hashes

Hashes for h5rdmtoolbox-1.5.2-py3-none-any.whl
Algorithm Hash digest
SHA256 65e92b45a24467b483f5cd1da03ffacdd337b109a8328a2751dfa6e782268615
MD5 f33b8791ccfb1a43e8b63f89103cecf6
BLAKE2b-256 a0de15a61aabbaf04bbd3105bfea306507deefc16536ac17b56ffdce5dff62b9

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page