Skip to main content

A parallel HDF5-based archiving tool.

Project description

HDF5Vault

HDF5Vault compresses multiple small tiff files of similar size into an HDF5-based container. HDF5Vault is designed for fast archive generation on parallel file systems (GPFS, Lustre, VAST Data)). It compresses multiple input files in parallel and simultaneously writes the data to multiple archive files.

Inside each archive, the contents of each compressed file are stored as an HDF5 dataset of type bytes. The dataset group and subgroup reflect the directory structure and the dataset name corresponds to the original file name with an extension reflecting the compression (e.g., .blosc2). HDF5Vault is based on Python and uses Blosc2 for fast and efficient compression.

Use cases

HDF5Vault is intended for handling the large number (105 to 106) of small (few MBs) files created by acquisition systems (e.g., the Yokogawa CellVoyager microscope). The tool may be suitable for other use cases if these conditions are met:

  1. Most files are at least a few MBs in size.
  2. The size of each file is much smaller than the available memory.

Although condition 1 was not tested, HDF5Vault would likely perform poorly when compressing multiple small files. Condition 2 allows HDF5Vault to load and compress n files into memory without chunking, where n is the number of parallel MPI processes.

HDF5 archives can be easily unpackaged again, but are intended to be used directly during further processing, in particular during the creation of OME-Zarr files (see workflow repository).

Installation

HDF5Vault requires a minimal Python setup that depends on h5py, blosc2 and mpi4py, and is provided as a PyPI package. These can easily be installed using a Python virtual environment or Pixi.

Using Python Virtual Environments and PIP

If Python>=3.8, python-venv and mpi are available on the system, HDF5Vault can be installed using a virtual environment:

python3 -m venv hdf5vault
source hdf5vault/bin/activate
pip install hdf5vault

Using Pixi

The commands install HDF5Vault using Pixi:

pixi init hdf5vault
cd hdf5vault
pixi add python=3.12 pip
pixi add h5py mpi4py python-blosc2 pandas humanize
pixi run pip install hdf5vault

Usage

HDF5Vault is invoked as an MPI program.

mpirun -n NUM_TASKS hdf5vault_create \
           DIRECTORY_TO_BE_ARCHIVED \
           ARCHIVE_BASENAME \
           -c COMPRESSION_LEVEL \
           -t THREADS \
           -w NUM_WRITERS

The contents of the directory_to_be_archived will be backed into NUM_WRITERS archives with the names ARCHIVE_BASENAME_X.h5, where X stands for the archive number. (If only one writer is specified, just one archive file with the name ARCHIVE_BASENAME.h5 is generated). The COMPRESSION_LEVEL ranges from 1 (low) to 9 (high) and is passed to BLOSC2. Each of the NUM_TASKS MPI processes compressed the data in parallel with THREADS threads.

NUM_TASKS must be at least NUM_WRITERS + 3 (with a minimal value of 4 if NUM_WRITERS is 1) . One MPI task is dedicated to scheduling compression and one to scheduling writing.

For example,

mpirun -n 20 hdf5vault_create data data_archive -c 7 -t 4 -w 4 

will create 4 archives with the names data_archive_1.h5 to data_archive_4.h5. 20 MPI tasks are involved, with 4 tasks for writing, 14 tasks for compressing and 2 for scheduling.

Notes

1 HDF5 Vault was redesigned to work efficiently on a VAST storage system. A previous version used the MPI driver for HDF5 to write the compressed content of multiple files to a single archive in parallel. We found that parallel writing to a single file did not produce any benefits on Vast. The current version writes multiple archives and overlaps the writing of compressed file contents with the compression operation.

Changelog

See Changelod.md

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hdf5vault-0.2.0.tar.gz (14.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

hdf5vault-0.2.0-py3-none-any.whl (16.3 kB view details)

Uploaded Python 3

File details

Details for the file hdf5vault-0.2.0.tar.gz.

File metadata

  • Download URL: hdf5vault-0.2.0.tar.gz
  • Upload date:
  • Size: 14.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: python-httpx/0.28.1

File hashes

Hashes for hdf5vault-0.2.0.tar.gz
Algorithm Hash digest
SHA256 ef6ea2798ee4a66e2f99ee815bb9452a44fa95a49cad7164483ef873dde4fddc
MD5 a3d48561e6019de2a8c1eef82453627c
BLAKE2b-256 09050e7767867ec3139b2d780800f9fcdd50e4f4d81076dd187aa7e80459ec38

See more details on using hashes here.

File details

Details for the file hdf5vault-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: hdf5vault-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 16.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: python-httpx/0.28.1

File hashes

Hashes for hdf5vault-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c464e319500cc11a282f0f6a17f2bf029b523eec8d94f37ba6f62958a90fbfff
MD5 fdd624f7e749d21c70af60f6308e4823
BLAKE2b-256 26792154792ded8e29bda5377084c9de11d158a840ecced65bfedeb5d85f189f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page