A parallel HDF5-based archiving tool.

Project description

HDF5Vault

HDF5Vault is an MPI-based tool for efficiently concatenating and compressing very large numbers of files into a small number of HDF5-based archive files. It is designed for high-throughput archive generation on parallel file systems such as GPFS, Lustre or VAST Data, where metadata operations and I/O parallelism are critical for performance.

HDF5Vault compresses multiple input files concurrently and pipelines file scanning, compression, and archive writing across MPI processes. While files are being compressed, data are written in parallel to multiple archive files, maximizing both CPU and storage utilization. The tool is implemented in Python and uses Blosc2 for fast and efficient compression.

Archive layout

Within each archive, every input file is stored as one or more HDF5 datasets of byte type. The group and subgroup hierarchy mirrors the directory structure of the input data, and each dataset is named after the original file, with an additional extension indicating the compression format (e.g., .blosc2).

Files larger than a configurable size threshold (default: 1 GB) are automatically split into multiple datasets. This chunking ensures compatibility with Blosc2 size limits and reduces memory usage.

Use cases

HDF5Vault was originally developed to handle on the order of 10⁵ to 10⁷ small files (typically a few MBs each) generated by high-throughput acquisition systems, such as the Yokogawa CellVoyager microscope.

In a representative production workload, several hundred thousand files with a total uncompressed size of 6.2 TB were compressed into eight HDF5 archives (2 TB total) in approximately 30 minutes on a compute cluster connected to a VAST storage system.

Limitations

HDF5Vault is not intended for archiving data stored on a single local disk. Its performance benefits rely on parallel I/O and concurrent metadata operations provided by distributed file systems. In addition, compression is applied independently at the file level; in workloads where strong redundancy exists across files, tools that exploit cross-file similarity may achieve better compression ratios.

Downstream processing

Archives created with HDF5Vault can be unpackaged efficiently in parallel. In some workflows, they may also be accessed directly without unpacking － for example, during the generation of OME-Zarr datasets (see the example workflow repository).

Installation

HDF5Vault is distributed as a PyPI package and requires Python ≥ 3.8 with the following dependencies:

h5py
mpi4py
blosc2

Installation can be performed using either a standard Python virtual environment or Pixi.

Using Python virtual environments and pip

If Python>=3.8, venv and an MPI implementation are available on your system, HDF5Vault can be installed as follows:

python3 -m venv hdf5vault
source hdf5vault/bin/activate
pip install hdf5vault

Using Pixi

Pixi can be used to create a fully reproducible environment, including MPI-related dependencies:

pixi init hdf5vault
cd hdf5vault
pixi add openmpi
pixi add python=3.12 pip
pixi add h5py mpi4py python-blosc2 pandas humanize
pixi run pip install hdf5vault

Usage

All HDF5Vault tools (create, verify, unpack) are executed as an MPI programs. To create an archive, use:

mpirun -n NUM_TASKS hdf5vault_create \
                    DIRECTORY_TO_BE_ARCHIVED \
                    ARCHIVE_BASENAME \
                    -c COMPRESSION_LEVEL \
                    -t THREADS \
                    -w NUM_WRITERS
                    -s CHUNKSIZE

The contents of the DIRECTORY_TO_BE_ARCHIVED are packed into NUM_WRITERS archive files named ARCHIVE_BASENAME_X.h5, where X denotes the archive index. If only one writer is specified, just one archive file with the name ARCHIVE_BASENAME.h5 is created.

The compression level ranges from 1 (fastest) to 9 (highest compression) and is passed directly to Blosc2. Each compression process uses THREADS threads to compress the data.

At least NUM_WRITERS + 3 MPI processes are required (with a minimum of 4 when NUM_WRITERS = 1). MPI ranks are assigned to distinct roles:

one for directory scanning and scheduling compression,
one for scheduling archive writes,
NUM_WRITERS for writing (one per archive file),
and the remaining ranks for compression.

Example

mpirun -n 20 hdf5vault_create data data_archive -c 7 -t 4 -w 4

This command creates four archive files (data_archive_0.h5 to data_archive_3.h5) using 20 MPI processes: four wrtiers, two scheduling/scanning ranks, and fourteen compression workers.

The maximum dataset size is controlled with -s (default 1 GB). The maximum chunk size supported by Blosc2 is 2.1 GB.

Archive verification:

To verify the integrity of an archive against the original directory, use

mpirun -n $NUM_TASKS hdf5vault_check \
                     -d DIRECTORY_THAT_WAS_ARCHIVED \
                     -f ARCHIVE_FILE(S)
                    [-j JSON_SUMMARY_FILE]

The verification tool recomputes checksums for each file and compares the archived data with the on-disk originals. The result of the verification (pass or fail) or written to standard output. When the -j option is used, MD5 checksums for both archived and original files are written to a JSON file.

Since one MPI rank is dedicated to directory scanning, at least two MPI processes are required.

Unpacking archives

Archives can be unpacked in parallel using:

mpirun -n $NUM_TASKS hdf5vault_unpack \
                     -f ARCHIVE_FILE(S) \
                     -d DESTDIR

If DESTDIR is not specified, the base archive name (without the numeric suffix) is used as destination directory (e.g., data_archive in the above example).

Logging and verbosity

In all HDF5Vault tools, debugging information can be turned on with the -e flag. The flag -q suppresses all output except errors and warnings.

Archive format metadata

Each HDF5Vault archive includes an attribute named __description__ that contains basic information about the archive format. The definition can be found at src/hdf5vault/hdf5_archive_info.py.

Changelog

See Changelod.md for a record of changes, including new features, bug fixes, and compatibility notes.

Project details

Release history Release notifications | RSS feed

0.3.3

Feb 2, 2026

This version

0.3.2

Jan 7, 2026

0.3.1 yanked

Jan 7, 2026

0.3.0 yanked

Jan 6, 2026

Reason this release was yanked:

bug when import functions from common module

0.2.1

Dec 18, 2025

0.2.0

Dec 12, 2025

0.1.9

Dec 11, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hdf5vault-0.3.2.tar.gz (19.3 kB view details)

Uploaded Jan 7, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

hdf5vault-0.3.2-py3-none-any.whl (22.1 kB view details)

Uploaded Jan 7, 2026 Python 3

File details

Details for the file hdf5vault-0.3.2.tar.gz.

File metadata

Download URL: hdf5vault-0.3.2.tar.gz
Upload date: Jan 7, 2026
Size: 19.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: python-httpx/0.28.1

File hashes

Hashes for hdf5vault-0.3.2.tar.gz
Algorithm	Hash digest
SHA256	`93f229a53c75ff41e48649a416c23507e6174cbe4f236f7bac6b3fefef3907a6`
MD5	`0eabf360983801b7f949e07f2929fdcf`
BLAKE2b-256	`cb9b28985a4d1d1a284c6666aab64e648adec15c220811e814b274cb00cad6d7`

See more details on using hashes here.

File details

Details for the file hdf5vault-0.3.2-py3-none-any.whl.

File metadata

Download URL: hdf5vault-0.3.2-py3-none-any.whl
Upload date: Jan 7, 2026
Size: 22.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: python-httpx/0.28.1

File hashes

Hashes for hdf5vault-0.3.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`3bdad4c47c2336343b47cb6b4253dfdc02a04958dd28b8f2afc3f69980a756b9`
MD5	`1db3c6e86730ee69e81e624006590371`
BLAKE2b-256	`7d27cfb6e50f7a235c858e25ee9b529adb3915ae2ab8ade745264ebef9064e50`

See more details on using hashes here.

hdf5vault 0.3.2

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

HDF5Vault

Archive layout

Use cases

Limitations

Downstream processing

Installation

Using Python virtual environments and pip

Using Pixi

Usage

Example

Archive verification:

Unpacking archives

Logging and verbosity

Archive format metadata

Changelog

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes