Skip to main content

HDF5 file loader using h5py — tree extraction, node metadata, and dataset loading for the vcti-fileloader framework

Project description

FileLoader HDF5

HDF5 file loader using h5py — tree extraction, node metadata, and dataset loading for the vcti-fileloader framework.

When to Use This Loader

Use vcti-fileloader-hdf5 when you need to inspect the structure of an HDF5 file — groups, datasets, attributes — without reading every dataset array into memory upfront. The separated loading design lets you:

  • Browse the tree hierarchy first, then fetch only the datasets you need.
  • Retrieve node metadata (names, types, byte sizes) for display or filtering before committing to a full data load.
  • Load attributes selectively by node ID instead of scanning the whole file.

If you only need raw array access without tree/metadata introspection, use h5py directly.

Installation

pip install vcti-fileloader-hdf5>=1.0.0

Quick Start

from pathlib import Path
from vcti.fileloader_hdf5 import H5pyLoader, get_loader_descriptor
from vcti.fileloader import LoaderRegistry

# Context manager (recommended)
loader = H5pyLoader()
with loader.open(Path("data.h5")) as handle:
    tree = loader.load_tree(handle)
    info = loader.load_node_info(handle)
    node = loader.load_dataset(handle, node_id=2)

# Manual load/unload
loader = H5pyLoader()
handle = loader.load(Path("data.h5"))
try:
    tree = loader.load_tree(handle)
finally:
    loader.unload(handle)

# Registry-based usage
registry = LoaderRegistry()
registry.register(get_loader_descriptor())
desc = registry.get("hdf5-h5py-loader")
with desc.loader.open(Path("data.h5")) as handle:
    tree = desc.loader.load_tree(handle)

Example Output

load_tree() — structured array

Each row represents a node in the HDF5 hierarchy. Pointers use node IDs (0 = no link).

id  parent_id  first_child_id  prev_sibling_id  next_sibling_id
 1          0               2                0                0   ← / (root)
 2          1               4                0                3   ← results/
 3          1               0                2                0   ← ids
 4          2               0                0                0   ← results/stress

load_node_info() — structured array

id  name               type       size
 1  /                   group         0
 2  results             group         0
 3  ids                 dataset      24   ← 3 × int64 = 24 bytes
 4  results/stress      dataset      24   ← 3 × float64 = 24 bytes

load_dataset() — DataNode

node = loader.load_dataset(handle, node_id=4)
node.data          # np.array([1.0, 2.0, 3.0])
node.attributes    # {'units': 'MPa', 'type': 'dataset', 'shape': (3,), 'dtype': 'float64'}

API

H5pyLoader

Method Description
load(path, **options) Open HDF5 file, return h5py.File handle
open(path, **options) Context manager — loads and auto-unloads
unload(data) Close HDF5 file and clear cached mappings
can_load(path) Check extension (.h5, .hdf5, .he5)
load_tree(data) Tree structure as structured array
load_node_info(data) Node metadata (id, name, type, size)
load_attributes(data, node_ids) Attributes dict per node
load_dataset(data, node_id) DataNode with array + attributes

Helpers

Description
get_loader_descriptor() Create LoaderDescriptor for registry
H5pyValidator Check h5py availability
H5pySetup No-op setup (h5py needs no config)

Error Handling

The loader raises specific exceptions for different failure modes:

from vcti.fileloader import LoadError, UnloadError, UnsupportedFormatError

loader = H5pyLoader()
try:
    with loader.open(Path("data.h5")) as handle:
        node = loader.load_dataset(handle, node_id=99)
except FileNotFoundError:
    # File does not exist at the given path
    ...
except UnsupportedFormatError:
    # File exists but is not a valid HDF5 file
    ...
except LoadError:
    # Other failure during file open (e.g., permissions)
    ...
except KeyError:
    # Node ID not found in load_dataset
    ...
except ValueError:
    # File handle is closed
    ...

Performance

Node map caching

On the first call to any load method, the loader walks the HDF5 hierarchy once via h5py.File.visit() to build a bidirectional path-to-ID / ID-to-path mapping. This mapping is cached per file handle (via WeakKeyDictionary) and reused by all subsequent calls — load_tree, load_node_info, load_attributes, load_dataset — so you never pay for a second traversal.

Memory overhead

The node map stores two Python dicts (path string and integer ID per node). Rough overhead: ~200-300 bytes per node. For a file with 100,000 nodes, expect ~20-30 MB for the mapping alone. The structured arrays returned by load_tree and load_node_info add ~20 bytes and ~300 bytes per node respectively.

Traversal time

h5py.File.visit() is backed by HDF5's C-level H5Literate, so traversal is fast — typically < 1 second for 100K nodes on local SSD. The bottleneck for large files is usually dataset I/O, not tree walking.

Filtered vs. full attribute loading

  • load_attributes(handle) — reads attributes for every node. Use this when you need a complete picture (e.g., building a search index).
  • load_attributes(handle, node_ids=np.array([2, 5])) — reads only the specified nodes. Prefer this when you know which nodes you need, as it avoids touching unrelated HDF5 objects.

Full array loading

load_dataset() reads the entire dataset into memory via obj[:]. For very large datasets (multi-GB), consider using h5py slicing directly on the file handle instead.


Thread Safety

h5py file handles are not thread-safe. Do not share a single h5py.File handle across threads. Instead, open a separate handle per thread or serialize access with a lock.


Dependencies

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vcti_fileloader_hdf5-1.0.0.tar.gz (14.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

vcti_fileloader_hdf5-1.0.0-py3-none-any.whl (9.1 kB view details)

Uploaded Python 3

File details

Details for the file vcti_fileloader_hdf5-1.0.0.tar.gz.

File metadata

  • Download URL: vcti_fileloader_hdf5-1.0.0.tar.gz
  • Upload date:
  • Size: 14.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for vcti_fileloader_hdf5-1.0.0.tar.gz
Algorithm Hash digest
SHA256 885b4935b9e8a03959a7defc9d79d37bd3965620c8e335f09073716202c4bf1a
MD5 a7dac71692a662b002bf0c3e468352e7
BLAKE2b-256 b35b5626c404cfbb52f7c7e7e881d6b9d42121dd8d4706caba114304118472b2

See more details on using hashes here.

Provenance

The following attestation bundles were made for vcti_fileloader_hdf5-1.0.0.tar.gz:

Publisher: publish.yml on vcollab/vcti-python-fileloader-hdf5

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file vcti_fileloader_hdf5-1.0.0-py3-none-any.whl.

File metadata

File hashes

Hashes for vcti_fileloader_hdf5-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 1e2e1c8e5525cb4392cb394d5f4566625a764539ae35e9ced1da36810f3ecd8b
MD5 dd6e2ea29e7cdfee6b82d76ec2893044
BLAKE2b-256 371dc7fd2e74f2add08a8a0c4e951aee18c58af59dd428072b263d7c50863568

See more details on using hashes here.

Provenance

The following attestation bundles were made for vcti_fileloader_hdf5-1.0.0-py3-none-any.whl:

Publisher: publish.yml on vcollab/vcti-python-fileloader-hdf5

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page