HDF5 file loader using h5py — tree extraction, node metadata, and dataset loading for the vcti-fileloader framework
Project description
FileLoader HDF5
HDF5 file loader using h5py — tree extraction, node metadata, and dataset loading for the vcti-fileloader framework.
When to Use This Loader
Use vcti-fileloader-hdf5 when you need to inspect the structure of an
HDF5 file — groups, datasets, attributes — without reading every dataset
array into memory upfront. The separated loading design lets you:
- Browse the tree hierarchy first, then fetch only the datasets you need.
- Retrieve node metadata (names, types, byte sizes) for display or filtering before committing to a full data load.
- Load attributes selectively by node ID instead of scanning the whole file.
If you only need raw array access without tree/metadata introspection, use h5py directly.
Installation
pip install vcti-fileloader-hdf5>=1.0.0
Quick Start
from pathlib import Path
from vcti.fileloader_hdf5 import H5pyLoader, get_loader_descriptor
from vcti.fileloader import LoaderRegistry
# Context manager (recommended)
loader = H5pyLoader()
with loader.open(Path("data.h5")) as handle:
tree = loader.load_tree(handle)
info = loader.load_node_info(handle)
node = loader.load_dataset(handle, node_id=2)
# Manual load/unload
loader = H5pyLoader()
handle = loader.load(Path("data.h5"))
try:
tree = loader.load_tree(handle)
finally:
loader.unload(handle)
# Registry-based usage
registry = LoaderRegistry()
registry.register(get_loader_descriptor())
desc = registry.get("hdf5-h5py-loader")
with desc.loader.open(Path("data.h5")) as handle:
tree = desc.loader.load_tree(handle)
Example Output
load_tree() — structured array
Each row represents a node in the HDF5 hierarchy. Pointers use node IDs (0 = no link).
id parent_id first_child_id prev_sibling_id next_sibling_id
1 0 2 0 0 ← / (root)
2 1 4 0 3 ← results/
3 1 0 2 0 ← ids
4 2 0 0 0 ← results/stress
load_node_info() — structured array
id name type size
1 / group 0
2 results group 0
3 ids dataset 24 ← 3 × int64 = 24 bytes
4 results/stress dataset 24 ← 3 × float64 = 24 bytes
load_dataset() — DataNode
node = loader.load_dataset(handle, node_id=4)
node.data # np.array([1.0, 2.0, 3.0])
node.attributes # {'units': 'MPa', 'type': 'dataset', 'shape': (3,), 'dtype': 'float64'}
API
H5pyLoader
| Method | Description |
|---|---|
load(path, **options) |
Open HDF5 file, return h5py.File handle |
open(path, **options) |
Context manager — loads and auto-unloads |
unload(data) |
Close HDF5 file and clear cached mappings |
can_load(path) |
Check extension (.h5, .hdf5, .he5) |
load_tree(data) |
Tree structure as structured array |
load_node_info(data) |
Node metadata (id, name, type, size) |
load_attributes(data, node_ids) |
Attributes dict per node |
load_dataset(data, node_id) |
DataNode with array + attributes |
Helpers
| Description | |
|---|---|
get_loader_descriptor() |
Create LoaderDescriptor for registry |
H5pyValidator |
Check h5py availability |
H5pySetup |
No-op setup (h5py needs no config) |
Error Handling
The loader raises specific exceptions for different failure modes:
from vcti.fileloader import LoadError, UnloadError, UnsupportedFormatError
loader = H5pyLoader()
try:
with loader.open(Path("data.h5")) as handle:
node = loader.load_dataset(handle, node_id=99)
except FileNotFoundError:
# File does not exist at the given path
...
except UnsupportedFormatError:
# File exists but is not a valid HDF5 file
...
except LoadError:
# Other failure during file open (e.g., permissions)
...
except KeyError:
# Node ID not found in load_dataset
...
except ValueError:
# File handle is closed
...
Performance
Node map caching
On the first call to any load method, the loader walks the HDF5 hierarchy
once via h5py.File.visit() to build a bidirectional path-to-ID /
ID-to-path mapping. This mapping is cached per file handle (via
WeakKeyDictionary) and reused by all subsequent calls — load_tree,
load_node_info, load_attributes, load_dataset — so you never pay
for a second traversal.
Memory overhead
The node map stores two Python dicts (path string and integer ID per node).
Rough overhead: ~200-300 bytes per node. For a file with 100,000 nodes,
expect ~20-30 MB for the mapping alone. The structured arrays returned by
load_tree and load_node_info add ~20 bytes and ~300 bytes per node
respectively.
Traversal time
h5py.File.visit() is backed by HDF5's C-level H5Literate, so
traversal is fast — typically < 1 second for 100K nodes on local SSD.
The bottleneck for large files is usually dataset I/O, not tree walking.
Filtered vs. full attribute loading
load_attributes(handle)— reads attributes for every node. Use this when you need a complete picture (e.g., building a search index).load_attributes(handle, node_ids=np.array([2, 5]))— reads only the specified nodes. Prefer this when you know which nodes you need, as it avoids touching unrelated HDF5 objects.
Full array loading
load_dataset() reads the entire dataset into memory via obj[:]. For
very large datasets (multi-GB), consider using h5py slicing directly on
the file handle instead.
Thread Safety
h5py file handles are not thread-safe. Do not share a single
h5py.File handle across threads. Instead, open a separate handle per
thread or serialize access with a lock.
Dependencies
- h5py (>=3.0)
- numpy (>=1.24)
- vcti-fileloader (>=1.0.0)
- vcti-array-tree (>=1.0.0) — DataNode
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file vcti_fileloader_hdf5-1.0.0.tar.gz.
File metadata
- Download URL: vcti_fileloader_hdf5-1.0.0.tar.gz
- Upload date:
- Size: 14.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
885b4935b9e8a03959a7defc9d79d37bd3965620c8e335f09073716202c4bf1a
|
|
| MD5 |
a7dac71692a662b002bf0c3e468352e7
|
|
| BLAKE2b-256 |
b35b5626c404cfbb52f7c7e7e881d6b9d42121dd8d4706caba114304118472b2
|
Provenance
The following attestation bundles were made for vcti_fileloader_hdf5-1.0.0.tar.gz:
Publisher:
publish.yml on vcollab/vcti-python-fileloader-hdf5
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
vcti_fileloader_hdf5-1.0.0.tar.gz -
Subject digest:
885b4935b9e8a03959a7defc9d79d37bd3965620c8e335f09073716202c4bf1a - Sigstore transparency entry: 1193196272
- Sigstore integration time:
-
Permalink:
vcollab/vcti-python-fileloader-hdf5@efeca56e7f962c1e6175bc74b7ba8e4b328bc73d -
Branch / Tag:
refs/heads/main - Owner: https://github.com/vcollab
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@efeca56e7f962c1e6175bc74b7ba8e4b328bc73d -
Trigger Event:
workflow_dispatch
-
Statement type:
File details
Details for the file vcti_fileloader_hdf5-1.0.0-py3-none-any.whl.
File metadata
- Download URL: vcti_fileloader_hdf5-1.0.0-py3-none-any.whl
- Upload date:
- Size: 9.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1e2e1c8e5525cb4392cb394d5f4566625a764539ae35e9ced1da36810f3ecd8b
|
|
| MD5 |
dd6e2ea29e7cdfee6b82d76ec2893044
|
|
| BLAKE2b-256 |
371dc7fd2e74f2add08a8a0c4e951aee18c58af59dd428072b263d7c50863568
|
Provenance
The following attestation bundles were made for vcti_fileloader_hdf5-1.0.0-py3-none-any.whl:
Publisher:
publish.yml on vcollab/vcti-python-fileloader-hdf5
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
vcti_fileloader_hdf5-1.0.0-py3-none-any.whl -
Subject digest:
1e2e1c8e5525cb4392cb394d5f4566625a764539ae35e9ced1da36810f3ecd8b - Sigstore transparency entry: 1193196333
- Sigstore integration time:
-
Permalink:
vcollab/vcti-python-fileloader-hdf5@efeca56e7f962c1e6175bc74b7ba8e4b328bc73d -
Branch / Tag:
refs/heads/main - Owner: https://github.com/vcollab
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@efeca56e7f962c1e6175bc74b7ba8e4b328bc73d -
Trigger Event:
workflow_dispatch
-
Statement type: