Compact, lazy-readable HDF5 trajectories with incremental atomistic property storage.
Project description
dumpDUCK
dumpDUCK stores atomistic trajectories as compact, lazy-readable HDF5 files. It is designed for large MD trajectories where you want to read one frame at a time, and for incremental labelling workflows where new properties are added after the trajectory already exists.
Installation
pip install -e .
Optional Zstandard/Blosc compression:
pip install -e '.[compression]'
Convert a trajectory
LAMMPS dump:
dumpduck convert 4-azif_hda_512FUs_seed4-quenched_1500K_to_300K_rate0.1Kps-equilibrated300K_5ns.dump 4-azif_hda_512FUs_seed4-quenched_1500K_to_300K_rate0.1Kps-equilibrated300K_5ns.h5 \
--format lammpstrj \
--type-map '1:C,2:H,3:N,4:Zn' \
--compression gzip \
--compression-level 6 \
--float-dtype float32 \
--chunk-frames 16
dumpduck convert azif_rmc_2010_nmr_300K_10fs.lammpstrj azif_rmc_2010_nmr_nvt_300K_10fs.h5 \
--format lammpstrj \
--type-map '1:C,2:H,3:N,4:Zn' \
--compression blosc-zstd \
--compression-level 9 \
--float-dtype float32 \
--chunk-frames 64
TYPE_MAP='1:Zn,2:Zn'
TYPE_MAP="${TYPE_MAP},3:H,4:H,5:H,6:H,7:H,8:H,9:H,10:H,11:H,12:H,13:H,14:H"
TYPE_MAP="${TYPE_MAP},15:C,16:C,17:C,18:C,19:C,20:C,21:C,22:C,23:C,24:C,25:C,26:C"
TYPE_MAP="${TYPE_MAP},27:N,28:N,29:N,30:N,31:N,32:N,33:N,34:N"
dumpduck convert \
zif4_2x2x2_300K_nvt_nmr_nvt_300K_1ns_10fs.lammpstrj \
zif4_2x2x2_300K_nvt_nmr_nvt_300K_1ns_10fs.h5 \
--format lammpstrj \
--type-map "${TYPE_MAP}" \
--compression blosc-zstd \
--compression-level 7 \
--float-dtype float32 \
--chunk-frames 100 \
--n-frames 100000
dumpduck info zif4_2x2x2_300K_nvt_nmr_nvt_300K_1ns_10fs.h5
ASE-readable trajectory:
dumpduck convert trajectory.xyz trajectory.h5 --chunk-frames 16
Inspect a file
dumpduck info trajectory.h5
Example output:
file: trajectory.h5
format: dumpduck-hdf5
version: 0.2.0
frames: 100001
atoms: 4352
core datasets:
positions shape=(100001, 4352, 3) dtype=float32 chunks=(16, 4352, 3) compression=gzip
properties:
atomic/shielding_tensors
shape: (100001, 4352, 3, 3)
dtype: float32
valid frames: 2183 / 100001
units: ppm
Lazy reading
from dump_duck import H5Trajectory
with H5Trajectory('trajectory.h5') as traj:
atoms = traj[0]
for atoms in traj.iter_frames(start=0, stop=1000, step=10):
print(atoms.info['timestep'], atoms.positions.shape)
Only the requested frame is read from disk.
Incremental properties
Properties live under /properties/atomic/<name> or /properties/frame/<name>.
Each property has:
data # actual data
valid # bool mask saying which frames have been written
This allows sparse labelling: the property can exist for all frames, while only a subset has been computed.
NMR shielding tensors, one frame at a time
from dump_duck import H5Trajectory
with H5Trajectory('trajectory.h5', mode='r+') as traj:
if not traj.has_property('shielding_tensors', kind='atomic'):
traj.create_property(
'shielding_tensors',
kind='atomic',
frame_shape=(3, 3),
dtype='float32',
units='ppm',
description='Per-atom NMR shielding tensors',
compression='gzip',
compression_level=6,
chunk_frames=1,
)
for i, atoms in enumerate(traj.iter_frames()):
if traj.property_valid('shielding_tensors', i, kind='atomic'):
continue
shielding = calculator.predict_shielding_tensors(atoms) # shape: (n_atoms, 3, 3)
traj.write_property('shielding_tensors', i, shielding, kind='atomic')
Chemical shifts
with H5Trajectory('trajectory.h5', mode='r+') as traj:
traj.create_property(
'chemical_shifts',
kind='atomic',
frame_shape=(),
dtype='float32',
units='ppm',
description='Per-atom NMR chemical shifts',
)
traj.write_property('chemical_shifts', 0, shifts, kind='atomic') # shape: (n_atoms,)
Frame-wise energies
with H5Trajectory('trajectory.h5', mode='r+') as traj:
traj.create_property('energy', kind='frame', dtype='float64', units='eV')
traj.write_property('energy', 0, 123.4, kind='frame')
Extract frames
dumpduck extract trajectory.h5 frame_1000.xyz --index 1000
With valid properties included as ASE arrays/info:
dumpduck extract trajectory.h5 labelled.xyz --start 0 --stop 100 --step 10 --include-properties
Compression notes
Portable built-in options:
none, lzf, gzip
Optional plugin options with dumpduck[compression]:
zstd, blosc-zstd
For MD trajectories, a good default is:
gzip level 6, float32, chunk_frames 16
For single-frame random access, use smaller chunks. For better compression and sequential reading, use larger chunks such as 32 or 64.
HDF5 layout
/
atomic_numbers (n_atoms,)
ids (n_atoms,)
lammps_types optional, (n_atoms,)
mol_ids optional, (n_atoms,)
positions (n_frames, n_atoms, 3)
cells (n_frames, 3, 3)
pbc (n_frames, 3)
timesteps (n_frames,)
properties/
atomic/
<name>/
data (n_frames, n_atoms, *frame_shape)
valid (n_frames,)
frame/
<name>/
data (n_frames, *frame_shape)
valid (n_frames,)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dumpduck-0.2.2.tar.gz.
File metadata
- Download URL: dumpduck-0.2.2.tar.gz
- Upload date:
- Size: 29.8 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ac53d691ac88222db7f6d4f848f3c3d3593b52f7572e5e800ec8aac39ef368fb
|
|
| MD5 |
58b9e7fb74777eadae78a6491b7c4587
|
|
| BLAKE2b-256 |
3a740aa21b42f7664108e06c8e3d38fa77858d562a47228982a52c02398bd961
|
Provenance
The following attestation bundles were made for dumpduck-0.2.2.tar.gz:
Publisher:
publish.yaml on tcnicholas/dump-duck
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
dumpduck-0.2.2.tar.gz -
Subject digest:
ac53d691ac88222db7f6d4f848f3c3d3593b52f7572e5e800ec8aac39ef368fb - Sigstore transparency entry: 1507777294
- Sigstore integration time:
-
Permalink:
tcnicholas/dump-duck@6153104666688027624e5f77893097030d9fa5c6 -
Branch / Tag:
refs/tags/v0.2.2 - Owner: https://github.com/tcnicholas
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yaml@6153104666688027624e5f77893097030d9fa5c6 -
Trigger Event:
release
-
Statement type:
File details
Details for the file dumpduck-0.2.2-py3-none-any.whl.
File metadata
- Download URL: dumpduck-0.2.2-py3-none-any.whl
- Upload date:
- Size: 20.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
849a54a7f569abe4baec0ab95acea9fe12d16d14c33ed4c7bf52578dd7ba0de3
|
|
| MD5 |
0b52fe75c6d0d97f9aaad20991f2865f
|
|
| BLAKE2b-256 |
b83e15fa44bd571e8ffa7c01ccedd1b687c8d9431e95f541bfcd72cd9fd64f2b
|
Provenance
The following attestation bundles were made for dumpduck-0.2.2-py3-none-any.whl:
Publisher:
publish.yaml on tcnicholas/dump-duck
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
dumpduck-0.2.2-py3-none-any.whl -
Subject digest:
849a54a7f569abe4baec0ab95acea9fe12d16d14c33ed4c7bf52578dd7ba0de3 - Sigstore transparency entry: 1507777415
- Sigstore integration time:
-
Permalink:
tcnicholas/dump-duck@6153104666688027624e5f77893097030d9fa5c6 -
Branch / Tag:
refs/tags/v0.2.2 - Owner: https://github.com/tcnicholas
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yaml@6153104666688027624e5f77893097030d9fa5c6 -
Trigger Event:
release
-
Statement type: