Skip to main content

Compact, lazy-readable HDF5 trajectories with incremental atomistic property storage.

Project description

dumpDUCK

dumpDUCK stores atomistic trajectories as compact, lazy-readable HDF5 files. It is designed for large MD trajectories where you want to read one frame at a time, and for incremental labelling workflows where new properties are added after the trajectory already exists.

Installation

pip install -e .

Optional Zstandard/Blosc compression:

pip install -e '.[compression]'

Convert a trajectory

LAMMPS dump:

dumpduck convert 4-azif_hda_512FUs_seed4-quenched_1500K_to_300K_rate0.1Kps-equilibrated300K_5ns.dump 4-azif_hda_512FUs_seed4-quenched_1500K_to_300K_rate0.1Kps-equilibrated300K_5ns.h5 \
  --format lammpstrj \
  --type-map '1:C,2:H,3:N,4:Zn' \
  --compression gzip \
  --compression-level 6 \
  --float-dtype float32 \
  --chunk-frames 16
dumpduck convert azif_rmc_2010_nmr_300K_10fs.lammpstrj azif_rmc_2010_nmr_nvt_300K_10fs.h5 \
  --format lammpstrj \
  --type-map '1:C,2:H,3:N,4:Zn' \
  --compression blosc-zstd \
  --compression-level 9 \
  --float-dtype float32 \
  --chunk-frames 64
TYPE_MAP='1:Zn,2:Zn'
TYPE_MAP="${TYPE_MAP},3:H,4:H,5:H,6:H,7:H,8:H,9:H,10:H,11:H,12:H,13:H,14:H"
TYPE_MAP="${TYPE_MAP},15:C,16:C,17:C,18:C,19:C,20:C,21:C,22:C,23:C,24:C,25:C,26:C"
TYPE_MAP="${TYPE_MAP},27:N,28:N,29:N,30:N,31:N,32:N,33:N,34:N"

dumpduck convert \
  zif4_2x2x2_300K_nvt_nmr_nvt_300K_1ns_10fs.lammpstrj \
  zif4_2x2x2_300K_nvt_nmr_nvt_300K_1ns_10fs.h5 \
  --format lammpstrj \
  --type-map "${TYPE_MAP}" \
  --compression blosc-zstd \
  --compression-level 7 \
  --float-dtype float32 \
  --chunk-frames 100 \
  --n-frames 100000
dumpduck info zif4_2x2x2_300K_nvt_nmr_nvt_300K_1ns_10fs.h5

ASE-readable trajectory:

dumpduck convert trajectory.xyz trajectory.h5 --chunk-frames 16

Inspect a file

dumpduck info trajectory.h5

Example output:

file: trajectory.h5
format: dumpduck-hdf5
version: 0.2.0
frames: 100001
atoms: 4352

core datasets:
  positions        shape=(100001, 4352, 3) dtype=float32 chunks=(16, 4352, 3) compression=gzip

properties:
  atomic/shielding_tensors
    shape: (100001, 4352, 3, 3)
    dtype: float32
    valid frames: 2183 / 100001
    units: ppm

Lazy reading

from dump_duck import H5Trajectory

with H5Trajectory('trajectory.h5') as traj:
    atoms = traj[0]

    for atoms in traj.iter_frames(start=0, stop=1000, step=10):
        print(atoms.info['timestep'], atoms.positions.shape)

Only the requested frame is read from disk.

Incremental properties

Properties live under /properties/atomic/<name> or /properties/frame/<name>. Each property has:

data   # actual data
valid  # bool mask saying which frames have been written

This allows sparse labelling: the property can exist for all frames, while only a subset has been computed.

NMR shielding tensors, one frame at a time

from dump_duck import H5Trajectory

with H5Trajectory('trajectory.h5', mode='r+') as traj:
    if not traj.has_property('shielding_tensors', kind='atomic'):
        traj.create_property(
            'shielding_tensors',
            kind='atomic',
            frame_shape=(3, 3),
            dtype='float32',
            units='ppm',
            description='Per-atom NMR shielding tensors',
            compression='gzip',
            compression_level=6,
            chunk_frames=1,
        )

    for i, atoms in enumerate(traj.iter_frames()):
        if traj.property_valid('shielding_tensors', i, kind='atomic'):
            continue

        shielding = calculator.predict_shielding_tensors(atoms)  # shape: (n_atoms, 3, 3)
        traj.write_property('shielding_tensors', i, shielding, kind='atomic')

Chemical shifts

with H5Trajectory('trajectory.h5', mode='r+') as traj:
    traj.create_property(
        'chemical_shifts',
        kind='atomic',
        frame_shape=(),
        dtype='float32',
        units='ppm',
        description='Per-atom NMR chemical shifts',
    )

    traj.write_property('chemical_shifts', 0, shifts, kind='atomic')  # shape: (n_atoms,)

Frame-wise energies

with H5Trajectory('trajectory.h5', mode='r+') as traj:
    traj.create_property('energy', kind='frame', dtype='float64', units='eV')
    traj.write_property('energy', 0, 123.4, kind='frame')

Extract frames

dumpduck extract trajectory.h5 frame_1000.xyz --index 1000

With valid properties included as ASE arrays/info:

dumpduck extract trajectory.h5 labelled.xyz --start 0 --stop 100 --step 10 --include-properties

Compression notes

Portable built-in options:

none, lzf, gzip

Optional plugin options with dumpduck[compression]:

zstd, blosc-zstd

For MD trajectories, a good default is:

gzip level 6, float32, chunk_frames 16

For single-frame random access, use smaller chunks. For better compression and sequential reading, use larger chunks such as 32 or 64.

HDF5 layout

/
  atomic_numbers        (n_atoms,)
  ids                   (n_atoms,)
  lammps_types          optional, (n_atoms,)
  mol_ids               optional, (n_atoms,)

  positions             (n_frames, n_atoms, 3)
  cells                 (n_frames, 3, 3)
  pbc                   (n_frames, 3)
  timesteps             (n_frames,)

  properties/
    atomic/
      <name>/
        data            (n_frames, n_atoms, *frame_shape)
        valid           (n_frames,)
    frame/
      <name>/
        data            (n_frames, *frame_shape)
        valid           (n_frames,)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dumpduck-0.2.2.tar.gz (29.8 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dumpduck-0.2.2-py3-none-any.whl (20.2 kB view details)

Uploaded Python 3

File details

Details for the file dumpduck-0.2.2.tar.gz.

File metadata

  • Download URL: dumpduck-0.2.2.tar.gz
  • Upload date:
  • Size: 29.8 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for dumpduck-0.2.2.tar.gz
Algorithm Hash digest
SHA256 ac53d691ac88222db7f6d4f848f3c3d3593b52f7572e5e800ec8aac39ef368fb
MD5 58b9e7fb74777eadae78a6491b7c4587
BLAKE2b-256 3a740aa21b42f7664108e06c8e3d38fa77858d562a47228982a52c02398bd961

See more details on using hashes here.

Provenance

The following attestation bundles were made for dumpduck-0.2.2.tar.gz:

Publisher: publish.yaml on tcnicholas/dump-duck

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file dumpduck-0.2.2-py3-none-any.whl.

File metadata

  • Download URL: dumpduck-0.2.2-py3-none-any.whl
  • Upload date:
  • Size: 20.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for dumpduck-0.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 849a54a7f569abe4baec0ab95acea9fe12d16d14c33ed4c7bf52578dd7ba0de3
MD5 0b52fe75c6d0d97f9aaad20991f2865f
BLAKE2b-256 b83e15fa44bd571e8ffa7c01ccedd1b687c8d9431e95f541bfcd72cd9fd64f2b

See more details on using hashes here.

Provenance

The following attestation bundles were made for dumpduck-0.2.2-py3-none-any.whl:

Publisher: publish.yaml on tcnicholas/dump-duck

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page