Skip to main content

Compact, lazy-readable HDF5 trajectories with incremental atomistic property storage.

Project description

dumpDUCK

dumpDUCK stores atomistic trajectories as compact, lazy-readable HDF5 files. It is designed for large MD trajectories where you want to read one frame at a time, and for incremental labelling workflows where new properties are added after the trajectory already exists.

Installation

pip install -e .

Optional Zstandard/Blosc compression:

pip install -e '.[compression]'

Convert a trajectory

LAMMPS dump:

dumpduck convert 4-azif_hda_512FUs_seed4-quenched_1500K_to_300K_rate0.1Kps-equilibrated300K_5ns.dump 4-azif_hda_512FUs_seed4-quenched_1500K_to_300K_rate0.1Kps-equilibrated300K_5ns.h5 \
  --format lammpstrj \
  --type-map '1:C,2:H,3:N,4:Zn' \
  --compression gzip \
  --compression-level 6 \
  --float-dtype float32 \
  --chunk-frames 16
dumpduck convert 4-azif_hda_512FUs_seed4-quenched_1500K_to_300K_rate0.1Kps-equilibrated300K_5ns.dump 4-azif_hda_512FUs_seed4-quenched_1500K_to_300K_rate0.1Kps-equilibrated300K_5ns.h5 \
  --format lammpstrj \
  --type-map '1:C,2:H,3:N,4:Zn' \
  --compression blosc-zstd \
  --compression-level 9 \
  --float-dtype float32 \
  --chunk-frames 221

ASE-readable trajectory:

dumpduck convert trajectory.xyz trajectory.h5 --chunk-frames 16

Inspect a file

dumpduck info trajectory.h5

Example output:

file: trajectory.h5
format: dumpduck-hdf5
version: 0.2.0
frames: 100001
atoms: 4352

core datasets:
  positions        shape=(100001, 4352, 3) dtype=float32 chunks=(16, 4352, 3) compression=gzip

properties:
  atomic/shielding_tensors
    shape: (100001, 4352, 3, 3)
    dtype: float32
    valid frames: 2183 / 100001
    units: ppm

Lazy reading

from dump_duck import H5Trajectory

with H5Trajectory('trajectory.h5') as traj:
    atoms = traj[0]

    for atoms in traj.iter_frames(start=0, stop=1000, step=10):
        print(atoms.info['timestep'], atoms.positions.shape)

Only the requested frame is read from disk.

Incremental properties

Properties live under /properties/atomic/<name> or /properties/frame/<name>. Each property has:

data   # actual data
valid  # bool mask saying which frames have been written

This allows sparse labelling: the property can exist for all frames, while only a subset has been computed.

NMR shielding tensors, one frame at a time

from dump_duck import H5Trajectory

with H5Trajectory('trajectory.h5', mode='r+') as traj:
    if not traj.has_property('shielding_tensors', kind='atomic'):
        traj.create_property(
            'shielding_tensors',
            kind='atomic',
            frame_shape=(3, 3),
            dtype='float32',
            units='ppm',
            description='Per-atom NMR shielding tensors',
            compression='gzip',
            compression_level=6,
            chunk_frames=1,
        )

    for i, atoms in enumerate(traj.iter_frames()):
        if traj.property_valid('shielding_tensors', i, kind='atomic'):
            continue

        shielding = calculator.predict_shielding_tensors(atoms)  # shape: (n_atoms, 3, 3)
        traj.write_property('shielding_tensors', i, shielding, kind='atomic')

Chemical shifts

with H5Trajectory('trajectory.h5', mode='r+') as traj:
    traj.create_property(
        'chemical_shifts',
        kind='atomic',
        frame_shape=(),
        dtype='float32',
        units='ppm',
        description='Per-atom NMR chemical shifts',
    )

    traj.write_property('chemical_shifts', 0, shifts, kind='atomic')  # shape: (n_atoms,)

Frame-wise energies

with H5Trajectory('trajectory.h5', mode='r+') as traj:
    traj.create_property('energy', kind='frame', dtype='float64', units='eV')
    traj.write_property('energy', 0, 123.4, kind='frame')

Extract frames

dumpduck extract trajectory.h5 frame_1000.xyz --index 1000

With valid properties included as ASE arrays/info:

dumpduck extract trajectory.h5 labelled.xyz --start 0 --stop 100 --step 10 --include-properties

Compression notes

Portable built-in options:

none, lzf, gzip

Optional plugin options with dumpduck[compression]:

zstd, blosc-zstd

For MD trajectories, a good default is:

gzip level 6, float32, chunk_frames 16

For single-frame random access, use smaller chunks. For better compression and sequential reading, use larger chunks such as 32 or 64.

HDF5 layout

/
  atomic_numbers        (n_atoms,)
  ids                   (n_atoms,)
  lammps_types          optional, (n_atoms,)
  mol_ids               optional, (n_atoms,)

  positions             (n_frames, n_atoms, 3)
  cells                 (n_frames, 3, 3)
  pbc                   (n_frames, 3)
  timesteps             (n_frames,)

  properties/
    atomic/
      <name>/
        data            (n_frames, n_atoms, *frame_shape)
        valid           (n_frames,)
    frame/
      <name>/
        data            (n_frames, *frame_shape)
        valid           (n_frames,)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dumpduck-0.2.0.tar.gz (29.8 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dumpduck-0.2.0-py3-none-any.whl (15.5 kB view details)

Uploaded Python 3

File details

Details for the file dumpduck-0.2.0.tar.gz.

File metadata

  • Download URL: dumpduck-0.2.0.tar.gz
  • Upload date:
  • Size: 29.8 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for dumpduck-0.2.0.tar.gz
Algorithm Hash digest
SHA256 d150ebf0b59619238ce5fe2d783eb261cfe3bd2c084ae1b01077b097d1fba9cd
MD5 17301aee375c5c58c10c44f917d36a13
BLAKE2b-256 ebdb941a9d1d40374e8c6fbc1203e4def3a4e2fa4de90f46bf37113792bc773b

See more details on using hashes here.

Provenance

The following attestation bundles were made for dumpduck-0.2.0.tar.gz:

Publisher: publish.yaml on tcnicholas/dump-duck

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file dumpduck-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: dumpduck-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 15.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for dumpduck-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b1b1822598827bb92dc7f73e9b7bd19486fae37fcd7af84313a7ae415080e74a
MD5 1996a1759a7f37dbcdcc5d86bd8cc96e
BLAKE2b-256 f97643e1fc947ecb5ffe71d4f069e018be52ff34533122eab439a76d8b737af2

See more details on using hashes here.

Provenance

The following attestation bundles were made for dumpduck-0.2.0-py3-none-any.whl:

Publisher: publish.yaml on tcnicholas/dump-duck

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page