Skip to main content

Create, Run and Benchmark DVC Pipelines in Python

Project description

codecov PyTest PyPI version code-style Documentation Binder DOI ZnTrack zincware Discord

Logo

ZnTrack: Make Your Python Code Reproducible!

ZnTrack (zɪŋk træk) is a lightweight and easy-to-use Python package for converting your existing Python code into reproducible workflows. By structuring your code as a directed graph with well-defined inputs and outputs, ZnTrack ensures reproducibility, scalability, and ease of collaboration.

Key Features

  • Reproducible Workflows: Convert Python scripts into reproducible workflows with minimal effort.
  • Parameter, Output, and Metric Tracking: Easily track parameters, outputs, and metrics in your Python code.
  • Shareable and Collaborative: Collaborate with your team by working together through GIT. Share your workflows and use parts in other projects or package them as Python packages.
  • DVC Integration: ZnTrack is built on top of DVC for version control and experiment management and seamlessly integrates into the DVC ecosystem.

Example: Molecular Dynamics Workflow

Let’s take a workflow that constructs a periodic, atomistic system of Ethanol and runs a geometry optimization using MACE-MP-0.

Original Workflow

from ase.optimize import LBFGS
from mace.calculators import mace_mp
from rdkit2ase import pack, smiles2conformers

model = mace_mp()

frames = smiles2conformers(smiles="CCO", numConfs=32)
box = pack(data=[frames], counts=[32], density=789)

box.calc = model

dyn = LBFGS(box, trajectory="optim.traj")
dyn.run(fmax=0.5)
Dependencies For this example to work, you will need:
  • https://github.com/ACEsuit/mace
  • https://github.com/m3g/packmol
  • https://github.com/zincware/rdkit2ase

Converted Workflow with ZnTrack

To make this workflow reproducible, we convert it into a directed graph structure where each step is represented as a Node. Nodes define their inputs, outputs, and the computational logic to execute. Here's the graph structure for our example:

flowchart LR

Smiles2Conformers --> Pack --> StructureOptimization
MACE_MP --> StructureOptimization

Node Definitions

In ZnTrack, each Node is defined as a Python class. The class attributes define the inputs (parameters and dependencies) and outputs, while the run method contains the computational logic to be executed.

[!NOTE] ZnTrack uses Python dataclasses under the hood, providing an automatic __init__ method. Starting from Python 3.11, most IDEs should reliably provide type hints for ZnTrack Nodes.

[!TIP] For files produced during the run method, ZnTrack provides a unique Node Working Directory (zntrack.nwd). Always use this directory to store files to ensure reproducibility and avoid conflicts.

from dataclasses import dataclass
from pathlib import Path

import ase.io
from ase.optimize import LBFGS
from mace.calculators import mace_mp
from rdkit2ase import pack, smiles2conformers

import zntrack


class Smiles2Conformers(zntrack.Node):
    smiles: str = zntrack.params()  # A required parameter
    numConfs: int = zntrack.params(32)  # A parameter with a default value

    frames_path: Path = zntrack.outs_path(zntrack.nwd / "frames.xyz")  # Output file path

    def run(self) -> None:
        frames = smiles2conformers(smiles=self.smiles, numConfs=self.numConfs)
        ase.io.write(self.frames_path, frames)

    @property
    def frames(self) -> list[ase.Atoms]:
        # Load the frames from the output file using the node's filesystem
        with self.state.fs.open(self.frames_path, "r") as f:
            return list(ase.io.iread(f, ":", format="extxyz"))


class Pack(zntrack.Node):
    data: list[list[ase.Atoms]] = zntrack.deps()  # Input dependency (list of ASE Atoms)
    counts: list[int] = zntrack.params()  # Parameter (list of counts)
    density: float = zntrack.params()  # Parameter (density value)

    frames_path: Path = zntrack.outs_path(zntrack.nwd / "frames.xyz")  # Output file path

    def run(self) -> None:
        box = pack(data=self.data, counts=self.counts, density=self.density)
        ase.io.write(self.frames_path, box)

    @property
    def frames(self) -> list[ase.Atoms]:
        # Load the packed structure from the output file
        with self.state.fs.open(self.frames_path, "r") as f:
            return list(ase.io.iread(f, ":", format="extxyz"))


# We could hardcode the MACE_MP model into the StructureOptimization Node, but we
# can also define it as a dependency. Since the model doesn't require a `run` method,
# we define it as a `@dataclass`.


@dataclass
class MACE_MP:
    model: str = "medium"  # Default model type

    def get_calculator(self, **kwargs):
        return mace_mp(model=self.model)


class StructureOptimization(zntrack.Node):
    model: MACE_MP = zntrack.deps()  # Dependency (MACE_MP model)
    data: list[ase.Atoms] = zntrack.deps()  # Dependency (list of ASE Atoms)
    data_id: int = zntrack.params()  # Parameter (index of the structure to optimize)
    fmax: float = zntrack.params(0.05)  # Parameter (force convergence threshold)

    frames_path: Path = zntrack.outs_path(zntrack.nwd / "frames.traj")  # Output file path

    def run(self):
        atoms = self.data[self.data_id]
        atoms.calc = self.model.get_calculator()
        dyn = LBFGS(atoms, trajectory=self.frames_path.as_posix())
        dyn.run(fmax=0.5)

    @property
    def frames(self) -> list[ase.Atoms]:
        # Load the optimization trajectory from the output file
        with self.state.fs.open(self.frames_path, "rb") as f:
            return list(ase.io.iread(f, ":", format="traj"))

Building and Running the Workflow

Now that we’ve defined all the necessary Nodes, we can build and execute the workflow. Follow these steps:

  1. Initialize a new directory for your project:

    git init
    dvc init
    
  2. Create a Python module for the Node definitions:

    • Create a file src/__init__.py and place the Node definitions inside it.
  3. Define and execute the workflow in a main.py file:

     from src import MACE_MP, Pack, Smiles2Conformers, StructureOptimization
    
     import zntrack
    
     # Initialize the ZnTrack project
     project = zntrack.Project()
    
     # Define the MACE-MP model
     model = MACE_MP()
    
     # Build the workflow graph
     with project:
         etoh = Smiles2Conformers(smiles="CCO", numConfs=32)
         box = Pack(data=[etoh.frames], counts=[32], density=789)
         optm = StructureOptimization(model=model, data=box.frames, data_id=-1, fmax=0.5)
    
     # Execute the workflow
     project.repro()
    

[!TIP] If you don’t want to execute the graph immediately, use project.build() instead. You can run the graph later using dvc repro or the paraffin package.

Accessing Results

Once the workflow has been executed, the results are stored in the respective files. For example, the optimized trajectory is saved in nodes/StructureOptimization/frames.traj.

You can load the results directly using ZnTrack, without worrying about file paths or formats:

import zntrack

# Load the StructureOptimization Node
optm = zntrack.from_rev(name="StructureOptimization")
# you can pass `remote: str` and `rev: str` to access data from
# a different commit or a remote repository.

# Access the optimization trajectory
print(optm.frames)

More Examples

For additional examples and advanced use cases, check out these packages built on top of ZnTrack:

  • mlipx - Machine Learned Interatomic Potential eXploration.
  • IPSuite - Machine Learned Interatomic Potential Tools.

References

If you use ZnTrack in your research, please cite us:

@misc{zillsZnTrackDataCode2024,
  title = {{{ZnTrack}} -- {{Data}} as {{Code}}},
  author = {Zills, Fabian and Sch{\"a}fer, Moritz and Tovey, Samuel and K{\"a}stner, Johannes and Holm, Christian},
  year = {2024},
  eprint={2401.10603},
  archivePrefix={arXiv},
}

Copyright

This project is distributed under the Apache License Version 2.0.


Similar Tools

Here’s a list of other projects that either work together with ZnTrack or achieve similar results with slightly different goals or programming languages:

  • DVC - Main dependency of ZnTrack for Data Version Control.
  • dvthis - Introduce DVC to R.
  • DAGsHub Client - Logging parameters from within Python.
  • MLFlow - A Machine Learning Lifecycle Platform.
  • Metaflow - A framework for real-life data science.
  • Hydra - A framework for elegantly configuring complex applications.
  • Snakemake - Workflow management system for reproducible and scalable data analyses.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

zntrack-0.8.7.tar.gz (345.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

zntrack-0.8.7-py3-none-any.whl (61.0 kB view details)

Uploaded Python 3

File details

Details for the file zntrack-0.8.7.tar.gz.

File metadata

  • Download URL: zntrack-0.8.7.tar.gz
  • Upload date:
  • Size: 345.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.7.3

File hashes

Hashes for zntrack-0.8.7.tar.gz
Algorithm Hash digest
SHA256 fc51eb7a96cf5d3baf24f413895748e7de1404ba931928267f5fcb2e140ab85b
MD5 6afaf7f70de2dd1204db57d844d6e59f
BLAKE2b-256 b08aa48a9beeb0817bf5a48153275559ed5ba2c2d65b892d000c278a3ce48653

See more details on using hashes here.

File details

Details for the file zntrack-0.8.7-py3-none-any.whl.

File metadata

  • Download URL: zntrack-0.8.7-py3-none-any.whl
  • Upload date:
  • Size: 61.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.7.3

File hashes

Hashes for zntrack-0.8.7-py3-none-any.whl
Algorithm Hash digest
SHA256 2eb2e8f4c9ea7d066a02ab28597e7ca21f1983035aad73c9f624aaba34b68316
MD5 dcba31345ee169a0b4380ea328e7c309
BLAKE2b-256 b7e38540c753dc6d30ca1cec0789b431921961017eadd80b5f85a6f32a02bd26

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page