A framework for integrating Hydra/DVC/MLflow for reproducible ML experiments.

These details have not been verified by PyPI

Project links

Project description

Framework Project Documentation (zendag/README.md)

ZenDag

ZenDag is a Python framework designed to streamline Machine Learning experimentation workflows by integrating:

Configuration Management: Hydra and Hydra-Zen for modular, reusable, and composable configuration-as-code.
Pipeline Orchestration & Versioning: DVC for defining experiment pipelines (DAGs) and versioning data, artifacts, and models.
Experiment Tracking: MLflow for logging parameters, metrics, artifacts, and comparing runs.

The core idea is to drive the DVC pipeline definition directly from your Hydra configurations, minimizing redundancy and ensuring consistency between your code, configuration, and the execution pipeline.

Core Concepts

Configuration as Code: Define all aspects of your experiment (data sources, preprocessing steps, model architecture, training parameters, evaluation metrics, logger settings) using Python code via Hydra-Zen and store them in a structured way (e.g., using hydra_zen.ZenStore).
Stage-Based Pipelines: Structure your ML workflow into logical stages (e.g., data_prep, feature_eng, train, evaluate, deploy). Each stage corresponds to a node in the DVC pipeline graph.
Automatic DAG Generation: ZenDag automatically generates the dvc.yaml file. It discovers dependencies (deps) and outputs (outs) by inspecting your Hydra configurations during a resolution step. You declare these using ${deps:...} and ${outs:...} interpolations directly within your configuration values (e.g., file paths).
Integrated Experiment Tracking: A simple decorator (@zendag.mlflow_run) wraps your stage execution functions to automatically handle MLflow setup, log parameters from the Hydra config, capture artifacts (including logs and the config itself), and manage nested runs within a parent pipeline run.
Environment & Task Management: While ZenDag itself is framework-agnostic regarding environment management, it's designed to work seamlessly with tools like Pixi or Conda/Poetry. A Cookiecutter template is provided to quickly set up a project using Pixi.

Installation

pip install zendag # Or install from source/git if needed

API Reference

zendag.core.configure_pipeline(...)

def configure_pipeline(
    store: hydra_zen.ZenStore,
    stage_groups: List[str],
    stage_dir_fn: Callable[[str, str], str] = default_stage_dir_fn,
    configs_dir_fn: Callable[[str], str] = default_configs_dir_fn,
    dvc_filename: str = "dvc.yaml",
    run_script: str = "zendag.run",
    config_root: Optional[str] = None,
) -> None:
    # ... (Full signature in docstring above) ...

Purpose: The main function to generate the dvc.yaml file.
How it works:
- Iterates through specified stage_groups in the hydra_zen.ZenStore.
- For each configuration (name) within a stage group (stage):
  
  *Composes the full Hydra config (e.g., hydra.compose(overrides=[f"+{stage}={name}"])).
  
  *Writes the composed config to <configs_dir_fn(stage)>/.yaml. This file is tracked as a param by DVC.
  
  *Registers temporary Hydra resolvers for ${deps:...} and ${outs:...}.
  
  *Calls OmegaConf.resolve(cfg). During resolution, any ${deps:path} or ${outs:path} encountered trigger the resolvers, which append the path to internal lists (side-effect).
  
  *Collects the unique dependencies and outputs discovered during resolution.
  
  *Defines a DVC stage entry in a dictionary (e.g., stages['stage/name'] = {...}). The cmd calls the specified run_script using the composed config. deps, outs, and params are populated.
- Writes the complete stage dictionary to the dvc_filename.
Logging: Provides INFO and DEBUG level logs about the process, including discovered deps/outs. Configure Python's logging to see these.

zendag.config_utils.deps_path(...) & zendag.config_utils.outs_path(...)

def deps_path(s: str, input_stage: Optional[str] = None, input_name: Optional[str] = None, stage_dir_fn=None) -> str:
    # ...

def outs_path(s: str) -> str:
    # ...

Purpose: These functions format strings suitable for Hydra interpolation to declare DVC dependencies and outputs within your configuration values.
Mechanism: They return strings like "${deps:path/to/dependency}" or "${outs:path/to/output}". When configure_pipeline calls OmegaConf.resolve, the registered resolvers detect these prefixes and capture the path (path/to/dependency or path/to/output) for the dvc.yaml generation. The resolver also returns the path part (k in the lambda lambda k: current_list.append(k) or k) so that the config value itself resolves to the intended path after interpolation (relative to the stage's output directory for outs).
Usage: Use these inside your Hydra-Zen configurations where file paths are defined:

    from zendag.config_utils import deps_path, outs_path
    from hydra_zen import builds

    DataConfig = builds(
        MyDataset,
        data_file=deps_path("raw_data.csv", input_stage="data_fetch", input_name="fetch_europe"),
        processed_file=outs_path("processed_data.parquet"),
        # Need stage_dir_fn for deps_path resolution if using input_stage/name
        zen_meta=dict(stage_dir_fn=my_stage_dir_function) # Or rely on default/global
    )

@zendag.mlflow_utils.mlflow_run(...)

@mlflow_run(project_name: str = os.environ.get("MLFLOW_PROJECT_NAME", "DefaultProject"))
def my_training_stage(cfg: DictConfig):
    # ... stage logic ...

Purpose: Decorator for your main stage functions.
Functionality:
- Sets the MLflow experiment.
- Handles parent/nested MLflow runs using .pipeline_id and DVC_STAGE env var.
- If run via DVC (DVC_STAGE is set), loads the corresponding composed Hydra config (artifacts//.yaml).
- Logs parameters from the resolved Hydra config to the nested MLflow run.
- Logs the composed config .yaml file as an artifact.
- Executes the decorated function.
- Logs the run.log file from the Hydra output directory as an artifact on success or failure.
- Manages exceptions and MLflow run states.

Recommended Project Structure (See Cookiecutter Template)

my_project/
├── artifacts/             # DVC-managed outputs (configs, logs, models...)
│   ├── data_prep/
│   │   ├── config_a.yaml
│   │   └── config_a/      # Stage output dir
│   │       └── run.log
│   └── training/
│       ├── config_b.yaml
│       └── config_b/
│           ├── checkpoints/
│           ├── model.onnx
│           └── run.log
├── configs/               # Hydra-Zen config definitions (structured)
│   ├── __init__.py
│   ├── common.py
│   ├── data.py
│   ├── model.py
│   └── training.py
├── data/                  # Raw data (potentially DVC-managed)
├── src/                   # Project source code
│   └── my_project_pkg/
│       ├── __init__.py
│       ├── stages/        # Stage logic functions (decorated)
│       │   ├── __init__.py
│       │   ├── data_prep.py
│       │   └── train.py
│       └── utils.py       # Utility functions
├── tests/                 # Unit/integration tests
├── .dvc/                  # DVC internal files
├── .dvcignore
├── .gitignore
├── .pipeline_id           # Stores current parent MLflow run ID (auto-managed)
├── configure.py           # Script to run zendag.configure_pipeline
├── dvc.yaml               # Generated by configure.py (defines pipeline)
├── pixi.toml              # Environment and task definitions (Pixi)
└── README.md

configs/: Organize your Hydra-Zen builds calls here, grouped by functionality (data, model, trainer, logger, etc.). Use hydra_zen.make_custom_builds_fn for brevity. Import these into configure.py.
src/my_project_pkg/stages/: Implement the core logic for each pipeline stage here. Decorate the main function for each stage with @zendag.mlflow_run. These functions typically accept the Hydra DictConfig as an argument.
configure.py: The script that:
- Imports configs from configs/.
- Populates a hydra_zen.ZenStore.
- Defines the list of stage_groups to process.
- Calls zendag.core.configure_pipeline(store, stage_groups, ...).
pixi.toml: Defines the environment (dependencies like python, dvc, mlflow, hydra-core, hydra-zen, zendag, your src package) and tasks (configure, pipeline, save, etc.).

How Automatic DAG Generation Works Internally

The key is the interaction between configure_pipeline, OmegaConf.resolve, and the custom deps/outs resolvers:

configure_pipeline registers temporary resolvers for deps and outs just before calling OmegaConf.resolve(cfg) for a specific stage config.
resolvers are simple lambdas, e.g., lambda k: my_list.append(k) or k.
OmegaConf.resolve encounters ${deps:some/path} within the config structure:
- It calls the deps resolver with k = "some/path".
- The resolver appends "some/path" to the current_deps list (side-effect).
- The resolver returns k ("some/path").
- OmegaConf uses this returned value to replace the ${deps:some/path} interpolation.
The same happens for ${outs:other/path}.
After OmegaConf.resolve(cfg) finishes, the current_deps and current_outs lists contain all paths discovered via these interpolations for that specific stage configuration.
These lists are then used to populate the deps and outs fields in the generated dvc.yaml.

This avoids manual duplication of paths between the config where they are used and the DVC pipeline definition.

4. Cookiecutter Template Documentation (`README.md` for template users)

License

Apache 2.0

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.5

Jul 4, 2025

0.1.4

Jun 3, 2025

0.1.3

May 22, 2025

0.1.2

May 22, 2025

This version

0.1.0

May 20, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

zendag-0.1.0.tar.gz (16.4 kB view details)

Uploaded May 20, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

zendag-0.1.0-py3-none-any.whl (20.3 kB view details)

Uploaded May 20, 2025 Python 3

File details

Details for the file zendag-0.1.0.tar.gz.

File metadata

Download URL: zendag-0.1.0.tar.gz
Upload date: May 20, 2025
Size: 16.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for zendag-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`8e270548a15b04de8f47d187499f5b71b3b11ec7f061c6ff0c749a9c94f55269`
MD5	`3e4c382ddc012206c110138656af41f2`
BLAKE2b-256	`00a1e28d40f94a4bf85c4331b22072011077fa2b3ac50cf858e3253fe4913ee9`

See more details on using hashes here.

File details

Details for the file zendag-0.1.0-py3-none-any.whl.

File metadata

Download URL: zendag-0.1.0-py3-none-any.whl
Upload date: May 20, 2025
Size: 20.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for zendag-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`44f62545935ac42c12bdb34b3a1823f17da4d2735cfc372de03cce2b8f6fabbf`
MD5	`5e3b1d1c54870f7121cdc7eb37db8341`
BLAKE2b-256	`a805cfc67b90bcba979867e41dba77a4e8a4e20cb844cc6821ca967c82941655`

See more details on using hashes here.

zendag 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

ZenDag

Core Concepts

Installation

API Reference

Recommended Project Structure (See Cookiecutter Template)

How Automatic DAG Generation Works Internally

4. Cookiecutter Template Documentation (`README.md` for template users)

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

zendag 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

ZenDag

Core Concepts

Installation

API Reference

Recommended Project Structure (See Cookiecutter Template)

How Automatic DAG Generation Works Internally

4. Cookiecutter Template Documentation (README.md for template users)

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

4. Cookiecutter Template Documentation (`README.md` for template users)