A framework for integrating Hydra/DVC/MLflow for reproducible ML experiments.
Project description
- Framework Project Documentation (zendag/README.md)
ZenDag
ZenDag is a Python framework designed to streamline Machine Learning experimentation workflows by integrating:
- Configuration Management: Hydra and Hydra-Zen for modular, reusable, and composable configuration-as-code.
- Pipeline Orchestration & Versioning: DVC for defining experiment pipelines (DAGs) and versioning data, artifacts, and models.
- Experiment Tracking: MLflow for logging parameters, metrics, artifacts, and comparing runs.
The core idea is to drive the DVC pipeline definition directly from your Hydra configurations, minimizing redundancy and ensuring consistency between your code, configuration, and the execution pipeline.
Core Concepts
- Configuration as Code: Define all aspects of your experiment (data sources, preprocessing steps, model architecture, training parameters, evaluation metrics, logger settings) using Python code via Hydra-Zen and store them in a structured way (e.g., using
hydra_zen.ZenStore). - Stage-Based Pipelines: Structure your ML workflow into logical stages (e.g.,
data_prep,feature_eng,train,evaluate,deploy). Each stage corresponds to a node in the DVC pipeline graph. - Automatic DAG Generation: ZenDag automatically generates the
dvc.yamlfile. It discovers dependencies (deps) and outputs (outs) by inspecting your Hydra configurations during a resolution step. You declare these using${deps:...}and${outs:...}interpolations directly within your configuration values (e.g., file paths). - Integrated Experiment Tracking: A simple decorator (
@zendag.mlflow_run) wraps your stage execution functions to automatically handle MLflow setup, log parameters from the Hydra config, capture artifacts (including logs and the config itself), and manage nested runs within a parent pipeline run. - Environment & Task Management: While ZenDag itself is framework-agnostic regarding environment management, it's designed to work seamlessly with tools like Pixi or Conda/Poetry. A Cookiecutter template is provided to quickly set up a project using Pixi.
Installation
pip install zendag # Or install from source/git if needed
API Reference
zendag.core.configure_pipeline(...)
def configure_pipeline(
store: hydra_zen.ZenStore,
stage_groups: List[str],
stage_dir_fn: Callable[[str, str], str] = default_stage_dir_fn,
configs_dir_fn: Callable[[str], str] = default_configs_dir_fn,
dvc_filename: str = "dvc.yaml",
run_script: str = "zendag.run",
config_root: Optional[str] = None,
) -> None:
# ... (Full signature in docstring above) ...
-
Purpose: The main function to generate the dvc.yaml file.
-
How it works:
-
Iterates through specified stage_groups in the hydra_zen.ZenStore.
-
For each configuration (name) within a stage group (stage):
*Composes the full Hydra config (e.g., hydra.compose(overrides=[f"+{stage}={name}"])).
*Writes the composed config to <configs_dir_fn(stage)>/.yaml. This file is tracked as a param by DVC.
*Registers temporary Hydra resolvers for ${deps:...} and ${outs:...}.
*Calls OmegaConf.resolve(cfg). During resolution, any ${deps:path} or ${outs:path} encountered trigger the resolvers, which append the path to internal lists (side-effect).
*Collects the unique dependencies and outputs discovered during resolution.
*Defines a DVC stage entry in a dictionary (e.g., stages['stage/name'] = {...}). The cmd calls the specified run_script using the composed config. deps, outs, and params are populated.
-
Writes the complete stage dictionary to the dvc_filename.
-
-
Logging: Provides INFO and DEBUG level logs about the process, including discovered deps/outs. Configure Python's logging to see these.
zendag.config_utils.deps_path(...) & zendag.config_utils.outs_path(...)
def deps_path(s: str, input_stage: Optional[str] = None, input_name: Optional[str] = None, stage_dir_fn=None) -> str:
# ...
def outs_path(s: str) -> str:
# ...
-
Purpose: These functions format strings suitable for Hydra interpolation to declare DVC dependencies and outputs within your configuration values.
-
Mechanism: They return strings like "${deps:path/to/dependency}" or "${outs:path/to/output}". When configure_pipeline calls OmegaConf.resolve, the registered resolvers detect these prefixes and capture the path (path/to/dependency or path/to/output) for the dvc.yaml generation. The resolver also returns the path part (k in the lambda lambda k: current_list.append(k) or k) so that the config value itself resolves to the intended path after interpolation (relative to the stage's output directory for outs).
-
Usage: Use these inside your Hydra-Zen configurations where file paths are defined:
from zendag.config_utils import deps_path, outs_path
from hydra_zen import builds
DataConfig = builds(
MyDataset,
data_file=deps_path("raw_data.csv", input_stage="data_fetch", input_name="fetch_europe"),
processed_file=outs_path("processed_data.parquet"),
# Need stage_dir_fn for deps_path resolution if using input_stage/name
zen_meta=dict(stage_dir_fn=my_stage_dir_function) # Or rely on default/global
)
@zendag.mlflow_utils.mlflow_run(...)
@mlflow_run(project_name: str = os.environ.get("MLFLOW_PROJECT_NAME", "DefaultProject"))
def my_training_stage(cfg: DictConfig):
# ... stage logic ...
-
Purpose: Decorator for your main stage functions.
-
Functionality:
-
Sets the MLflow experiment.
-
Handles parent/nested MLflow runs using .pipeline_id and DVC_STAGE env var.
-
If run via DVC (DVC_STAGE is set), loads the corresponding composed Hydra config (artifacts//.yaml).
-
Logs parameters from the resolved Hydra config to the nested MLflow run.
-
Logs the composed config .yaml file as an artifact.
-
Executes the decorated function.
-
Logs the run.log file from the Hydra output directory as an artifact on success or failure.
-
Manages exceptions and MLflow run states.
-
Recommended Project Structure (See Cookiecutter Template)
my_project/
├── artifacts/ # DVC-managed outputs (configs, logs, models...)
│ ├── data_prep/
│ │ ├── config_a.yaml
│ │ └── config_a/ # Stage output dir
│ │ └── run.log
│ └── training/
│ ├── config_b.yaml
│ └── config_b/
│ ├── checkpoints/
│ ├── model.onnx
│ └── run.log
├── configs/ # Hydra-Zen config definitions (structured)
│ ├── __init__.py
│ ├── common.py
│ ├── data.py
│ ├── model.py
│ └── training.py
├── data/ # Raw data (potentially DVC-managed)
├── src/ # Project source code
│ └── my_project_pkg/
│ ├── __init__.py
│ ├── stages/ # Stage logic functions (decorated)
│ │ ├── __init__.py
│ │ ├── data_prep.py
│ │ └── train.py
│ └── utils.py # Utility functions
├── tests/ # Unit/integration tests
├── .dvc/ # DVC internal files
├── .dvcignore
├── .gitignore
├── .pipeline_id # Stores current parent MLflow run ID (auto-managed)
├── configure.py # Script to run zendag.configure_pipeline
├── dvc.yaml # Generated by configure.py (defines pipeline)
├── pixi.toml # Environment and task definitions (Pixi)
└── README.md
-
configs/: Organize your Hydra-Zen builds calls here, grouped by functionality (data, model, trainer, logger, etc.). Use hydra_zen.make_custom_builds_fn for brevity. Import these into configure.py.
-
src/my_project_pkg/stages/: Implement the core logic for each pipeline stage here. Decorate the main function for each stage with @zendag.mlflow_run. These functions typically accept the Hydra DictConfig as an argument.
-
configure.py: The script that:
-
Imports configs from configs/.
-
Populates a hydra_zen.ZenStore.
-
Defines the list of stage_groups to process.
-
Calls zendag.core.configure_pipeline(store, stage_groups, ...).
-
-
pixi.toml: Defines the environment (dependencies like python, dvc, mlflow, hydra-core, hydra-zen, zendag, your src package) and tasks (configure, pipeline, save, etc.).
How Automatic DAG Generation Works Internally
The key is the interaction between configure_pipeline, OmegaConf.resolve, and the custom deps/outs resolvers:
-
configure_pipeline registers temporary resolvers for deps and outs just before calling OmegaConf.resolve(cfg) for a specific stage config.
-
resolvers are simple lambdas, e.g., lambda k: my_list.append(k) or k.
-
OmegaConf.resolve encounters ${deps:some/path} within the config structure:
-
It calls the deps resolver with k = "some/path".
-
The resolver appends "some/path" to the current_deps list (side-effect).
-
The resolver returns k ("some/path").
-
OmegaConf uses this returned value to replace the ${deps:some/path} interpolation.
-
-
The same happens for ${outs:other/path}.
-
After OmegaConf.resolve(cfg) finishes, the current_deps and current_outs lists contain all paths discovered via these interpolations for that specific stage configuration.
-
These lists are then used to populate the deps and outs fields in the generated dvc.yaml.
This avoids manual duplication of paths between the config where they are used and the DVC pipeline definition.
4. Cookiecutter Template Documentation (README.md for template users)
License
Apache 2.0
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file zendag-0.1.3.tar.gz.
File metadata
- Download URL: zendag-0.1.3.tar.gz
- Upload date:
- Size: 16.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b059bfe71310f5a68ee31f326d8915bd852883bbbf93fffff7345cd2cee5acdf
|
|
| MD5 |
b7580cc39b7e9012c0a157bfdd574a34
|
|
| BLAKE2b-256 |
1ecd4bd103a6046e3ae6f9742a2ce9d49ec015cba2e60783475389b33835ec38
|
File details
Details for the file zendag-0.1.3-py3-none-any.whl.
File metadata
- Download URL: zendag-0.1.3-py3-none-any.whl
- Upload date:
- Size: 20.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ab8d77b7ea05c228f8254ff47ced68227f573b320c3fb724b68123c2ab67aad2
|
|
| MD5 |
fdb7f4c71329db1488b365f76272295b
|
|
| BLAKE2b-256 |
0c97aeab44c898a5f31d075dbce9fdac07c9d4fb989aac5f166deacf1a8ef035
|