A set of tools for working with mlflow (see https://mlflow.org)
Project description
mlflowhelper
A set of tools for working with mlflow (see https://mlflow.org)
Features
- managed artifact logging and loading
- automatic artifact logging and cleanup
- no overwriting files when running scripts in parallel
- loading artifact
- central configuration of logging and loading behavior
- log all function parameters and locals with a simple call to
mlflowhelper.log_vars()
Documentation
pip install mlflowhelper
Managed artifact logging and loading
General functionality
from matplotlib import pyplot as plt import mlflowhelper with mlflowhelper.start_run(): with mlflowhelper.managed_artifact("plot.png") as artifact: fig = plt.figure() plt.plot([1,2,3], [1,2,3]) fig.savefig(artifact.get_path())
This code snippet automatically logs the created artifact (plot.png
).
At the same time if will create the artifact in a temporary folder so that you don't have to worry about
overwriting it when running your scripts in parallel.
By default, this also cleans up the artifact and the temporary folder after logging.
You can also manage artifacts on a directory level:
from matplotlib import pyplot as plt import mlflowhelper with mlflowhelper.start_run(): with mlflowhelper.managed_artifact_dir("plots") as artifact_dir: # plot 1 fig = plt.figure() plt.plot([1,2,3], [1,2,3]) fig.savefig(artifact_dir.get_path("plot1.png")) # plot 2 fig = plt.figure() plt.plot([1,2,3], [1,2,3]) fig.savefig(artifact_dir.get_path("plot2.png"))
Artifact loading
You may want to run experiments but reuse some precomputed artifact from a different run (such as preprocessed data, trained models, etc.). This can be done as follows:
import mlflowhelper import pandas as pd with mlflowhelper.start_run(): mlflowhelper.set_load(run_id="e1363f760b1e4ab3a9e93f856f2e9341", stages=["load_data"]) # activate loading from previous run with mlflowhelper.managed_artifact_dir("data.csv", stage="load_data") as artifact: if artifact.loaded: # load artifact data = pd.read_csv(artifact.get_path()) else: # create and save artifact data = pd.read_csv("/shared/dir/data.csv").sample(frac=1) data.to_csv(artifact.get_path())
Similarly, this works for directories of course:
import mlflowhelper import pandas as pd mlflowhelper.set_load(run_id="e1363f760b1e4ab3a9e93f856f2e9341", stages=["load_data"]) # activate loading from previous run with mlflowhelper.start_run(): with mlflowhelper.managed_artifact_dir("data", stage="load_data") as artifact_dir: train_path = artifact_dir.get_path("test.csv") test_path = artifact_dir.get_path("train.csv") if artifact_dir.loaded: # load artifacts train = pd.read_csv(train_path) test = pd.read_csv(test_path) else: data = pd.read_csv("/shared/dir/data.csv").sample(frac=1) train = data.iloc[:100,:] test = data.iloc[100:,:] # save artifacts train.to_csv(train_path) test.to_csv(test_path)
Note: The stage
parameter must be set in mlflowhelper.managed_artifact(_dir)
to enable loading.
Central logging and loading behavior management
Logging and loading behavior can be managed in a central way:
import mlflowhelper import pandas as pd with mlflowhelper.start_run(): # activate loading the stage `load_data` from previous run `e1363f760b1e4ab3a9e93f856f2e9341` mlflowhelper.set_load(run_id="e1363f760b1e4ab3a9e93f856f2e9341", stages=["load_data"]) # deactivate logging the stage `load_data`, in this case for example because it was loaded from a previous run mlflowhelper.set_skip_log(stages=["load_data"]) with mlflowhelper.managed_artifact_dir("data", stage="load_data") as artifact_dir: train_path = artifact_dir.get_path("test.csv") test_path = artifact_dir.get_path("train.csv") if artifact_dir.loaded: # load artifacts train = pd.read_csv(train_path) test = pd.read_csv(test_path) else: data = pd.read_csv("/shared/dir/data.csv").sample(frac=1) train = data.iloc[:100,:] test = data.iloc[100:,:] # save artifacts train.to_csv(train_path) test.to_csv(test_path)
Note: For central managing the stage
parameter must be set in mlflowhelper.managed_artifact(_dir)
.
Easy parameter logging
mlflowhelper helps you to never forget logging parameters again by making it easy to log all existing variables
using mlflowhelper.log_vars
.
import mlflowhelper def main(param1, param2, param3="defaultvalue", verbose=0, *args, **kwargs): some_variable = "x" with mlflowhelper.start_run(): # mlflow.start_run() is also OK here mlflowhelper.log_vars(exclude=["verbose"]) if __name__ == '__main__': main("a", "b", something_else=6)
This will log:
{ "param1": "a", "param2": "b", "param3": "defaultvalue", "something_else": 6 }
Persistent dictionary
mlflowhelper provides dictionary-like implementation that persistens elements to MLFlow.
from mlflowhelper.tracking.collections import MlflowDict d = MlflowDict() # you can also provide tracking URI or an MlflowClient d["a"] = 5 del d d = MlflowDict() # you can also provide tracking URI or an MlflowClient print(d["a"]) # will give you 5
Other
There are a few more convenience functions included in mlflowhelper
:
TODOs / Ideas
- [ ] check if loading works across experiments
- [ ] purge local artifacts (check via API which runs are marked as deleted and delete their artifacts)
- [ ] support nested runs by creating subdirectories based on experiment and run
- [ ] support loading from central cache instead of from runs
- [ ] automatically log from where and what has been loaded
- [ ] set tags for logged stages (to check for artifacts before loading them)
- [ ] consider loading extensions:
- [ ] does nested loading make sense (different loads for certain nested runs)?
- [ ] does mixed loading make sense (loading artifacts from different runs for different stages)?
Note
This project has been set up using PyScaffold 3.2.1. For details and usage information on PyScaffold see https://pyscaffold.org/.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Filename, size | File type | Python version | Upload date | Hashes |
---|---|---|---|---|
Filename, size mlflowhelper-1.1.0.tar.gz (27.1 kB) | File type Source | Python version None | Upload date | Hashes View |