A set of tools for working with mlflow (see https://mlflow.org)
Project description
mlflowhelper
A set of tools for working with mlflow (see https://mlflow.org)
Features
- managed artifact logging and loading
- automatic artifact logging and cleanup
- no overwriting files when running scripts in parallel
- loading artifact
- central configuration of logging and loading behavior
- log all function parameters and locals with a simple call to
mlflowhelper.log_vars()
Documentation
pip install mlflowhelper
Managed artifact logging and loading
General functionality
from matplotlib import pyplot as plt
import mlflowhelper
with mlflowhelper.start_run():
with mlflowhelper.managed_artifact("plot.png") as artifact:
fig = plt.figure()
plt.plot([1,2,3], [1,2,3])
fig.savefig(artifact.get_path())
This code snippet automatically logs the created artifact (plot.png
).
At the same time if will create the artifact in a temporary folder so that you don't have to worry about
overwriting it when running your scripts in parallel.
By default, this also cleans up the artifact and the temporary folder after logging.
You can also manage artifacts on a directory level:
from matplotlib import pyplot as plt
import mlflowhelper
with mlflowhelper.start_run():
with mlflowhelper.managed_artifact_dir("plots") as artifact_dir:
# plot 1
fig = plt.figure()
plt.plot([1,2,3], [1,2,3])
fig.savefig(artifact_dir.get_path("plot1.png"))
# plot 2
fig = plt.figure()
plt.plot([1,2,3], [1,2,3])
fig.savefig(artifact_dir.get_path("plot2.png"))
Artifact loading
You may want to run experiments but reuse some precomputed artifact from a different run (such as preprocessed data, trained models, etc.). This can be done as follows:
import mlflowhelper
import pandas as pd
with mlflowhelper.start_run():
mlflowhelper.set_load(run_id="e1363f760b1e4ab3a9e93f856f2e9341", stages=["load_data"]) # activate loading from previous run
with mlflowhelper.managed_artifact_dir("data.csv", stage="load_data") as artifact:
if artifact.loaded:
# load artifact
data = pd.read_csv(artifact.get_path())
else:
# create and save artifact
data = pd.read_csv("/shared/dir/data.csv").sample(frac=1)
data.to_csv(artifact.get_path())
Similarly, this works for directories of course:
import mlflowhelper
import pandas as pd
mlflowhelper.set_load(run_id="e1363f760b1e4ab3a9e93f856f2e9341", stages=["load_data"]) # activate loading from previous run
with mlflowhelper.start_run():
with mlflowhelper.managed_artifact_dir("data", stage="load_data") as artifact_dir:
train_path = artifact_dir.get_path("test.csv")
test_path = artifact_dir.get_path("train.csv")
if artifact_dir.loaded:
# load artifacts
train = pd.read_csv(train_path)
test = pd.read_csv(test_path)
else:
data = pd.read_csv("/shared/dir/data.csv").sample(frac=1)
train = data.iloc[:100,:]
test = data.iloc[100:,:]
# save artifacts
train.to_csv(train_path)
test.to_csv(test_path)
Note: The stage
parameter must be set in mlflowhelper.managed_artifact(_dir)
to enable loading.
Central logging and loading behavior management
Logging and loading behavior can be managed in a central way:
import mlflowhelper
import pandas as pd
with mlflowhelper.start_run():
# activate loading the stage `load_data` from previous run `e1363f760b1e4ab3a9e93f856f2e9341`
mlflowhelper.set_load(run_id="e1363f760b1e4ab3a9e93f856f2e9341", stages=["load_data"])
# deactivate logging the stage `load_data`, in this case for example because it was loaded from a previous run
mlflowhelper.set_skip_log(stages=["load_data"])
with mlflowhelper.managed_artifact_dir("data", stage="load_data") as artifact_dir:
train_path = artifact_dir.get_path("test.csv")
test_path = artifact_dir.get_path("train.csv")
if artifact_dir.loaded:
# load artifacts
train = pd.read_csv(train_path)
test = pd.read_csv(test_path)
else:
data = pd.read_csv("/shared/dir/data.csv").sample(frac=1)
train = data.iloc[:100,:]
test = data.iloc[100:,:]
# save artifacts
train.to_csv(train_path)
test.to_csv(test_path)
Note: For central managing the stage
parameter must be set in mlflowhelper.managed_artifact(_dir)
.
Easy parameter logging
mlflowhelper helps you to never forget logging parameters again by making it easy to log all existing variables
using mlflowhelper.log_vars
.
import mlflowhelper
def main(param1, param2, param3="defaultvalue", verbose=0, *args, **kwargs):
some_variable = "x"
with mlflowhelper.start_run(): # mlflow.start_run() is also OK here
mlflowhelper.log_vars(exclude=["verbose"])
if __name__ == '__main__':
main("a", "b", something_else=6)
This will log:
{
"param1": "a",
"param2": "b",
"param3": "defaultvalue",
"something_else": 6
}
Persistent dictionary
mlflowhelper provides dictionary-like implementation that persistens elements to MLFlow.
from mlflowhelper.tracking.collections import MlflowDict
d = MlflowDict() # you can also provide tracking URI or an MlflowClient
d["a"] = 5
del d
d = MlflowDict() # you can also provide tracking URI or an MlflowClient
print(d["a"]) # will give you 5
Other
There are a few more convenience functions included in mlflowhelper
:
TODOs / Ideas
- check if loading works across experiments
- purge local artifacts (check via API which runs are marked as deleted and delete their artifacts)
- support nested runs by creating subdirectories based on experiment and run
- support loading from central cache instead of from runs
- automatically log from where and what has been loaded
- set tags for logged stages (to check for artifacts before loading them)
- consider loading extensions:
- does nested loading make sense (different loads for certain nested runs)?
- does mixed loading make sense (loading artifacts from different runs for different stages)?
Note
This project has been set up using PyScaffold 3.2.1. For details and usage information on PyScaffold see https://pyscaffold.org/.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.