Skip to main content

A kedro-plugin to use mlflow in your kedro projects

Project description

Introduction

kedro-mlflow is a kedro-plugin for integration of mlflow capabilities inside kedro projects. Its core functionalities are :

  • versioning: you can effortlessly register your parameters or your datasets with minimal configuration in a kedro run. Later, you will be able to browse your runs in the mlflow UI, and retrieve the runs you want. This is directly linked to Mlflow Tracking
  • model packaging: kedro-mlflow offers a convenient API to register a pipeline as a model in the mlflow sense. Consequently, you can API-fy your kedro pipeline with one line of code, or share a model with without worrying of the preprocessing to be made for further use. This is directly linked to Mlflow Models

Release history

The release history is available here.

Coming soon:

  • enhanced documentation, especially with detailed tutorials for PipelineML class and advanced versioning parametrisation
  • better integration to Mlflow Projects
  • better integration to Mlflow Model Registry
  • better CLI experience and bug fixes
  • ability to retrieve parameters / re-run a former run for reproducibility / collaboration

Getting started

Installation

Pre-requisites

I strongly recommend to use conda (a package manager) to create an environment in order to avoid version conflicts between packages. This is specially important because this package uses the develop version of kedro which is very likely not the default one you use in your projects.

I also recommend to read kedro installation guide.

Installation guide

First, install the develop version of kedro from github which is the only compatible with current kedro-mlflow version:

pip install --upgrade git+https://github.com/quantumblacklabs/kedro.git@develop

Second, and since the package is not on PyPi yet, you must install it from sources:

pip install  git+https://github.com/Galileo-Galilei/kedro-mlflow.git

Note : with this develop version of kedro, you need to install extras dependencies by hand. You will very likely need :

pip install kedro[pandas]

else check the documentation and install the dependencies you need.

Check the installation

Type kedro info in a terminal to check the installation. If it has succeeded, you should see the following ascii art:

 _            _
| | _____  __| |_ __ ___
| |/ / _ \/ _` | '__/ _ \
|   <  __/ (_| | | | (_) |
|_|\_\___|\__,_|_|  \___/
v0.15.9

kedro allows teams to create analytics
projects. It is developed as part of
the Kedro initiative at QuantumBlack.

Installed plugins:
kedro_mlflow: 0.1.0 (hooks:global,project)

The version 0.1.0 of the plugin is installed ans has both global and project commands.

That's it! You are now ready to go!

A "hello world" example

Step 1 : Create a kedro project

If you do not have a kedro project yet, you can create it with kedro new command. See the kedro docs for a tutorial.

For this tutorial and if you do not have a real-world project, I strongly suggest that you accept to include the proposed example to make a demo of this plugin out of the box.

Step 2 : update the template

Position yourself with at the root (i.e. the folder with the .kedro.yml file)

$ cd path/to/your/project

Run the init command :

$ kedro mlflow init

Note : If the warning "You have not updated your template yet. This is mandatory to use 'kedro-mlflow' plugin. Please run the following command before you can access to other commands : '$ kedro mlflow init' is raised, this is a bug to be corrected and you can safely ignore it.

  • If you have never modified your run.py manually, it should run smoothly.
  • If you have modified you run.py manually since the creation of the project, you must set up the hooks manually or use :
$ kedro mlflow init --force

to override your run.py (be cautious if you do this in an existing project !).

Run the project

You can now run

$ kedro run

Open and check what is logged

Run the following command :

$ kedro mlflow ui

and opens your browser to http://localhost:5000. You will reach the mlflow ui.

Click on your project name, and then last run recorded, and enjoy what is logged!

Going further

Tutorial

TO BE DONE... It will contain more explanations and examples of the pipeline_ml function.

Plugin overview

The current version of kedro-mlflow provides the following items:

New cli commands:

  1. kedro mlflow init: this command is needed to initalize your project. You cannot run any other commands before you run this one once. It performs 2 actions:
    • creates a mlflow.yml configuration file in your conf/base folder
    • replace the src/PYTHON_PACKAGE/run.py file by an updated version of the template. If your template has been modified since project creation, a warning wil be raised. You can either run kedro mlflow init --force to ignore this warning (but this will erase your run.py) or set hooks manually.
  2. kedro mlflow ui: this command opens the mlflow UI (basically launches the mlflow ui command with the configuration of your mlflow.yml file)

New DataSet:

MlflowDataSet is a wrapper for any AbstractDataSet which logs the dataset automatically in mlflow as an artifact when its save method is called. It can be used both with the YAML API:

my_dataset_to_version:
    type: kedro_mlflow.io.MlflowDataSet
    data_set:
        type: pandas.CSVDataSet  # or any valid kedro DataSet
        filepath: /path/to/a/local/destination/file.csv

or with additional parameters:

my_dataset_to_version:
    type: kedro_mlflow.io.MlflowDataSet
    data_set:
        type: pandas.CSVDataSet  # or any valid kedro DataSet
        filepath: /path/to/a/local/destination/file.csv
        load_args:
            sep: ;
        save_args:
            sep: ;
        # ... any other valid arguments for data_set
    run_id: 13245678910111213  # a valid mlflow run to log in. If None, default to active run
    artifact_path: reporting  # relative path where the artifact must be stored. if None, saved in root folder.

or with the python API:

from kedro_mlflow.io import MlflowDataSet
from kedro.extras.datasets.pandas import CSVDataSet
csv_dataset = MlflowDataSet(data_set={"type": CSVDataSet,
                                      "filepath": r"/path/to/a/local/destination/file.csv"})
csv_dataset.save(data=pd.DataFrame({"a":[1,2], "b": [3,4]}))

New Hooks

This package provides 2 new hooks:

  1. The MlflowPipelineHook :
    1. manages mlflow settings at the beginning and the end of the run (run start / end).
    2. log useful informations for reproducibility as mlflow tags (including kedro Journal information and the commands used to launch the run.)
    3. register the pipeline as a valid mlflow model if it is a PipelineML instance
  2. The MlflowNodeHook :
    1. must be used with the MlflowPipelineHook
    2. autolog nodes parameters each time the pipeline is run (with kedro run or programatically).

These hooks need to be registered in the the run.py file. You can either :

from kedro_mlflow.framework.hooks import MlflowNodeHook, MlflowPipelineHook
from pk.pipeline import create_pipelines

class ProjectContext(KedroContext):
    """Users can override the remaining methods from the parent class here,
    or create new ones (e.g. as required by plugins)
    """

    project_name = "YOUR PROJECT NAME"
    project_version = "0.15.9"
    hooks = (
        MlflowNodeHook(flatten_dict_params=False),
        MlflowPipelineHook(model_name="YOUR_PYTHON_PACKAGE",
                           conda_env="src/requirements.txt")
    )

New Pipeline

PipelineML is a new class which extends Pipeline and enable to bind two pipelines (one of training, one of inference) together. This class comes with a KedroPipelineModel class for logging it in mlflow. A pipeline logged as a mlflow model can be served using mlflow models serve and mlflow models predict command.

The PipelineML class is not intended to be used directly. A pipeline_ml factory is provided for user friendly interface.

Example within kedro template:

# in src/PYTHON_PACKAGE/pipeline.py

from PYTHON_PACKAGE.pipelines import data_science as ds

def create_pipelines(**kwargs) -> Dict[str, Pipeline]:
    data_science_pipeline = ds.create_pipeline()
    training_pipeline = pipeline_ml(training=data_science_pipeline.only_nodes_with_tags("training"), # or whatever your logic is for filtering
                                    inference=data_science_pipeline.only_nodes_with_tags("inference"))

    return {
        "ds": data_science_pipeline,
        "training": training_pipeline,
        "__default__": data_engineering_pipeline + data_science_pipeline,
    }

Now each time you will run kedro run --pipeline=training (provided you registered MlflowPipelineHook in you run.py), the full inference pipeline will be registered as a mlflow model (with all the outputs produced by training as artifacts : the machine learning, but also the scaler, vectorizer, imputer, or whatever object fitted on data you create in training and that is used in inference).

Note: If you want to log a PipelineML object in mlflow programatically, you can use the following code snippet:

from pathlib import Path
from kedro.framework.context import load_context
from kedro_mlflow.mlflow import KedroPipelineModel

# pipeline_training is your PipelineML object, created as previsously
catalog = load_context(".").io

# artifacts are all the inputs of the inference pipelines that are persisted in the catalog
pipeline_catalog = pipeline_training.extract_pipeline_catalog(catalog)
artifacts = {name: Path(dataset._filepath).resolve().as_uri()
                for name, dataset in pipeline_catalog._data_sets.items()
                if name != pipeline_training.model_input_name}


mlflow.pyfunc.log_model(artifact_path="model",
                        python_model=KedroPipelineModel(pipeline_ml=pipeline_training,
                                                        catalog=pipeline_catalog),
                        artifacts=artifacts,
                            conda_env={"python": "3.7.0"})

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kedro_mlflow-0.2.0.tar.gz (554.3 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page