A kedro-plugin to use mlflow in your kedro projects

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

Introduction

kedro-mlflow is a kedro-plugin for integration of mlflow capabilities inside kedro projects. Its core functionalities are :

versioning: you can effortlessly register your parameters or your datasets with minimal configuration in a kedro run. Later, you will be able to browse your runs in the mlflow UI, and retrieve the runs you want. This is directly linked to Mlflow Tracking
model packaging: kedro-mlflow offers a convenient API to register a pipeline as a model in the mlflow sense. Consequently, you can API-fy your kedro pipeline with one line of code, or share a model with without worrying of the preprocessing to be made for further use. This is directly linked to Mlflow Models

Release history

The release history is available here.

Coming soon:

enhanced documentation, especially with detailed tutorials for PipelineML class and advanced versioning parametrisation
better integration to Mlflow Projects
better integration to Mlflow Model Registry
better CLI experience and bug fixes
ability to retrieve parameters / re-run a former run for reproducibility / collaboration

Getting started

Installation

Pre-requisites

I strongly recommend to use conda (a package manager) to create an environment in order to avoid version conflicts between packages. This is specially important because this package uses the develop version of kedro which is very likely not the default one you use in your projects.

I also recommend to read kedro installation guide.

Installation guide

First, install the develop version of kedro from github which is the only compatible with current kedro-mlflow version:

pip install --upgrade git+https://github.com/quantumblacklabs/kedro.git@develop

Second, and since the package is not on PyPi yet, you must install it from sources:

pip install  git+https://github.com/Galileo-Galilei/kedro-mlflow.git

Note : with this develop version of kedro, you need to install extras dependencies by hand. You will very likely need :

pip install kedro[pandas]

else check the documentation and install the dependencies you need.

Check the installation

Type kedro info in a terminal to check the installation. If it has succeeded, you should see the following ascii art:

 _            _
| | _____  __| |_ __ ___
| |/ / _ \/ _` | '__/ _ \
|   <  __/ (_| | | | (_) |
|_|\_\___|\__,_|_|  \___/
v0.15.9

kedro allows teams to create analytics
projects. It is developed as part of
the Kedro initiative at QuantumBlack.

Installed plugins:
kedro_mlflow: 0.1.0 (hooks:global,project)

The version 0.1.0 of the plugin is installed ans has both global and project commands.

That's it! You are now ready to go!

A "hello world" example

Step 1 : Create a kedro project

If you do not have a kedro project yet, you can create it with kedro new command. See the kedro docs for a tutorial.

For this tutorial and if you do not have a real-world project, I strongly suggest that you accept to include the proposed example to make a demo of this plugin out of the box.

Step 2 : update the template

Position yourself with at the root (i.e. the folder with the .kedro.yml file)

$ cd path/to/your/project

Run the init command :

$ kedro mlflow init

Note : If the warning "You have not updated your template yet. This is mandatory to use 'kedro-mlflow' plugin. Please run the following command before you can access to other commands : '$ kedro mlflow init' is raised, this is a bug to be corrected and you can safely ignore it.

If you have never modified your run.py manually, it should run smoothly.
If you have modified you run.py manually since the creation of the project, you must set up the hooks manually or use :

$ kedro mlflow init --force

to override your run.py (be cautious if you do this in an existing project !).

Run the project

You can now run

$ kedro run

Open and check what is logged

Run the following command :

$ kedro mlflow ui

and opens your browser to http://localhost:5000. You will reach the mlflow ui.

Click on your project name, and then last run recorded, and enjoy what is logged!

Going further

You want to log additional metrics to the run? -> See mlflow.log_metric and add it to your functions !
You want to log nice dataviz of your pipeline that you register with MatplotlibWriter? -> Try MlflowDataSet to log any local files (.png, .pkl, .csv...) automagically!
You want to create easily an API to share your awesome model to anyone? -> See if pipeline_ml can fit your needs
You want to do something that is not straigthforward with current implementation? Open an issue, and let's see what happens!

Tutorial

TO BE DONE... It will contain more explanations and examples of the pipeline_ml function.

Plugin overview

The current version of kedro-mlflow provides the following items:

New `cli` commands:

kedro mlflow init: this command is needed to initalize your project. You cannot run any other commands before you run this one once. It performs 2 actions:
- creates a mlflow.yml configuration file in your conf/base folder
- replace the src/PYTHON_PACKAGE/run.py file by an updated version of the template. If your template has been modified since project creation, a warning wil be raised. You can either run kedro mlflow init --force to ignore this warning (but this will erase your run.py) or set hooks manually.
kedro mlflow ui: this command opens the mlflow UI (basically launches the mlflow ui command with the configuration of your mlflow.yml file)

New `DataSet`:

MlflowDataSet is a wrapper for any AbstractDataSet which logs the dataset automatically in mlflow as an artifact when its save method is called. It can be used both with the YAML API:

my_dataset_to_version:
    type: kedro_mlflow.io.MlflowDataSet
    data_set:
        type: pandas.CSVDataSet  # or any valid kedro DataSet
        filepath: /path/to/a/local/destination/file.csv

or with additional parameters:

my_dataset_to_version:
    type: kedro_mlflow.io.MlflowDataSet
    data_set:
        type: pandas.CSVDataSet  # or any valid kedro DataSet
        filepath: /path/to/a/local/destination/file.csv
        load_args:
            sep: ;
        save_args:
            sep: ;
        # ... any other valid arguments for data_set
    run_id: 13245678910111213  # a valid mlflow run to log in. If None, default to active run
    artifact_path: reporting  # relative path where the artifact must be stored. if None, saved in root folder.

or with the python API:

from kedro_mlflow.io import MlflowDataSet
from kedro.extras.datasets.pandas import CSVDataSet
csv_dataset = MlflowDataSet(data_set={"type": CSVDataSet,
                                      "filepath": r"/path/to/a/local/destination/file.csv"})
csv_dataset.save(data=pd.DataFrame({"a":[1,2], "b": [3,4]}))

New `Hooks`

This package provides 2 new hooks:

The MlflowPipelineHook :
1. manages mlflow settings at the beginning and the end of the run (run start / end).
2. log useful informations for reproducibility as mlflow tags (including kedro Journal information and the commands used to launch the run.)
3. register the pipeline as a valid mlflow model if it is a PipelineML instance
The MlflowNodeHook :
1. must be used with the MlflowPipelineHook
2. autolog nodes parameters each time the pipeline is run (with kedro run or programatically).

These hooks need to be registered in the the run.py file. You can either :

from kedro_mlflow.framework.hooks import MlflowNodeHook, MlflowPipelineHook
from pk.pipeline import create_pipelines

class ProjectContext(KedroContext):
    """Users can override the remaining methods from the parent class here,
    or create new ones (e.g. as required by plugins)
    """

    project_name = "YOUR PROJECT NAME"
    project_version = "0.15.9"
    hooks = (
        MlflowNodeHook(flatten_dict_params=False),
        MlflowPipelineHook(model_name="YOUR_PYTHON_PACKAGE",
                           conda_env="src/requirements.txt")
    )

(RECOMMENDED) use the kedro mlflow init command.

New `Pipeline`

PipelineML is a new class which extends Pipeline and enable to bind two pipelines (one of training, one of inference) together. This class comes with a KedroPipelineModel class for logging it in mlflow. A pipeline logged as a mlflow model can be served using mlflow models serve and mlflow models predict command.

The PipelineML class is not intended to be used directly. A pipeline_ml factory is provided for user friendly interface.

Example within kedro template:

# in src/PYTHON_PACKAGE/pipeline.py

from PYTHON_PACKAGE.pipelines import data_science as ds

def create_pipelines(**kwargs) -> Dict[str, Pipeline]:
    data_science_pipeline = ds.create_pipeline()
    training_pipeline = pipeline_ml(training=data_science_pipeline.only_nodes_with_tags("training"), # or whatever your logic is for filtering
                                    inference=data_science_pipeline.only_nodes_with_tags("inference"))

    return {
        "ds": data_science_pipeline,
        "training": training_pipeline,
        "__default__": data_engineering_pipeline + data_science_pipeline,
    }

Now each time you will run kedro run --pipeline=training (provided you registered MlflowPipelineHook in you run.py), the full inference pipeline will be registered as a mlflow model (with all the outputs produced by training as artifacts : the machine learning, but also the scaler, vectorizer, imputer, or whatever object fitted on data you create in training and that is used in inference).

Note: If you want to log a PipelineML object in mlflow programatically, you can use the following code snippet:

from pathlib import Path
from kedro.framework.context import load_context
from kedro_mlflow.mlflow import KedroPipelineModel

# pipeline_training is your PipelineML object, created as previsously
catalog = load_context(".").io

# artifacts are all the inputs of the inference pipelines that are persisted in the catalog
pipeline_catalog = pipeline_training.extract_pipeline_catalog(catalog)
artifacts = {name: Path(dataset._filepath).resolve().as_uri()
                for name, dataset in pipeline_catalog._data_sets.items()
                if name != pipeline_training.model_input_name}


mlflow.pyfunc.log_model(artifact_path="model",
                        python_model=KedroPipelineModel(pipeline_ml=pipeline_training,
                                                        catalog=pipeline_catalog),
                        artifacts=artifacts,
                            conda_env={"python": "3.7.0"})

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.12.2

Apr 17, 2024

0.12.1

Feb 9, 2024

0.12.0

Dec 19, 2023

0.11.10

Oct 3, 2023

0.11.9

Jul 23, 2023

0.11.8

Feb 13, 2023

0.11.7

Jan 28, 2023

0.11.6

Jan 9, 2023

0.11.5

Dec 12, 2022

0.11.4

Oct 4, 2022

0.11.3

Sep 6, 2022

0.11.2

Aug 28, 2022

0.11.1

Jul 6, 2022

0.11.0

Jun 18, 2022

0.10.0

May 15, 2022

0.9.0

Apr 1, 2022

0.8.1

Feb 13, 2022

0.8.0

Jan 5, 2022

0.7.6

Oct 8, 2021

0.7.5

Sep 21, 2021

0.7.4

Aug 30, 2021

0.7.3

Aug 16, 2021

0.7.2

May 2, 2021

0.7.1

Apr 9, 2021

0.7.0

Mar 17, 2021

0.6.0

Mar 14, 2021

0.5.0

Feb 21, 2021

0.4.1

Dec 3, 2020

0.4.0

Nov 3, 2020

0.3.0

Oct 11, 2020

0.2.1

Aug 6, 2020

This version

0.2.0

Jul 14, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kedro_mlflow-0.2.0.tar.gz (554.3 kB view hashes)

Uploaded Jul 14, 2020 Source

Hashes for kedro_mlflow-0.2.0.tar.gz

Hashes for kedro_mlflow-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`851d261b457fd7a24cbc44cee4f3d749e4f12959bd7c97cc52927beedefd188a`
MD5	`9daea655dbed920f30dca56438ddb7fb`
BLAKE2b-256	`175645c3562a040e09ab12414d32beaf272ccec18b689b43bda51f92f872c897`

kedro-mlflow 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Project description

Introduction

Release history

Getting started

Installation

Pre-requisites

Installation guide

Check the installation

A "hello world" example

Step 1 : Create a kedro project

Step 2 : update the template

Run the project

Open and check what is logged

Going further

Tutorial

Plugin overview

New `cli` commands:

New `DataSet`:

New `Hooks`

New `Pipeline`

Project details

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

kedro-mlflow 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Project description

Introduction

Release history

Getting started

Installation

Pre-requisites

Installation guide

Check the installation

A "hello world" example

Step 1 : Create a kedro project

Step 2 : update the template

Run the project

Open and check what is logged

Going further

Tutorial

Plugin overview

New cli commands:

New DataSet:

New Hooks

New Pipeline

Project details

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

New `cli` commands:

New `DataSet`:

New `Hooks`

New `Pipeline`