Copy MLflow objects (experiments, runs or registered models) to another tracking server
Project description
MLflow Export Import
This package provides tools to export and import MLflow objects (runs, experiments or registered models) from one MLflow tracking server (Databricks workspace) to another.
For more details on MLflow objects (Databricks MLflow) see the Databricks MLflow Object Relationships slide deck.
Architecture
Overview
Why use MLflow Export Import?
- Share and collaborate with other data scientists in the same or another tracking server.
- For example, clone a favorite experiment from another user in your own workspace.
- Migrate experiments to another tracking server.
- For example, promote a registered model version (and its associated run) from the development to the test tracking server and then to the production tracking server.
- Backup your MLflow objects.
- Disaster recovery.
MLflow Export Import scenarios
Source tracking server | Destination tracking server | Note |
---|---|---|
Open source | Open source | common |
Open source | Databricks | less common |
Databricks | Databricks | common |
Databricks | Open source | rare |
Two sets of tools
- Open source MLflow Python scripts.
- Databricks notebooks that invoke the Python scripts.
Tools Overview
Python Scripts
There are two sets of Python scripts:
-
Individual tools. Use these tools to export and import individual MLflow objects between tracking servers. They allow you to specify a different destination object name. For example, if you want to clone the experiment
/Mary/Experiments/Iris
under a new name, you can specify the target experiment name as/John/Experiments/Iris
. -
Collection tools. High-level tools to copy an entire tracking server or a collection of MLflow objects (runs and experiments). Full object referential integrity is maintained as well as the original MLflow object names.
Databricks notebooks
Databricks notebooks simply invoke their corresponding Python scripts.
Note that only Individual
notebooks are currently available.
See README.
Other
Limitations
General Limitations
- Nested runs are only supported when you import an experiment. For a run, it is still a TODO.
- If the run linked to a registered model version does not exist (has been deleted) the version is not exported since when importing MLflowClient.create_model_version requires a run ID.
Databricks Limitations
Exporting Notebook Revisions
- The notebook revision associated with the run can be exported. It is stored as an an artifact in the run's
notebooks
artifact directory. - You can save the notebook in the suppported SOURCE, HTML, JUPYTER and DBC formats.
- Examples:
notebooks/notebook.dbc
ornotebooks/notebook.source
.
Importing Notebooks
- Partial functionality due to Databricks REST API limitations.
- The Databricks REST API does not support:
- Importing a notebook with its revision history.
- Linking an imported run with the imported notebook.
- When you import a run, the link to its source notebook revision ID will be a dead link and you cannot access the notebook from the MLflow UI.
- As a convenience, the import tools allows you to import the exported notebook into Databricks. For more details, see:
- You must export a notebook in the SOURCE format for the notebook to be imported.
Used ID
- When importing a run or experiment, for open source MLflow you can specify the user owner.
- OSS MLflow - the destination run
mlflow.user
tag can be the same as the sourcemlflow.user
tag since OSS MLflow allows you to set this tag. - Databricks MLflow - you cannot set the
mlflow.user
tag. Themlflow.user
will be based on the personal access token (PAT) of the importing user.
Common options details
notebook-formats
- If exporting a Databricks run, the run's notebook revision can be saved in the specified formats (comma-delimited argument). Each format is saved in the notebooks folder of the run's artifact root directory as notebook.{format}
. Supported formats are SOURCE, HTML, JUPYTER and DBC. See Databricks Export Format documentation.
use-src-user-id
- Set the destination user ID to the source user ID. Source user ID is ignored when importing into Databricks since the user is automatically picked up from your Databricks access token.
export-source-tags
- Exports source information under the mlflow_export_import
tag prefix. See section below for details.
MLflow Export Import Source Run Tags - mlflow_export_import
For governance purposes, original source run information is saved under the mlflow_export_import
tag prefix. When you import a run, the values of RunInfo
are auto-generated for you as well as some tags.
This is useful for governance, provenance and auditing purposes for regulated industries such as finance and HLS (health case and life science) industries.
If the export-source-tags
option is set on an export tool, three sets of source run tags will be saved under the mlflow_export_import
prefix.
-
MLflow system tags. Prefix:
mlflow_export_import.mlflow
. Saves all source MLflow system tags starting withmlflow.
- For example, the source tag
mlflow.source.type
is saved as the destination tagmlflow_export_import.mlflow.source.type
.
- For example, the source tag
-
RunInfo field tags. Prefix:
mlflow_export_import.run_info
. Saves all source RunInfo fields.- For example RunInfo.run_id is stored as
mlflow_export_import.run_info.run_id
. - Note that since MLflow tag values must be a string, non-string
RunInfo
fields (int) are cast to a string such asstart_time
.
- For example RunInfo.run_id is stored as
-
Metadata tags. Prefix:
mlflow_export_import.metadata
. Tags indicating source export metadata information- Example:
mlflow_export_import.metadata.tracking_uri
.
- Example:
Open Source Mlflow Export Import Tags
See sample run tags.
MLflow system tags
Tag | Value |
---|---|
mlflow_export_import.mlflow.log-model.history | [{\run_id: \eb66c160957d4a28b11d3f1b968df9cd\ |
mlflow_export_import.mlflow.runName | train.sh |
mlflow_export_import.mlflow.source.git.commit | 67fb8f823ec794902cdbb67be653a6155a0b5172 |
mlflow_export_import.mlflow.source.name | /Users/andre/git/mlflow-examples/python/sklearn/wine_quality/train.py |
mlflow_export_import.mlflow.source.type | LOCAL |
mlflow_export_import.mlflow.user | andre |
MLflow RunInfo tags
Tag | Value |
---|---|
mlflow_export_import.run_info.artifact_uri | /opt/mlflow/server/mlruns/2/eb66c160957d4a28b11d3f1b968df9cd/artifacts |
mlflow_export_import.run_info.end_time | 1655004883611 |
mlflow_export_import.run_info.experiment_id | 2 |
mlflow_export_import.run_info.lifecycle_stage | active |
mlflow_export_import.run_info.run_id | eb66c160957d4a28b11d3f1b968df9cd |
mlflow_export_import.run_info.start_time | 1655004878844 |
mlflow_export_import.run_info.status | FINISHED |
mlflow_export_import.run_info.user_id | andre |
Metadata tags
Tag | Value | Description |
---|---|---|
mlflow_export_import.metadata.mlflow_version | 1.26.1 | MLflow version |
mlflow_export_import.metadata.tracking_uri | http://127.0.0.1:5020 | Source tracking server URI |
mlflow_export_import.metadata.experiment_name | sklearn_wine | Name of experiment |
mlflow_export_import.metadata.timestamp | 1655007510 | Time when run was exported |
mlflow_export_import.metadata.timestamp_nice | 2022-06-12 04:18:30 | ibid |
Databricks MLflow source tags
See sample run tags.
MLflow system tags
Tag | Value |
---|---|
mlflow_export_import.mlflow.databricks.cluster.id | 0318-151752-abed99 |
mlflow_export_import.mlflow.databricks.cluster.info | {"cluster_name":"Shared Autoscaling Americas","spark_version":"10.2.x-cpu-ml-scala2.12","node_type_id":"i3.2xl |
mlflow_export_import.mlflow.databricks.cluster.libraries | {"installable":[],"redacted":[]} |
mlflow_export_import.mlflow.databricks.notebook.commandID | 6101304639030907941_7207589773925520000_e458c0ed7c5c4e52b020b1b92d39b308 |
mlflow_export_import.mlflow.databricks.notebookID | 3532228 |
mlflow_export_import.mlflow.databricks.notebookPath | /Users/andre@mycompany.com/mlflow/02a_Sklearn_Train_Predict |
mlflow_export_import.mlflow.databricks.notebookRevisionID | 1647395473565 |
mlflow_export_import.mlflow.databricks.webappURL | https://demo.cloud.databricks.com |
mlflow_export_import.mlflow.log-model.history | [{"artifact_path":"sklearn-model","signature":{"inputs":"[{"name": "fixed acidity", "type": "double"}, |
mlflow_export_import.mlflow.runName | sklearn |
mlflow_export_import.mlflow.source.name | /Users/andre@mycompany.com/mlflow/02a_Sklearn_Train_Predict |
mlflow_export_import.mlflow.source.type | NOTEBOOK |
mlflow_export_import.mlflow.user | andre@mycompany.com |
MLflow RunInfo tags
Tag | Value |
---|---|
mlflow_export_import.run_info.artifact_uri | dbfs:/databricks/mlflow/3532228/826a19c33d8c461ebf91aa90c25a5dd8/artifacts |
mlflow_export_import.run_info.end_time | 1647395473410 |
mlflow_export_import.run_info.experiment_id | 3532228 |
mlflow_export_import.run_info.lifecycle_stage | active |
mlflow_export_import.run_info.run_id | 826a19c33d8c461ebf91aa90c25a5dd8 |
mlflow_export_import.run_info.run_uuid | 826a19c33d8c461ebf91aa90c25a5dd8 |
mlflow_export_import.run_info.start_time | 1647395462575 |
mlflow_export_import.run_info.status | FINISHED |
mlflow_export_import.run_info.user_id |
Metadata tags
Tag | Value |
---|---|
mlflow_export_import.metadata.mlflow_version | 1.26.1 |
mlflow_export_import.metadata.tracking_uri | databricks |
mlflow_export_import.metadata.experiment_name | /Users/andre@mycompany.com/mlflow/02a_Sklearn_Train_Predict |
mlflow_export_import.metadata.timestamp | 1655148303 |
mlflow_export_import.metadata.timestamp_nice | 2022-06-13 19:25:03 |
Setup
Supports python 3.7.6 or above.
Local setup
First create a virtual environment.
python -m venv mlflow-export-import
source mlflow-export-import/bin/activate
There are two different ways to install the package.
1. Install from github directly
pip install git+https:///github.com/mlflow/mlflow-export-import/#egg=mlflow-export-import
2. Install from github clone
git clone https://github.com/mlflow/mlflow-export-import
cd mlflow-export-import
pip install -e .
Databricks setup
There are two different ways to install the package.
1. Install package in notebook
Install notebook-scoped libraries with %pip.
pip install git+https:///github.com/mlflow/mlflow-export-import/#egg=mlflow-export-import
2. Install package as a wheel on cluster
Build the wheel artifact, upload it to DBFS and then install it on your cluster.
python setup.py bdist_wheel
databricks fs cp dist/mlflow_export_import-1.0.0-py3-none-any.whl {MY_DBFS_PATH}
Databricks MLflow usage
To run the tools externally (from your laptop) against a Databricks tracking server (workspace) set the following environment variables.
export MLFLOW_TRACKING_URI=databricks
export DATABRICKS_HOST=https://mycompany.cloud.databricks.com
export DATABRICKS_TOKEN=MY_TOKEN
For full details see Access the MLflow tracking server from outside Databricks.
Running tools
The main tool scripts can be executed either as a standard Python script or console script.
Python console scripts (such as export-run, import-run, etc.) are provided as a convenience. For a list of scripts see setup.py.
This allows you to use:
export-experiment --help
instead of:
python -u -m mlflow_export_import.experiment.export_experiment --help
Testing
Two types of tests exist: open source and Databricks tests. See tests/README.
Workflow API
- README.md
- The WorkflowApiClient is a Python wrapper around the Databricks REST API to execute job runs in a synchronous polling manner.
- Although a generic tool, in terms of mlflow-export-import it used for testing Databricks notebook jobs.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Hashes for mlflow_export_import-1.0.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ed036a604d570b12ed50eebcdef42bed0447a2fe3f345a01daad51d03a3ef7f9 |
|
MD5 | 1fdb0bdb9068facd9ac488bc87ed9ba2 |
|
BLAKE2b-256 | 50301531b26115ea656dd16f029b63bee69b9585a6fa0e618e52b155ffa92699 |