Deploy Kedro project to Dataiku
Project description
Kedro to Dataiku
Convert Kedro project to Dataiku project in minutes
This is a tool to enable one to deploy a Kedro (>=0.16.5) project on Dataiku DSS instance without modifying the original Kedro project at all.
- Automatic
- Fast
- Flexible
Features
- Create Dataiku datasets automatically based on Kedro dataset catalog
- Convert Kedro nodes into Dataiku recipes
- Convert Kedro pipelines into Dataiku flow
- Create flow zones in Dataiku project based on Kedro pipeline segmentation
- Load all raw input data for the Kedro project into corresponding datasets in Dataiku proeject
- Support PySpark through PySpark recipes in Dataiku
- Support source code clone from git repository
- Enable code editing via Dataiku project library
Adaption of Kedro project
-
As Dataiku flow is basically pandas dataset based, and every single Kedro node will be converted to Dataiku recipe, it is recommended to make inputs and ouputs of Kedro nodes in dataframe format. Pandas DataFrame, PySpark DataFrame, dictionary of Pandas DataFrame will be saeved into Dataiku datasets which can be previewed, while other types of inputs/outputs (array, string, dictionary, etc.) will be saved in managed folders as pickle object.
-
Nodes in Kedro must have distinct function names.
-
As Dataiku recipe must have at least one output, it is recommened to make sure that each Kedro node has at least one output too. However, just in case some nodes in Kedro do not explicitly have any output, this tool will automatically create dummy dataset outputs which are actually not meaningful.
-
If there are local data files, it is required to define the file paths in data catalog configuration either in static format or in templated format (https://kedro.readthedocs.io/en/stable/kedro.config.TemplatedConfigLoader.html) with a macro named as ${data_prefix}. Example:
### data location data --01_raw --sales.csv ### In conf/base/catalog.yml: sales: <<: *csv_tab filepath: ${data_prefix}/01_raw/sales.csv ### in conf/local/globals.yml: data_prefix: data/ ### or simply sales: <<: *csv_tab filepath: data/01_raw/sales.csv
-
In order to create Dataiku zones in flow automatically, there must be pipepline segmentation defined in context.pipelines. The keys of context.pipelines can be used to define zones. For example, if there are pipelines defined as the following, then we can use ["int","primary","master","modeling"] to define the zones.
return { "int": int, "primary": primary, "master": master, "modeling": modeling, "__default__": ( int + primary + master + modeling ), }
Installation
As the package depends on dataiku which is internal module in DSS instance, it is recommneded to install and use this package inside Dataiku DSS.
Install it in Dataiku DSS code enviroment like any other pip packages, or install in Jupyter notebook by
%pip install kedro_to_dataiku
The required packages "dataiku" and "kedro" (>=0.16.5) will be the ones already exist in the Dataiku DSS environment.
Instead of installing the package, one can also upload the kernel file https://github.com/ppvastar/kedro_to_dataiku/blob/main/kedro_to_dataiku/kedro_to_dataiku.py to Dataiku project library
lib/python
so that one can
from kedro_to_dataiku import *
in project code.
Usage
- Create a managed folder in Dataiku project. Let us suppose it to be "workspace".
- Compress (into zip) and upload the whole Kedro project root folder (containing subfolders like data, conf, src, etc) into the managed folder, and uncompress it there.
- Open Jupyter notebook in Dataiku, follow the following steps:
-
Initial set up
import dataiku from kedro_to_dataiku import * ### the absolut path to the Kedro project root folder in Dataiku DSS filesystem. kedro_project_path=dataiku.Folder("workspace").get_path()+"[relative path of the kedro project root folder]" ### package_name: name of the folder in "[kedro project root folder]/src/" which contains "nodes" and "pipelines" subfolders package_name="[Kedro project package name]" ### set dataset connection (location). Or any other established connections (like S3) in Dataiku DSS. connection="filesystem_managed" [or any other established connections (like S3) in Dataiku DSS] ### data foramt in Dataiku dataset format_type="csv" ### define recipe type. Or use "pyspark" if want to create pyspark recipes. recipe_type="python" ### use source code residing in kedro_project_path+"/src". Otherwise, if True, will use source code imported as Dataiku python library -- this option will enable us to edit the soruce code residing in library. src_in_lib=False ### a list of zones to be created. They are from the keys of context.pipelines in the Kedro project. Example: ["int","primary","master","master_ds","modeling"]. Or just keep it as None so that no zones will be created automatically. zone_list=None ### if want to load the raw input data to Dataiku datasets. load_data=False ### if some inputs/outputs of Kedro projects are not Pandas dataframe/Spark dataframe/dictionary of Pandas dataframe format, they will be saved in managed folders instead of Dataiku datasets. This is critical to clarity. folder_list=None
-
Fast creation and clean
## fast creation and clean ### one command to create the projects create_all(kedro_project_path, package_name, connection, recipe_type,folder_list,zone_list,load_data,format_type,src_in_lib) ### one command to clean the projects. Make sure not to delete the managed folder hosting the Kedro project. delete_all(excluded=["workspace"]) ``
-
Create the project step by step
### update data catalog configuration update_catalog_conf(kedro_project_path) ### create datasets input_list,dataset_list=create_datasets(kedro_project_path, package_name,connection,folder_list,format_type,src_in_lib) ### create recipes create_recipes(kedro_project_path, package_name,folder_list,recipe_type,src_in_lib) ### create zones create_zones(zone_list,folder_list,kedro_project_path, package_name,src_in_lib) ### load raw input datasets load_input_datasets(input_list,kedro_project_path, package_name,src_in_lib)
-
Try some other tools:
### list all datastes act_on_project(target="dataset",cmd="list") ### clear data in all datastes act_on_project(target="dataset",cmd="clear",excluded=None) ### delete all datastes act_on_project(target="dataset",cmd="delete",excluded=None) ### return all recipes act_on_project(target="recipe",cmd="list") ### delete all recipes act_on_project(target="recipe",cmd="delete",excluded=None) ### return all zones act_on_project(target="zone",cmd="list") ### delete all zones except the "Default". Caution: do not delete this Default zone otherwise the project flow will corrupt. act_on_project(target="zone",cmd="delete") ### return all folders act_on_project(target="folder",cmd="list") ### delete all folders act_on_project(target="folder",cmd="delete",excluded=['workspace'])
-
In Dataiku, the src code in managed folder is not editable. If one want to do simple and fast edit on code within dataiku after deployment, one can import the source code to project library (https://doc.dataiku.com/dss/latest/python/reusing-code.html) which is editable. To do this, just load (one can use git) the folder in "[kedro project root folder]/src/" which contains "nodes" and "pipelines" subfolders into the lib/python path (keep the module name as the kedro package name), and then set
src_in_lib=True
in previouly mentioned steps.
By doing so, the soruce code (nodes, pipelines, etc) in this library "lib/python/[package_name]" will be used instead of the orgin one under "[kedro project root folder]/src/[package_name]"
-
One can also clone Kedro project from git repository to the managed folder we created previously.
git_url="[git repository URL]" kedro_project_path_in_git="[relative path of Kedro project root folder on git repository]" ### Keep existing data folder under the kedro_project_path in Dataiku managed folder clone_from_git(kedro_project_path,git_url,kedro_project_path_in_git)
-
When code has been edited under path lib/python/[package_name], one may want to copy it back to the managed folder (for further download or other operations). To do this, one can use copy_lib function:
### overwrite=False, module lib/python/[package_name] will be copied to a new folder: [kedro_project_path]/src/[package_name]_lib copy_lib(kedro_project_path,package_name,overwrite=False) ### overwrite=True, [kedro_project_path]/src/[package_name] in managed folder will just be overwritten with the lib/python/[package_name] copy_lib(kedro_project_path,package_name,overwrite=True)
Example
Taking the Kedro 0.17.0 IRIS starter project as an example (https://github.com/ppvastar/kedro_example_iris)
One can create a "workspace" managed folder on dataiku, and then upload the project root folder "iris" into the "workspace".
Then run code like the following in Dataiku project jupyter notebook:
from kedro_to_dataiku import *
import dataiku
kedro_project_path=dataiku.Folder("workspace").get_path()+"/iris"
package_name="iris"
## change the connection according to actual situation
connection="S3_DSS"
recipe_type="python"
src_in_lib=False
load_data=True
format_type="csv"
folder_list=["example_model","example_predictions"]
zone_list=["ds","de"]
create_all(kedro_project_path, package_name, connection, recipe_type,folder_list,zone_list,load_data,format_type,src_in_lib)
As a magic, the Dataiku flow is created and raw input data is loaded in seconds:
The flow is already ready for execution. Note that this simple Iris example is not a typical example, as at the final end there is actually no output (except some log information to report the model accuracy). As a resullt, the end of the Dataiku low is a dummy dataset called as "report_accurary_dummy_output" which will not be utilized.
To clear everything just now generated,
delete_all(excluded=["workspace"])
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.