A plugin to run Kedro pipelines on Databricks.
Project description
kedro-databricks
Kedro plugin to develop Kedro pipelines for Databricks. This plugin strives to provide the ultimate developer experience when using Kedro on Databricks. The plugin provides three main features:
- Initialization: Transform your local Kedro project into a Databricks Asset Bundle project with a single command.
- Generation: Generate Asset Bundle resources definition with a single command.
- Deployment: Deploy your Kedro project to Databricks with a single command.
Overview
The plugin provides a new kedro-databricks
CLI command group with the following commands:
kedro databricks init
: Initialize a Kedro project for Databricks.kedro databricks bundle
: Generate Asset Bundle resources definition.kedro databricks deploy
: Deploy a Kedro project to Databricks.
Prerequisites
Databricks CLI
must be installed and configured. See Databricks CLI for more information.
Installation
pip install kedro-databricks
Usage
Initialization
To initialize a Kedro project for Databricks, run:
kedro databricks init
This command will create the following files:
├── databricks.yml # Databricks Asset Bundle configuration
├── conf/
│ └── base/
│ └── databricks.yml # Workflow overrides
The databricks.yml
file is the main configuration file for the Databricks Asset Bundle. The conf/base/databricks.yml
file is used to override the Kedro workflow configuration for Databricks.
Override the Kedro workflow configuration for Databricks in the conf/base/databricks.yml
file:
# conf/base/databricks.yml
default: # will be applied to all workflows
job_clusters:
- job_cluster_key: default
new_cluster:
spark_version: 7.3.x-scala2.12
node_type_id: Standard_DS3_v2
num_workers: 2
spark_env_vars:
KEDRO_LOGGING_CONFIG: /dbfs/FileStore/<package-name>/conf/logging.yml
tasks: # will be applied to all tasks in each workflow
- task_key: default
job_cluster_key: default
<workflow-name>: # will only be applied to the workflow with the specified name
job_clusters:
- job_cluster_key: high-concurrency
new_cluster:
spark_version: 7.3.x-scala2.12
node_type_id: Standard_DS3_v2
num_workers: 2
spark_env_vars:
KEDRO_LOGGING_CONFIG: /dbfs/FileStore/<package-name>/conf/logging.yml
tasks:
- task_key: default # will be applied to all tasks in the specified workflow
job_cluster_key: high-concurrency
- task_key: <my-task> # will only be applied to the specified task in the specified workflow
job_cluster_key: high-concurrency
The plugin loads all configuration named according to conf/databricks*
or conf/databricks/*
.
Generation
To generate Asset Bundle resources definition, run:
kedro databricks bundle
This command will generate the following files:
├── resources/
│ ├── <project>.yml # Asset Bundle resources definition corresponds to `kedro run`
│ └── <project-pipeline>.yml # Asset Bundle resources definition for each pipeline corresponds to `kedro run --pipeline <pipeline-name>`
The generated resources definition files are used to define the resources required to run the Kedro pipeline on Databricks.
Deployment
To deploy a Kedro project to Databricks, run:
kedro databricks deploy
This command will deploy the Kedro project to Databricks. The deployment process includes the following steps:
- Package the Kedro project for a specfic environment
- Generate Asset Bundle resources definition for that environment
- Upload environment-specific
/conf
files to Databricks - Upload
/data/raw/*
and ensure other/data
directories are created - Deploy Asset Bundle to Databricks
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file kedro_databricks-0.1.8.tar.gz
.
File metadata
- Download URL: kedro_databricks-0.1.8.tar.gz
- Upload date:
- Size: 20.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/5.1.0 CPython/3.12.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4bfad18da09963085c8802490533a89f75ffda2fa51fcf8da67885c5668ae21a |
|
MD5 | 219dc5b7efb4ec66b3d26976b1ec9d3d |
|
BLAKE2b-256 | d78627ebf0a6c3fec0ed9ed8862707b7bea2c0ac85fec900238965414ed7eb85 |
File details
Details for the file kedro_databricks-0.1.8-py3-none-any.whl
.
File metadata
- Download URL: kedro_databricks-0.1.8-py3-none-any.whl
- Upload date:
- Size: 14.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/5.1.0 CPython/3.12.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7283fe8b22dd4b343fe4d2fc8f10310d6eb373c8ac97517f43d41daefc3a7d16 |
|
MD5 | 38a3a917bbddb83a67b24c6910a8d267 |
|
BLAKE2b-256 | b1376deae2ab4dcf841bd8e6c2b09d6b1845dca29b337e77536e98320829b924 |