A plugin to run Kedro pipelines on Databricks.

Project description

kedro-databricks

Kedro plugin to develop Kedro pipelines for Databricks. This plugin strives to provide the ultimate developer experience when using Kedro on Databricks. The plugin provides three main features:

Initialization: Transform your local Kedro project into a Databricks Asset Bundle project with a single command.
Generation: Generate Asset Bundle resources definition with a single command.
Deployment: Deploy your Kedro project to Databricks with a single command.

Installation

To install the plugin, simply run:

pip install kedro-databricks

Now you can use the plugin to develop Kedro pipelines for Databricks.

How to get started

Prerequisites:

Before you begin, ensure that the Databricks CLI is installed and configured. For more information on installation and configuration, please refer to the Databricks CLI documentation.

Creating a new project

To create a project based on this starter, ensure you have installed Kedro into a virtual environment. Then use the following command:

pip install kedro

Soon you will be able to initialize the databricks-iris starter with the following command:

kedro new --starter="databricks-iris"

After the project is created, navigate to the newly created project directory:

cd <my-project-name>  # change directory

Install the required dependencies:

pip install -r requirements.txt
pip install kedro-databricks

Now you can nitialize the Databricks asset bundle

kedro databricks init

Next, generate the Asset Bundle resources definition:

kedro databricks bundle

Finally, deploy the Kedro project to Databricks:

kedro databricks deploy

That's it! Your pipelines have now been deployed as a workflow to Databricks as [dev <user>] <project_name>. Try running the workflow to see the results.

Commands

`kedro databricks init`

To initialize a Kedro project for Databricks, run:

kedro databricks init

This command will create the following files:

├── databricks.yml # Databricks Asset Bundle configuration
├── conf/
│   └── base/
│       └── databricks.yml # Workflow overrides

The databricks.yml file is the main configuration file for the Databricks Asset Bundle. The conf/base/databricks.yml file is used to override the Kedro workflow configuration for Databricks.

Override the Kedro workflow configuration for Databricks in the conf/base/databricks.yml file:

# conf/base/databricks.yml

default: # will be applied to all workflows
    job_clusters:
        - job_cluster_key: default
          new_cluster:
            spark_version: 7.3.x-scala2.12
            node_type_id: Standard_DS3_v2
            num_workers: 2
            spark_env_vars:
                KEDRO_LOGGING_CONFIG: /dbfs/FileStore/<package-name>/conf/logging.yml
    tasks: # will be applied to all tasks in each workflow
        - task_key: default
          job_cluster_key: default

<workflow-name>: # will only be applied to the workflow with the specified name
    job_clusters:
        - job_cluster_key: high-concurrency
          new_cluster:
            spark_version: 7.3.x-scala2.12
            node_type_id: Standard_DS3_v2
            num_workers: 2
            spark_env_vars:
                KEDRO_LOGGING_CONFIG: /dbfs/FileStore/<package-name>/conf/logging.yml
    tasks:
        - task_key: default # will be applied to all tasks in the specified workflow
          job_cluster_key: high-concurrency
        - task_key: <my-task> # will only be applied to the specified task in the specified workflow
          job_cluster_key: high-concurrency

The plugin loads all configuration named according to conf/databricks* or conf/databricks/*.

`kedro databricks bundle`

To generate Asset Bundle resources definition, run:

kedro databricks bundle

This command will generate the following files:

├── resources/
│   ├── <project>.yml # Asset Bundle resources definition corresponds to `kedro run`
│   └── <project-pipeline>.yml # Asset Bundle resources definition for each pipeline corresponds to `kedro run --pipeline <pipeline-name>`

The generated resources definition files are used to define the resources required to run the Kedro pipeline on Databricks.

`kedro databricks deploy`

To deploy a Kedro project to Databricks, run:

kedro databricks deploy

This command will deploy the Kedro project to Databricks. The deployment process includes the following steps:

Package the Kedro project for a specfic environment
Generate Asset Bundle resources definition for that environment
Upload environment-specific /conf files to Databricks
Upload /data/raw/* and ensure other /data directories are created
Deploy Asset Bundle to Databricks

Project details

Release history Release notifications | RSS feed

0.6.4

Nov 6, 2024

0.6.3

Nov 6, 2024

0.6.2

Nov 2, 2024

This version

0.6.1

Oct 26, 2024

0.6.0

Oct 20, 2024

0.5.0

Oct 17, 2024

0.4.0

Oct 5, 2024

0.3.0

Sep 14, 2024

0.2.1

Sep 2, 2024

0.1.11

Jul 25, 2024

0.1.10

Jul 25, 2024

0.1.9

Jul 23, 2024

0.1.8

Jul 19, 2024

0.1.7

Jul 18, 2024

0.1.6

Jul 17, 2024

0.1.5

Jul 15, 2024

0.1.4

Jul 14, 2024

0.1.2

Jul 12, 2024

0.1.1

Jul 11, 2024

0.1.0

Jul 11, 2024

0.0.1

Nov 25, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kedro_databricks-0.6.1.tar.gz (25.5 kB view details)

Uploaded Oct 26, 2024 Source

Built Distribution

kedro_databricks-0.6.1-py3-none-any.whl (17.9 kB view details)

Uploaded Oct 26, 2024 Python 3

File details

Details for the file kedro_databricks-0.6.1.tar.gz.

File metadata

Download URL: kedro_databricks-0.6.1.tar.gz
Upload date: Oct 26, 2024
Size: 25.5 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for kedro_databricks-0.6.1.tar.gz
Algorithm	Hash digest
SHA256	`3bd3aef4b9e26d5b85ee1e038d327837992f3ee6e58bd0dbcb7e9f1660e33cc5`
MD5	`971bbc2f67cb5d4577025f6311933842`
BLAKE2b-256	`87c7b31a3a70bfdef2d8ffa061a7429adacefc670d47cae40f8a99acec1fe953`

See more details on using hashes here.

File details

Details for the file kedro_databricks-0.6.1-py3-none-any.whl.

File metadata

Download URL: kedro_databricks-0.6.1-py3-none-any.whl
Upload date: Oct 26, 2024
Size: 17.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for kedro_databricks-0.6.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a214afcd0e774e96351a32d135b436e4e586447263e6fb0d8a55541b827f4938`
MD5	`7fae177b68ab80c5cd7e2a9c433d56eb`
BLAKE2b-256	`9bceaf7364107285778b3cd213ed04087434775de359bcd50820f0c9032fd9df`