Skip to main content

A plugin to run Kedro pipelines on Databricks.

Project description

kedro-databricks

Rye Ruff License: MIT codecov Python Version PyPI Version

Kedro plugin to develop Kedro pipelines for Databricks. This plugin strives to provide the ultimate developer experience when using Kedro on Databricks. The plugin provides three main features:

  1. Initialization: Transform your local Kedro project into a Databricks Asset Bundle project with a single command.
  2. Generation: Generate Asset Bundle resources definition with a single command.
  3. Deployment: Deploy your Kedro project to Databricks with a single command.

Overview

The plugin provides a new kedro-databricks CLI command group with the following commands:

  • kedro databricks init: Initialize a Kedro project for Databricks.
  • kedro databricks bundle: Generate Asset Bundle resources definition.
  • kedro databricks deploy: Deploy a Kedro project to Databricks.

Prerequisites

Installation

pip install kedro-databricks

Usage

Initialization

To initialize a Kedro project for Databricks, run:

kedro databricks init

This command will create the following files:

├── databricks.yml # Databricks Asset Bundle configuration
├── conf/
│   └── base/
│       └── databricks.yml # Workflow overrides

The databricks.yml file is the main configuration file for the Databricks Asset Bundle. The conf/base/databricks.yml file is used to override the Kedro workflow configuration for Databricks.

Override the Kedro workflow configuration for Databricks in the conf/base/databricks.yml file:

# conf/base/databricks.yml

default: # will be applied to all workflows
    job_clusters:
        - job_cluster_key: default
          new_cluster:
            spark_version: 7.3.x-scala2.12
            node_type_id: Standard_DS3_v2
            num_workers: 2
            spark_env_vars:
                KEDRO_LOGGING_CONFIG: /dbfs/FileStore/<package-name>/conf/logging.yml
    tasks: # will be applied to all tasks in each workflow
        - task_key: default
          job_cluster_key: default

<workflow-name>: # will only be applied to the workflow with the specified name
    job_clusters:
        - job_cluster_key: high-concurrency
          new_cluster:
            spark_version: 7.3.x-scala2.12
            node_type_id: Standard_DS3_v2
            num_workers: 2
            spark_env_vars:
                KEDRO_LOGGING_CONFIG: /dbfs/FileStore/<package-name>/conf/logging.yml
    tasks:
        - task_key: default # will be applied to all tasks in the specified workflow
          job_cluster_key: high-concurrency
        - task_key: <my-task> # will only be applied to the specified task in the specified workflow
          job_cluster_key: high-concurrency

The plugin loads all configuration named according to conf/databricks* or conf/databricks/*.

Generation

To generate Asset Bundle resources definition, run:

kedro databricks bundle

This command will generate the following files:

├── resources/
│   ├── <project>.yml # Asset Bundle resources definition corresponds to `kedro run`
│   └── <project-pipeline>.yml # Asset Bundle resources definition for each pipeline corresponds to `kedro run --pipeline <pipeline-name>`

The generated resources definition files are used to define the resources required to run the Kedro pipeline on Databricks.

Deployment

To deploy a Kedro project to Databricks, run:

kedro databricks deploy

This command will deploy the Kedro project to Databricks. The deployment process includes the following steps:

  1. Package the Kedro project for a specfic environment
  2. Generate Asset Bundle resources definition for that environment
  3. Upload environment-specific /conf files to Databricks
  4. Upload /data/raw/* and ensure other /data directories are created
  5. Deploy Asset Bundle to Databricks

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kedro_databricks-0.1.8.tar.gz (20.1 kB view details)

Uploaded Source

Built Distribution

kedro_databricks-0.1.8-py3-none-any.whl (14.0 kB view details)

Uploaded Python 3

File details

Details for the file kedro_databricks-0.1.8.tar.gz.

File metadata

  • Download URL: kedro_databricks-0.1.8.tar.gz
  • Upload date:
  • Size: 20.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.0 CPython/3.12.4

File hashes

Hashes for kedro_databricks-0.1.8.tar.gz
Algorithm Hash digest
SHA256 4bfad18da09963085c8802490533a89f75ffda2fa51fcf8da67885c5668ae21a
MD5 219dc5b7efb4ec66b3d26976b1ec9d3d
BLAKE2b-256 d78627ebf0a6c3fec0ed9ed8862707b7bea2c0ac85fec900238965414ed7eb85

See more details on using hashes here.

File details

Details for the file kedro_databricks-0.1.8-py3-none-any.whl.

File metadata

File hashes

Hashes for kedro_databricks-0.1.8-py3-none-any.whl
Algorithm Hash digest
SHA256 7283fe8b22dd4b343fe4d2fc8f10310d6eb373c8ac97517f43d41daefc3a7d16
MD5 38a3a917bbddb83a67b24c6910a8d267
BLAKE2b-256 b1376deae2ab4dcf841bd8e6c2b09d6b1845dca29b337e77536e98320829b924

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page