Skip to main content

CICD tool for testing and deploying to Databricks

Project description

Databricks CI/CD

PyPI Latest Release

This is a tool for building CI/CD pipelines for Databricks. It is a python package that works in conjunction with a custom GIT repository (or a simple file structure) to validate and deploy content to databricks. Currently, it can handle the following content:

  • Workspace - a collection of notebooks written in Scala, Python, R or SQL
  • Jobs - list of Databricks jobs
  • Clusters
  • Instance Pools
  • DBFS - an arbitrary collection of files that may be deployed on a Databricks workspace

Installation

pip install databricks-cicd

Requirements

To use this tool, you need a source directory structure (preferably as a private GIT repository) that has the following structure:

any_local_folder_or_git_repo/
├── workspace/
│   ├── some_notebooks_subdir
│   │   └── Notebook 1.py
│   ├── Notebook 2.sql
│   ├── Notebook 3.r
│   └── Notebook 4.scala
├── jobs/
│   ├── My first job.json
│   └── Side gig.json
├── clusters/
│   ├── orion.json
│   └── Another cluster.json
├── instance_pools/
│   ├── Pool 1.json
│   └── Pool 2.json
└── dbfs/
    ├── strawbery_jam.jar
    ├── subdir
    │   └── some_other.jar
    ├── some_python.egg
    └── Ice cream.jpeg

Note: All folder names represent the default and can be configured. This is just a sample.

Usage

For the latest options and commands run:

cicd -h

A sample command could be:

cicd deploy \
   -w sample_12432.7.azuredatabricks.net \
   -u john.smith@domain.com \
   -t dapi_sample_token_0d5-2 \
   -lp '~/git/my-private-repo' \
   -tp /blabla \
   -c DEV.ini \
   --verbose

Note: Paths for windows need to be in double quotes

The default configuration is defined in default.ini and can be overridden with a custom ini file using the -c option, usually one config file per target environment. (sample)

Create content

Notebooks:

  1. Add a notebook to source
    1. On the databricks UI go to your notebook.
    2. Click on File -> Export -> Source file.
    3. Add that file to the workspace folder of this repo without changing the file name.

Jobs:

  1. Add a job to source
    1. Get the source of the job and write it to a file. You need to have the Databricks CLI and JQ installed. For Windows, it is easier to rename the jq-win64.exe to jq.exe and place it in c:\Windows\System32 folder. Then on Windows/Linux/MAC:

      databricks jobs get --job-id 74 | jq .settings > Job_Name.json
      

      This downloads the source JSON of the job from the databricks server and pulls only the settings from it, then writes it in to a file.

      Note: The file name should be the same as the job name within the json file. Please, avoid spaces in names.

    2. Add that file to the jobs folder

Clusters:

  1. Add a cluster to source
    1. Get the source of the cluster and write it to a file.
      databricks clusters get --cluster-name orion > orion.json
      
      Note: The file name should be the same as the cluster name within the json file. Please, avoid spaces in names.
    2. Add that file to the clusters folder

Instance pools:

  1. Add an instance pool to source
    1. Similar to clusters, just use instance-pools instead of clusters

DBFS:

  1. Add a file to dbfs
    1. Just add a file to the the dbfs folder.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

databricks-cicd-0.1.16.tar.gz (16.0 kB view hashes)

Uploaded Source

Built Distribution

databricks_cicd-0.1.16-py3-none-any.whl (22.1 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page