Skip to main content

A simple commandline application to manage databricks resources.

Project description

DatabricksTools CLI

A simple commandline application to keep in sync between databricks and your local filesystem. This project uses the databricks workspace api.

Key features:

  • List and MkDir commands.
  • Download databricks notebooks as HTML, Jupyter, or Source format.
  • Upload file from your filesystem (source format) to create/overwrite databricks notebooks.
  • Automatically transform markdown files to source!
    • Using markdown syntax instead of source is more convenient.
    • Use codeblocks to change between languages (e.g. python, scala, sql, sh).
    • Integrate local development and execution with databricks notebooks.
$ databrickstools ls --path /Shared
$ databrickstools upload markdown \
    --from-path exploration.md \
    --to-path /Shared/merchants/exploration

Setup

Clone the repository and install the commandline application!

  1. Clone: git clone https://github.com/rhdzmota/databrickstools-cli
  2. Install: pip install -e databrickstools-cli

Note: you'll need python 3.6 or greater.

Create the .env file containing your environment variables.

  1. Create .env following the example .env.example:
    • cp .env.example .env
  2. Load the variables into the environment:
    • export $(cat .env | xargs)

Note: the .env file can be on any place of your machine. The commandline application just needs those variables to be present on the environment.

Test with a simple command:

$ databrickstools ls --path /Shared

Use case!

Assume you have a file named exploration.md with the following markdown content:

# Exporatory Data Analysis

This is a file containing the initial EDA work.

## Setup

```python
import pandas as pd
from sqlalchemy import create_engine 

connection_string = "..."
engine = create_engine(connection_string) 
```

## Getting the data

Consider the following invalid merchant data:

```python
SQL_INVALID_MERCHANTS = "SELECT * FROM merchants WHERE invalid = 1"

df = pd.read_sql(query, engine)
df.head()
```

Location distribution: 

```python
df.groupby("state").size().to_frame()
```
...

You could use rmarkdown + reticulate for local development and execution of that report. And then use databrickstools to deploy and transform this markdown file into a databricks notebook!

$ databrickstools upload markdown \
    --from-path exploration.md \
    --to-path /Shared/merchants/exploration \
    --base-language python \
    --overwrite

Usage

The CLI contains groups or commands. A command is a method call that received zero or more arguments. A group contains one or more commands with common functionality.

Available groups:

  • download
  • upload

Top Level Commands

cmd: ls

List the databricks resources for a give path.

$ databrickstools ls --path <value>

Where:

--path (string)

Path to folder inside databricks.

Example:

$ databrickstools ls --path /Shared

cmd: mkdir

Create a directory on databricks.

$ databrickstools mkdir --path <value>

Where:

--path (string)

Path to folder inside databricks.

Example:

$ databrickstools mkdir --path /Shared/temp

Group: Download

cmd: file

Download a given file from databricks!

$ databrickstools download file \
    --from-path <value> \
    --to-path <value> \
    --file-format <value>

Where:

--from-path (string)

Path to file inside databricks.

--to-path (string)

Path in local machine.

--file-format (string)

Format used to download file. Default: DATABRICKSTOOLS_DEFAULT_FORMAT

Example:

databrickstools download file \
    --from-path /Shared/example \
    --to-path example \
    --file-format SOURCE

cmd: html

Download a notebook as an HTML file.

$ databrickstools download html \
    --from-path <value> \
    --to-path <value>

Where:

--from-path (string)

Path to file inside databricks.

--to-path (string)

Path in local machine.

Example:

$ databrickstools download html \
    --from-path /Shared/example \
    --to-path example.html

cmd: ipynb

Download a notebook as a Jupyter file. Only works for python-based notebooks.

$ databrickstools download ipynb \
    --from-path <value> \
    --to-path <value>

Where:

--from-path (string)

Path to file inside databricks.

--to-path (string)

Path in local machine.

Example:

$ databrickstools download ipynb \
    --from-path /Shared/example \
    --to-path example.ipynb

cmd: source

Download a notebook as a source file. This can be either a .py file for Python or a .sc file for Scala.

$ databrickstools download source \
    --from-path <value> \
    --to-path <value>

Where:

--from-path (string)

Path to file inside databricks.

--to-path (string)

Path in local machine.

Example:

$ databrickstools download source \
    --from-path /Shared/example \
    --to-path example.py

Group: Upload

cmd: markdown

Upload a markdown file to databricks as a notebook.

$ databrickstools upload markdown \
    --from-path <value> \
    --to-path <value> \
    --base-language <value> \
    --overwrite

Where:

--from-path (string)

Path to file.md in your local machine.

--to-path (string)

Path to notebook on databricks.

--base-language (string)

The markdown might contain multiple languages, but we'll need to define (or know) the base language of the notebook. If not present, the CLI will try to infer the base language by looking into the file-ending or fallback to: DATABRICKSTOOLS_DEFAULT_LANGUAGE.

--overwrite (flag)

If present, the new file will overwrite the current one on databricks.

Example:

$ databrickstools markdown \
    --from-path markdown-file.md \
    --to-path /Shared/test \
    --overwrite

cmd: source

Same as markdown but with the SOURCE format.

Recommendations

To get the most out of markdown please consider taking a look into:

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

databrickstools-0.2.0.tar.gz (10.3 kB view details)

Uploaded Source

File details

Details for the file databrickstools-0.2.0.tar.gz.

File metadata

  • Download URL: databrickstools-0.2.0.tar.gz
  • Upload date:
  • Size: 10.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.45.0 CPython/3.8.2

File hashes

Hashes for databrickstools-0.2.0.tar.gz
Algorithm Hash digest
SHA256 080e4b93c5411d423c7a21fde1caed969f661d10345dfcc0b10d47c00cb0b429
MD5 2b6e52082ddf9d63885ad4b936b1bd94
BLAKE2b-256 798f446fb5a665440e9834dfa12b801c2a06467fc08e010a8cd959f75fabe38a

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page