Remote Jupyter Lab kernel for Databricks

These details have not been verified by PyPI

Project description

Local JupyterLab connecting to Databricks via SSH

This package allows to connect to a remote Databricks cluster from a locally running Jupyter Lab:

1 Prerequisites

Anaconda installation A recent version of Anaconda with Python >= 3.5 The tool conda must be newer then 4.7.5
Databricks CLI

Install Databricks CLI and configure profile(s) for your cluster(s)
- AWS Databricks CLI
- Azure Databricks CLI
Note:
- Whenever $PROFILE is used in this documentation, it refers to a valid Databricks CLI profile name, stored in a shell environment variable.
SSH access to the Databricks cluster

Configure your Databricks clusters to allow ssh access:
- AWS: SSH Access to the cluster
- Azure: You need to have a Azure Databricks cluster that is deployed into your Azure Virtual Network (see VNet Injection). For these clusters the SSH configuration described for AWS is available. You additionally have to open port 2200 in the Network Security Group of your cluster.
Note:
- Only clusters with valid ssh configuration can be accessed by databrickslabs_jupyterlab.
- Creation of ssh key and updating the cluster configuration for SSH access can also be done with databrickslabs_jupyterlab, see below

2 Installation

Create a new conda environment and install databrickslabs_jupyterlab with the following commands:

(base)$ conda create -n db-jlab python=3.6
(base)$ conda activate db-jlab
(db-jlab)$ pip install --upgrade databrickslabs-jupyterlab==1.0.2-rc4

Bootstrap the environment for databrickslabs_jupyterlab with the following command:
```
(db-jlab)$ databrickslabs-jupyterlab -b
```
It finishes with an overview of the usage.

3 Usage

3.1 Configure ssh access to the cluster

If the ssh connection with the cluster is not already configured, get the cluster ID from the cluster URL:

Select menu entry Clusters and then click on the cluster of choice. The URL in the browser address window should look like:

AWS: https://$PROFILE.cloud.databricks.com/#/setting/clusters/$CLUSTER_ID/configuration
Azure: https://$PROFILE.azuredatabricks.net/?o=$ORG_ID#/setting/clusters/$CLUSTER_ID/configuration

and call:

(db-jlab)$ databrickslabs-jupyterlab $PROFILE -s -i $CLUSTER_ID

3.2 Starting Jupyter Lab

Activate the conda environment for databrickslabs-jupyterlab with the following command:
```
(base)$ conda activate db-jlab
```
Create a jupyter kernel specification for a databricks cli profile ($PROFILE) with the following command:
```
(db-jlab)$ databrickslabs-jupyterlab $PROFILE -k -f
```
Start Jupyter Lab the usual way:
```
(db-jlab)$ jupyter lab
```

Note: A new kernel is available in the kernel change menu. The kernel name has the following structure: SSH $CLUSTER_ID $PROFILE:$CLUSTER_NAME ($LOCAL_CONDA_ENV_NAME) where $LOCAL_CONDA_ENV_NAME will be omitted if $LOCAL_CONDA_ENV_NAME == $CLUSTER_NAME. Example:

Examples:

SSH 0806-143104-skirt84 demo:bernhard-5.5-ml (db-jlab)
- Workspace profile name: demo
- Cluster ID: 0806-143104-skirt84
- Cluster Name: bernhard-5.5-ml
- Local conda environment: db-jlab
SSH 0806-143104-skirt84 demo:bernhard-5.5-ml
- Workspace profile name: demo
- Cluster ID: 0806-143104-skirt84
- Cluster Name: bernhard-5.5-ml
- Local conda environment: bernhard-5.5-ml

3.3 Using Spark in the Notebook

Getting a remote Spark Session in the notebook

When the cluster is already running the status bar of Jupyter lab should show

kernel ready

To connect to the remote Spark context, enter the following two lines into a notebook cell:

[1] from databrickslabs_jupyterlab.connect import dbcontext, is_remote
    dbcontext()

This will request you to add the token copied to clipboard above:

    Fri Aug  9 09:58:04 2019 py4j imported
    Enter personal access token for profile 'demo' |_____________________________|

After pressing Enter, you will see

    Gateway created for cluster '0806-143104-skirt84' ... connected
    The following global variables have been created:
    - spark       Spark session
    - sc          Spark context
    - sqlContext  Hive Context
    - dbutils     Databricks utilities

Overview

Note: databrickslabs-jupyterlab $PROFILE -c let's you quickly copy the token to the clipboard so that you can simply paste the token to the input box.

Switching kernels

Kernels can be switched via the Jupyterlab Kernel Change dialog. However, when switching to a remote kernel, the local connecteion context might get out of sync and the notebook cannot be used. In this case (step 1) shutdown the kernel and (2) Select the remote kernel again from the Jupyterlab Kernel Change dialog. A simple Kernel Restart by Jupyter lab will not work since this does not refresh the connection context.

Restart after cluster auto-termination

Should the cluster auto terminate while the notebook is connected, the status bar will change to

Clicking on the status bar entry as indicated by the message will open a dialog box to confirm that the remote cluster should be started again. During restart the following status messages will be shown in this order:

If the cluster is up and running, however cannot be reached by ssh (e.g. VPN not running), then one would see

In this case check connectivity, e.g. by calling ssh <cluster_id> in a terminal window.

After successful start the status would again show:

4 Creating a mirror of a remote Databricks cluster

For the specific use case when the same notebook should run locally and remotely, a local mirror of the remote libraries and versions is needed. There are two ways to achieve this:

White list The packages mirrored are filtered via white list of Data Science focussed libraries (if the packages is installed on the remote cluster and is in the white list, it will be installed in the local mirror). The list can be printed with
```
databrickslabs_jupyterlab -W
```
Black list The packages mirrored are filtered via black list of generic libraries (if the packages is installed on the remote cluster and is in the white list, it will not be installed in the local mirror). The list can be printed with
```
databrickslabs_jupyterlab -B
```

A local mirror can be created via databrickslabs_jupyterlab with the following command:

$(base) conda activate db-jlab
$(db-jlab) databrickslabs-jupyterlab $PROFILE -m     # filter via black list
# OR
$(db-jlab) databrickslabs-jupyterlab $PROFILE -m -w  # filter via white list

The command will

ask for the cluster to mirror

Valid version of conda detected: 4.7.11

* Getting host and token from .databrickscfg

* Select remote cluster
[?] Which cluster to connect to?: 0: bernhard-5.5-ml (id: 0806-143104-skirt84, state: RUNNING, scale: 2-4)
> 0: bernhard-5.5-ml (id: 0815-32415-abcde42, state: RUNNING, scale: 2-4)

=> Selected cluster: bernhard-5.5-ml (ec2-xxx-xxx-xxx-xxx.us-west-2.compute.amazonaws.com)

configure ssh access

* Configuring ssh config for remote cluster
=> Added ssh config entry or modified IP address:

    Host 0815-32415-abcde42
        HostName ec2-xxx-xxx-xxx-xxx.us-west-2.compute.amazonaws.com
        User ubuntu
        Port 2200
        IdentityFile ~/.ssh/id_demo
        ServerAliveInterval 300

=> Testing whether cluster can be reached

retrieve the necessary libraries to install locally.

* Installation of local environment to mirror a remote Databricks cluster

    Library versions being installed:
    - hyperopt==0.1.2
    - Keras==2.2.4
    - Keras-Applications==1.0.8
    - Keras-Preprocessing==1.1.0
    - matplotlib==2.2.2
    - mleap==0.8.1
    ...
    - tensorflow-estimator==1.13.0
    - torch==1.1.0
    - torchvision==0.3.0

ask for an environment name (default is the remote cluster name):

    => Provide a conda environment name (default = bernhard-5.5-ml):

and finally install the new environment:

* Installing conda environment bernhard-5.5-ml
...

After switching into this environment via

conda activate bernhard-5.5-ml

follow the usage guide in section 3.

5 Details

Show help

(db-jlab)$ databrickslabs-jupyterlab -h

usage: databrickslabs-jupyterlab [-h] [-b] [-m] [-c] [-f] [-i CLUSTER_ID] [-k]
                                [-l] [-o ORGANISATION] [-p] [-r] [-s] [-v]
                                [-V {all,diff,same}] [-w] [-W] [-B]
                                [profile]

Configure remote Databricks access with Jupyter Lab

positional arguments:
profile               A databricks-cli profile

optional arguments:
-h, --help            show this help message and exit
-b, --bootstrap       Bootstrap the local databrickslabs-jupyterlab
                        environment
-m, --mirror          Mirror a a remote Databricks environment
-c, --clipboard       Copy the personal access token to the clipboard
-f, --force           Force remote installation of databrickslabs_jupyterlab
                        package
-i CLUSTER_ID, --id CLUSTER_ID
                        The cluster_id to avoid manual selection
-k, --kernelspec      Create a kernel specification
-l, --lab             Safely start Jupyter Lab
-o ORGANISATION, --organisation ORGANISATION
                        The organisation for Azure Databricks
-p, --profiles        Show all databricks cli profiles and check SSH key
-r, --reconfigure     Reconfigure cluster with id cluster_id
-s, --ssh-config      Configure SSH acces for a cluster
-v, --version         Check version of databrickslabs-jupyterlab
-V {all,diff,same}, --versioncheck {all,diff,same}
                        Check version of local env with remote env
-w, --whitelist       Use a whitelist (include pkg) of packages to install
                        instead of blacklist (exclude pkg)
-W, --print-whitelist
                        Print whitelist (include pkg) of packages to install
-B, --print-blacklist
                        Print blacklist (exclude pkg) of packages to install

Show currently available profiles (databrickslabs-jupyterlab -p):

(db-jlab)$ databrickslabs-jupyterlab -p

Valid version of conda detected: 4.7.10

PROFILE       HOST                                    SSH KEY
eastus2       https://eastus2.azuredatabricks.net     MISSING
demo          https://demo.cloud.databricks.com       OK

Note: If the column SSH KEY e.g. for PROPFILE "demo" says "MISSING", use

(db-jlab)$ ssh-keygen -f ~/.ssh/id_demo -N ""

and add ~/.ssh/id_demo.pub to the SSH config of the respective cluster and restart it.

Create jupyter kernel for remote cluster
- Databricks on AWS:
```
(db-jlab)$ databrickslabs-jupyterlab $PROFILE -k [-i <cluster name>]
```
- Azure:
```
(db-jlab)$ databrickslabs-jupyterlab $PROFILE -k -o <organisation> [-i <cluster name>]
```
This will execute the following steps:
- Get host and token from .databrickscfg for the given profile
- In case -i is not used, show a list of clusters that have the correct SSH key (id_$PROFILE) configured
- Installs databrickslabs_jupyterlab and ipywidgets on the remote driver
- Creates the remote kernel specification
Safely start Jupyter Lab

while you can start Jupyter Lab via jupyter lab, it is recommended to use the wrapper
```
(db-jlab)$ databrickslabs-jupyterlab $PROFILE -l [-i <cluster name>]
```
It will check whether the remote cluster is up and running, update the ssh info, check the availability of th relevant libs before starting jupyter Lab.
Copy Personal Access token for databricks workspace to the clipboard

This is the same command on AWS and Azure
```
(db-jlab)$ databrickslabs-jupyterlab $PROFILE -c
```

Compare local and remote library versions (uses the locally activated canda environment)

(db-jlab)$ databrickslabs-jupyterlab $PROFILE -v all|same|diff [-i <cluster name>]

4 Test notebooks

To work with the test notebooks in ./examples the remote cluster needs to have the following libraries installed:

mlflow==1.0.0
spark-sklearn

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

2.2.1

Jun 29, 2021

2.2.0

Apr 21, 2021

2.2.0rc5 pre-release

Apr 21, 2021

2.2.0rc4 pre-release

Apr 21, 2021

2.2.0rc3 pre-release

Apr 20, 2021

2.2.0rc2 pre-release

Apr 20, 2021

2.2.0rc1 pre-release

Apr 20, 2021

2.2.0rc0 pre-release

Apr 20, 2021

2.1.0rc4 pre-release

Jan 24, 2021

2.1.0rc3 pre-release

Jan 24, 2021

2.1.0rc2 pre-release

Jan 24, 2021

2.1.0rc1 pre-release

Jan 24, 2021

2.1.0rc0 pre-release

Jan 24, 2021

2.0.0

May 27, 2020

2.0.0rc5 pre-release

May 26, 2020

2.0.0rc3 pre-release

May 25, 2020

2.0.0rc2 pre-release

May 9, 2020

2.0.0rc1 pre-release

May 8, 2020

2.0.0rc0 pre-release

May 7, 2020

1.0.9

Dec 16, 2019

1.0.5

Nov 7, 2019

1.0.2

Oct 20, 2019

1.0.2rc11 pre-release

Sep 25, 2019

1.0.2rc8 pre-release

Sep 24, 2019

1.0.2rc7 pre-release

Sep 6, 2019

1.0.2rc6 pre-release

Sep 4, 2019

1.0.2rc5 pre-release

Sep 4, 2019

This version

1.0.2rc4 pre-release

Aug 30, 2019

1.0.2rc3 pre-release

Aug 30, 2019

1.0.2rc2 pre-release

Aug 30, 2019

1.0.2rc1 pre-release

Aug 28, 2019

1.0.2rc0 pre-release

Aug 22, 2019

1.0.0rc4 pre-release

Aug 22, 2019

1.0.0rc3 pre-release

Aug 22, 2019

1.0.0rc2 pre-release

Aug 21, 2019

1.0.0rc1 pre-release

Aug 21, 2019

1.0.0rc0 pre-release

Aug 21, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

databrickslabs_jupyterlab-1.0.2rc4.tar.gz (37.9 kB view details)

Uploaded Aug 30, 2019 Source

Built Distribution

databrickslabs_jupyterlab-1.0.2rc4-py3-none-any.whl (40.3 kB view details)

Uploaded Aug 30, 2019 Python 3

File details

Details for the file databrickslabs_jupyterlab-1.0.2rc4.tar.gz.

File metadata

Download URL: databrickslabs_jupyterlab-1.0.2rc4.tar.gz
Upload date: Aug 30, 2019
Size: 37.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.34.0 CPython/2.7.16

File hashes

Hashes for databrickslabs_jupyterlab-1.0.2rc4.tar.gz
Algorithm	Hash digest
SHA256	`152410130e561563ec86b568991ba0ab410b749397f2e7beef24dbb02bd553d3`
MD5	`8186447c206097c47550f63146aa7187`
BLAKE2b-256	`e1031bc49780d7aefa0e1f46751958566b6c171df929515784c9fa89c2b9ad24`

See more details on using hashes here.

File details

Details for the file databrickslabs_jupyterlab-1.0.2rc4-py3-none-any.whl.

File metadata

Download URL: databrickslabs_jupyterlab-1.0.2rc4-py3-none-any.whl
Upload date: Aug 30, 2019
Size: 40.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.34.0 CPython/2.7.16

File hashes

Hashes for databrickslabs_jupyterlab-1.0.2rc4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`104589da5bc83ba1211403c788041cf5127161bc88856514cd1abf4c28ee356d`
MD5	`0d385e748100ee0fb7d63426a60e8ede`
BLAKE2b-256	`a20d1a5f52e5206f14c919050a30d22fc45a36a14af1705de4dbd3ba7ee62945`

See more details on using hashes here.

databrickslabs-jupyterlab 1.0.2rc4

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

Local JupyterLab connecting to Databricks via SSH

1 Prerequisites

2 Installation

3 Usage

3.1 Configure ssh access to the cluster

3.2 Starting Jupyter Lab

3.3 Using Spark in the Notebook

Getting a remote Spark Session in the notebook

Switching kernels

Restart after cluster auto-termination

4 Creating a mirror of a remote Databricks cluster

5 Details

4 Test notebooks

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes