Remote Jupyter Lab kernel for Databricks
Project description
Local JupyterLab connecting to Databricks via SSH
This package allows to connect to a remote Databricks cluster from a locally running Jupyter Lab:
1 Prerequisites
-
Operating System
Either Macos or Linux. Windows is currently not supported
-
Anaconda installation
A recent version of Anaconda with Python >= 3.5 The tool conda must be newer then 4.7.5
-
Databricks CLI
To install Databricks CLI and configure profile(s) for your cluster(s), please refer to AWS / Azure
Whenever
$PROFILE
is used in this documentation, it refers to a valid Databricks CLI profile name, stored in a shell environment variable. -
SSH access to the Databricks cluster
Configure your Databricks clusters to allow ssh access, see Configure SSH access
Only clusters with valid ssh configuration are visible to databrickslabs_jupyterlab.
2 Installation
-
Create a new conda environment and install databrickslabs_jupyterlab with the following commands:
(base)$ conda create -n db-jlab python=3.6 (base)$ conda activate db-jlab (db-jlab)$ pip install --upgrade databrickslabs-jupyterlab==1.0.2-rc6
The prefix
(db-jlab)$
for all command examples in this document assumes that the databrickslabs_jupyterlab conda enviromnentdb-jlab
is activated. -
Bootstrap the environment for databrickslabs_jupyterlab with the following command:
(db-jlab)$ databrickslabs-jupyterlab -b
It finishes with an overview of the usage.
3 Usage
Ensure, ssh access is correctly configured, see Configure SSH access
3.1 Starting Jupyter Lab
-
Create a jupyter kernel specification for a Databricks CLI profile
$PROFILE
and start Jupyter Lab with the following command:(db-jlab)$ databrickslabs-jupyterlab $PROFILE -l
Notes:
-
The command with
-l
is a shortcut for(db-jlab)$ databrickslabs-jupyterlab $PROFILE -k (db-jlab)$ jupyter lab
that ensures that the kernel specificiation is updated (one could omit the first step if the kernel specification is up to date)
-
A new kernel is available in the kernel change menu (see here for an explanation of the kernel name structure)
3.2 Using Spark in the Notebook
Getting a remote Spark Session in the notebook
When the cluster is already running the status bar of Jupyter lab should show
To connect to the remote Spark context, enter the following two lines into a notebook cell:
[1] from databrickslabs_jupyterlab.connect import dbcontext
dbcontext()
This will request you to add the token copied to clipboard above:
Fri Aug 9 09:58:04 2019 py4j imported
Enter personal access token for profile 'demo' |_____________________________|
After pressing Enter, you will see
Gateway created for cluster '0806-143104-skirt84' ... connected
The following global variables have been created:
- spark Spark session
- sc Spark context
- sqlContext Hive Context
- dbutils Databricks utilities
Note: databrickslabs-jupyterlab $PROFILE -c
let's you quickly copy the token to the clipboard so that you can simply paste the token to the input box.
Switching kernels
Kernels can be switched via the Jupyterlab Kernel Change dialog. However, when switching to a remote kernel, the local connection context might get out of sync and the notebook cannot be used. In this case:
- shutdown the kernel
- Select the remote kernel again from the Jupyterlab Kernel Change dialog.
A simple Kernel Restart by Jupyter lab will not work since this does not refresh the connection context!
Restart after cluster auto-termination
Should the cluster auto terminate while the notebook is connected or the network connection is down, the status bar will change to
Additionally a dialog to confirm that the remote cluster should be started again will be launched in Jupyter Lab
Notes:
- One can check connectivity before, e.g. by calling
ssh <cluster_id>
in a terminal window) - After cancelling the dialog, clicking on the status bar entry as indicated by the message will open the dialog box again
During restart the following status messages will be shown in this order:
After successful start the status would again show:
4) Advanced topics
- Creating a mirror of a remote Databricks cluster
- Detailed databrickslabs_jupyterlab command overview
- How it works
4 Test notebooks
To work with the test notebooks in ./examples
the remote cluster needs to have the following libraries installed:
- mlflow==1.0.0
- spark-sklearn
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for databrickslabs_jupyterlab-1.0.2rc6.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5382563a6f67e0074a3b69acc8ca0ce8c514109c791d38030b1802b8b29529de |
|
MD5 | c33f67b562a14b4fe43fa39c00794f2a |
|
BLAKE2b-256 | 0ca497a48dda9d0e12e84bf21307a2d135e0ef4bf562a8039860b2595f24f71f |
Hashes for databrickslabs_jupyterlab-1.0.2rc6-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 385c5bfb85ee9d2133298154b7d4b259d2167603916aa40b19ca91667d331d52 |
|
MD5 | 110a76612dc5196404150afbaf78d024 |
|
BLAKE2b-256 | 48de25b5b7140d5a56da3ccb0b1d5eb59017b5de399ca6a80a270c649210922f |