Skip to main content

Helper to connect to CERN's Spark Clusters

Project description

SparkMonitor

SparkMonitor is an extension for Jupyter that enables the live monitoring of Apache Spark Jobs spawned from a notebook. The extension provides several features to monitor and debug a Spark job from within the notebook interface itself.

jobdisplay

It was originally developed as part of Google Summer of Code by @krishnan-r. The original repo can be seen here: https://github.com/krishnan-r/sparkmonitor

This extension is composed of a Python package named sparkmonitor, which installs the nbextension, Kernel extension and a NPM package named @swan-cern/sparkmonitor for the JupyterLab extension (still under development).

Requirements

  • JupyterLab >= 2.0
  • PySpark on Apache Spark version 2.1.1 or higher
  • Jupyter Notebook version 4.4.0 or higher
  • SBT to compile the Scala listener

Install

Note: You will need NodeJS to install the extension.

pip install sparkmonitor
jupyter nbextension install sparkmonitor --py
jupyter nbextension enable  sparkmonitor --py
jupyter serverextension enable --py --system sparkmonitor # this should happen automatically
jupyter lab build

To enable the Kernel extension, create the default profile configuration files (Skip if config file already exists) and configure the kernel to load the extension on startup. This is added to the configuration files in users home directory.

ipython profile create
echo "c.InteractiveShellApp.extensions.append('sparkmonitor.kernelextension')" >>  $(ipython profile locate default)/ipython_kernel_config.py

Usage

To use the extension, it is necessary to set the monitor in the Spark configuration, like so:

spark.extraListeners = sparkmonitor.listener.JupyterSparkMonitorListener

# Pick one of the following:
# For Spark 2
park.driver.extraClassPath = /usr/local/lib/sparkmonitor/listener_2.11.jar #lives inside the sparkmonitor module
# For Spark 3
park.driver.extraClassPath = /usr/local/lib/sparkmonitor/listener_2.12.jar #lives inside the sparkmonitor module

To ease the configuration, and if the kernel extension is correctly installed, you should have the variable swan_spark_conf available from inside your notebook with everything already set. To use it, just configure SparkContext like so:

SparkContext.getOrCreate(conf=swan_spark_conf)

Complete example:

from pyspark import SparkContext
sc = SparkContext.getOrCreate(conf=swan_spark_conf) #Start the spark context
rdd = sc.parallelize([1, 2, 4, 8])
rdd.count()

Troubleshoot

Check if the server and nb extension are correctly installed:

jupyter nbextension list
jupyter serverextension list

If the problem is with the kernel extension, check the logs to see if it was loaded or if there was any problem with the ipython profile.

If you are not seeing the frontend JupyterLab extension, check if it's installed:

jupyter labextension list

If it is installed, try:

jupyter lab clean
jupyter lab build

Contributing

Install

The jlpm command is JupyterLab's pinned version of yarn that is installed with JupyterLab. You may use yarn or npm in lieu of jlpm below.

# Clone the repo to your local environment
# Move to sparkmonitor directory

# Install server extension
# This will also build the js code
pip install -e .

# Install and enable the nbextension
jupyter nbextension install sparkmonitor --py --sys-prefix
jupyter nbextension enable  sparkmonitor --py --sys-prefix

# Link your development version of the extension with JupyterLab
jupyter labextension link .
# Rebuild JupyterLab after making any changes
jupyter lab build

# Rebuild Typescript source after making changes
jlpm build
# Rebuild JupyterLab after making any changes
jupyter lab build

You can watch the source directory and run JupyterLab in watch mode to watch for changes in the extension's source and automatically rebuild the extension and application.

# Watch the source directory in another terminal tab
jlpm watch
# Run jupyterlab in watch mode in one terminal tab
jupyter lab --watch

Uninstall

pip uninstall sparkmonitor
jupyter labextension uninstall @swan-cern/sparkmonitor

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sparkmonitor-1.1.0.tar.gz (3.2 MB view hashes)

Uploaded Source

Built Distribution

sparkmonitor-1.1.0-py3-none-any.whl (3.3 MB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page