Helper to connect to CERN's Spark Clusters
Project description
SparkMonitor
SparkMonitor is an extension for Jupyter that enables the live monitoring of Apache Spark Jobs spawned from a notebook. The extension provides several features to monitor and debug a Spark job from within the notebook interface itself.
It was originally developed as part of Google Summer of Code by @krishnan-r. The original repo can be seen here: https://github.com/krishnan-r/sparkmonitor
This extension is composed of a Python package named sparkmonitor
, which installs the nbextension, Kernel extension and a NPM package named @swan-cern/sparkmonitor
for the JupyterLab extension (still under development).
Requirements
- JupyterLab >= 2.0
- PySpark on Apache Spark version 2.1.1 or higher
- Jupyter Notebook version 4.4.0 or higher
- SBT to compile the Scala listener
Install
Note: You will need NodeJS to install the extension.
pip install sparkmonitor
jupyter nbextension install sparkmonitor --py
jupyter nbextension enable sparkmonitor --py
jupyter serverextension enable --py --system sparkmonitor # this should happen automatically
jupyter lab build
To enable the Kernel extension, create the default profile configuration files (Skip if config file already exists) and configure the kernel to load the extension on startup. This is added to the configuration files in users home directory.
ipython profile create
echo "c.InteractiveShellApp.extensions.append('sparkmonitor.kernelextension')" >> $(ipython profile locate default)/ipython_kernel_config.py
Usage
To use the extension, it is necessary to set the monitor in the Spark configuration, like so:
spark.extraListeners = sparkmonitor.listener.JupyterSparkMonitorListener
# Pick one of the following:
# For Spark 2
park.driver.extraClassPath = /usr/local/lib/sparkmonitor/listener_2.11.jar #lives inside the sparkmonitor module
# For Spark 3
park.driver.extraClassPath = /usr/local/lib/sparkmonitor/listener_2.12.jar #lives inside the sparkmonitor module
To ease the configuration, and if the kernel extension is correctly installed, you should have the variable swan_spark_conf
available from inside your notebook with everything already set.
To use it, just configure SparkContext like so:
SparkContext.getOrCreate(conf=swan_spark_conf)
Complete example:
from pyspark import SparkContext
sc = SparkContext.getOrCreate(conf=swan_spark_conf) #Start the spark context
rdd = sc.parallelize([1, 2, 4, 8])
rdd.count()
Troubleshoot
Check if the server and nb extension are correctly installed:
jupyter nbextension list
jupyter serverextension list
If the problem is with the kernel extension, check the logs to see if it was loaded or if there was any problem with the ipython profile.
If you are not seeing the frontend JupyterLab extension, check if it's installed:
jupyter labextension list
If it is installed, try:
jupyter lab clean
jupyter lab build
Contributing
Install
The jlpm
command is JupyterLab's pinned version of
yarn that is installed with JupyterLab. You may use
yarn
or npm
in lieu of jlpm
below.
# Clone the repo to your local environment
# Move to sparkmonitor directory
# Install server extension
# This will also build the js code
pip install -e .
# Install and enable the nbextension
jupyter nbextension install sparkmonitor --py --sys-prefix
jupyter nbextension enable sparkmonitor --py --sys-prefix
# Link your development version of the extension with JupyterLab
jupyter labextension link .
# Rebuild JupyterLab after making any changes
jupyter lab build
# Rebuild Typescript source after making changes
jlpm build
# Rebuild JupyterLab after making any changes
jupyter lab build
You can watch the source directory and run JupyterLab in watch mode to watch for changes in the extension's source and automatically rebuild the extension and application.
# Watch the source directory in another terminal tab
jlpm watch
# Run jupyterlab in watch mode in one terminal tab
jupyter lab --watch
Uninstall
pip uninstall sparkmonitor
jupyter labextension uninstall @swan-cern/sparkmonitor
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for sparkmonitor-1.1.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ebdf7545866ccde66c058a37229d4bbbdb13311205308c524a0368b61eb8aea6 |
|
MD5 | da314cbb28caa94b9ad15772ca3e5925 |
|
BLAKE2b-256 | 280b1332eb97cbaa74d635b90e2282e74b2bd1227c0cd5c6721fc25bee4cc1f2 |