Skip to main content

A pyspark management framework

Project description

A small module that will load as a singleton class object to manage Spark related things.

Installation

Directly via pip on the command line, in a virtualenv:

pip install https://github.com/matz-e/sparkmanager/tarball/master

or for the current user:

pip install --user https://github.com/matz-e/sparkmanager/tarball/master

Usage

The module itself acts as a mediator to Spark:

import sparkmanager as sm

# Create a new application
sm.create("My fancy name",
          [("spark.executor.cores", 4), ("spark.executor.memory", "8g")])

data = sm.spark.range(5)
# Will show up in the UI with the name "broadcasting some data"
with sm.jobgroup("broadcasting some data"):
    data = sm.broadcast(data.collect())

The Spark session can be accessed via sm.spark, the Spark context via sm.sc. Both attributes are instantiated once the create method is called, with the option to call unambiguous methods from both directly via the SparkManager object:

# The following two calls are equivalent
c = sm.parallelize(range(5))
d = sm.sc.parallelize(range(5))
assert c.collect() == d.collect()

Cluster support scripts

Environment setup

To create a self-contained Spark environment, the script provided in examples/env.sh can be used. It is currently tuned to the requirements of the bbpviz cluster. A usage example:

SPARK_ROOT=/path/to/my/spark/installation SM_WORKDIR=/path/to/a/work/directory examples/env.sh

The working directory will contain:

  • A Python virtual environment

  • A basic Spark configuration pointing to directories within the working directory

  • An environment script to establish the setup

To use the resulting working environment:

. /path/to/a/work/directory/env.sh

Spark deployment on allocations

Within a cluster allocation, the script sm_cluster can be used to start a Spark cluster. The script will be automatically installed by pip. To use it, pass either a working directory containing an environment or specify them separately:

sm_cluster startup $WORKDIR
sm_cluster startup $WORKDIR /path/to/some/env.sh

Similar, to stop a cluster (not necessary with slurm):

sm_cluster shutdown $WORKDIR
sm_cluster shutdown $WORKDIR /path/to/some/env.sh

Spark applications then can connect to a master found via:

cat $WORKDIR/spark_master

TL;DR on BlueBrain 5

Setup a Spark environment in your current shell, and point WORKDIR to a shared directory. SPARK_HOME needs to be in your environment and point to your Spark installation. By default, only a file with the Spark master and the cluster launch script will be copied to WORKDIR. Then submit a cluster with:

sbatch -A proj16 -t 24:00:00 -N4 --exclusive -C nvme $(which sm_cluster) startup $WORKDIR

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sparkmanager-0.7.3.tar.gz (15.5 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page