Skip to main content

A pyspark management framework

Project description

A small module that will load as a singleton class object to manage Spark related things.

Installation

Directly via pip on the command line, in a virtualenv:

pip install https://github.com/matz-e/sparkmanager/tarball/master

or for the current user:

pip install --user https://github.com/matz-e/sparkmanager/tarball/master

Usage

The module itself acts as a mediator to Spark:

import sparkmanager as sm

# Create a new application
sm.create("My fancy name",
          [("spark.executor.cores", 4), ("spark.executor.memory", "8g")])

data = sm.spark.range(5)
# Will show up in the UI with the name "broadcasting some data"
with sm.jobgroup("broadcasting some data"):
    data = sm.broadcast(data.collect())

The Spark session can be accessed via sm.spark, the Spark context via sm.sc. Both attributes are instantiated once the create method is called, with the option to call unambiguous methods from both directly via the SparkManager object:

# The following two calls are equivalent
c = sm.parallelize(range(5))
d = sm.sc.parallelize(range(5))
assert c.collect() == d.collect()

Cluster support scripts

Environment setup

To create a self-contained Spark environment, the script provided in examples/env.sh can be used. It is currently tuned to the requirements of the bbpviz cluster. A usage example:

SPARK_ROOT=/path/to/my/spark/installation SM_WORKDIR=/path/to/a/work/directory examples/env.sh

The working directory will contain:

  • A Python virtual environment

  • A basic Spark configuration pointing to directories within the working directory

  • An environment script to establish the setup

To use the resulting working environment:

. /path/to/a/work/directory/env.sh

Spark deployment on allocations

Within a cluster allocation, the script sm_cluster can be used to start a Spark cluster. The script will be automatically installed by pip. To use it, pass either a working directory containing an environment or specify them separately:

sm_cluster startup $WORKDIR
sm_cluster startup $WORKDIR /path/to/some/env.sh

Similar, to stop a cluster (not necessary with slurm):

sm_cluster shutdown $WORKDIR
sm_cluster shutdown $WORKDIR /path/to/some/env.sh

Spark applications then can connect to a master found via:

cat $WORKDIR/spark_master

TL;DR on BlueBrain 5

Setup a Spark environment in your current shell, and point WORKDIR to a shared directory. SPARK_HOME needs to be in your environment and point to your Spark installation. By default, only a file with the Spark master and the cluster launch script will be copied to WORKDIR. Then submit a cluster with:

sbatch -A proj16 -t 24:00:00 -N4 --exclusive -C nvme $(which sm_cluster) startup $WORKDIR

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sparkmanager-0.7.3.tar.gz (15.5 kB view details)

Uploaded Source

File details

Details for the file sparkmanager-0.7.3.tar.gz.

File metadata

  • Download URL: sparkmanager-0.7.3.tar.gz
  • Upload date:
  • Size: 15.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.4.0 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.7.4

File hashes

Hashes for sparkmanager-0.7.3.tar.gz
Algorithm Hash digest
SHA256 bf952e31d9fc6c4945613caae64558625b792596985920bcd4c5fa8b73a97a78
MD5 87daeeac0b216399c03492a9c7ffa4d1
BLAKE2b-256 2f638a2a4e0230d133aa5e7d78a88b715c0231cd5784aaf82f67a833cfe6c4f4

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page