A pyspark management framework
Project description
A small module that will load as a singleton class object to manage Spark related things.
Installation
Directly via pip on the command line, in a virtualenv:
pip install https://github.com/matz-e/sparkmanager/tarball/master
or for the current user:
pip install --user https://github.com/matz-e/sparkmanager/tarball/master
Usage
The module itself acts as a mediator to Spark:
import sparkmanager as sm
# Create a new application
sm.create("My fancy name",
[("spark.executor.cores", 4), ("spark.executor.memory", "8g")])
data = sm.spark.range(5)
# Will show up in the UI with the name "broadcasting some data"
with sm.jobgroup("broadcasting some data"):
data = sm.broadcast(data.collect())
The Spark session can be accessed via sm.spark, the Spark context via sm.sc. Both attributes are instantiated once the create method is called, with the option to call unambiguous methods from both directly via the SparkManager object:
# The following two calls are equivalent
c = sm.parallelize(range(5))
d = sm.sc.parallelize(range(5))
assert c.collect() == d.collect()
Cluster support scripts
Environment setup
To create a self-contained Spark environment, the script provided in examples/env.sh can be used. It is currently tuned to the requirements of the bbpviz cluster. A usage example:
SPARK_ROOT=/path/to/my/spark/installation SM_WORKDIR=/path/to/a/work/directory examples/env.sh
The working directory will contain:
A Python virtual environment
A basic Spark configuration pointing to directories within the working directory
An environment script to establish the setup
To use the resulting working environment:
. /path/to/a/work/directory/env.sh
Spark deployment on allocations
Within a cluster allocation, the script sm_cluster can be used to start a Spark cluster. The script will be automatically installed by pip. To use it, pass either a working directory containing an environment or specify them separately:
sm_cluster startup $WORKDIR
sm_cluster startup $WORKDIR /path/to/some/env.sh
Similar, to stop a cluster (not necessary with slurm):
sm_cluster shutdown $WORKDIR
sm_cluster shutdown $WORKDIR /path/to/some/env.sh
Spark applications then can connect to a master found via:
cat $WORKDIR/spark_master
TL;DR on BlueBrain 5
Setup a Spark environment in your current shell, and point WORKDIR to a shared directory. SPARK_HOME needs to be in your environment and point to your Spark installation. By default, only a file with the Spark master and the cluster launch script will be copied to WORKDIR. Then submit a cluster with:
sbatch -A proj16 -t 24:00:00 -N4 --exclusive -C nvme $(which sm_cluster) startup $WORKDIR
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file sparkmanager-0.7.3.tar.gz
.
File metadata
- Download URL: sparkmanager-0.7.3.tar.gz
- Upload date:
- Size: 15.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.4.0 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.7.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | bf952e31d9fc6c4945613caae64558625b792596985920bcd4c5fa8b73a97a78 |
|
MD5 | 87daeeac0b216399c03492a9c7ffa4d1 |
|
BLAKE2b-256 | 2f638a2a4e0230d133aa5e7d78a88b715c0231cd5784aaf82f67a833cfe6c4f4 |