Skip to main content

A pyspark management framework

Project description

Spark Management Consolidated
=============================

A small module that will load as a singleton class object to manage Spark
related things.

Installation
------------

Directly via ``pip`` on the command line, in a `virtualenv`:

.. code:: shell

pip install https://github.com/matz-e/sparkmanager/tarball/master

or for the current user:

.. code:: shell

pip install --user https://github.com/matz-e/sparkmanager/tarball/master

Usage
-----

The module itself acts as a mediator to Spark:

.. code:: python

import sparkmanager as sm

# Create a new application
sm.create("My fancy name",
[("spark.executor.cores", 4), ("spark.executor.memory", "8g")])

data = sm.spark.range(5)
# Will show up in the UI with the name "broadcasting some data"
with sm.jobgroup("broadcasting some data"):
data = sm.broadcast(data.collect())

The Spark session can be accessed via ``sm.spark``, the Spark context via
``sm.sc``. Both attributes are instantiated once the ``create`` method is
called, with the option to call unambiguous methods from both directly via
the :py:class:`SparkManager` object:

.. code:: python

# The following two calls are equivalent
c = sm.parallelize(range(5))
d = sm.sc.parallelize(range(5))
assert c.collect() == d.collect()

Cluster support scripts
-----------------------

.. note::

Scripts to run on the cluster are still somewhat experimental and should
be used with caution!

Environment setup
~~~~~~~~~~~~~~~~~

To create a self-contained Spark environment, the script provided in
``examples/env.sh`` can be used. It is currently tuned to the requirements of
the `bbpviz` cluster. A usage example:

.. code:: shell

SPARK_ROOT=/path/to/my/spark/installation SM_WORKDIR=/path/to/a/work/directory examples/env.sh

The working directory will contain:

* A Python virtual environment
* A basic Spark configuration pointing to directories within the working
directory
* An environment script to establish the setup

To use the resulting working environment:

.. code:: shell

. /path/to/a/work/directory/env.sh

Spark deployment on allocations
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Within a cluster allocation, the script ``sm_cluster`` can be used to start
a Spark cluster. The script will be automatically installed by `pip`. To
use it, pass either a working directory containing an environment or
specify them separately:

.. code:: shell

sm_cluster startup $WORKDIR
sm_cluster startup $WORKDIR /path/to/some/env.sh

Similar, to stop a cluster (not necessary with slurm):

.. code:: shell

sm_cluster shutdown $WORKDIR
sm_cluster shutdown $WORKDIR /path/to/some/env.sh

Spark applications then can connect to a master found via:

.. code:: shell

cat $WORKDIR/spark_master

TL;DR on BlueBrain 5
~~~~~~~~~~~~~~~~~~~~

Setup a Spark environment in your current shell, and point `WORKDIR` to a
shared directory. `SPARK_HOME` needs to be in your environment and point to
your Spark installation. By default, only a file with the Spark master and
the cluster launch script will be copied to `WORKDIR`. Then submit a
cluster with:

.. code:: shell

sbatch -A proj16 -t 24:00:00 -N4 --exclusive -C nvme $(which sm_cluster) startup $WORKDIR


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sparkmanager-0.5.7.tar.gz (12.6 kB view details)

Uploaded Source

Built Distributions

sparkmanager-0.5.7-py3-none-any.whl (23.2 kB view details)

Uploaded Python 3

sparkmanager-0.5.7-py2.py3-none-any.whl (23.2 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file sparkmanager-0.5.7.tar.gz.

File metadata

File hashes

Hashes for sparkmanager-0.5.7.tar.gz
Algorithm Hash digest
SHA256 3e8139311cc4ef463cf3cb9a9904cf8a7a828957ea4bf5fd211512c9357ae931
MD5 919db0ad9a67093d66785770ac2796ef
BLAKE2b-256 dffbff5af7e8fafb6f2ad5e17ef7a075bc1fc1aa9948d466785159de43a5afbb

See more details on using hashes here.

File details

Details for the file sparkmanager-0.5.7-py3-none-any.whl.

File metadata

File hashes

Hashes for sparkmanager-0.5.7-py3-none-any.whl
Algorithm Hash digest
SHA256 b80de1a8e6eff91c81a590ba359af84471d1f86d379ff00b53343d26eb4358a8
MD5 67731568d34a5483bacac2fc38b35a04
BLAKE2b-256 b9593646226d1e422c4782a17a682647331876f3ac3f3b47b81edae88755f6d0

See more details on using hashes here.

File details

Details for the file sparkmanager-0.5.7-py2.py3-none-any.whl.

File metadata

File hashes

Hashes for sparkmanager-0.5.7-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 f4599f8716e31531805ba44765715fef7ca28ea0335e617cc8fd42f6907f413b
MD5 d21eba210a5f08ee4a1b0a7af2410733
BLAKE2b-256 36a5402b429849309c43a298b34b1b5bd7e1a0f1a23330b485bad7a2ea60fb1f

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page