Skip to main content
This is a pre-production deployment of Warehouse. Changes made here affect the production instance of PyPI (pypi.python.org).
Help us improve Python packaging - Donate today!

Setup and manage a Apache Spark cluster in EC2

Project Description

The CGCloud plugin for Spark lets you setup a fully configured Apache Spark cluster in EC2 in just minutes, regardless of the number of nodes. While Apache Spark already comes with a script called spark-ec2 that lets you build a cluster in EC2, CGCloud Spark differs from spark-ec2 in the following ways:

  • Tachyon or Yarn are not included
  • Setup time does not scale linearly with the number of nodes. Setting up a 100 node cluster takes just as long as setting up a 10 node cluster (2-3 min, as opposed to 45min with spark-ec2). This is made possible by baking all required software into a single AMI. All slave nodes boot up concurrently and join the cluster autonomously in just a few minutes.
  • Unlike with spark-ec2, the cluster can be stopped and started via the EC2 API or the EC2 console, without involvement of cgcloud.
  • The Spark services (master and worker) run as an unprivileged user, not root as with spark-ec2. Ditto for the HDFS services (namenode, datanode and secondarynamenode).
  • The Spark and Hadoop services are started automatically as the instance boots up, via a regular init script.
  • Nodes can be added easily, simply by booting up new instances from the AMI. They will join the cluster automatically. HDFS may have to be rebalanced after that.
  • You can customize the AMI that cluster nodes boot from by subclassing the SparkMaster and SparkSlave classes.
  • CGCloud Spark uses the CGCLoud Agent which takes care of maintaining a list of authorized keypairs on each node.
  • CGCloud Spark is based on the official Ubuntu Trusty 14.04 LTS, not the Amazon Linux AMI.

Prerequisites

The cgcloud-spark package requires that the cgcloud-core package and its prerequisites are present.

Installation

Read the entire section before pasting any commands and ensure that all prerequisites are installed. It is recommended to install this plugin into the virtualenv you created for CGCloud:

source ~/cgcloud/bin/activate
pip install cgcloud-spark

If you get DistributionNotFound: No distributions matching the version for cgcloud-spark, try running pip install --pre cgcloud-spark.

Be sure to configure cgcloud-core before proceeding.

Configuration

Modify your .profile or .bash_profile by adding the following line:

export CGCLOUD_PLUGINS="cgcloud.spark:$CGCLOUD_PLUGINS"

Login and out (or, on OS X, start a new Terminal tab/window).

Verify the installation by running:

cgcloud list-roles

The output should include the spark-box role.

Usage

Create a single t2.micro box to serve as the template for the cluster nodes:

cgcloud create -IT spark-box

The I option stops the box once it is fully set up and takes an image (AMI) of it. The T option terminates the box after that.

Now create a cluster by booting a master and the slaves from that AMI:

cgcloud create-cluster spark -s 2 -t m3.large

This will launch a master and two slaves using the m3.large instance type.

SSH into the master:

cgcloud ssh spark-master

… or the first slave:

cgcloud ssh -o 0 spark-slave

… or the second slave:

cgcloud ssh -o 1 spark-slave
Release History

Release History

This version
History Node

1.6.0

History Node

1.6.0a1.dev409

History Node

1.6.0a1.dev403

History Node

1.6.0a1.dev397

History Node

1.6.0a1.dev393

History Node

1.6.0a1.dev384

History Node

1.6.0a1.dev378

History Node

1.6.0a1.dev376

History Node

1.6.0a1.dev373

History Node

1.6.0a1.dev371

History Node

1.6.0a1.dev370

History Node

1.6.0a1.dev368

History Node

1.6.0a1.dev364

History Node

1.6.0a1.dev362

History Node

1.6.0a1.dev361

History Node

1.6.0a1.dev360

History Node

1.6.0a1.dev356

History Node

1.6.0a1.dev351

History Node

1.6.0a1.dev340

History Node

1.6.0a1.dev337

History Node

1.5.6

History Node

1.5.5

History Node

1.5.5a1.dev381

History Node

1.5.4

History Node

1.5.3

History Node

1.5.2

History Node

1.5.1

History Node

1.5.0

History Node

1.5.0a1.dev331

History Node

1.5.0a1.dev322

History Node

1.4.1

History Node

1.4.0

History Node

1.4a1.dev320

History Node

1.4a1.dev316

History Node

1.4a1.dev312

History Node

1.4a1.dev295

History Node

1.4a1.dev288

History Node

1.4a1.dev286

History Node

1.4a1.dev283

History Node

1.4a1.dev276

History Node

1.4a1.dev275

History Node

1.4a1.dev274

History Node

1.4a1.dev269

History Node

1.4a1.dev266

History Node

1.4a1.dev263

History Node

1.4a1.dev256

History Node

1.4a1.dev202

History Node

1.4a1.dev198

History Node

1.4a1.dev197

History Node

1.4a1.dev195

History Node

1.3.8

History Node

1.3.7

History Node

1.3.6

History Node

1.3.6a1.dev232

History Node

1.3.5

History Node

1.3.4

History Node

1.3.3

History Node

1.3.3a1.dev216

History Node

1.3.2

History Node

1.3.2a1.dev201

History Node

1.3.1a1.dev200

History Node

1.3

History Node

1.3a1.dev193

History Node

1.3a1.dev192

History Node

1.3a1.dev191

History Node

1.3a1.dev190

History Node

1.2.3

History Node

1.2.3.dev183

History Node

1.2.2

History Node

1.2.2a1.dev165

History Node

1.2.1a1.dev161

History Node

1.2

History Node

1.2a1.dev159

History Node

1.1a1.dev149

History Node

1.1a1.dev147

History Node

1.1a1.dev144

History Node

1.1a1.dev143

History Node

1.1a1.dev139

History Node

1.1a1.dev137

History Node

1.1a1.dev132

History Node

1.1a1.dev131

History Node

1.1a1.dev130

History Node

1.1a1.dev129

History Node

1.0.dev8

Download Files

Download Files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

File Name & Checksum SHA256 Checksum Help Version File Type Upload Date
cgcloud_spark-1.6.0-py2.7.egg (22.3 kB) Copy SHA256 Checksum SHA256 2.7 Egg Nov 22, 2016
cgcloud-spark-1.6.0.tar.gz (10.0 kB) Copy SHA256 Checksum SHA256 Source Nov 22, 2016

Supported By

WebFaction WebFaction Technical Writing Elastic Elastic Search Pingdom Pingdom Monitoring Dyn Dyn DNS Sentry Sentry Error Logging CloudAMQP CloudAMQP RabbitMQ Heroku Heroku PaaS Kabu Creative Kabu Creative UX & Design Fastly Fastly CDN DigiCert DigiCert EV Certificate Rackspace Rackspace Cloud Servers DreamHost DreamHost Log Hosting