sparksteps

Workflow tool to launch Spark jobs on AWS EMR

These details have not been verified by PyPI

Project links

Homepage

Project description

SparkSteps allows you to configure your EMR cluster and upload your spark script and its dependencies via AWS S3. All you need to do is define an S3 bucket.

Install

pip install sparksteps

CLI Options

Prompt parameters:
  app                           main spark script for submit spark (required)
  app-args:                     arguments passed to main spark script
  app-list:                     Space delimited list of applications to be installed on the EMR cluster (Default: Hadoop Spark)
  aws-region:                   AWS region name
  bid-price:                    specify bid price for task nodes
  bootstrap-script:             include a bootstrap script (s3 path)
  cluster-id:                   job flow id of existing cluster to submit to
  debug:                        allow debugging of cluster
  defaults:                     cluster configurations of the form "<classification1> key1=val1 key2=val2 ..."
  dynamic-pricing-master:       use spot pricing for the master nodes.
  dynamic-pricing-core:         use spot pricing for the core nodes.
  dynamic-pricing-task:         use spot pricing for the task nodes.
  ebs-volume-size-core:         size of the EBS volume to attach to core nodes in GiB.
  ebs-volume-type-core:         type of the EBS volume to attach to core nodes (supported: [standard, gp2, io1]).
  ebs-volumes-per-core:         the number of EBS volumes to attach per core node.
  ebs-optimized-core:           whether to use EBS optimized volumes for core nodes.
  ebs-volume-size-task:         size of the EBS volume to attach to task nodes in GiB.
  ebs-volume-type-task:         type of the EBS volume to attach to task nodes.
  ebs-volumes-per-task:         the number of EBS volumes to attach per task node.
  ebs-optimized-task:           whether to use EBS optimized volumes for task nodes.
  ec2-key:                      name of the Amazon EC2 key pair
  ec2-subnet-id:                Amazon VPC subnet id
  help (-h):                    argparse help
  jobflow-role:                 Amazon EC2 instance profile name to use (Default: EMR_EC2_DefaultRole)
  service-role:                 AWS IAM service role to use for EMR (Default: EMR_DefaultRole)
  keep-alive:                   whether to keep the EMR cluster alive when there are no steps
  log-level (-l):               logging level (default=INFO)
  instance-type-master:         instance type of of master host (default='m4.large')
  instance-type-core:           instance type of the core nodes, must be set when num-core > 0
  instance-type-task:           instance type of the task nodes, must be set when num-task > 0
  maximize-resource-allocation: sets the maximizeResourceAllocation property for the cluster to true when supplied.
  name:                         specify cluster name
  num-core:                     number of core nodes
  num-task:                     number of task nodes
  release-label:                EMR release label
  s3-bucket:                    name of s3 bucket to upload spark file (required)
  s3-path:                      path within s3-bucket to use when writing assets
  s3-dist-cp:                   s3-dist-cp step after spark job is done
  submit-args:                  arguments passed to spark-submit
  tags:                         EMR cluster tags of the form "key1=value1 key2=value2"
  uploads:                      files to upload to /home/hadoop/ in master instance
  wait:                         poll until all steps are complete (or error)

Example

AWS_S3_BUCKET = <insert-s3-bucket>
cd sparksteps/
sparksteps examples/episodes.py \
  --s3-bucket $AWS_S3_BUCKET \
  --aws-region us-east-1 \
  --release-label emr-4.7.0 \
  --uploads examples/lib examples/episodes.avro \
  --submit-args="--deploy-mode client --jars /home/hadoop/lib/spark-avro_2.10-2.0.2-custom.jar" \
  --app-args="--input /home/hadoop/episodes.avro" \
  --tags Application="Spark Steps" \
  --debug

The above example creates an EMR cluster of 1 node with default instance type m4.large, uploads the pyspark script episodes.py and its dependencies to the specified S3 bucket and copies the file from S3 to the cluster. Each operation is defined as an EMR “step” that you can monitor in EMR. The final step is to run the spark application with submit args that includes a custom spark-avro package and app args “–input”.

Run Spark Job on Existing Cluster

You can use the option --cluster-id to specify a cluster to upload and run the Spark job. This is especially helpful for debugging.

Dynamic Pricing

Use CLI option --dynamic-pricing-<instance-type> to allow sparksteps to dynamically determine the best bid price for EMR instances within a certain instance group.

Currently the algorithm looks back at spot history over the last 12 hours and calculates min(0.8 * on_demand_price, 1.2 * max_spot_price) to determine bid price. That said, if the current spot price is over 80% of the on-demand cost, then on-demand instances are used to be conservative.

Testing

make test

Blog

Read more about sparksteps in our blog post here: https://www.jwplayer.com/blog/sparksteps/

License

Apache License 2.0

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

3.0.1

Dec 23, 2020

3.0.0

Aug 20, 2020

2.2.1

Nov 4, 2019

2.2.0

Sep 24, 2019

2.1.0

Aug 29, 2019

2.0.0

Jul 31, 2019

1.1.1

Jul 22, 2019

1.1.0

Jul 12, 2019

1.0.0

Jul 3, 2019

0.3.4

Feb 13, 2019

0.3.3

Oct 17, 2018

0.3.2

Oct 9, 2018

0.3.1

Oct 3, 2018

0.3.0

Oct 2, 2018

0.2.5

Sep 27, 2018

0.2.4

Aug 24, 2017

0.2.3

Jul 20, 2017

0.2.2

Jan 24, 2017

0.2.1

Jan 3, 2017

0.2.0

Jan 3, 2017

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sparksteps-3.0.1.tar.gz (39.1 kB view details)

Uploaded Dec 23, 2020 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

sparksteps-3.0.1-py2.py3-none-any.whl (19.0 kB view details)

Uploaded Dec 23, 2020 Python 2Python 3

File details

Details for the file sparksteps-3.0.1.tar.gz.

File metadata

Download URL: sparksteps-3.0.1.tar.gz
Upload date: Dec 23, 2020
Size: 39.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/44.0.0 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.8.5

File hashes

Hashes for sparksteps-3.0.1.tar.gz
Algorithm	Hash digest
SHA256	`9fab7f2f5c2f3f92ff1c6fe704302eefff8b037e846bc0df85cdc155d9e6e3cf`
MD5	`ca5df7400ec0b889092f80a75f0dae1b`
BLAKE2b-256	`83f763512f50ec7b1216415672566bb96cd071044f8b58054ea64ab5fd951aff`

See more details on using hashes here.

File details

Details for the file sparksteps-3.0.1-py2.py3-none-any.whl.

File metadata

Download URL: sparksteps-3.0.1-py2.py3-none-any.whl
Upload date: Dec 23, 2020
Size: 19.0 kB
Tags: Python 2, Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/44.0.0 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.8.5

File hashes

Hashes for sparksteps-3.0.1-py2.py3-none-any.whl
Algorithm	Hash digest
SHA256	`916415da96d7fd447b9c019e82e7e8bc8e6c763bb547dfef2187e2ea3da6688d`
MD5	`4688b3cde9da01bc3f70e461804036c2`
BLAKE2b-256	`c21458ecbe6ad59f7cac6bf5ae9d79205cf9c3127512f15abf1851263996efe8`

See more details on using hashes here.

sparksteps 3.0.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Install

CLI Options

Example

Run Spark Job on Existing Cluster

Dynamic Pricing

Testing

Blog

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes