Skip to main content

Run python packages on AWS EMR

Project description

# Spark-EMR

[![Build Status](https://api.travis-ci.org/delijati/spark-emr.svg?branch=master)](https://travis-ci.org/delijati/spark-emr)

Run an python package on AWS EMR

## Install

Develop install:

$ pip install -e .

Testing:

$ pip install tox
$ tox

## Setup

The easiest way to get EMR up and running is to go through the Web-Interface
and create a ssh key, and start a cluster by hand. This will then create the
needed subnet_key and EMR roles.

## Config yaml file

Create a ``config.yaml`` per project or as a default into
`~/.config/spark-emr.yaml`

bootstrap_uri: s3://foo/bar
master:
instance_type: m4.large
size_in_gb: 100
core:
instance_type: m4.large
instance_count: 2
size_in_gb: 100
ssh_key: XXXXX
subnet_id: subnet-XXXXXX
python_version: python36
emr_version: emr-5.20.0
consistent: false
optimization: false
region: eu-central-1
job_flow_role: EMR_EC2_DefaultRole
service_role: EMR_DefaultRole

## CLI-Interface

### Start

To run a python code on EMR you need to build a proper python package aka
`setup.py` with `console_scripts` the script needs to end on `.py` or yarn
won't be able to execute it |-(

Bootstrap a cluster, install the pypackage, execute the task in cmdline, poll
cluster until finished, stop cluster:

$ spark-emr start \
[--config config.yaml] \
--name "Spark-ETL" \
--bid-master 0.04 \
--bid-core 0.04 \
--cmdline "etl.py --input s3://in/in.csv --output s3://out/out.csv" \
--tags foo 2 bar 4 \
--poll \
--yarn-log \
--package "../"

Running with a released pypackage version (pip):

$ spark-emr start \
... \
--package pip+etl_pypackage

### Status

Returns the status of a cluster (also terminated ones):

$ spark-emr status --cluster-id j-XXXXX

### List

List all cluster and filter optionally by tag:

$ spark-emr list [--config config.yaml] [--filter somekey somevalue]

### Stop

Stop a running cluster:

$ spark-emr stop --cluster-id j-XXXXX

### Spot price check

This call returns for all regions and configured instances the spot price:

$ spark-emr spot

# Appendix

### Running commands on EMR

The created command can also be run directly from the master:

$ /usr/bin/spark-submit \
--deploy-mode cluster \
--master yarn \
--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=python35 \
--conf spark.executorEnv.PYSPARK_PYTHON=python35 \
/usr/local/bin/etl.py --input s3://in/in.csv --output s3://out/out.csv

### Running commands on docker

To test if our spark is running as expected we can run it locally in docker.

$ git clone https://github.com/delijati/spark-docker
$ cd spark-docker
$ docker build . --pull -t spark

Now we can run our spark job locally.

$ docker run --rm -ti -v `pwd`/test/dummy:/app/work spark \
bash -c "cd /app/work && pip3 install -e . && spark_emr_dummy.py 10"


# CHANGES

0.1.2 (2019-03-10)
------------------

- Add spot price check cli.
- Add spot BidPrice.
- Show estimated cost.
- Filter by tag for list cli.


0.1.1 (2019-02-21)
------------------

- Fixed url in setup.py.


0.1.0 (2019-02-21)
------------------

- Initial release.


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

spark_emr-0.1.2.tar.gz (13.4 kB view details)

Uploaded Source

Built Distribution

spark_emr-0.1.2-py2.py3-none-any.whl (18.7 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file spark_emr-0.1.2.tar.gz.

File metadata

  • Download URL: spark_emr-0.1.2.tar.gz
  • Upload date:
  • Size: 13.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.20.1 setuptools/40.6.2 requests-toolbelt/0.8.0 tqdm/4.31.1 CPython/3.6.6

File hashes

Hashes for spark_emr-0.1.2.tar.gz
Algorithm Hash digest
SHA256 9dc61d3f3b25b311d3e86e294062fd5145a6e4274f8668ef8c3c891de798a208
MD5 43904396dfaf247d054903b53f134805
BLAKE2b-256 08a8795758983d206c610cb69de875a8a56f42f1a069b4a4ec5043f5275c3634

See more details on using hashes here.

File details

Details for the file spark_emr-0.1.2-py2.py3-none-any.whl.

File metadata

  • Download URL: spark_emr-0.1.2-py2.py3-none-any.whl
  • Upload date:
  • Size: 18.7 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.20.1 setuptools/40.6.2 requests-toolbelt/0.8.0 tqdm/4.31.1 CPython/3.6.6

File hashes

Hashes for spark_emr-0.1.2-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 b3c3f127ed2df309b89115f8aaef9c2c66b89dc311f8cf00d3323d403d7b8f5d
MD5 fcece6b7d068dd960f353e3a27bdb146
BLAKE2b-256 99ff672e1bf5d3da274ab38bf816f10dd47635bf6b56a6a037efb1424d1b5dae

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page