Skip to main content

Run python packages on AWS EMR

Project description

# Spark-EMR

[![Build Status](https://api.travis-ci.org/delijati/spark-emr.svg?branch=master)](https://travis-ci.org/delijati/spark-emr)

Run an python package on AWS EMR

## Install

Develop install:

$ pip install -e .

Testing:

$ pip install tox
$ tox

## Setup

The easiest way to get EMR up and running is to go through the Web-Interface
and create a ssh key, and start a cluster by hand. This will then create the
needed subnet_key and EMR roles.

## Config yaml file

Create a ``config.yaml`` per project or as a default into
`~/.config/spark-emr.yaml`

bootstrap_uri: s3://foo/bar
master:
instance_type: m4.large
size_in_gb: 100
core:
instance_type: m4.large
instance_count: 2
size_in_gb: 100
ssh_key: XXXXX
subnet_id: subnet-XXXXXX
python_version: python36
emr_version: emr-5.20.0
consistent: false
optimization: false
region: eu-central-1
job_flow_role: EMR_EC2_DefaultRole
service_role: EMR_DefaultRole

## CLI-Interface

### Start

To run a python code on EMR you need to build a proper python package aka
`setup.py` with `console_scripts` the script needs to end on `.py` or yarn
won't be able to execute it |-(

Bootstrap a cluster, install the pypackage, execute the task in cmdline, poll
cluster until finished, stop cluster:

$ spark-emr start \
[--config config.yaml] \
--name "Spark-ETL" \
--cmdline "etl.py --input s3://in/in.csv --output s3://out/out.csv" \
--tags foo 2 bar 4 \
--poll \
--yarn-log \
--package "../"

Running with a released pypackage version (pip):

$ spark-emr start \
... \
--package pip+etl_pypackage

### Status

Returns the status of a cluster (also terminated ones):

$ spark-emr status --cluster-id j-XXXXX

### List

List all cluster:

$ spark-emr list [--config config.yaml] [--namespace spark_emr]

### Stop

Stop a running cluster:

$ spark-emr stop --cluster-id j-XXXXX

# Appendix

### Running commands on EMR

The created command can also be run directly from the master:

$ /usr/bin/spark-submit \
--deploy-mode cluster \
--master yarn \
--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=python35 \
--conf spark.executorEnv.PYSPARK_PYTHON=python35 \
/usr/local/bin/etl.py --input s3://in/in.csv --output s3://out/out.csv

### Running commands on docker

$ docker run --rm -ti -v `pwd`/test/dummy:/app/work spark-base \
bash -c "cd /app/work && pip3 install -e . && spark_emr_dummy.py 10"


# CHANGES

0.1.1 (2019-02-21)
------------------

- Fixed url in setup.py.


0.1.0 (2019-02-21)
------------------

- Initial release.


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

spark_emr-0.1.1.tar.gz (11.5 kB view details)

Uploaded Source

Built Distribution

spark_emr-0.1.1-py2.py3-none-any.whl (16.4 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file spark_emr-0.1.1.tar.gz.

File metadata

  • Download URL: spark_emr-0.1.1.tar.gz
  • Upload date:
  • Size: 11.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.20.1 setuptools/40.6.2 requests-toolbelt/0.8.0 tqdm/4.31.1 CPython/3.6.6

File hashes

Hashes for spark_emr-0.1.1.tar.gz
Algorithm Hash digest
SHA256 3e8280451974bccf1dee4d0c407d49ae7ed0cfe6948285f0ea81f6a1775197a8
MD5 464c063469892007c522d0b3ed7b6644
BLAKE2b-256 cd534e912945c464da3dc42babfa9a5e1b9f550a86555b579993a7a83fea32db

See more details on using hashes here.

File details

Details for the file spark_emr-0.1.1-py2.py3-none-any.whl.

File metadata

  • Download URL: spark_emr-0.1.1-py2.py3-none-any.whl
  • Upload date:
  • Size: 16.4 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.20.1 setuptools/40.6.2 requests-toolbelt/0.8.0 tqdm/4.31.1 CPython/3.6.6

File hashes

Hashes for spark_emr-0.1.1-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 bcdb17ee396f47706cb37e70b7782704ceb438d8f294bb54c4567e7cc1075d44
MD5 9bd16dbf4780d5401475b48c9140e4f7
BLAKE2b-256 34402c1b439eb9e0ac846733701b9d11a6aa0f6bcd795492f9bfbf81d9b2b579

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page