Skip to main content
Join the official 2019 Python Developers SurveyStart the survey!

library to handle spark job submit in a yarn cluster in different environment

Project description

A python library that can submit spark job to spark yarn cluster using rest API

Note: It Currently supports the CDH(5.6.1) and HDP(2.3.2.0-2950,2.4.0.0-169)
The Library is Inspired from: github.com/bernhard-42/spark-yarn-rest-api

Getting Started:

Use the library

# Import the SparkJobHandler
from spark_job_handler import SparkJobHandler

...

logger = logging.getLogger('TestLocalJobSubmit')
# Create a spark JOB
# jobName:           name of the Spark Job
# jar:               location of the Jar (local/hdfs)
# run_class:         entry class of the appliaction
# hadoop_rm:         hadoop resource manager host ip
# hadoop_web_hdfs:   hadoop web hdfs ip
# hadoop_nn:         hadoop name node ip (Normally same as of web_hdfs)
# env_type:          env type is CDH or HDP
# local_jar:         flag to define if a jar is local (Local jar gets uploaded to hdfs)
# spark_properties:  custom properties that need to be set
sparkJob = SparkJobHandler(logger=logger, job_name="test_local_job_submit",
            jar="./simple-project/target/scala-2.10/simple-project_2.10-1.0.jar",
            run_class="IrisApp", hadoop_rm='rma', hadoop_web_hdfs='nn', hadoop_nn='nn',
            env_type="CDH", local_jar=True, spark_properties=None)
trackingUrl = sparkJob.run()
print "Job Tracking URL: %s" % trackingUrl
The above code starts an spark application using the local jar (simple-project/target/scala-2.10/simple-project_2.10-1.0.jar)
For more example see the test_spark_job_handler.py

Build the simple-project

$ cd simple-project
$ sbt package;cd ..

The above steps will create the target jar as: ./simple-project/target/scala-2.10/simple-project_2.10-1.0.jar

Update the nodes Ip in test:

Add the node IP for hadoop resource manager and Name node in the test_cases:
* rm: Resource Manager * nn: Name Node

load the data and make it available to HDFS:

$ wget https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data

upload data to the HDFS:

$ python upload_to_hdfs.py <name_nodei_ip> iris.data /tmp/iris.data

Run the test cases:

Make the simple-project jar available in HDFS to test remote jar:

$ python upload_to_hdfs.py <name_nodei_ip> simple-project/target/scala-2.10/simple-project_2.10-1.0.jar /tmp/test_data/simple-project_2.10-1.0.jar

Run the test:

$ python test_spark_job_handler.py

Utility:

  • upload_to_hdfs.py: upload local file to hdfs file system

Notes:

The Library is still in early stage and need testing, bug-fixing and documentation
Before running, follow the below steps:
* Update the ResourceManager,NameNode and WebHDFS Port if required in settings.py
* Make the spark-jar available in hdfs as: hdfs:/user/spark/share/lib/spark-assembly.jar
For Contribution Please Create Issue corresponding PR

Project details


Release history Release notifications

This version

1.0.0

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for spark-yarn-submit, version 1.0.0
Filename, size File type Python version Upload date Hashes
Filename, size spark_yarn_submit-1.0.0-py2.py3-none-any.whl (12.3 kB) File type Wheel Python version py2.py3 Upload date Hashes View hashes

Supported by

Elastic Elastic Search Pingdom Pingdom Monitoring Google Google BigQuery Sentry Sentry Error logging AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN SignalFx SignalFx Supporter DigiCert DigiCert EV certificate StatusPage StatusPage Status page