Skip to main content

library to handle spark job submit in a yarn cluster in different environment

Project description

A python library that can submit spark job to spark yarn cluster using rest API

Note: It Currently supports the CDH(5.6.1) and HDP(2.3.2.0-2950,2.4.0.0-169)
The Library is Inspired from: github.com/bernhard-42/spark-yarn-rest-api

Getting Started:

Use the library

# Import the SparkJobHandler
from spark_job_handler import SparkJobHandler

...

logger = logging.getLogger('TestLocalJobSubmit')
# Create a spark JOB
# jobName:           name of the Spark Job
# jar:               location of the Jar (local/hdfs)
# run_class:         entry class of the appliaction
# hadoop_rm:         hadoop resource manager host ip
# hadoop_web_hdfs:   hadoop web hdfs ip
# hadoop_nn:         hadoop name node ip (Normally same as of web_hdfs)
# env_type:          env type is CDH or HDP
# local_jar:         flag to define if a jar is local (Local jar gets uploaded to hdfs)
# spark_properties:  custom properties that need to be set
sparkJob = SparkJobHandler(logger=logger, job_name="test_local_job_submit",
            jar="./simple-project/target/scala-2.10/simple-project_2.10-1.0.jar",
            run_class="IrisApp", hadoop_rm='rma', hadoop_web_hdfs='nn', hadoop_nn='nn',
            env_type="CDH", local_jar=True, spark_properties=None)
trackingUrl = sparkJob.run()
print "Job Tracking URL: %s" % trackingUrl
The above code starts an spark application using the local jar (simple-project/target/scala-2.10/simple-project_2.10-1.0.jar)
For more example see the test_spark_job_handler.py

Build the simple-project

$ cd simple-project
$ sbt package;cd ..

The above steps will create the target jar as: ./simple-project/target/scala-2.10/simple-project_2.10-1.0.jar

Update the nodes Ip in test:

Add the node IP for hadoop resource manager and Name node in the test_cases:
* rm: Resource Manager * nn: Name Node

load the data and make it available to HDFS:

$ wget https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data

upload data to the HDFS:

$ python upload_to_hdfs.py <name_nodei_ip> iris.data /tmp/iris.data

Run the test cases:

Make the simple-project jar available in HDFS to test remote jar:

$ python upload_to_hdfs.py <name_nodei_ip> simple-project/target/scala-2.10/simple-project_2.10-1.0.jar /tmp/test_data/simple-project_2.10-1.0.jar

Run the test:

$ python test_spark_job_handler.py

Utility:

  • upload_to_hdfs.py: upload local file to hdfs file system

Notes:

The Library is still in early stage and need testing, bug-fixing and documentation
Before running, follow the below steps:
* Update the ResourceManager,NameNode and WebHDFS Port if required in settings.py
* Make the spark-jar available in hdfs as: hdfs:/user/spark/share/lib/spark-assembly.jar
For Contribution Please Create Issue corresponding PR

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

spark_yarn_submit-1.0.0-py2.py3-none-any.whl (12.3 kB view hashes)

Uploaded Python 2 Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page