library to handle spark job submit in a yarn cluster in different environment
Project description
A python library that can submit spark job to spark yarn cluster using rest API
Note: It Currently supports the CDH(5.6.1) and
HDP(2.3.2.0-2950,2.4.0.0-169)
The Library is Inspired from:
github.com/bernhard-42/spark-yarn-rest-api
Getting Started:
Use the library
# Import the SparkJobHandler from spark_job_handler import SparkJobHandler ... logger = logging.getLogger('TestLocalJobSubmit') # Create a spark JOB # jobName: name of the Spark Job # jar: location of the Jar (local/hdfs) # run_class: entry class of the appliaction # hadoop_rm: hadoop resource manager host ip # hadoop_web_hdfs: hadoop web hdfs ip # hadoop_nn: hadoop name node ip (Normally same as of web_hdfs) # env_type: env type is CDH or HDP # local_jar: flag to define if a jar is local (Local jar gets uploaded to hdfs) # spark_properties: custom properties that need to be set sparkJob = SparkJobHandler(logger=logger, job_name="test_local_job_submit", jar="./simple-project/target/scala-2.10/simple-project_2.10-1.0.jar", run_class="IrisApp", hadoop_rm='rma', hadoop_web_hdfs='nn', hadoop_nn='nn', env_type="CDH", local_jar=True, spark_properties=None) trackingUrl = sparkJob.run() print "Job Tracking URL: %s" % trackingUrl
The above code starts an spark application using the local jar
(simple-project/target/scala-2.10/simple-project_2.10-1.0.jar)
For more example see the
test_spark_job_handler.py
Build the simple-project
$ cd simple-project $ sbt package;cd ..
The above steps will create the target jar as: ./simple-project/target/scala-2.10/simple-project_2.10-1.0.jar
Update the nodes Ip in test:
Add the node IP for hadoop resource manager and Name node in the
test_cases:
* rm: Resource Manager * nn: Name Node
load the data and make it available to HDFS:
$ wget https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data
upload data to the HDFS:
$ python upload_to_hdfs.py <name_nodei_ip> iris.data /tmp/iris.data
Run the test cases:
Make the simple-project jar available in HDFS to test remote jar:
$ python upload_to_hdfs.py <name_nodei_ip> simple-project/target/scala-2.10/simple-project_2.10-1.0.jar /tmp/test_data/simple-project_2.10-1.0.jar
Run the test:
$ python test_spark_job_handler.py
Utility:
- upload_to_hdfs.py: upload local file to hdfs file system
Notes:
The Library is still in early stage and need testing, bug-fixing and
documentation
Before running, follow the below steps:
* Update the ResourceManager,NameNode and WebHDFS Port if required in
settings.py
* Make the spark-jar available in hdfs as:
hdfs:/user/spark/share/lib/spark-assembly.jar
For Contribution Please Create Issue corresponding PR
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Filename, size | File type | Python version | Upload date | Hashes |
---|---|---|---|---|
Filename, size spark_yarn_submit-1.0.0-py2.py3-none-any.whl (12.3 kB) | File type Wheel | Python version py2.py3 | Upload date | Hashes View |
Close
Hashes for spark_yarn_submit-1.0.0-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b757b2d7b3a47997dc803609eb40df3ae93026618c83febdf3e66ada3bd15fcd |
|
MD5 | fa51e6c96cfa71c5657a3a521438cd8c |
|
BLAKE2-256 | d25df8b9747498ebbcf36c82e6d17f0c8e3964e2a5eb238588c077360e54c8f9 |