library to handle spark job submit in a yarn cluster in different environment
Project description
A python library that can submit spark job to spark yarn cluster using rest API
Note: It Currently supports the CDH(5.6.1) and
HDP(2.3.2.0-2950,2.4.0.0-169)
The Library is Inspired from:
github.com/bernhard-42/spark-yarn-rest-api
Getting Started:
Use the library
# Import the SparkJobHandler
from spark_job_handler import SparkJobHandler
...
logger = logging.getLogger('TestLocalJobSubmit')
# Create a spark JOB
# jobName: name of the Spark Job
# jar: location of the Jar (local/hdfs)
# run_class: entry class of the appliaction
# hadoop_rm: hadoop resource manager host ip
# hadoop_web_hdfs: hadoop web hdfs ip
# hadoop_nn: hadoop name node ip (Normally same as of web_hdfs)
# env_type: env type is CDH or HDP
# local_jar: flag to define if a jar is local (Local jar gets uploaded to hdfs)
# spark_properties: custom properties that need to be set
sparkJob = SparkJobHandler(logger=logger, job_name="test_local_job_submit",
jar="./simple-project/target/scala-2.10/simple-project_2.10-1.0.jar",
run_class="IrisApp", hadoop_rm='rma', hadoop_web_hdfs='nn', hadoop_nn='nn',
env_type="CDH", local_jar=True, spark_properties=None)
trackingUrl = sparkJob.run()
print "Job Tracking URL: %s" % trackingUrl
The above code starts an spark application using the local jar
(simple-project/target/scala-2.10/simple-project_2.10-1.0.jar)
For more example see the
test_spark_job_handler.py
Build the simple-project
$ cd simple-project
$ sbt package;cd ..
The above steps will create the target jar as: ./simple-project/target/scala-2.10/simple-project_2.10-1.0.jar
Update the nodes Ip in test:
Add the node IP for hadoop resource manager and Name node in the
test_cases:
* rm: Resource Manager * nn: Name Node
load the data and make it available to HDFS:
$ wget https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data
upload data to the HDFS:
$ python upload_to_hdfs.py <name_nodei_ip> iris.data /tmp/iris.data
Run the test cases:
Make the simple-project jar available in HDFS to test remote jar:
$ python upload_to_hdfs.py <name_nodei_ip> simple-project/target/scala-2.10/simple-project_2.10-1.0.jar /tmp/test_data/simple-project_2.10-1.0.jar
Run the test:
$ python test_spark_job_handler.py
Utility:
upload_to_hdfs.py: upload local file to hdfs file system
Notes:
The Library is still in early stage and need testing, bug-fixing and
documentation
Before running, follow the below steps:
* Update the ResourceManager,NameNode and WebHDFS Port if required in
settings.py
* Make the spark-jar available in hdfs as:
hdfs:/user/spark/share/lib/spark-assembly.jar
For Contribution Please Create Issue corresponding PR
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
No source distribution files available for this release.See tutorial on generating distribution archives.
Built Distribution
File details
Details for the file spark_yarn_submit-1.0.0-py2.py3-none-any.whl
.
File metadata
- Download URL: spark_yarn_submit-1.0.0-py2.py3-none-any.whl
- Upload date:
- Size: 12.3 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | b757b2d7b3a47997dc803609eb40df3ae93026618c83febdf3e66ada3bd15fcd |
|
MD5 | fa51e6c96cfa71c5657a3a521438cd8c |
|
BLAKE2b-256 | d25df8b9747498ebbcf36c82e6d17f0c8e3964e2a5eb238588c077360e54c8f9 |