Python manager for spark-submit jobs
Project description
Spark-submit
TL;DR: Python manager for spark-submit jobs
Description
This package allows for submission and management of Spark jobs in Python scripts via Apache Spark's spark-submit
functionality.
Installation
The easiest way to install is using pip
:
pip install spark-submit
To install from source:
git clone https://github.com/PApostol/spark-submit.git
cd spark-submit
python setup.py install
For usage details check help(spark_submit)
.
Usage Examples
Spark arguments can either be provided as keyword arguments or as an unpacked dictionary.
Simple example:
from spark_submit import SparkJob
app = SparkJob('/path/some_file.py', master='local', name='simple-test')
app.submit()
print(app.get_state())
Another example:
from spark_submit import SparkJob
spark_args = {
'master': 'spark://some.spark.master:6066',
'deploy_mode': 'cluster',
'name': 'spark-submit-app',
'class': 'main.Class',
'executor_memory': '2G',
'executor_cores': '1',
'total_executor_cores': '2',
'verbose': True,
'conf': ["spark.foo.bar='baz'", "spark.x.y='z'"],
'main_file_args': '--foo arg1 --bar arg2'
}
app = SparkJob('s3a://bucket/path/some_file.jar', **spark_args)
print(app.get_submit_cmd(multiline=True))
# poll state in the background every x seconds with `poll_time=x`
app.submit(use_env_vars=True,
extra_env_vars={'PYTHONPATH': '/some/path/'},
poll_time=10
)
print(app.get_state()) # 'SUBMITTED'
while not app.concluded:
# do other stuff...
print(app.get_state()) # 'RUNNING'
print(app.get_state()) # 'FINISHED'
Examples of spark-submit
to spark_args
dictionary:
A client
example:
~/spark_home/bin/spark-submit \
--master spark://some.spark.master:7077 \
--name spark-submit-job \
--total-executor-cores 8 \
--executor-cores 4 \
--executor-memory 4G \
--driver-memory 2G \
--py-files /some/utils.zip \
--files /some/file.json \
/path/to/pyspark/file.py --data /path/to/data.csv
becomes
spark_args = {
'master': 'spark://some.spark.master:7077',
'name': 'spark_job_client',
'total_executor_cores: '8',
'executor_cores': '4',
'executor_memory': '4G',
'driver_memory': '2G',
'py_files': '/some/utils.zip',
'files': '/some/file.json',
'main_file_args': '--data /path/to/data.csv'
}
main_file = '/path/to/pyspark/file.py'
app = SparkJob(main_file, **spark_args)
A cluster
example:
~/spark_home/bin/spark-submit \
--master spark://some.spark.master:6066 \
--deploy-mode cluster \
--name spark_job_cluster \
--jars "s3a://mybucket/some/file.jar" \
--conf "spark.some.conf=foo" \
--conf "spark.some.other.conf=bar" \
--total-executor-cores 16 \
--executor-cores 4 \
--executor-memory 4G \
--driver-memory 2G \
--class my.main.Class \
--verbose \
s3a://mybucket/file.jar "positional_arg1" "positional_arg2"
becomes
spark_args = {
'master': 'spark://some.spark.master:6066',
'deploy_mode': 'cluster',
'name': 'spark_job_cluster',
'jars': 's3a://mybucket/some/file.jar',
'conf': ["spark.some.conf='foo'", "spark.some.other.conf='bar'"], # note the use of quotes
'total_executor_cores: '16',
'executor_cores': '4',
'executor_memory': '4G',
'driver_memory': '2G',
'class': 'my.main.Class',
'verbose': True,
'main_file_args': '"positional_arg1" "positional_arg2"'
}
main_file = 's3a://mybucket/file.jar'
app = SparkJob(main_file, **spark_args)
Testing
You can do some simple testing with local mode Spark after cloning the repo.
Note any additional requirements for running the tests: pip install -r tests/requirements.txt
pytest tests/
python tests/run_integration_test.py
Additional methods
spark_submit.system_info()
: Collects Spark related system information, such as versions of spark-submit, Scala, Java, PySpark, Python and OS
spark_submit.SparkJob.kill()
: Kills the running Spark job (cluster mode only)
spark_submit.SparkJob.get_code()
: Gets the spark-submit return code
spark_submit.SparkJob.get_output()
: Gets the spark-submit stdout
spark_submit.SparkJob.get_id()
: Gets the spark-submit submission ID
License
Released under MIT by @PApostol.
- You can freely modify and reuse.
- The original license must be included with copies of this software.
- Please link back to this repo if you use a significant portion the source code.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file spark-submit-1.4.0.tar.gz
.
File metadata
- Download URL: spark-submit-1.4.0.tar.gz
- Upload date:
- Size: 12.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.9.13
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 246ca5bf239d821479c4418399d6f09e07b8c0957042e071107d328fafd4d920 |
|
MD5 | 8ddd29ba5fb920f575646b6d2f33b979 |
|
BLAKE2b-256 | 2ba551504208b6c435d42e644cfd39b631cc3ae7a621fcb4b61b605ea13d7104 |