spark-submit

Python manager for spark-submit jobs

These details have not been verified by PyPI

Project links

Homepage

Project description

Spark-submit

TL;DR: Python manager for spark-submit jobs

Description

This package allows for submission and management of Spark jobs in Python scripts via Apache Spark's spark-submit functionality.

Installation

The easiest way to install is using pip:

pip install spark-submit

To install from source:

git clone https://github.com/PApostol/spark-submit.git
cd spark-submit
python setup.py install

For usage details check help(spark_submit).

Usage Examples

Spark arguments can either be provided as keyword arguments or as an unpacked dictionary.

Simple example:

from spark_submit import SparkJob

app = SparkJob('/path/some_file.py', master='local', name='simple-test')
app.submit()

print(app.get_state())

Another example:

from spark_submit import SparkJob

spark_args = {
    'master': 'spark://some.spark.master:6066',
    'deploy_mode': 'cluster',
    'name': 'spark-submit-app',
    'class': 'main.Class',
    'executor_memory': '2G',
    'executor_cores': '1',
    'total_executor_cores': '2',
    'verbose': True,
    'conf': ["spark.foo.bar='baz'", "spark.x.y='z'"],
    'main_file_args': '--foo arg1 --bar arg2'
    }

app = SparkJob('s3a://bucket/path/some_file.jar', **spark_args)
print(app.get_submit_cmd(multiline=True))

# poll state in the background every x seconds with `poll_time=x`
app.submit(use_env_vars=True,
           extra_env_vars={'PYTHONPATH': '/some/path/'},
           poll_time=10
           )

print(app.get_state()) # 'SUBMITTED'

while not app.concluded:
    # do other stuff...
    print(app.get_state()) # 'RUNNING'

print(app.get_state()) # 'FINISHED'

Examples of `spark-submit` to `spark_args` dictionary:

A `client` example:

~/spark_home/bin/spark-submit \
--master spark://some.spark.master:7077 \
--name spark-submit-job \
--total-executor-cores 8 \
--executor-cores 4 \
--executor-memory 4G \
--driver-memory 2G \
--py-files /some/utils.zip \
--files /some/file.json \
/path/to/pyspark/file.py --data /path/to/data.csv

becomes

spark_args = {
    'master': 'spark://some.spark.master:7077',
    'name': 'spark_job_client',
    'total_executor_cores: '8',
    'executor_cores': '4',
    'executor_memory': '4G',
    'driver_memory': '2G',
    'py_files': '/some/utils.zip',
    'files': '/some/file.json',
    'main_file_args': '--data /path/to/data.csv'
    }
main_file = '/path/to/pyspark/file.py'
app = SparkJob(main_file, **spark_args)

A `cluster` example:

~/spark_home/bin/spark-submit \
--master spark://some.spark.master:6066 \
--deploy-mode cluster \
--name spark_job_cluster \
--jars "s3a://mybucket/some/file.jar" \
--conf "spark.some.conf=foo" \
--conf "spark.some.other.conf=bar" \
--total-executor-cores 16 \
--executor-cores 4 \
--executor-memory 4G \
--driver-memory 2G \
--class my.main.Class \
--verbose \
s3a://mybucket/file.jar "positional_arg1" "positional_arg2"

becomes

spark_args = {
    'master': 'spark://some.spark.master:6066',
    'deploy_mode': 'cluster',
    'name': 'spark_job_cluster',
    'jars': 's3a://mybucket/some/file.jar',
    'conf': ["spark.some.conf='foo'", "spark.some.other.conf='bar'"], # note the use of quotes
    'total_executor_cores: '16',
    'executor_cores': '4',
    'executor_memory': '4G',
    'driver_memory': '2G',
    'class': 'my.main.Class',
    'verbose': True,
    'main_file_args': '"positional_arg1" "positional_arg2"'
    }
main_file = 's3a://mybucket/file.jar'
app = SparkJob(main_file, **spark_args)

Testing

You can do some simple testing with local mode Spark after cloning the repo.

Note any additional requirements for running the tests: pip install -r tests/requirements.txt

pytest tests/

python tests/run_integration_test.py

Additional methods

spark_submit.system_info(): Collects Spark related system information, such as versions of spark-submit, Scala, Java, PySpark, Python and OS

spark_submit.SparkJob.kill(): Kills the running Spark job (cluster mode only)

spark_submit.SparkJob.get_code(): Gets the spark-submit return code

spark_submit.SparkJob.get_output(): Gets the spark-submit stdout

spark_submit.SparkJob.get_id(): Gets the spark-submit submission ID

License

Released under MIT by @PApostol.

You can freely modify and reuse.
The original license must be included with copies of this software.
Please link back to this repo if you use a significant portion the source code.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

1.4.0

Apr 19, 2023

1.3.0

Jul 30, 2022

1.2.1

Jan 16, 2022

1.2.0

Dec 9, 2021

1.1.0

Nov 13, 2021

1.0.1

Oct 22, 2021

1.0.0

Oct 16, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

spark-submit-1.4.0.tar.gz (12.6 kB view details)

Uploaded Apr 19, 2023 Source

File details

Details for the file spark-submit-1.4.0.tar.gz.

File metadata

Download URL: spark-submit-1.4.0.tar.gz
Upload date: Apr 19, 2023
Size: 12.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.1 CPython/3.9.13

File hashes

Hashes for spark-submit-1.4.0.tar.gz
Algorithm	Hash digest
SHA256	`246ca5bf239d821479c4418399d6f09e07b8c0957042e071107d328fafd4d920`
MD5	`8ddd29ba5fb920f575646b6d2f33b979`
BLAKE2b-256	`2ba551504208b6c435d42e644cfd39b631cc3ae7a621fcb4b61b605ea13d7104`

See more details on using hashes here.

spark-submit 1.4.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Spark-submit

TL;DR: Python manager for spark-submit jobs

Description

Installation

Usage Examples

Simple example:

Another example:

Examples of `spark-submit` to `spark_args` dictionary:

A `client` example:

becomes

A `cluster` example:

becomes

Testing

Additional methods

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes

spark-submit 1.4.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Spark-submit

TL;DR: Python manager for spark-submit jobs

Description

Installation

Usage Examples

Simple example:

Another example:

Examples of spark-submit to spark_args dictionary:

A client example:

becomes

A cluster example:

becomes

Testing

Additional methods

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes

Examples of `spark-submit` to `spark_args` dictionary:

A `client` example:

A `cluster` example: