Calculate optimized properties of Spark configuration

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 3 - Alpha
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
Topic
- Scientific/Engineering
- Utilities

Project description

scopt

Spark Config Optimizer calculate optimal cpu cores and memory values for Spark executor and driver.

Installing

Install scopt from PyPI via pip.

pip install scopt

Usage

Basic

from scopt import SparkConfOptimizer
from scopt.instances import Instance

executor_instance = Instance(32, 250)
num_nodes = 10
deploy_mode = 'client'

sco = SparkConfOptimizer(executor_instance, num_nodes, deploy_mode)
print(sco)

# spark.driver.cores: 5
# spark.driver.memory: 36
# spark.driver.memoryOvearhead: 5
# spark.executor.cores: 5
# spark.executor.memory: 36
# spark.executor.memoryOvearhead: 5
# spark.executor.instances: 60
# spark.default.parallelism: 600
# spark.sql.shuffle.partitions: 600

Cluster mode is also supported.

deploy_mode = 'cluster'

sco = SparkConfOptimizer(executor_instance, num_nodes, deploy_mode)
print(sco)

# spark.driver.cores: 5
# spark.driver.memory: 36
# spark.driver.memoryOvearhead: 5
# spark.executor.cores: 5
# spark.executor.memory: 36
# spark.executor.memoryOvearhead: 5
# spark.executor.instances: 59
# spark.default.parallelism: 590
# spark.sql.shuffle.partitions: 590

Different instance type for driver node is also supported. Specifying driver instance is enabled only client mode.

executor_instance = Instance(32, 250)
driver_instance = Instance(4, 30)
deploy_mode = 'client'

sco = SparkConfOptimizer(
    executor_instance,
    num_nodes,
    deploy_mode,
    driver_instance,
)
print(sco)

# spark.driver.cores: 3
# spark.driver.memory: 26
# spark.driver.memoryOvearhead: 3
# spark.executor.cores: 5
# spark.executor.memory: 36
# spark.executor.memoryOvearhead: 5
# spark.executor.instances: 60
# spark.default.parallelism: 600
# spark.sql.shuffle.partitions: 600

Set properties to SparkConf

You can set properties to SparkConf directory via as_list method.

from pyspark import SparkConf
from scopt import SparkConfOptimizer
from scopt.instances import Instance

executor_instance = Instance(32, 250)
num_nodes = 10
deploy_mode = 'client'

sco = SparkConfOptimizer(executor_instance, num_nodes, deploy_mode)

conf = SparkConf()
print(conf.getAll())
# Property has not be set yet.
# dict_items([])

conf.setAll(sco.as_list())
# dict_items([
#     ('spark.driver.cores', '5'),
#     ('spark.driver.memory', '36'),
#     ('spark.driver.memoryOvearhead', '5'),
#     ('spark.executor.cores', '5'),
#     ('spark.executor.memory', '36'),
#     ('spark.executor.memoryOvearhead', '5'),
#     ('spark.executor.instances', '60'),
#     ('spark.default.parallelism', '600'),
#     ('spark.sql.shuffle.partitions', '600')
# ])

Reference

Best practices for successfully managing memory for Apache Spark applications on Amazon EMR

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 3 - Alpha
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
Topic
- Scientific/Engineering
- Utilities

Release history Release notifications | RSS feed

0.0.5

Apr 2, 2022

0.0.4

Oct 10, 2021

0.0.3

Jul 15, 2021

0.0.2

Jul 11, 2021

This version

0.0.1

Jun 10, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scopt-0.0.1.tar.gz (6.4 kB view hashes)

Uploaded Jun 10, 2021 Source

Built Distribution

scopt-0.0.1-py3-none-any.whl (6.4 kB view hashes)

Uploaded Jun 10, 2021 Python 3

Hashes for scopt-0.0.1.tar.gz

Hashes for scopt-0.0.1.tar.gz
Algorithm	Hash digest
SHA256	`00f49fdcf7934b577b881ade7722675875427f870d14519371acb3db7171e19c`
MD5	`a9e56653c6bd9da298d980ff63e00925`
BLAKE2b-256	`40cd36d75b18360d632221083538a8f96ebb080d7fe9ce1eef6bc3fb4e734f3b`

Hashes for scopt-0.0.1-py3-none-any.whl

Hashes for scopt-0.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ac6c38f68c183a9fcac15a18baedb812a5b496884f03eb35af080553da45f695`
MD5	`ca2d2e321bb0ffcfea3bc6d08eca1b09`
BLAKE2b-256	`062530bfa09ed403e707893022483498dc33876167013162399ceb694fe0b07b`