Skip to main content

Orchestrates Spark standalone clusters on HPCs.

Project description

sparkctl

This package implements configuration and orchestration of Spark clusters with standalone cluster managers. This is useful in environments like HPCs where the infrastructure implemented by cloud providers, such as AWS, is not available. It is particularly helpful when users want to deploy Spark but do not have administrative control of the servers.

Example usage

There are two main ways to use this package:

First, allocate compute nodes. For example, with Slurm (1 compute node for the Spark master and 4 compute nodes for Spark workers):

$ salloc -t 01:00:00 -n4 --partition=shared --mem=30G : -N4 --account=<your-account> --mem=240G
  1. Configure a Spark cluster and run Spark jobs with spark-submit or pyspark.
$ sparkctl configure
$ sparkctl start
$ spark-submit --master spark://$(hostname):7077 my-job.py
$ sparkctl stop
  1. Run Spark jobs in a Python script using the sparkctl library to manage the cluster.
from sparkctl import ClusterManager, make_default_spark_config

config = make_default_spark_config()
mgr = ClusterManager(config)
with mgr.managed_cluster() as spark:
    df = spark.createDataFrame([(x, x + 1) for x in range(1000)], ["a", "b"])
    df.show()

Refer to the user documentation for a description of features and detailed usage instructions.

Project Status

The package is actively maintained and used at the National Renewable Energy Laboratory (NREL). The software is primarily geared toward HPCs that use Slurm. It also supports a generic list of servers as long as the servers have access to a shared filesystem and are accessible via SSH without password login.

It would be straightforward to extend the functionality to support other HPC resource managers. Please submit an issue or idea or discussion if you have interest in this package but need that support.

Contributions are welcome.

License

sparkctl is released under a BSD 3-Clause license.

Software Record

This package is developed under NREL Software Record SWR-25-109.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sparkctl-0.3.0.tar.gz (29.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sparkctl-0.3.0-py3-none-any.whl (32.6 kB view details)

Uploaded Python 3

File details

Details for the file sparkctl-0.3.0.tar.gz.

File metadata

  • Download URL: sparkctl-0.3.0.tar.gz
  • Upload date:
  • Size: 29.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for sparkctl-0.3.0.tar.gz
Algorithm Hash digest
SHA256 f1544c22aa9c2bd11137bd530dc4c7e78abec8b9503618adaa8bd8b29e450260
MD5 c567fec0db0b019c0bbd10989bd2d2ee
BLAKE2b-256 fb98fbfb065a7db3ae2ddfc864241e2c6bc5dbd8260b39a78fb3b7134ec4c17b

See more details on using hashes here.

Provenance

The following attestation bundles were made for sparkctl-0.3.0.tar.gz:

Publisher: publish_to_pypi.yml on NREL/sparkctl

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sparkctl-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: sparkctl-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 32.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for sparkctl-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 5d00e27658da2455eb415e587083c2a1ce10a20c94a5241ec0a02833727cc978
MD5 a42d7a09959bce412f191c5691457c02
BLAKE2b-256 644f87ee88ae5449a1f910ffc82c73485e8062fdd5e0a46e25cb3c5dea63b06a

See more details on using hashes here.

Provenance

The following attestation bundles were made for sparkctl-0.3.0-py3-none-any.whl:

Publisher: publish_to_pypi.yml on NREL/sparkctl

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page