Skip to main content

Interactions between Dask and Spark

Project description

Launch Dask from Spark and Spark from Dask. This project is not mature.

Examples

Create Spark cluster from a Dask cluster

>>> from dask.distributed import Client
>>> client = Client('scheduler-address:8786')
>>> client
<Client: scheduler='tcp://scheduler-address:8786' processes=8 cores=64>

>>> from dask_spark import dask_to_spark
>>> sc = dask_to_spark(client)
>>> sc
<pyspark.context.SparkContext at 0x7f62fa4bb550>

Create Dask cluster from a Spark cluster

>>> import pyspark
>>> sc = pyspark.SparkContext('local[4]')
<pyspark.context.SparkContext at 0x7f8b908b0128>

>>> from dask_spark import spark_to_dask
>>> client = spark_to_dask(sc)
>>> client
<Client: scheduler="'tcp://127.0.0.1:8786'">

Requirements and How this Works

This depends on a relatively recent version of Dask.distributed.

For starting Spark from Dask this assumes that you have Spark installed and that the start-master.sh and start-slave.sh Spark scripts are available on the PATH of the workers. This starts a long-running Spark master process on the Dask Scheduler and starts long running Spark slaves on Dask workers. There will only be one slave per worker. We set the number of cores and the amount of memory to match the Dask workers and available memory.

When starting Dask from Spark this will block the Spark cluster. We start a scheduler on the local machine and then run a long-running function that starts up a Dask worker using RDD.mapPartitions.

TODO

  • [ ] This almost certainly fails in non-trivial situations

  • [ ] Enable user specification of Java flags for memory and core use

  • [ ] Support multiple spark clusters per Dask cluster

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dask-spark-0.0.1.tar.gz (3.5 kB view details)

Uploaded Source

Built Distribution

dask_spark-0.0.1-py2.py3-none-any.whl (5.2 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file dask-spark-0.0.1.tar.gz.

File metadata

  • Download URL: dask-spark-0.0.1.tar.gz
  • Upload date:
  • Size: 3.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for dask-spark-0.0.1.tar.gz
Algorithm Hash digest
SHA256 73235fbca522b3d933fdc257d52663c0798ccb7036e216357fba92a37541555a
MD5 7140d36017a628be3f91ffebb77fbc27
BLAKE2b-256 d34ba58624623d9bc86f66e0be31c81324a4ab62dde80832269bb1f5a32837ad

See more details on using hashes here.

File details

Details for the file dask_spark-0.0.1-py2.py3-none-any.whl.

File metadata

File hashes

Hashes for dask_spark-0.0.1-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 76d9f344356742994da1884cbd5509c1e70e1cae4e7bdd2566eebd803233e739
MD5 00e95e016c443e4b7bc1c13ab1ccb7d0
BLAKE2b-256 53dc0e234f60469840d63f5748298c0495c0c2c3558b845c5fa3c414b6eb66fd

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page