Skip to main content

Interactions between Dask and Spark

Project description

Launch Dask from Spark and Spark from Dask. This project is not mature.

Examples

pip install dask-spark

Create Spark cluster from a Dask cluster

>>> from dask.distributed import Client
>>> client = Client('scheduler-address:8786')
>>> client
<Client: scheduler='tcp://scheduler-address:8786' processes=8 cores=64>

>>> from dask_spark import dask_to_spark
>>> sc = dask_to_spark(client)
>>> sc
<pyspark.context.SparkContext at 0x7f62fa4bb550>

Create Dask cluster from a Spark cluster

>>> import pyspark
>>> sc = pyspark.SparkContext('local[4]')
<pyspark.context.SparkContext at 0x7f8b908b0128>

>>> from dask_spark import spark_to_dask
>>> client = spark_to_dask(sc)
>>> client
<Client: scheduler="'tcp://127.0.0.1:8786'">

Requirements and How this Works

This depends on a relatively recent version of Dask.distributed.

For starting Spark from Dask this assumes that you have Spark installed and that the start-master.sh and start-slave.sh Spark scripts are available on the PATH of the workers. This starts a long-running Spark master process on the Dask Scheduler and starts long running Spark slaves on Dask workers. There will only be one slave per worker. We set the number of cores and the amount of memory to match the Dask workers and available memory.

When starting Dask from Spark this will block the Spark cluster. We start a scheduler on the local machine and then run a long-running function that starts up a Dask worker using RDD.mapPartitions.

TODO

  • [ ] This almost certainly fails in non-trivial situations

  • [ ] Enable user specification of Java flags for memory and core use

  • [ ] Support multiple spark clusters per Dask cluster

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dask-spark-0.0.2.tar.gz (3.6 kB view details)

Uploaded Source

Built Distribution

dask_spark-0.0.2-py2.py3-none-any.whl (5.3 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file dask-spark-0.0.2.tar.gz.

File metadata

  • Download URL: dask-spark-0.0.2.tar.gz
  • Upload date:
  • Size: 3.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for dask-spark-0.0.2.tar.gz
Algorithm Hash digest
SHA256 cac382297f17cc48308baea9f3277cae46c3a5d3a8ee1be4f68720a10e438a0c
MD5 0211073cd5e64b9d6445905ffcecec3d
BLAKE2b-256 a80ddd3429bba41a7ca00c81b8f9e97baf43439c8d9b4ba4f072d34888905b20

See more details on using hashes here.

File details

Details for the file dask_spark-0.0.2-py2.py3-none-any.whl.

File metadata

File hashes

Hashes for dask_spark-0.0.2-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 119c00115f21d793671c385d9ca44e480130374cb40140132cef82517ba3def3
MD5 1268ec75445356624e5015f3ff723437
BLAKE2b-256 71aa8473e64e11fd8c11374b2d0c7924b678b16a412a0ca858e69cbe85a3450d

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page