Skip to main content

A proof of concept asynchronous actions for PySpark using concurent.futures

Project description

Build Status PyPI version

A proof of concept asynchronous actions for PySpark using concurent.futures Originally developed as proof-of-concept solution for SPARK-20347

How does it work?

The package patches RDD, DataFrame and DataFrameWriter classes by adding thin wrappers to the commonly used action methods.

Methods are patched by retrieving shared ThreadPoolExecutor (attached to SparkContext) and applying its submit method:

def async_action(f):
    def async_action_(self, *args, **kwargs):
        executor = get_context(self)._get_executor()
        return executor.submit(f, self, *args, **kwargs)
    return async_action_

The naming convention for the patched methods is methodNameAsync, for example:

  • RDD.countRDD.countAsync

  • DataFrame.takeRDD.takeAsync

  • DataFrameWriter.saveDataFrameWriter.saveAsync

Number of threads is determined as follows:

  • spark.driver.cores is set.

  • 2 otherwise.

Usage

To patch existing classes just import the package:

>>> import asyncactions
>>> from pyspark.sql import SparkSession
>>>
>>> spark = SparkSession.builder.getOrCreate()

All *Async methods return concurrent.futures._base.Future:

>>> rdd = spark.sparkContext.range(100)
>>> f = rdd.countAsync()
>>> f
<Future at ... state=running>
>>> type(f)
concurrent.futures._base.Future

and can be used when Future is expected.

Installation

The package is available on PYPI:

pip install pyspark-asyncactions

Installation is required only on the driver node.

Dependencies

The package supports Python 3.5 or later with a common codebase and requires no external dependencies.

It is also possible, but not supported, to use it with Python 2.7, using concurent.futures backport.

Disclaimer

Apache Spark, Spark, Apache, and the Spark logo are trademarks of The Apache Software Foundation. This project is not owned, endorsed, or sponsored by The Apache Software Foundation.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyspark-asyncactions-0.0.1.tar.gz (5.4 kB view hashes)

Uploaded Source

Built Distribution

pyspark_asyncactions-0.0.1-py3-none-any.whl (8.3 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page