Skip to main content

Connect Pyspark to remote clusters

Project description

Pypsark Gateway is a library to seamlessly connect to remote spark clusters.

Quick Start

Install the pysparkgateway package on both the remote Spark cluster you are connecting to and the local machine.

pip install pysparkgateway

Start the Pyspark Gateway server on the cluster.

pyspark-gateway start

Pyspark Gateway communicates over 3 ports, 25000, 25001, 25002. Currently the client only supports connecting to these ports on localhost so you’ll need to tunnel them.

ssh myuser@foo.bar.cluster.com -L 25000:localhost:25000 -L 25001:localhost:25001 -L 25002:localhost:25002

Now you’re ready to connect. The main thing to keep in mind is the Pyspark Gateway import needs to come before any other import. Pypsark Gateway needs to patch your local pyspark in order to function properly.

The way that your local Python connects to the remote cluster is via a custom py4j gateway. Pyspark Gateway will create and configure automatically, you just need to pass it into the SparkContext options.

Also to enable all pyspark functions to work, spark.io.encryption.enabled needs to be set to true.

# This import comes first!
from pyspark_gateway import PysparkGateway
pg = PysparkGateway()

from pyspark import SparkContext, SparkConf

conf = conf.set('spark.io.encryption.enabled', 'true')
sc = SparkContext(gateway=pg.gateway, conf=conf)

Now you have a working spark context connected to a remote cluster.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

PysparkGateway-0.0.15.tar.gz (9.3 kB view details)

Uploaded Source

File details

Details for the file PysparkGateway-0.0.15.tar.gz.

File metadata

  • Download URL: PysparkGateway-0.0.15.tar.gz
  • Upload date:
  • Size: 9.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.22.0 setuptools/41.0.0 requests-toolbelt/0.8.0 tqdm/4.26.0 CPython/2.7.13

File hashes

Hashes for PysparkGateway-0.0.15.tar.gz
Algorithm Hash digest
SHA256 be7efe8a85ccd546335f5a21b0d07c1717a5d17b19a4dab03c63588f642d9daa
MD5 009d03172894a81f0593405c29bc422a
BLAKE2b-256 b3146f1f65eef8be8394c499a1ff407f66b47b43df03c5c61cbdf6c346e04b55

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page