Connect Pyspark to remote clusters
Project description
Pypsark Gateway is a library to seamlessly connect to remote spark clusters.
Quick Start
Install the pysparkgateway package on both the remote Spark cluster you are connecting to and the local machine.
pip install pysparkgateway
Start the Pyspark Gateway server on the cluster.
pyspark-gateway start
Pyspark Gateway communicates over 3 ports, 25000, 25001, 25002. Currently the client only supports connecting to these ports on localhost so you’ll need to tunnel them.
ssh myuser@foo.bar.cluster.com -L 25000:localhost:25000 -L 25001:localhost:25001 -L 25002:localhost:25002
Now you’re ready to connect. The main thing to keep in mind is the Pyspark Gateway import needs to come before any other import. Pypsark Gateway needs to patch your local pyspark in order to function properly.
The way that your local Python connects to the remote cluster is via a custom py4j gateway. Pyspark Gateway will create and configure automatically, you just need to pass it into the SparkContext options.
Also to enable all pyspark functions to work, spark.io.encryption.enabled needs to be set to true.
# This import comes first! from pyspark_gateway import PysparkGateway pg = PysparkGateway() from pyspark import SparkContext, SparkConf conf = conf.set('spark.io.encryption.enabled', 'true') sc = SparkContext(gateway=pg.gateway, conf=conf)
Now you have a working spark context connected to a remote cluster.
Running Tests
Build the docker image
docker build -t pyspark_gateway_3_7 -f docker/3_7_Dockerfile .
Run tests
docker run -it -e CI=true pyspark_gateway_3_7 python tests/test_pyspark_gateway.py
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file PysparkGateway-0.0.22.tar.gz
.
File metadata
- Download URL: PysparkGateway-0.0.22.tar.gz
- Upload date:
- Size: 9.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.22.0 setuptools/40.6.2 requests-toolbelt/0.9.1 tqdm/4.58.0 CPython/3.7.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 08e21b0fe3e623fdea13257ade2d19e2d1120910c430ffcca519b58a4b0ccdb9 |
|
MD5 | c9e0f1bdde382375ed47fa0713265427 |
|
BLAKE2b-256 | cd7d5396903f94d19fcf24e15acc87617ef840efca3ffe2b6127fa2daba61005 |