Skip to main content

A reverse proxy server which allows secure connectivity to a Spark Connect server

Project description

spark-connect-proxy

A reverse proxy server which allows secure connectivity to a Spark Connect server.

spark-connect-proxy-ci Supported Python Versions PyPI version PyPI Downloads

Why?

Because Spark Connect does NOT provide authentication and/or TLS encryption out of the box. This project provides a reverse proxy server which can be used to secure the connection to a Spark Connect server.

Setup (to run locally)

Install Python package

You can install spark-connect-proxy from PyPi or from source.

Option 1 - from PyPi

# Create the virtual environment
python3 -m venv .venv

# Activate the virtual environment
. .venv/bin/activate

pip install spark-connect-proxy[client]

Option 2 - from source - for development

git clone https://github.com/gizmodata/spark-connect-proxy

cd spark-connect-proxy

# Create the virtual environment
python3 -m venv .venv

# Activate the virtual environment
. .venv/bin/activate

# Upgrade pip, setuptools, and wheel
pip install --upgrade pip setuptools wheel

# Install Spark Connect Proxy - in editable mode with client and dev dependencies
pip install --editable .[client,dev]

Note

For the following commands - if you running from source and using --editable mode (for development purposes) - you will need to set the PYTHONPATH environment variable as follows:

export PYTHONPATH=$(pwd)/src

Usage

This repo contains scripts to let you provision an AWS EMR Spark cluster with a secure Spark Connect Proxy server to allow you to securely and remotely connect to it.

First - you'll need to open up a port for public access to the AWS EMR Spark Cluster - in addition to the ssh port: 22. Add port: 50051 as shown here:
Open port 50051

[!NOTE]
Even though you are opening this port to the public, the Spark Connect Proxy will secure it with TLS and JWT Authentication.

The scripts use the AWS CLI to provision the EMR Spark cluster - so you will need to have the AWS CLI installed and configured with your AWS credentials.

You can create a file in your local copy of the scripts directory called .env with the following contents:

export AWS_ACCESS_KEY_ID="put value from AWS here"
export AWS_SECRET_ACCESS_KEY="put value from AWS here"
export AWS_SESSION_TOKEN="put value from AWS here"
export AWS_REGION="us-east-2"

To provision the EMR Spark cluster - run the following command from the root directory of this repo:

scripts/provision_emr_spark_cluster.sh

That will output several files (which will be git ignored for security reasons):

  • tls/ca.crt - the EMR Spark cluster generated TLS certificate - needed for your PySpark client to trust the Spark Connect Proxy server (b/c it is self-signed)
  • scripts/output/instance_details.txt - shows the ssh command for connecting to the master node of the EMR Spark cluster
  • scripts/output/spark_connect_proxy_details.log - shows how to run a PySpark Ibis client example - which connects securely from your local computer to the remote EMR Spark cluster. Example command:
spark-connect-proxy-ibis-client-example \
  --host ec2-01-01-01-01.us-east-2.compute.amazonaws.com \
  --port 50051 \
  --use-tls \
  --tls-roots tls/ca.crt \
  --token honey.badger.dontcare

[!IMPORTANT]
You must have installed the spark-connect-proxy package with the [client] extras onto the client computer to run the spark-connect-proxy-ibis-client-example command.

Handy development commands

Version management

Bump the version of the application - (you must have installed from source with the [dev] extras)
bumpver update --patch

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

spark_connect_proxy-0.0.14.tar.gz (12.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

spark_connect_proxy-0.0.14-py3-none-any.whl (13.5 kB view details)

Uploaded Python 3

File details

Details for the file spark_connect_proxy-0.0.14.tar.gz.

File metadata

  • Download URL: spark_connect_proxy-0.0.14.tar.gz
  • Upload date:
  • Size: 12.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for spark_connect_proxy-0.0.14.tar.gz
Algorithm Hash digest
SHA256 25ff4806a1d59ca3e2cc6e3e9471f53a362d068a24c27f52ffe72fe4c4246d3a
MD5 26d97f8dcfc31968023b1fff923889df
BLAKE2b-256 63f959940380564b4b0958103dcadf9b08b73c671ad35d735a2fdce68b0e8311

See more details on using hashes here.

Provenance

The following attestation bundles were made for spark_connect_proxy-0.0.14.tar.gz:

Publisher: ci.yml on gizmodata/spark-connect-proxy

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file spark_connect_proxy-0.0.14-py3-none-any.whl.

File metadata

File hashes

Hashes for spark_connect_proxy-0.0.14-py3-none-any.whl
Algorithm Hash digest
SHA256 47d5b1db8205fd89ede0e066b98802c4b7ec26e17b4a539f7b52c19c36fbfa93
MD5 eb917c394325ec2860f5b8b8c6021c65
BLAKE2b-256 f0b20b91fea825523d6f91ef241f3f02ea8752775452e6af000221874963c35d

See more details on using hashes here.

Provenance

The following attestation bundles were made for spark_connect_proxy-0.0.14-py3-none-any.whl:

Publisher: ci.yml on gizmodata/spark-connect-proxy

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page