Skip to main content

Google client library for Spark Connect

Project description

# Google Spark Connect Client

A wrapper of the Apache [Spark Connect](https://spark.apache.org/spark-connect/) client with additional functionalities that allow applications to communicate with a remote Dataproc Spark cluster using the Spark Connect protocol without requiring additional steps.

## Install

pip install google_spark_connect

## Uninstall

pip uninstall google_spark_connect

## Setup This client requires permissions to manage [Dataproc sessions and session templates](https://cloud.google.com/dataproc-serverless/docs/concepts/iam). If you are running the client outside of Google Cloud, you must set following environment variables:

## Usage

  1. Install the latest version of Dataproc Python client and Google Spark Connect modules:

    pip install google_cloud_dataproc --force-reinstall
    pip install google_spark_connect --force-reinstall
  2. Add the required import into your PySpark application or notebook:

    from google.cloud.spark_connect import GoogleSparkSession
  3. There are two ways to create a spark session,

    1. Start a Spark session using properties defined in DATAPROC_SPARK_CONNECT_SESSION_DEFAULT_CONFIG:

      spark = GoogleSparkSession.builder.getOrCreate()
    2. Start a Spark session with the following code instead of using a config file:

      from google.cloud.dataproc_v1 import SparkConnectConfig
      from google.cloud.dataproc_v1 import Session
      google_session_config = Session()
      google_session_config.spark_connect_session = SparkConnectConfig()
      google_session_config.environment_config.execution_config.subnetwork_uri = "<subnet>"
      google_session_config.runtime_config.version = '3.0'
      spark = GoogleSparkSession.builder.googleSessionConfig(google_session_config).getOrCreate()

## Billing As this client runs the spark workload on Dataproc, your project will be billed as per [Dataproc Serverless Pricing](https://cloud.google.com/dataproc-serverless/pricing). This will happen even if you are running the client from a non-GCE instance.

## Contributing ### Building and Deploying SDK

  1. Install the requirements in virtual environment.

    pip install -r requirements.txt
  2. Build the code.

    python setup.py sdist bdist_wheel
  3. Copy the generated .whl file to Cloud Storage. Use the version specified in the setup.py file.

    VERSION=<version> gsutil cp dist/google_spark_connect-${VERSION}-py2.py3-none-any.whl gs://<your_bucket_name>
  4. Download the new SDK on Vertex, then uninstall the old version and install the new one.

    %%bash
    export VERSION=<version>
    gsutil cp gs://<your_bucket_name>/google_spark_connect-${VERSION}-py2.py3-none-any.whl .
    yes | pip uninstall google_spark_connect
    pip install google_spark_connect-${VERSION}-py2.py3-none-any.whl

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

google_spark_connect-0.5.0.tar.gz (16.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

google_spark_connect-0.5.0-py2.py3-none-any.whl (18.4 kB view details)

Uploaded Python 2Python 3

File details

Details for the file google_spark_connect-0.5.0.tar.gz.

File metadata

  • Download URL: google_spark_connect-0.5.0.tar.gz
  • Upload date:
  • Size: 16.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.15

File hashes

Hashes for google_spark_connect-0.5.0.tar.gz
Algorithm Hash digest
SHA256 5e5a614e5a4cda5b242f4634af338afd51e0bef6d9cecc85d2bc3cd5604de655
MD5 13e51d060e8ecd0376e7e3aeb7127a7b
BLAKE2b-256 c0dcb2b8072f3c18b51ced6d39542f0679ca611a8429b0a0dd258fe7a5d27974

See more details on using hashes here.

File details

Details for the file google_spark_connect-0.5.0-py2.py3-none-any.whl.

File metadata

File hashes

Hashes for google_spark_connect-0.5.0-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 f595e78aafefbdc6546be1ac9a2e81f27f83b7d9bd7d0e1cf6146661921b5998
MD5 0309c8f68a0a2a100147afea5508bd6a
BLAKE2b-256 8c52f8fc7dbabaf6b6b816c93489799fff2e267cef969d5a221494a3dfd20bde

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page