Skip to main content

Dataproc client library for Spark Connect

Project description

# Dataproc Spark Connect Client

A wrapper of the Apache [Spark Connect](https://spark.apache.org/spark-connect/) client with additional functionalities that allow applications to communicate with a remote Dataproc Spark cluster using the Spark Connect protocol without requiring additional steps.

## Install

`console pip install dataproc_spark_connect `

## Uninstall

`console pip uninstall dataproc_spark_connect `

## Setup This client requires permissions to manage [Dataproc sessions and session templates](https://cloud.google.com/dataproc-serverless/docs/concepts/iam). If you are running the client outside of Google Cloud, you must set following environment variables:

## Usage

  1. Install the latest version of Dataproc Python client and Dataproc Spark Connect modules:

    `console pip install google_cloud_dataproc --force-reinstall pip install dataproc_spark_connect --force-reinstall `

  2. Add the required import into your PySpark application or notebook:

    `python from google.cloud.dataproc_spark_connect import DataprocSparkSession `

  3. There are two ways to create a spark session,

    1. Start a Spark session using properties defined in DATAPROC_SPARK_CONNECT_SESSION_DEFAULT_CONFIG:

      `python spark = DataprocSparkSession.builder.getOrCreate() `

    2. Start a Spark session with the following code instead of using a config file:

      `python from google.cloud.dataproc_v1 import SparkConnectConfig from google.cloud.dataproc_v1 import Session dataproc_session_config = Session() dataproc_session_config.spark_connect_session = SparkConnectConfig() dataproc_session_config.environment_config.execution_config.subnetwork_uri = "<subnet>" dataproc_session_config.runtime_config.version = '3.0' spark = DataprocSparkSession.builder.dataprocSessionConfig(dataproc_session_config).getOrCreate() `

## Billing As this client runs the spark workload on Dataproc, your project will be billed as per [Dataproc Serverless Pricing](https://cloud.google.com/dataproc-serverless/pricing). This will happen even if you are running the client from a non-GCE instance.

## Contributing ### Building and Deploying SDK

  1. Install the requirements in virtual environment.

    `console pip install -r requirements-dev.txt `

  2. Build the code.

    `console python setup.py sdist bdist_wheel `

  3. Copy the generated .whl file to Cloud Storage. Use the version specified in the setup.py file.

    `sh VERSION=<version> gsutil cp dist/dataproc_spark_connect-${VERSION}-py2.py3-none-any.whl gs://<your_bucket_name> `

  4. Download the new SDK on Vertex, then uninstall the old version and install the new one.

    `sh %%bash export VERSION=<version> gsutil cp gs://<your_bucket_name>/dataproc_spark_connect-${VERSION}-py2.py3-none-any.whl . yes | pip uninstall dataproc_spark_connect pip install dataproc_spark_connect-${VERSION}-py2.py3-none-any.whl `

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dataproc_spark_connect-0.6.0.tar.gz (17.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dataproc_spark_connect-0.6.0-py2.py3-none-any.whl (21.5 kB view details)

Uploaded Python 2Python 3

File details

Details for the file dataproc_spark_connect-0.6.0.tar.gz.

File metadata

  • Download URL: dataproc_spark_connect-0.6.0.tar.gz
  • Upload date:
  • Size: 17.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.15

File hashes

Hashes for dataproc_spark_connect-0.6.0.tar.gz
Algorithm Hash digest
SHA256 36da1b83ab0cd2781e5ab6c8eecffd33d52d77a7cba7afb6e22c86272e411efd
MD5 b0bf48efd075150ebb3dc7321c44316c
BLAKE2b-256 47e4920c57830255d3eced8a69ee8d37bbd203a388be9cc91f5aaccc29dda9da

See more details on using hashes here.

File details

Details for the file dataproc_spark_connect-0.6.0-py2.py3-none-any.whl.

File metadata

File hashes

Hashes for dataproc_spark_connect-0.6.0-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 87a750c0f339af1658e31d0369b248a6eb2eb2e5645ea0297826429cab7fa9cf
MD5 9fd2c2b5462c5ec208c8c2cd46610ad5
BLAKE2b-256 8a0da6ee16f773163032dcc4c6c2bcb121b1f281729ffb3bd04f78049a4a9dff

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page