Dataproc client library for Spark Connect
Project description
# Dataproc Spark Connect Client
A wrapper of the Apache [Spark Connect](https://spark.apache.org/spark-connect/) client with additional functionalities that allow applications to communicate with a remote Dataproc Spark cluster using the Spark Connect protocol without requiring additional steps.
## Install
`console pip install dataproc_spark_connect `
## Uninstall
`console pip uninstall dataproc_spark_connect `
## Setup This client requires permissions to manage [Dataproc sessions and session templates](https://cloud.google.com/dataproc-serverless/docs/concepts/iam). If you are running the client outside of Google Cloud, you must set following environment variables:
GOOGLE_CLOUD_PROJECT - The Google Cloud project you use to run Spark workloads
GOOGLE_CLOUD_REGION - The Compute Engine [region](https://cloud.google.com/compute/docs/regions-zones#available) where you run the Spark workload.
GOOGLE_APPLICATION_CREDENTIALS - Your [Application Credentials](https://cloud.google.com/docs/authentication/provide-credentials-adc)
DATAPROC_SPARK_CONNECT_SESSION_DEFAULT_CONFIG (Optional) - The config location, such as tests/integration/resources/session.textproto
## Usage
Install the latest version of Dataproc Python client and Dataproc Spark Connect modules:
`console pip install google_cloud_dataproc --force-reinstall pip install dataproc_spark_connect --force-reinstall `
Add the required import into your PySpark application or notebook:
`python from google.cloud.dataproc_spark_connect import DataprocSparkSession `
There are two ways to create a spark session,
Start a Spark session using properties defined in DATAPROC_SPARK_CONNECT_SESSION_DEFAULT_CONFIG:
`python spark = DataprocSparkSession.builder.getOrCreate() `
Start a Spark session with the following code instead of using a config file:
`python from google.cloud.dataproc_v1 import SparkConnectConfig from google.cloud.dataproc_v1 import Session dataproc_session_config = Session() dataproc_session_config.spark_connect_session = SparkConnectConfig() dataproc_session_config.environment_config.execution_config.subnetwork_uri = "<subnet>" dataproc_session_config.runtime_config.version = '3.0' spark = DataprocSparkSession.builder.dataprocSessionConfig(dataproc_session_config).getOrCreate() `
## Billing As this client runs the spark workload on Dataproc, your project will be billed as per [Dataproc Serverless Pricing](https://cloud.google.com/dataproc-serverless/pricing). This will happen even if you are running the client from a non-GCE instance.
## Contributing ### Building and Deploying SDK
Install the requirements in virtual environment.
`console pip install -r requirements-dev.txt `
Build the code.
`console python setup.py sdist bdist_wheel `
Copy the generated .whl file to Cloud Storage. Use the version specified in the setup.py file.
`sh VERSION=<version> gsutil cp dist/dataproc_spark_connect-${VERSION}-py2.py3-none-any.whl gs://<your_bucket_name> `
Download the new SDK on Vertex, then uninstall the old version and install the new one.
`sh %%bash export VERSION=<version> gsutil cp gs://<your_bucket_name>/dataproc_spark_connect-${VERSION}-py2.py3-none-any.whl . yes | pip uninstall dataproc_spark_connect pip install dataproc_spark_connect-${VERSION}-py2.py3-none-any.whl `
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dataproc_spark_connect-0.6.0.tar.gz.
File metadata
- Download URL: dataproc_spark_connect-0.6.0.tar.gz
- Upload date:
- Size: 17.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.10.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
36da1b83ab0cd2781e5ab6c8eecffd33d52d77a7cba7afb6e22c86272e411efd
|
|
| MD5 |
b0bf48efd075150ebb3dc7321c44316c
|
|
| BLAKE2b-256 |
47e4920c57830255d3eced8a69ee8d37bbd203a388be9cc91f5aaccc29dda9da
|
File details
Details for the file dataproc_spark_connect-0.6.0-py2.py3-none-any.whl.
File metadata
- Download URL: dataproc_spark_connect-0.6.0-py2.py3-none-any.whl
- Upload date:
- Size: 21.5 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.10.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
87a750c0f339af1658e31d0369b248a6eb2eb2e5645ea0297826429cab7fa9cf
|
|
| MD5 |
9fd2c2b5462c5ec208c8c2cd46610ad5
|
|
| BLAKE2b-256 |
8a0da6ee16f773163032dcc4c6c2bcb121b1f281729ffb3bd04f78049a4a9dff
|