Skip to main content

Dataproc client library for Spark Connect

Project description

Dataproc Spark Connect Client

A wrapper of the Apache Spark Connect client with additional functionalities that allow applications to communicate with a remote Dataproc Spark Session using the Spark Connect protocol without requiring additional steps.

Install

pip install dataproc_spark_connect

Uninstall

pip uninstall dataproc_spark_connect

Setup

This client requires permissions to manage Dataproc Sessions and Session Templates.

If you are running the client outside of Google Cloud, you need to provide authentication credentials. Set the GOOGLE_APPLICATION_CREDENTIALS environment variable to point to your Application Credentials file.

You can specify the project and region either via environment variables or directly in your code using the builder API:

  • Environment variables: GOOGLE_CLOUD_PROJECT and GOOGLE_CLOUD_REGION
  • Builder API: .projectId() and .location() methods (recommended)

Usage

  1. Install the latest version of Dataproc Spark Connect:

    pip install -U dataproc-spark-connect
    
  2. Add the required imports into your PySpark application or notebook and start a Spark session using the fluent API:

    from google.cloud.dataproc_spark_connect import DataprocSparkSession
    spark = DataprocSparkSession.builder.getOrCreate()
    
  3. You can configure Spark properties using the .config() method:

    from google.cloud.dataproc_spark_connect import DataprocSparkSession
    spark = DataprocSparkSession.builder.config('spark.executor.memory', '4g').config('spark.executor.cores', '2').getOrCreate()
    
  4. For advanced configuration, you can use the Session class to customize settings like subnetwork or other environment configurations:

    from google.cloud.dataproc_spark_connect import DataprocSparkSession
    from google.cloud.dataproc_v1 import Session
    session_config = Session()
    session_config.environment_config.execution_config.subnetwork_uri = '<subnet>'
    session_config.runtime_config.version = '3.0'
    spark = DataprocSparkSession.builder.projectId('my-project').location('us-central1').dataprocSessionConfig(session_config).getOrCreate()
    

Builder Configuration

The DataprocSparkSession.builder provides a fluent API to configure the session. Below is a list of available methods:

Method Description
config(key, value) Sets a Spark configuration property.
dataprocSessionConfig(dataproc_config) Sets the Dataproc Session configuration object.
dataprocSessionId(session_id) Sets a custom session ID for creating or reusing sessions.
idleTtl(duration) Sets the idle time-to-live (idle TTL) for the session using a datetime.timedelta object.
label(key, value) Adds a single label to the session.
labels(labels) Adds multiple labels to the session.
location(location) Sets the Google Cloud region.
projectId(project_id) Sets the Google Cloud project ID.
runtimeVersion(version) Sets the Dataproc runtime version (e.g., "3.0").
serviceAccount(account) Sets the service account for the session.
sessionTemplate(template) Sets the session template to use.
subnetwork(subnet) Sets the subnetwork URI for the session.
ttl(duration) Sets the time-to-live (TTL) for the session using a datetime.timedelta object.

Reusing Named Sessions Across Notebooks

Named sessions allow you to share a single Spark session across multiple notebooks, improving efficiency by avoiding repeated session startup times and reducing costs.

To create or connect to a named session:

  1. Create a session with a custom ID in your first notebook:

    from google.cloud.dataproc_spark_connect import DataprocSparkSession
    session_id = 'my-ml-pipeline-session'
    spark = DataprocSparkSession.builder.dataprocSessionId(session_id).getOrCreate()
    df = spark.createDataFrame([(1, 'data')], ['id', 'value'])
    df.show()
    
  2. Reuse the same session in another notebook by specifying the same session ID:

    from google.cloud.dataproc_spark_connect import DataprocSparkSession
    session_id = 'my-ml-pipeline-session'
    spark = DataprocSparkSession.builder.dataprocSessionId(session_id).getOrCreate()
    df = spark.createDataFrame([(2, 'more-data')], ['id', 'value'])
    df.show()
    
  3. Session IDs must be 4-63 characters long, start with a lowercase letter, contain only lowercase letters, numbers, and hyphens, and not end with a hyphen.

  4. Named sessions persist until explicitly terminated or reach their configured TTL.

  5. A session with a given ID that is in a TERMINATED state cannot be reused. It must be deleted before a new session with the same ID can be created.

Using Spark SQL Magic Commands (Jupyter Notebooks)

The package supports the sparksql-magic library for executing Spark SQL queries directly in Jupyter notebooks.

Installation: To use magic commands, install the required dependencies manually:

pip install dataproc-spark-connect
pip install IPython sparksql-magic
  1. Load the magic extension:

    %load_ext sparksql_magic
    
  2. Configure default settings (optional):

    %config SparkSql.limit=20
    
  3. Execute SQL queries:

    %%sparksql
    SELECT * FROM your_table
    
  4. Advanced usage with options:

    # Cache results and create a view
    %%sparksql --cache --view result_view df
    SELECT * FROM your_table WHERE condition = true
    

Available options:

  • --cache / -c: Cache the DataFrame
  • --eager / -e: Cache with eager loading
  • --view VIEW / -v VIEW: Create a temporary view
  • --limit N / -l N: Override default row display limit
  • variable_name: Store result in a variable

See sparksql-magic for more examples.

Note: Magic commands are optional. If you only need basic DataprocSparkSession functionality without Jupyter magic support, install only the base package:

pip install dataproc-spark-connect

Developing

For development instructions see guide.

Contributing

We'd love to accept your patches and contributions to this project. There are just a few small guidelines you need to follow.

Contributor License Agreement

Contributions to this project must be accompanied by a Contributor License Agreement. You (or your employer) retain the copyright to your contribution; this simply gives us permission to use and redistribute your contributions as part of the project. Head over to https://cla.developers.google.com to see your current agreements on file or to sign a new one.

You generally only need to submit a CLA once, so if you've already submitted one (even if it was for a different project), you probably don't need to do it again.

Code reviews

All submissions, including submissions by project members, require review. We use GitHub pull requests for this purpose. Consult GitHub Help for more information on using pull requests.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dataproc_spark_connect-1.1.0.tar.gz (28.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dataproc_spark_connect-1.1.0-py2.py3-none-any.whl (33.2 kB view details)

Uploaded Python 2Python 3

File details

Details for the file dataproc_spark_connect-1.1.0.tar.gz.

File metadata

  • Download URL: dataproc_spark_connect-1.1.0.tar.gz
  • Upload date:
  • Size: 28.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.2.0 CPython/3.11.2

File hashes

Hashes for dataproc_spark_connect-1.1.0.tar.gz
Algorithm Hash digest
SHA256 c4d7337a422e80fe6c5cab7ca4e642de5948c6f5dcbe2e6d6ae4aafd32aad55e
MD5 9b484a5abcd9d1610ab9f2f0eaeff939
BLAKE2b-256 83168fb6f9e5b409a11b87d58ecbabda66a7f2af2c1f8609cb9a8c5923b8f1ab

See more details on using hashes here.

Provenance

The following attestation bundles were made for dataproc_spark_connect-1.1.0.tar.gz:

Publisher: google-cloud-sdk-py@oss-exit-gate-prod.iam.gserviceaccount.com

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.
  • Statement: Publication detail:
    • Token Issuer: https://accounts.google.com
    • Service Account: google-cloud-sdk-py@oss-exit-gate-prod.iam.gserviceaccount.com

File details

Details for the file dataproc_spark_connect-1.1.0-py2.py3-none-any.whl.

File metadata

File hashes

Hashes for dataproc_spark_connect-1.1.0-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 0202ebb8e3816b5d3f27642e17464a2d64c320776baaef0e6acee88ef45be557
MD5 87e98fec46998dbcf1fea7216607c2a6
BLAKE2b-256 2b1bee56df54bd3413d6a9328bbc184bce2583dffa4c9262f5b9655247eb83ee

See more details on using hashes here.

Provenance

The following attestation bundles were made for dataproc_spark_connect-1.1.0-py2.py3-none-any.whl:

Publisher: google-cloud-sdk-py@oss-exit-gate-prod.iam.gserviceaccount.com

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.
  • Statement: Publication detail:
    • Token Issuer: https://accounts.google.com
    • Service Account: google-cloud-sdk-py@oss-exit-gate-prod.iam.gserviceaccount.com

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page