Google Dataproc templates written in Python

Project description

Build Status Python Integration Test Status Python Dataproc Cluster Integration Tests Status

Dataproc Templates (Python - PySpark)

AzureBlobStorageToBigQuery
BigQueryToGCS (blogpost link)
CassandraToBigquery
CassandraToGCS (blogpost link)
ElasticsearchToBigQuery
ElasticsearchToBigtable
ElasticsearchToGCS
GCSToBigQuery (blogpost link)
GCSToBigTable (blogpost link)
GCSToGCS(blogpost link)
GCSToJDBC (blogpost link)
GCSToMongo (blogpost link)
HbaseToGCS
HiveToBigQuery (blogpost link)
HiveToGCS(blogpost link)
JDBCToBigQuery (blogpost link)
JDBCToGCS (blogpost link)
JDBCToJDBC (blogpost link)
KafkaToGCS
KafkaToBigQuery
MongoToGCS(blogpost link)
MongoToBigQuery
PubSubLiteToBigtable Deprecated and will be removed in Q1 2025
RedshiftToGCS(blogpost link) Deprecated and will be removed in Q1 2025
S3ToBigQuery
SnowflakeToGCS(blogpost link)
TextToBigQuery Deprecated and will be removed in Q1 2025

Dataproc Templates (Python - PySpark) supports submitting jobs to both Dataproc Serverless using batches submit pyspark and Dataproc Cluster using jobs submit pyspark

Run using PyPi package

In this README, you see instructions on how to run the templates.
Currently, 3 options are described:

Using bin/start.sh
Using gcloud CLI
Using Vertex AI

Those 3 options require you to clone this repo and start running the templates.
The Dataproc Templates PyPi package is a 4th option to run templates from a PySpark environment directly (Dataproc or local/another).
Example:

!pip3 install --user google-dataproc-templates==0.0.3

from dataproc_templates.bigquery.bigquery_to_gcs import BigQueryToGCSTemplate
from pyspark.sql import SparkSession

args = dict()
args["bigquery.gcs.input.table"] = "<bq_dataset>.<bq_table>"
args["bigquery.gcs.input.location"] = "<location>"
args["bigquery.gcs.output.format"] = "<format>"
args["bigquery.gcs.output.mode"] = "<mode>"
args["bigquery.gcs.output.location"] = "gs://<bucket_name/path>"

spark = SparkSession.builder \
        .appName("BIGQUERYTOGCS") \
        .enableHiveSupport() \
        .getOrCreate()

template = BigQueryToGCSTemplate()
template.run(spark, args)

Pro Tip: Start a Dataproc Serverless Spark sessions in a Vertex AI managed notebook, and leverage a serverless Spark session, in which your job will run using Dataproc Serverless, instead of your local PySpark environment.

While this provides an easy way to get started, remember that the bin/start.sh already provides an easy way for you to, for example, specify required .jar dependencies. Using the PyPi package, you need to configure your PySpark sessions in accordance with the requirements of your specific template. You would need to, for example, specify the spark.driver.extraClassPath configuration:

spark = SparkSession.builder \
        ... \
        .config('spark.driver.extraClassPath', '<template_required_dependency>.jar')
        ... \
        .getOrCreate()

Setting up the local environment

It is recommended to use a virtual environment when setting up the local environment. This setup is not required for submitting templates, only for running and developing locally.

# Create a virtual environment, activate it and install requirements
mkdir venv
python -m venv venv/
source venv/bin/activate
pip install -r requirements.txt

Running unit tests

Unit tests are developed using pytest.

To run all unit tests, simply run pytest:

pytest

To generate a coverage report, run the tests using coverage

coverage run \
  --source=dataproc_templates \
  --module pytest \
  --verbose \
  test

coverage report --show-missing

Running Templates

The Dataproc Templates (Python - PySpark) support both serverless and cluster modes. By default, serverless mode is used. To run these templates use the gcloud CLI directly or the provided start.sh shell script.

Serverless Mode (Default)

Submits job to Dataproc Serverless using the batches submit pyspark command.

Cluster Mode

Submits job to a Dataproc Standard cluster using the jobs submit pyspark command.

To run the templates on an existing cluster, you must additionally specify the JOB_TYPE=CLUSTER and CLUSTER=<full clusterId> environment variables. For example:

export GCP_PROJECT=my-gcp-project
export REGION=gcp-region
export GCS_STAGING_LOCATION=gs://my-bucket/temp
export JOB_TYPE=CLUSTER
export CLUSTER=${DATAPROC_CLUSTER_NAME}
./bin/start.sh \
-- --template HIVETOBIGQUERY

Note: Some HBase templates that require a custom image to execute are not yet supported in CLUSTER mode.

Submitting templates

A shell script is provided to:

Build the python package
Set Dataproc parameters based on environment variables
Submit the desired template to Dataproc with the provided template parameters

When submitting, there are 3 types of properties/parameters for the user to provide.

Spark properties: Refer to this documentation to see the available spark properties.
Each template's specific parameters: refer to each template's README.
Common arguments: --template_name and --log_level
- The --log_level parameter is optional, it defaults to INFO.
  - Possible choices are the Spark log levels: ["ALL", "DEBUG", "ERROR", "FATAL", "INFO", "OFF", "TRACE", "WARN"].

bin/start.sh usage:

# Set required environment variables
export GCP_PROJECT=<project_id>
export REGION=<region>
export GCS_STAGING_LOCATION=<gs://path>

# Set optional environment variables
export SUBNET=<subnet>
export JARS="gs://additional/dependency.jar"
export HISTORY_SERVER_CLUSTER=projects/{projectId}/regions/{regionId}/clusters/{clusterId}
export METASTORE_SERVICE=projects/{projectId}/locations/{regionId}/services/{serviceId}

# Submit to Dataproc passing template parameters
./bin/start.sh [--properties=<spark.something.key>=<value>] \
               -- --template=TEMPLATENAME \
                  --log_level=INFO \
                  --my.property="<value>" \
                  --my.other.property="<value>"
                  (etc...)

gcloud CLI usage:

It is also possible to submit jobs using the gcloud CLI directly. That can be achieved by:

Building the dataproc_templates package into an .egg

PACKAGE_EGG_FILE=dist/dataproc_templates_distribution.egg
python setup.py bdist_egg --output=${PACKAGE_EGG_FILE}

Submitting the job

The main.py file should be the main python script
The .egg file for the package must be bundled using the --py-files flag

gcloud dataproc batches submit pyspark \
      --region=<region> \
      --project=<project_id> \
      --jars="<required_jar_dependencies>" \
      --deps-bucket=<gs://path> \
      --subnet=<subnet> \
      --py-files=${PACKAGE_EGG_FILE} \
      [--properties=<spark.something.key>=<value>] \
      main.py \
      -- --template=TEMPLATENAME \
         --log_level=INFO \
         --<my.property>="<value>" \
         --<my.other.property>="<value>"
         (etc...)

Vertex AI usage:

Follow Dataproc Templates (Jupyter Notebooks) README to submit Dataproc Templates from a Vertex AI notebook.

Project details

Release history Release notifications | RSS feed

This version

1.0.2b0 pre-release

Oct 24, 2024

1.0.1b0 pre-release

Oct 24, 2024

1.0.0b0 pre-release

Oct 24, 2024

0.6.0b0 pre-release

Sep 5, 2024

0.5.0b0 pre-release

Mar 4, 2024

0.4.0b0 pre-release

Aug 7, 2023

0.3.0b0 pre-release

May 12, 2023

0.2.0b0 pre-release

Apr 11, 2023

0.1.0b0 pre-release

Mar 17, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

google_dataproc_templates-1.0.2b0.tar.gz (63.2 kB view details)

Uploaded Oct 24, 2024 Source

Built Distribution

google_dataproc_templates-1.0.2b0-py2.py3-none-any.whl (92.8 kB view details)

Uploaded Oct 24, 2024 Python 2 Python 3

File details

Details for the file google_dataproc_templates-1.0.2b0.tar.gz.

File metadata

Download URL: google_dataproc_templates-1.0.2b0.tar.gz
Upload date: Oct 24, 2024
Size: 63.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for google_dataproc_templates-1.0.2b0.tar.gz
Algorithm	Hash digest
SHA256	`39ec5c571b910417a40822bca083f5d30d69c8f971734e4ba85e9169dbca230f`
MD5	`e47d47975559aab4ff6e1f99985f0285`
BLAKE2b-256	`642a037008f918639e1f825b2450f9802d1da3167ace55bade6e080508142864`

See more details on using hashes here.

File details

Details for the file google_dataproc_templates-1.0.2b0-py2.py3-none-any.whl.

File metadata

Download URL: google_dataproc_templates-1.0.2b0-py2.py3-none-any.whl
Upload date: Oct 24, 2024
Size: 92.8 kB
Tags: Python 2, Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for google_dataproc_templates-1.0.2b0-py2.py3-none-any.whl
Algorithm	Hash digest
SHA256	`87cb4835e57eeb14b8ed5d33ddaa865a49da05fa5a09e6cc36c6fad79b35fcf3`
MD5	`812cbaa66af709932cac0cfcbeaf7207`
BLAKE2b-256	`6df4a95c10d98adcf96424a6c4a7ae7c2ce06bf6cf3bbb2de64fcc16d5b5ff6c`