Skip to main content

spark structured streaming connector utility for gcp bigquery using storage api services.

Project description

bigquery-streaming-connector

This is a library for bigquery streaming connector for pyspark structured streaming

The underlying connector uses bigquery storage api services to pull bigquery table data at scale using spark workers.

The storage api services is cheaper and faster than the traditional Bigquery query api services enabling faster & cheaper Bigquery migration incrementally in a continuous fashion.

Pre-requisite

Need spark 4.0.0 or Databricks runtime version 15.3 & above.
pip install bigquery-spark-streaming-connector

Pyspark usage:

from streaming_connector import bq_stream_register

query=(spark.readStream.format("bigquery-streaming")
 .option("project_id", <bq_project_id>)
 .option("incremental_checkpoint_field",<table_incremental_ts_based_col>)
 .option("dataset",<bq_dataset_name>)
 .option("table",<bq_table_name>)
 .option("service_auth_json_file_name",<service_account_json_file_name>)
 .option("max_parallel_conn",<max_parallel_threads_to_pull_data>) #defaults max 1000
 .load()
 ## The above will ingest table data incrementally using the provided timestamp based field and latest value is checkpointed using offset semantics.
 ## Without the incremental input field full table ingestion is done.
 ## The service_account_json files need to be available to every spark executor workers in the '/home' folder using --files /home/<file_name>.json options or using init script (using Databricks spark)
 
 (query.writeStream.trigger(processingTime='30 seconds') 
 .option("checkpointLocation", "checkpoint_path")
 .foreachBatch(writeToTable) #your target table write function
 .start()
)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

File details

Details for the file bigquery_spark_streaming_connector-0.6.0-py3-none-any.whl.

File metadata

File hashes

Hashes for bigquery_spark_streaming_connector-0.6.0-py3-none-any.whl
Algorithm Hash digest
SHA256 29ead9c9252d64a7d43ad394c35cdbf139452447ad0cad2dfbfb3b182d6bc755
MD5 aa232ad0dd78aa1cc263ed12db324d9f
BLAKE2b-256 74772b3f243e1893e050808ca1f2ac4a00a91216359c49c7adef6c67b4537ff2

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page