Dask + BigQuery integration
Project description
Dask-BigQuery
Read data from Google BigQuery with Dask.
This package uses the BigQuery Storage API. Please refer to the data extraction pricing table for associated costs while using Dask-BigQuery.
Installation
dask-bigquery
can be installed with pip
:
pip install dask-bigquery
or with conda
:
conda install -c conda-forge dask-bigquery
Authentication
Default credentials can be provided by setting the environment variable GOOGLE_APPLICATION_CREDENTIALS
to the file name:
$ export GOOGLE_APPLICATION_CREDENTIALS=/home/<username>/google.json
For information on obtaining the credentials, use Google API documentation.
Example: read from BigQuery
dask-bigquery
assumes that you are already authenticated.
import dask_bigquery
ddf = dask_bigquery.read_gbq(
project_id="your_project_id",
dataset_id="your_dataset",
table_id="your_table",
)
ddf.head()
Example: write to BigQuery
With default credentials:
import dask
import dask_bigquery
ddf = dask.datasets.timeseries(freq="1min")
res = dask_bigquery.to_gbq(
ddf,
project_id="my_project_id",
dataset_id="my_dataset_id",
table_id="my_table_name",
)
With explicit credentials:
from google.oauth2.service_account import Credentials
# credentials
creds_dict = {"type": ..., "project_id": ..., "private_key_id": ...}
credentials = Credentials.from_service_account_info(info=creds_dict)
res = to_gbq(
ddf,
project_id="my_project_id",
dataset_id="my_dataset_id",
table_id="my_table_name",
credentials=credentials,
)
Before loading data into BigQuery, to_gbq
writes intermediary Parquet to a Google Storage bucket. Default bucket name is dask-bigquery-tmp
. You can provide a diferent bucket name by setting the parameter: bucket="my-gs-bucket"
. After the job is done, the intermediary data is deleted.
If you're using a persistent bucket, we recommend configuring a retention policy that ensures the data is cleaned up even in case of job failures.
Run tests locally
To run the tests locally you need to be authenticated and have a project created on that account. If you're using a service account, when created you need to select the role of "BigQuery Admin" in the section "Grant this service account access to project".
You can run the tests with
$ pytest dask_bigquery
if your default gcloud
project is set, or manually specify the project ID with
DASK_BIGQUERY_PROJECT_ID pytest dask_bigquery
History
This project stems from the discussion in this Dask issue and this initial implementation developed by Brett Naul, Jacob Hayes, and Steven Soojin Kim.
License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for dask_bigquery-2023.5.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9a9ee3894b6ba43e0de19571a429cb4a826a7e7038a6f170caec9541a169ecb8 |
|
MD5 | c68f0d1501537a4cd6e4f8daff8335ce |
|
BLAKE2b-256 | 7cc7a4bcad443758604ac5c262ecfbe69efb08a4c705a411482f9172011f120a |