Skip to main content

Dask + BigQuery integration

Project description

Dask-BigQuery

Tests Linting

Read/write data from/to Google BigQuery with Dask.

This package uses the BigQuery Storage API. Please refer to the data extraction pricing table for associated costs while using Dask-BigQuery.

Installation

dask-bigquery can be installed with pip:

pip install dask-bigquery

or with conda:

conda install -c conda-forge dask-bigquery

Google Cloud permissions

For reading from BiqQuery, you need the following roles to be enabled on the account:

  • BigQuery Read Session User
  • BigQuery Data Viewer, BigQuery Data Editor, or BigQuery Data Owner

Alternately, BigQuery Admin would give you full access to sessions and data.

For writing to BigQuery, the following roles are sufficient:

  • BigQuery Data Editor
  • Storage Object Creator

The minimal permissions to cover reading and writing:

  • BigQuery Data Editor
  • BigQuery Read Session User
  • Storage Object Creator

Authentication

By default, dask-bigquery will use the Application Default Credentials. When running code locally, you can set this to use your user credentials by running

$ gcloud auth application-default login

User credentials require interactive login. For settings where this isn't possible, you'll need to create a service account. You can set the Application Default Credentials to the service account key using the GOOGLE_APPLICATION_CREDENTIALS environment variable:

$ export GOOGLE_APPLICATION_CREDENTIALS=/home/<username>/google.json

For information on obtaining the credentials, use Google API documentation.

Example: read from BigQuery

dask-bigquery assumes that you are already authenticated.

import dask_bigquery

ddf = dask_bigquery.read_gbq(
    project_id="your_project_id",
    dataset_id="your_dataset",
    table_id="your_table",
)

ddf.head()

Example: write to BigQuery

Write to BigQuery with default credentials

Assuming that client and workers are already provisioned with default credentials:

import dask
import dask_bigquery

ddf = dask.datasets.timeseries(freq="1min")

res = dask_bigquery.to_gbq(
    ddf,
    project_id="my_project_id",
    dataset_id="my_dataset_id",
    table_id="my_table_name",
)

Before loading data into BigQuery, to_gbq writes intermediary Parquet to a Google Storage bucket. Default bucket name is <your_project_id>-dask-bigquery. You can provide a diferent bucket name by setting the parameter: bucket="my-gs-bucket". After the job is done, the intermediary data is deleted.

Write to BigQuery with explicit (non-default) credentials

# service account credentials
creds_dict = {"type": ..., "project_id": ..., "private_key_id": ...}

res = to_gbq(
    ddf,
    project_id="my_project_id",
    dataset_id="my_dataset_id",
    table_id="my_table_name",
    credentials=credentials,
)

Run tests locally

To run the tests locally you need to be authenticated and have a project created on that account. If you're using a service account, when created you need to select the role of "BigQuery Admin" in the section "Grant this service account access to project".

You can run the tests with

$ pytest dask_bigquery

if your default gcloud project is set, or manually specify the project ID with

DASK_BIGQUERY_PROJECT_ID pytest dask_bigquery

History

This project stems from the discussion in this Dask issue and this initial implementation developed by Brett Naul, Jacob Hayes, and Steven Soojin Kim.

License

BSD-3

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dask_bigquery-2024.7.0.tar.gz (12.5 kB view details)

Uploaded Source

Built Distribution

dask_bigquery-2024.7.0-py3-none-any.whl (11.7 kB view details)

Uploaded Python 3

File details

Details for the file dask_bigquery-2024.7.0.tar.gz.

File metadata

  • Download URL: dask_bigquery-2024.7.0.tar.gz
  • Upload date:
  • Size: 12.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.9

File hashes

Hashes for dask_bigquery-2024.7.0.tar.gz
Algorithm Hash digest
SHA256 2807bbdd934627d0ed47c58d0e7b6075af9cba7e6f2d2e4a1e2173e20e76a023
MD5 77514f951fcfb0eec4ed79931384f88b
BLAKE2b-256 9f978a0a6b5b59ee1334542e9f674cfa2d97598e3b49e1ae6843406a8b820542

See more details on using hashes here.

File details

Details for the file dask_bigquery-2024.7.0-py3-none-any.whl.

File metadata

File hashes

Hashes for dask_bigquery-2024.7.0-py3-none-any.whl
Algorithm Hash digest
SHA256 9c1a9c913fdb144c57e8895df1f1a3ba960a2975496aaed6d5ea3aad2e27ba4f
MD5 9981ceadcd482cf6c5b8849c1c2d66a5
BLAKE2b-256 d1689f6adde67f7d225c013fb7e5ee54c73a172b8f76582a2c98c7f4b7b20ed4

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page