Skip to main content

Utility class for building data pipelines in BigQuery

Project description

BigQuery Pipeline

Utility class for building data pipelines in BigQuery.

Provides methods for query, copy table, delete table and export to GCS.

Supports Jinja2 templated SQL.

Getting Started

Check out the codelab!

Usage

Create an instance of BQPipeline. By setting query_project, default_project and default_dataset, you can omit project and dataset from table references in your SQL statements.

default_project is the project used when a tablespec does not specify a project.

default_dataset is the dataset used when a tablespec does not specify project or dataset.

Place files containing a single BigQuery Standard SQL statement per file.

Note that if you reference a project with a '-' in the name, you must backtick the tablespec in your SQL: `my-project.dataset_id.table_id`

Writing scripts to be easily portable between environments

  • Use {{ project }} in all your sql queries
  • In your replacements dictionary, set 'project' to the value of BQPipeline.infer_project() this will infer the project from the credentials. This means in your local shell it will use GOOGLE_APPLICATION_DEFAULT and on the cron box it will use project of the CronBox's Service Account.
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from ox_bqpipeline.bqpipeline import BQPipeline

bq = BQPipeline(job_name='myjob',
                default_dataset='mydataset',
                json_credentials_path='credentials.json')

replacements = {
    'project': bq.infer_project(),
    'dataset': 'mydataset'
}

bq.copy_table('source_table', 'dest_table')

bq.run_queries(['../sql/q1.sql', ('../sql/q2.sql', 'tmp_table_1')], **replacements)

bq.export_csv_to_gcs('tmp_table_2', 'gs://my-bucket/path/to/tmp_table_2-*.csv')

bq.delete_tables(['tmp_table_1', 'tmp_table_2'])

Note, that the run_queries method provided this utility can alternatively take a list of tuples where the first entry is the sql path, and the second is a destination table. You can see an example of this in example_pipeline.py.

For detailed documentation about the methods provided by this utility class see docs.md.

Writing scripts with parameterized queries

Bigquery standard sql provides support for parameterized queries.

To run parameterized queries using the bqpipeline from commandline, use the following syntax:

python3 ox_bqpipeline/bqpipeline.py --query_file query.sql --gcs_destination gs://bucket_path --query_params '{"int_param": 1, "str_param": "one"}'

In order to invoke the BQPipelines.run_queries method from within your python module, use the following pattern.

        bqp = bqpipeline.BQPipeline(
            job_name='testjob', default_project='project_name',
            default_dataset='dataset_name')
        qj_list = bqp.run_queries(
            [(<query1_path>, <table_or_gcs_dest>, {'query1_params key,val'}),
             (<query2_path>, <table_or_gcs_dest>, {'query2_params key,val'}),
            ],
            batch=False, overwrite=False,
            dry_run=False)

Creating Service Account JSON Credentials

  1. Visit the Service Account Console
  2. Select a service account
  3. Select "Create Key"
  4. Select "JSON"
  5. Click "Create" to download the file

Installation

Optional: Install in virtualenv

python3 -m virtualenv venv
source venv/bin/activate

Install with pip

pipenv install --python 3

Install with pipenv

python3 -m pip install -r requirements.txt

or

pip3 install -r requirements.txt

Run test suite

python3 -m unittest discover

Run test suite using pipenv

pipenv run python -m unittest discover

Requirements

You'll need to download Python 3.4 or later

Google Cloud Python Client

Disclaimer

This is not an official Google project.

References

Python Example Code google-cloud-bigquery Jinja2 ox_bqpipeline Reference

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ox_bqpipeline-0.0.4.tar.gz (12.1 kB view details)

Uploaded Source

Built Distribution

ox_bqpipeline-0.0.4-py3-none-any.whl (14.0 kB view details)

Uploaded Python 3

File details

Details for the file ox_bqpipeline-0.0.4.tar.gz.

File metadata

  • Download URL: ox_bqpipeline-0.0.4.tar.gz
  • Upload date:
  • Size: 12.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/45.0.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.7.6

File hashes

Hashes for ox_bqpipeline-0.0.4.tar.gz
Algorithm Hash digest
SHA256 b60706616981076cb429e507a88171c3c30877557b80e3132fa643f93c0d0c71
MD5 68c5b25d32461110e37fdd6893998b46
BLAKE2b-256 bbb8d1c8e793f3c80418aa8379263e04cd50431e263ebaeb396555b09d2b8efd

See more details on using hashes here.

File details

Details for the file ox_bqpipeline-0.0.4-py3-none-any.whl.

File metadata

  • Download URL: ox_bqpipeline-0.0.4-py3-none-any.whl
  • Upload date:
  • Size: 14.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/45.0.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.7.6

File hashes

Hashes for ox_bqpipeline-0.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 1c24d99d98471cefdebdcaeb5b5dd803091f8191c9ebfa4d25ce8077f272ee88
MD5 78500b2fa87c9faae4ebf7a032aebcd5
BLAKE2b-256 046c108df2bbe1e287ae26a21f7d4b41979302088a9adcc387f02e835e8249fb

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page