Skip to main content

Google BigQuery target of singer.io framework.

Project description

target-bigquery

ANELEN's implementation of target-bigquery.

This is a Singer target that loads JSON-formatted data following the Singer spec to Google BigQuery.

Installation

Step 0: Acknowledge LICENSE and TERMS

Please especially note that the author(s) of target-bigquery is not responsible for the cost, including but not limited to BigQuery cost) incurred by running this program.

Step 1: Activate the Google BigQuery API

(originally found in the Google API docs)

  1. Use this wizard to create or select a project in the Google Developers Console and activate the BigQuery API. Click Continue, then Go to credentials.
  2. On the Add credentials to your project page, click the Cancel button.
  3. At the top of the page, select the OAuth consent screen tab. Select an Email address, enter a Product name if not already set, and click the Save button.
  4. Select the Credentials tab, click the Create credentials button and select OAuth client ID.
  5. Select the application type Other, enter the name "Singer BigQuery Tap", and click the Create button.
  6. Click OK to dismiss the resulting dialog.
  7. Click the Download button to the right of the client ID.
  8. Move this file to your working directory and rename it client_secrets.json.

Export the location of the secret file:

export GOOGLE_APPLICATION_CREDENTIALS="./client_secret.json"

For other authentication method, please see Authentication section.

Step 2: Install

First, make sure Python 3 is installed on your system or follow these installation instructions for Mac or Ubuntu.

pip install -U target-bigquery-partition

Or you can install the lastest development version from GitHub:

pip install --no-cache-dir https://github.com/anelendata/target-bigquery/archive/master.tar.gz#egg=target-bigquery

Run

Step 1: Configure

Create a file called target_config.json in your working directory, following config.sample.json:

{
    "project_id": "your-gcp-project-id",
    "dataset_id": "your-bigquery-dataset",
    "table_prefix": "optional_table_prefix",
    "table_ext": "optional_table_ext",
    "partition_by": "optional_column_name",
    "partition_type": "day",
    "partition_exp_ms": null,
    "stream": false,
}

Notes:

  • The table name is set as stream name from the tap. You can add prefix and ext to the name.
  • Optionally, you can set partition_by to create a partitioned table. Many production quailty taps implements a ingestion timestamp and it is recommended to use the column here to partition the table. It will increase the query performance and lower the BigQuery costs. partition_type can be hour, day, month, or year and the default is day. partition_exp_ms sets the partition expiration in millisecond. Default is null (never expire).
  • stream: Make this true to run the streaming updates to BigQuery. Note that performance of batch update is better when keeping this option false.

Step 2: Run

target-bigquery can be run with any Singer Target. As example, let use tap-exchangeratesapi.

pip install tap-exchangeratesapi

Run:

tap-exchangeratesapi | target-bigquery -c target_config.json

Authentication

It is recommended to use target-bigquery with a service account.

  • Download the client_secrets.json file for your service account, and place it on the machine where target-bigquery will be executed.
  • Set a GOOGLE_APPLICATION_CREDENTIALS environment variable on the machine, where the value is the fully qualified path to client_secrets.json

In the testing environment, you can also manually authenticate before runnig the tap. In this case you do not need GOOGLE_APPLICATION_CREDENTIALS defined:

gcloud auth application-default login

You may also have to set the project:

gcloud config set project <project-id>

Though not tested, it should also be possible to use the OAuth flow to authenticate to GCP as well:

  • target-bigquery will attempt to open a new window or tab in your default browser. If this fails, copy the URL from the console and manually open it in your browser.
  • If you are not already logged into your Google account, you will be prompted to log in.
  • If you are logged into multiple Google accounts, you will be asked to select one account to use for the authorization.
  • Click the Accept button to allow target-bigquery to access your Google BigQuery table.
  • You can close the tab after the signup flow is complete.

Schema change considerations

Typically, a change in schema is detected by target_bigquery's schema validation or when BigQuery rejects the input. Here are some ideas and options to handle the schema changes.

Mapping column names

It is costly to modify the existing columns in data warehouse. One way to handle source data type change is to create a new column. (e.g. original column name: price, new: price_) And let the downstream (e.g. dbt) reconcile the old and new column types.

In the config, you can use column_map to map the source field name to the target column name:

{
    "project_id": "your-gcp-project-id",
    "dataset_id": "your-bigquery-dataset",
    ...
    "column_map": {
      "<stream_name>": {
        "<source_field_name>": "<target_col_name>",
        ...
      },
      ...
    }

Note: Schema is validated against pre-mapped names. Then the column names are swapped, if applicable, just before being written to BigQuery. So, the input stream (tap) don't have to modify schema message.

Ignore unknown columns

BigQuery will reject the load if the data contains undefined column. This is disruptive to the daily operations which aren't depending on the new columns.

To ignore the unknown columns, add this to config:

{
    "project_id": "your-gcp-project-id",
    "dataset_id": "your-bigquery-dataset",
    ...
    "exclude_unknown_columns": true,
    ...

(Default is false)

Warning logs are written out when new columns are detected.

Add new columns

Use --schema or -s followed by the updated catalog file to automatically detect and add new column to BigQuery table:

target-bigquery -c files/target_config.json -s files/catalog.json --dryrun -t <table1,table2,...>
  • Use --dryrun switch to dry run. No change to the BigQuery table. Preview the result in the log.
  • Use --table followed by comma separated (no space) table names to execute only for the listed tables.

Original repo

https://github.com/anelendata/target-bigquery

About this project

This project is developed by ANELEN and friends. Please check out the ANELEN's open innovation philosophy and other projects

ANELEN

Copyright © 2020~ Anelen Co., LLC

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

target_bigquery_partition-0.2.5.tar.gz (19.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

target_bigquery_partition-0.2.5-py3-none-any.whl (17.6 kB view details)

Uploaded Python 3

File details

Details for the file target_bigquery_partition-0.2.5.tar.gz.

File metadata

File hashes

Hashes for target_bigquery_partition-0.2.5.tar.gz
Algorithm Hash digest
SHA256 c5ff6ffc00e2f5f708657b68bbaf4ccc15eb4e2bde96ee39123365e4bee1a0b5
MD5 e1c33df7ef4da35be98f4e5d9ee3f0b4
BLAKE2b-256 b6f3c2f736b2397c7153a22deec13380bdeb3ef907fb6f41286d6bf6cb4f7634

See more details on using hashes here.

File details

Details for the file target_bigquery_partition-0.2.5-py3-none-any.whl.

File metadata

File hashes

Hashes for target_bigquery_partition-0.2.5-py3-none-any.whl
Algorithm Hash digest
SHA256 b797a8fdb77bb98de32b5fb71305032378ab4023ea3fb16df407dbdeb8137057
MD5 c28ee03e43cacf41d38fe98098002d67
BLAKE2b-256 e89f00b3a3bbbd5e550c2e2bbe192771b4373f795091c3f7aad36940829469c2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page