Skip to main content

Macrometa source bigquery connector for extracting data from BigQuery

Project description

macrometa-source-bigquery

Extract data from BigQuery tables.

This is a macrometa source bigquery connector that produces JSON-formatted data following the Singer spec.

This tap:

  • Pulls data from Google BigQuery tables/views.
  • Infers the schema for each resource and produce catalog file.

Installation

Step 1: Activate the Google BigQuery API

(originally found in the Google API docs)

  1. Use this wizard to create or select a project in the Google Developers Console and activate the BigQuery API. Click Continue, then Go to credentials.
  2. On the Add credentials to your project page, click the Cancel button.
  3. At the top of the page, select the OAuth consent screen tab. Select an Email address, enter a Product name if not already set, and click the Save button.
  4. Select the Credentials tab, click the Create credentials button and select OAuth client ID.
  5. Select the application type Other, enter the name "Macrometa Source BigQuery", and click the Create button.
  6. Click OK to dismiss the resulting dialog.
  7. Click the Download button to the right of the client ID.
  8. Move this file to your working directory and rename it client_secrets.json.

Export the location of the secret file:

export GOOGLE_APPLICATION_CREDENTIALS="./client_secret.json"

For other authentication method, please see Authentication section.

Step 2: Install

First, make sure Python 3 is installed on your system or follow these installation instructions for Mac or Ubuntu.

pip install -U macrometa-source-bigquery

Run

Step 1: Configure

Create a file called tap_config.json in your working directory, following config.sample.json:

{
  "streams": [
      {"name": "<some_schema_name>",
       "table": "`<project>.<dataset>.<table>`",
       "columns": ["<col_name_0>", "<col_name_1>", "<col_name_2>"],
       "datetime_key": "<your_key>",
       "filters": ["country='us'", "state='CA'",
                   "registered_on>=DATE_ADD(current_date, INTERVAL -7 day)"
                  ] // also optional: these are parsed in 'WHERE' clause
      }
    ],
  "start_datetime": "2017-01-01T00:00:00Z", // This can be set at the command line argument
  "end_datetime": "2017-02-01T00:00:00Z", // end_datetime is optional
  "limit": 100,
  "start_always_inclusive": false // default is false, optional
}
  • The required parameters is at least one stream (one bigquery table/view) to copy.
    • It is not a recommended BigQuery practice to use * to specify the columns as it may blow up the cost for a table with a large number of columns.
    • filters are optional but we strongly recommend using this over a large partitioned table to control the cost. LIMIT (The authors of tap-bigquery is not responsible for the cost incurred by running this program. Always test thoroughly with small data set first.)
  • start_datetime must also be set in the config file or as the command line argument (See the next step).
  • limit will limit the number of results, but it does not result in reduce the query cost.

The table/view is expected to have a column to indicate the creation or update date and time so the tap sends the query with ORDER BY and use the column to record the bookmark (See State section).

Step 2: Create catalog

Run tap-bigquery in discovery mode to let it create json schema file and then run them together, piping the output of macrometa-source-bigquery to target-csv:

macrometa-source-bigquery -c tap_config.json -d > catalog.json

Step 3: Run

macrometa-source-bigquery can be run with any Target. As example, let use target-csv.

pip install target-csv

Run:

macrometa-source-bigquery -c tap_config.json \
    --catalog catalog.json --start_datetime '2020-08-01T00:00:00Z' \
    --end_datetime '2020-08-02T01:00:00Z' | target-csv --config target_config.json \
    > state.json

This example should create a csv file in the same directory. state.json should contain a state (bookmark) after the run. (See State section).

Notes:

  • start and end datetimes accept ISO 8601 format, can be date only. start datetime is inclusive, end datetime is not.
  • It is recommended to inspect the catalog file and fix the auto-type assignment if necessary.
  • target-csv's target_config.json is optinal.
  • macrometa-source-bigquery can produce nested records but it's up to target if the data writing will be successful. In this example with target-csv, the table is expected to be flat.

Authentication

It is recommended to use macrometa-source-bigquery with a service account.

  • Download the client_secrets.json file for your service account, and place it on the machine where macrometa-source-bigquery will be executed.
  • Set a GOOGLE_APPLICATION_CREDENTIALS environment variable on the machine, where the value is the fully qualified path to client_secrets.json

In the testing environment, you can also manually authenticate before runnig the tap. In this case you do not need GOOGLE_APPLICATION_CREDENTIALS defined:

gcloud auth application-default login

You may also have to set the project:

gcloud config set project <project-id>

Though not tested, it should also be possible to use the OAuth flow to authenticate to GCP as well:

  • macrometa-source-bigquery will attempt to open a new window or tab in your default browser. If this fails, copy the URL from the console and manually open it in your browser.
  • If you are not already logged into your Google account, you will be prompted to log in.
  • If you are logged into multiple Google accounts, you will be asked to select one account to use for the authorization.
  • Click the Accept button to allow macrometa-source-bigquery to access your Google BigQuery table.
  • You can close the tab after the signup flow is complete.

State

This source connector emits state. The command also takes a state file input with --state <file-name> option. If the state is set, start_datetime config and command line argument are ignored and the datetime value from last_update key is used as the resuming point.

To avoid the data duplication, start datetime is exclusive start_datetime < datetime_column when the source connector runs with state option. If you fear a data loss because of this, just use the --start_datetime option instead of state. Or set start_always_inclusive: true in configuration.

The source connector itself does not output a state file. It anticipate the target program or a downstream process to fianlize the state safetly and produce a state file.

Original repo

https://github.com/Macrometacorp/macrometa-source-bigquery

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

macrometa-source-bigquery-0.0.22.tar.gz (18.9 kB view details)

Uploaded Source

Built Distribution

macrometa_source_bigquery-0.0.22-py3-none-any.whl (17.8 kB view details)

Uploaded Python 3

File details

Details for the file macrometa-source-bigquery-0.0.22.tar.gz.

File metadata

File hashes

Hashes for macrometa-source-bigquery-0.0.22.tar.gz
Algorithm Hash digest
SHA256 2fdc2f7927abf7e1b716edd7252d6ea8e223f316cb7fe21a6564bc3b9192a313
MD5 27e71e47d5a73594aacea965bf347b9b
BLAKE2b-256 625e335a30c998d01717344392506845c26f284c9797b776a3d38bee6d9e119a

See more details on using hashes here.

File details

Details for the file macrometa_source_bigquery-0.0.22-py3-none-any.whl.

File metadata

File hashes

Hashes for macrometa_source_bigquery-0.0.22-py3-none-any.whl
Algorithm Hash digest
SHA256 86bb41fc353cca8c7f4a3ae28632845536b4c1b9b40c8b1b76a0d63186abefe2
MD5 f965a301d502f1162f54bfc11e585266
BLAKE2b-256 0a893e463a0c54d98a54fba65b1a0dcf89a61cacdbec2263389c3491612739e3

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page