Singer.io tap for extracting data from BigQuery
Project description
tap-bigquery
Extract data from BigQuery tables.
This is a Singer tap that produces JSON-formatted data following the Singer spec.
This tap:
- Pulls data from Google BigQuery tables/views with datetime field.
- Infers the schema for each resource and produce catalog file.
- Incrementally pulls data based on the input state.
Installation
Step 0: Acknowledge LICENSE and TERMS
Please especially note that the author(s) of tap-bigquery is not responsible for the cost, including but not limited to BigQuery cost) incurred by running this program.
Step 1: Activate the Google BigQuery API
(originally found in the Google API docs)
- Use this wizard to create or select a project in the Google Developers Console and activate the BigQuery API. Click Continue, then Go to credentials.
- On the Add credentials to your project page, click the Cancel button.
- At the top of the page, select the OAuth consent screen tab. Select an Email address, enter a Product name if not already set, and click the Save button.
- Select the Credentials tab, click the Create credentials button and select OAuth client ID.
- Select the application type Other, enter the name "Singer BigQuery Tap", and click the Create button.
- Click OK to dismiss the resulting dialog.
- Click the Download button to the right of the client ID.
- Move this file to your working directory and rename it client_secrets.json.
Export the location of the secret file:
export GOOGLE_APPLICATION_CREDENTIALS="./client_secret.json"
For other authentication method, please see Authentication section.
Step 2: Install
First, make sure Python 3 is installed on your system or follow these installation instructions for Mac or Ubuntu.
pip install -U tap-bigquery
Or you can install the lastest development version from GitHub:
pip install --no-cache-dir https://github.com/anelendata/tap-bigquery/archive/master.tar.gz#egg=tap-bigquery
Run
Step 1: Configure
Create a file called tap_config.json in your working directory, following config.sample.json:
{
"streams": [
{"name": "<some_schema_name>",
"table": "`<project>.<dataset>.<table>`",
"columns": ["<col_name_0>", "<col_name_1>", "<col_name_2>"],
"datetime_key": "<your_key>",
"filters": ["country='us'", "state='CA'",
"registered_on>=DATE_ADD(current_date, INTERVAL -7 day)"
] // also optional: these are parsed in 'WHERE' clause
}
],
"start_datetime": "2017-01-01T00:00:00Z", // This can be set at the command line argument
"end_datetime": "2017-02-01T00:00:00Z", // end_datetime is optional
"limit": 100,
"start_always_inclusive": false // default is false, optional
}
- The required parameters is at least one stream (one bigquery table/view) to copy.
- It is not a recommended BigQuery practice to use
*
to specify the columns as it may blow up the cost for a table with a large number of columns. filters
are optional but we strongly recommend using this over a large partitioned table to control the cost. LIMIT (The authors of tap-bigquery is not responsible for the cost incurred by running this program. Always test thoroughly with small data set first.)
- It is not a recommended BigQuery practice to use
start_datetime
must also be set in the config file or as the command line argument (See the next step).limit
will limit the number of results, but it does not result in reduce the query cost.
The table/view is expected to have a column to indicate the creation or
update date and time so the tap sends the query with ORDER BY
and use
the column to record the bookmark (See State section).
Step 2: Create catalog
Run tap-bigquery in discovery mode to let it create json schema file and then run them together, piping the output of tap-bigquery to target-csv:
tap-bigquery -c tap_config.json -d > catalog.json
Step 3: Run
tap-bigquery can be run with any Singer Target. As example, let use target-csv.
pip install target-csv
Run:
tap-bigquery -c tap_config.json \
--catalog tap_catalog.json --start_datetime '2020-08-01T00:00:00Z' \
--end_datetime '2020-08-02T01:00:00Z' | target-csv --config target_config.json \
> state.json
This example should create a csv file in the same directory.
state.json
should contain a state (bookmark) after the run. (See State section).
Notes:
- start and end datetimes accept ISO 8601 format, can be date only. start datetime is inclusive, end datetime is not.
- It is recommended to inspect the catalog file and fix the auto-type assignment if necessary.
- target-csv's target_config.json is optinal.
- tap-bigquery can produce nested records but it's up to target if the data writing will be successful. In this example with target-csv, the table is expected to be flat.
Authentication
It is recommended to use tap-bigquery
with a service account.
- Download the client_secrets.json file for your service account, and place it
on the machine where
tap-bigquery
will be executed. - Set a
GOOGLE_APPLICATION_CREDENTIALS
environment variable on the machine, where the value is the fully qualified path to client_secrets.json
In the testing environment, you can also manually authenticate before runnig
the tap. In this case you do not need GOOGLE_APPLICATION_CREDENTIALS
defined:
gcloud auth application-default login
Though not tested, it should also be possible to use the OAuth flow to authenticate to GCP as well:
tap-bigquery
will attempt to open a new window or tab in your default browser. If this fails, copy the URL from the console and manually open it in your browser.- If you are not already logged into your Google account, you will be prompted to log in.
- If you are logged into multiple Google accounts, you will be asked to select one account to use for the authorization.
- Click the Accept button to allow
tap-bigquery
to access your Google BigQuery table. - You can close the tab after the signup flow is complete.
State
This tap emits state.
The command also takes a state file input with --state <file-name>
option.
If the state is set, start_datetime config and command line argument are
ignored and the datetime value from last_update key is used as the resuming
point.
To avoid the data duplication, start datetime is exclusive
start_datetime < datetime_column
when the tap runs with state option. If
you fear a data loss because of this, just use the --start_datetime
option
instead of state. Or set start_always_inclusive: true
in configuration.
The tap itself does not output a state file. It anticipate the target program or a downstream process to fianlize the state safetly and produce a state file.
Original repo
https://github.com/anelendata/tap-bigquery
Copyright © 2020- Anelen Data
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for tap_bigquery-0.3.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 97c06562b7c843bb443b9fbffd2cc6a789f30b6475261fcce7f48e1c119aa0e6 |
|
MD5 | 098efadf06fe87b6797ba00698bd1a54 |
|
BLAKE2b-256 | 988126ad60789eebd3eacfb7b07e1688ac26647a18f39b61f98de46040703a65 |