jx-bigquery - JSON Expressions for BigQuery
Project description
jx-bigquery
JSON Expressions for BigQuery
Status
June 2020 - Can insert JSON documents into BigQuery while managing the schema. Queries are not supported yet.
Overview
The library is intended to manage multiple BigQuery tables to give the illusion of one table with a dynamically managed schema.
Definitions
partition
- Big data is split into separate containers based on age. This allows queries on recent data to use less resources, and allows old data to be dropped quicklycluster
- another name for the sorted order of the data in a partition. Sorting by the most common used lookup will make queries fasterid
- The set of columns that identifies the document
Configuration
table
- Any name you wish to give to this table seriestop_level_fields
- BigQuery demands that control columns are top-level. Define them here.partition
-field
- The dot-delimited field used to partition the tables (must betime
datatype)expire
- When BigQuery will automatically drop your data.
id
- The identification of documentsfield
- the set of columns to uniquely identify this documentversion
- column used to determine age of a document; replacing newer with older
cluster
- Columns used to sort the partitionsschema
- {name: type} dictionary - needed when there is no data; BigQuery demands column definitionssharded
- boolean - set totrue
if you allow this library to track multiple tables. It allows for schema migration (expansion only), and for faster insert from a multitude of machinesaccount_info
- The information BigQuery provides to connect
Example
This is a complicated example. See tests/config.json for a minimal example.
{
"table": "my_table_name",
"top_level_fields": {},
"partition": {
"field": "submit_time",
"expire": "2year"
},
"id": {
"field": "id",
"version": "last_modified"
},
"cluster": [
"id",
"last_modified"
],
"schema": {
"id": "integer",
"submit_time": "time",
"last_modified": "time"
},
"sharded": true,
"account_info": {
"private_key_id": {
"$ref": "env://BIGQUERY_PRIVATE_KEY_ID"
},
"private_key": {
"$ref": "env://BIGQUERY_PRIVATE_KEY"
},
"type": "service_account",
"project_id": "my-project-id",
"client_email": "me@my_project.iam.gserviceaccount.com",
"client_id": "12345",
"auth_uri": "https://accounts.google.com/o/oauth2/auth",
"token_uri": "https://oauth2.googleapis.com/token",
"auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
"client_x509_cert_url": "https://www.googleapis.com/robot/v1/metadata/x509/my-project.iam.gserviceaccount.com"
}
}
Usage
Setup Dataset
with an application name
dataset = bigquery.Dataset(
dataset=application_name,
kwargs=settings
)
Create a table
destination = dataset.get_or_create_table(settings.destination)
Insert documents as you please
destination.extend(documents)
Request a merge when done
destination.merge()
Running tests
Fork and clone this repo.
git clone https://github.com/klahnakoski/jx-bigquery.git
cd jx-bigquery
pip install -r requirements.txt
You will require a Google API key to run tests. The website will allow you to generate one and download a JSON file with the key. Update the tests/config.json to point to that file:
# contents of tests/config.json
{
"destination": {
"account_info": {
"$ref": "file:///e:/moz-fx-dev-ekyle-treeherder-a838a7718652.json"
}
},
"constants": {},
"debug": {
"trace": true
}
}
Then you can run the tests
python -m unittest discover tests
NOTE - the tests will create a
testing
dataset and generate/drop tables
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file jx-bigquery-3.85.20198.tar.gz
.
File metadata
- Download URL: jx-bigquery-3.85.20198.tar.gz
- Upload date:
- Size: 33.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.24.0 setuptools/45.1.0 requests-toolbelt/0.8.0 tqdm/4.43.0 CPython/3.7.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4f34274e99b601ef8cac1ae9f9864502a172155c35d9abb0765b197cfc917c7d |
|
MD5 | 55ed095b30e6bfc4d4d440cd93add3a9 |
|
BLAKE2b-256 | 2779e29a12747c50c861f5f502957594c20cbee441a3d9bc46763e9b738aa723 |