Define your BigQuery tables as dataclasses.

These details have not been verified by PyPI

Project links

Homepage

Project description

bq-schema

Python package Codecov

Motivation

At limehome we are heavy users of python and bigquery. This library was created to mainly solve the following issues:

Define table schemas in code and have a migration script to apply changes.
On deploy make sure that all schemas were applied, otherwise abort.
Guarantee that when we try to write data to a table, the data matches the schema of the table (required / optional, datatypes)
Version our tables and enable migrations to a new schema

Additionally this library aims to help the users through the usage of python typing.

Specify your schema as a python dataclass
Our migration script converts the data class into a bigquery schema definition
Deserialize rows into a dataclass instance, while reading from a table
Serialize a dataclass instance into a dictionary and write it to the table.

The main benefit of combining all these features is, that we can guarantee that our code will run, before we deploy to production.

Quickstart

Since this library makes use of newer features of python, you need at least python3.7.

Install the package

pip install bq_schema

Create a schema and a table definition in my_table.py

@dataclass
class Schema:
    string_field: str = field(metadata={"description": "This is a STRING field."})
    int_field: Optional[int]
    some_floats: List[float]
    bool_field: bool

class MyTable(BigqueryTable):
    name = "my_table_name"
    schema = Schema

If you have already tables created in your account, you can use the convert-table script to create a schema.

Create your table

Hint: Make sure to have you credentials set:

export GOOGLE_APPLICATION_CREDENTIALS=your_auth.json

Alternativly you can set the service_file as a environment variable:

export GOOGLE_SERVICE_FILE={"type": "service_account", ...}

Now create your table

migrate-tables --project my_project --dataset my_dataset --module-path my_table --apply

Write a row

from google.cloud import bigquery
row = Schema(string_field="foo", int_field=1, some_floats=[1.0, 2.0], bool_field=True)
row_transformer = RowTransformer[Schema](Schema)
serialized_row = RowTransformer.dataclass_instance_to_bq_row(row)

bigquery_client = bigquery.Client()
table = bigquery_client.get_table("project.dataset.my_table_name")
bigquery_client.insert_rows(table, [serialized_row])

Validate you code with a type checker like mypy

mypy my_table.py

Read a row

query = "SELECT * FROM project.dataset.my_table_name"
for row in bigquery_client.query(query=query):
    deserialized_row = row_transformer.bq_row_to_dataclass_instance(row)
    assert isinstance(deserialized_row, Schema)

Documentation

Schema definitions

For a full list of supported types check the following schema:

from typing import Optional
from dataclasses import dataclass

@dataclass
class RequiredNestedField:
    int_field: int = field(metadata={"description": "This field is an INT field."})


@dataclass
class RequiredSchema:
    string_field: str = field(metadata={"description": "This field is a STRING field."})
    string_field_optional = Optional[str]
    bytes_field: bytes
    int_field: int
    float_field: float
    numeric_field: Decimal
    bool_field: bool
    timestamp_field: Timestamp
    date_field: date
    time_field: time
    datetime_field: datetime
    required_nested_field: RequiredNestedField = field(metadata={"description": "This field is a STRUCT field."})
    optional_nested_field: Optional[RequiredNestedField] 
    repeated_nested_field: List[RequiredNestedField]

Timestamps

Timestamps are deserialized into datetime objects, due to the nature of the underlying bq library. To distinguish between datetime and timestamp use bq_schema.types.type_mapping. Usage:

from bq_schema.types.type_mapping import Timestamp
from datetime import datetime

the_timestamp = Timestamp(datetime.utcnow())

Table definitions

The bigquery class is used for:

Recursive table discovery by our migrate-tables script
Define table properties like name and schema

Required properties

name: The name of the table
schema: table schema either as dataclass or a list of schema fields

Optional properties

project: name of the project, can be overwritten by the migrate-tables script
dataset: name of the dataset, can be overwritten by the migrate-tables script

Versioning tables

Since bigquery does not allow backwards incompatible schema changes, you might want to version your schemas.project

class MyTable(BigqueryTable):
    name = "my_table_name"
    schema = Schema
    version = "1"

By default the version will be appended to the table name, like so: my_table_name_v1. If you want to overwrite this behaviour, you can implement the full_table_name method.

Time partitioning

Define time partitioning for your table:

from bq_schema.types.type_mapping import Timestamp
from google.cloud.bigquery import TimePartitioning, TimePartitioningType

class MyTable:
    time_partitioning = TimePartitioning(
        type_=TimePartitioningType.DAY, field="some_column"
    )

Scripts

migrate-tables

This script has two uses:

Check if locally defined schemas are in sync with the schemas in bigquery
If a difference is detected, we try to apply the changes

The script will find all defined tables recursivly for a given python module.

Note: If you have not defined your project and / or dataset in code, you will have to pass it as a parameter to the script. Show the help:

migrate-tables --help

Check if tables are in sync. List all changes.

migrate-tables --module-path module/

If you want the script to fail on a change, add the validate flag. Useful for running inside your CI:

migrate-tables --module-path module/ --validate

Apply changes

migrate-tables --module-path src/jobs/ --apply

convert-table

If you already have tables created in bigquery, this script print the corresponding dataclass for you.

Show the help:

convert-table --help

Print a table:

convert-table --project project --dataset scraper --table-name table_name >> schema.py

Development

Setting up your dev environment

Clone the project.
Navigate into the cloned project.
Create a virtual environment with python version >=3.7

pipenv --python PYTHON_VERSION
```
$ pipenv --python 3.7
```
or

virtualenv -p /PATH_TO_PYTHON/ /DESIRED_PATH/VENV_NAME
```
$ virtualenv -p /usr/bin/python3.7 placeholder
```
Install flit via pip
```
$ pip install flit
```
Install packages
```
$ flit install --symlink
```

Code quality

Run all code quality checks:

inv check-all

Test

inv test

Lint

inv lint

Types

inv type-check

Code format

inv format-code

Validate code is correctly formatted:

inv check-code-format

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.6.3

Jul 5, 2022

0.6.2

Jul 1, 2022

0.6.1

May 18, 2022

0.6.0

May 4, 2022

0.5.6

Feb 23, 2022

0.5.5

Mar 16, 2021

0.5.4

Mar 4, 2021

0.5.3

Mar 2, 2021

0.5.2

Feb 15, 2021

This version

0.5

Jan 18, 2021

0.4

Dec 16, 2020

0.3

Dec 7, 2020

0.2

Dec 7, 2020

0.1

Dec 3, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bq-schema-0.5.tar.gz (20.3 kB view details)

Uploaded Jan 18, 2021 Source

Built Distribution

bq_schema-0.5-py3-none-any.whl (14.4 kB view details)

Uploaded Jan 18, 2021 Python 3

File details

Details for the file bq-schema-0.5.tar.gz.

File metadata

Download URL: bq-schema-0.5.tar.gz
Upload date: Jan 18, 2021
Size: 20.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: python-requests/2.25.1

File hashes

Hashes for bq-schema-0.5.tar.gz
Algorithm	Hash digest
SHA256	`aa403b05aaffa1ff3ba5cc860941ef729e8ae4e0a80c06c12d4660c0f173eb4d`
MD5	`afad76ece987d5cf9428b8837646ce1b`
BLAKE2b-256	`8287a087445f7a2d8163dc2b90034553a834719dd46d944f48a45653f4cc4f1d`

See more details on using hashes here.

File details

Details for the file bq_schema-0.5-py3-none-any.whl.

File metadata

Download URL: bq_schema-0.5-py3-none-any.whl
Upload date: Jan 18, 2021
Size: 14.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: python-requests/2.25.1

File hashes

Hashes for bq_schema-0.5-py3-none-any.whl
Algorithm	Hash digest
SHA256	`44471c8245207cceaaac0900fc41a6e7f02ceb9b9ef87a1f8078a6eb05d1a701`
MD5	`c50058921cd5006bd9f25ba4a842e4ed`
BLAKE2b-256	`f0e986943cbe7d03d5b677f4c3106d8363954388af0496a09303ce13c716b8ed`

See more details on using hashes here.

bq-schema 0.5

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

bq-schema

Motivation

Quickstart

Documentation

Schema definitions

Timestamps

Table definitions

Required properties

Optional properties

Versioning tables

Time partitioning

Scripts

migrate-tables

convert-table

Development

Setting up your dev environment

Code quality

Test

Lint

Types

Code format

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes