Skip to main content

Lightweight data build tool

Project description

Lightweight data build tool

dbt-light is a library for developing transformations inside data warehouse. It handles the T in ELT processes by turning select queries templated with jinja2 into tables/views without the need to write and apply ddl as it will automatically handle objects creation and addition of new columns.

The purpose of DWH is to store history, dbt-light does that by using highly configurable snapshots that implement SCD2. They can handle CDC full/delta loads, events and store relationships.

Example

dbt-light has a simple API that allows object creation by its name in project directory:

>>> from dbt_light.model import Model
>>> from dbt_light.snapshot import Snapshot
>>> from dbt_light.seed import Seed

>>> Model('customer_data').materialize()
>>> Seed('customer_ref').materialize()
>>> Snapshot('customer_snapshot').materialize()

Installation

Use pip:

pip install dbt-light

Then execute the module to create new dbt project:

python -m dbt-light 

This will create ~/.dbt-light/profiles.yaml if it doesn't exist already. This file contains all dbt projects with paths to their folders as well as connection parameters to target db (only postgres is supproted now).

Example of profiles.yaml file:

dbt-light_project:
  path: /home/user/dbt-light_project
  adapter: postgres
  dbname: postgres
  host: localhost
  pass: pass
  port: 5432
  user: postgres

This will also create template folder with dbt-light project in your current directory with the following structure:

├── dbt-light_project
│   ├── models
│   ├── incremental_models
│   ├── snapshots
│   ├── seeds
│   ├── macros

Configuration

All configuration is done in yaml files separately for each type of object (models, snapshots, seeds, sources) in project directory. There's no need to create these files for objects you don't use. There are three ways to define configuration properties: for all objects at the top of yaml file, for objects which name follows specific pattern with pattern_name key and for specific object with name key. These methods can be combined, in this case more specific property will be chosen.

For example, if there's a models.yaml file in project's directory such as:

materialization: table
models:
  - pattern_name: "vw_.*"
    materialization: view
  - name: vw_model
    materialization: table

All models will be created as tables unless they follow the pattern "vw_.*" except for model with the name "vw_model".

Basic classes

There are three main classes: Model, Snapshot and Seed. They can be created by passing object's name and if there are more than one projects defined in profiles.yaml you will also need to pass dbt_project param with the name of the project. All object types have full_refresh constructor parameter which is False by default.

These classes have method materialize to load the object in database. By default, object (and its schema if not exists) gets created on the first run and insert (sometimes truncate before that) is performed on all consecutive runs. If full_refresh is passed as True than DROP CASCADE will be used which will also drop all objects that depend on it. Models with view materialization are dropped with cascade every time even if full_refresh is False.

dbt-light will handle adding new fields without loading objects with full_refresh set to True. So full_refresh can be used to recreate objects to delete fields that are no longer used, by default null will be inserted into them. Or it can be used to recreate objects that have state like snapshots and incremental models but in this case all the history will be lost.

Jinja variables can be used to reference another object by its name. dbt-light will interpolate schema when rendering the object.

Models

Models configuration is specified in models.yaml but some properties are implied. For example, target schema is defined by the subfolder name in models/incremental_models in which sql query resides. Whether model's incremental or not is defined by the parent folder name (models or incremental_models). By default, model will be created as table unless materialization property is defined as view in models.yaml.

For incremental models there's a way to filter new rows added to it. Jinja variable model_exist can be used to determine whether model already exists and model itself can be referenced with jinja variable this. Incremental models can have sequence key which gets created if incr_key property is defined for the model which can be used to generate integer surrogate keys. However, it is much better to use hash keys.

For example:

select * from {{ some_model }}
{% if model_exist %}
	where customer_id not in (select customer_id from {{ this }})
{% endif %}

Snapshots

Snapshots implement SCD2 (Slowly Changing Dimensions) to identify how row changes over time. If you're building snapshot based on another object you can set property model in snapshots.yaml, that way you don't need to create .sql file for that snapshot. Because of that unlike models target schema is defined in configuration file with the key target_schema rather than implied by the subfolder name. Unique key or keys should be defined with key_fields for each snapshot. target_schema and key_fields are two only required properties. But .sql file should exist in snapshots/ or model should be defined otherwise NoInputSpecifiedError will be raised.

Method materialize can be used to load a snapshot as well as methods delta_calc and delta_apply if there's a need to calculate and apply changes in separate steps. Method delta_calc will create delta table containing all changes and delta_apply will load changes from that table. In order to use these methods delta_schema and delta_table should be defined for that model otherwise temporary table is used when calculating delta which will not exist when calling delta_apply. Delta table has some additional columns including processed_dttm, hash_diff (if there are data columns) and diff_type which indicate the type of change. There are four diff types:

  • NO - new object
  • NV - new value of existing object
  • DO - deleted object
  • DU - direct update (when source timestamp hasn't changed but hash sum's changed)

There are two ways of detecting changes: check (used by default) and timestamp.

Timestamp will detect changes by comparing updated_at_field timestamp which indicates if a row has changed in source system. It gives better performance but should be used only if timestamp is reliable. If changes were made but updated_at_field didn't change, new version won't be created or if timestamp changed but no actual changes were made, new version will be created anyway.

Check strategy detects changes by comparing hash sums of data columns. By default, all data columns are used but you can specify columns in data_fields or set ignored_data_fields to exclude some of them. Hash sum will be stored in column with the name hash_diff but you can change the name with hash_diff_field. dbt-light uses md5 to calculate hash sum but you can calculate it in advance using your own method and if hash_diff_field is in input table it will be used instead of calculating a new one.

There can be no data columns at all. If you're modeling many-to-many relationship you may find it useful to create snapshot to see when relationship started and ended (when row was deleted from source). In this case hash_diff_field will not be created.

Check strategy can also be used with updated_at_field. In this case you avoid some problems with unreliable timestamp that occur when using timestamp strategy. New version will not be created if updated_at_field changed and data columns didn't. And if changes were made but updated_at_field didn't change row will be updated directly without creating new version (DU diff type in delta table).

If updated_at_field is not specified then processed_field (processed_dttm by default) is used to create date intervals using current timestamp. If you want to use other timestamp, for example, timestamp when a row was extracted from source to a staging area than you can specify it in processed_field property and it will be used instead of current datetime. updated_at_field can also be used for this purpose but it'll be less clear as updated_at_field is used for source timestamps.

Rows that are absent in input table but exist in snapshot by default will get invalidated unless invalidate_hard_deletes is set to False. processed_field or updated_at_field will be used if deleted_flg is defined which tells us when exactly it was deleted from source. null is used for effective records but you can specify a timestamp in max_dttm. Fields that hold date intervals are called effective_from_dttm and effective_to_dttm but that can be changed with start_field and end_field. Seeds

Timestamp strategy can be used to load data from CDC source. If you load data from table containing Change Data Capture events like Insert, Update, Delete, dbt-light will handle situations like multiple versions for the same row in one portion of data and deletions based on Delete events and not on the absence of a row. In order for it to work you must specify deleted_flg where dbt-light will search for Delete events. By default it will check for 'Delete' values but that can be changed with deleted_flg_val property. You might also need to set invalidate_hard_deletes to False so it won't invalidate rows based on the absence of it in data portion. deleted_flg can be used not only for CDC data but for invalidating rows based on some value, for example, if you have is_active column which if set to N means it was deleted from source and should be invalidated in snapshot instead of creating new version.

Seeds

Seeds are csv files containing static data that rarely changes. Configuration is specified in seeds.yaml but schema is implied by the subfolder in seeds/. By default comma is used as a delimeter but it can be changed in delimiter property.

Column names and types should be specified under columns for each seed:

seeds:
  - name: seed
    columns:
      - name: country_code
        type: varchar(2)
      - name: country_name
        type: varchar(32)

Sources

Sources are tables containing data loaded by your DI tools. They're are used to interpolate source schema in models that reference them. Each source has name, schema and tables. Optionally quoting can be specified either for source or specific table which is False by default.

You can use {{ some_source.some_table }} to reference source in the example below:

sources:
  - name: some_source
    schema: etl_stg
    tables:
      - name: some_table

Tests

Tests are assertions that you make about your data. They're used instead of constraints to alert you that something went wrong and possibly make rollback. Tests are executed right after object is materialized and if on_test_fail is error_with_rollback rollback is done or not if it's just error.

Tests can be defined for Models, Snapshots and Seeds in columns key:

models:
  - name: some_table
    on_test_fail: error_with_rollback
    columns:
      - name: some_id
        tests:
          - not_null
          - unique
      - name: some_type
        tests:
          - accepted_values:
              values: ["1type", "2type"]
      - name: some_fk
        tests:
          - relationships:
              to: other_model
              field: other_id

All existing tests are showed in the example above. Custom tests are not yet supported.

Macros

You can use jinja macros in models/snapshots. They're defined as usual jinja macros and stored in macros/:

{% macro surrogate_key(bus_key, sur_key) -%}
    case when {{ bus_key }} is not null then source_system_cd || '_' || row_id else null end as {{ sur_key }}
{%- endmacro %}

Statement

Function statement is a way to perform select query from template and return results back to jinja context. Statement will return either a list if only one column was selected or list of tuples where tuple represents a row:

{% set country_codes = statement('select country_code from {{ some_seed }}') %}

select * from {{ some_model }} where country_code in (
{% for cc in country_codes %}
	'{{ cc }}'
	{{ ', ' if not loop.last }}
{% endfor %})

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dbt-light-1.0.0.tar.gz (26.7 kB view details)

Uploaded Source

Built Distribution

dbt_light-1.0.0-py3-none-any.whl (37.0 kB view details)

Uploaded Python 3

File details

Details for the file dbt-light-1.0.0.tar.gz.

File metadata

  • Download URL: dbt-light-1.0.0.tar.gz
  • Upload date:
  • Size: 26.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.4.2 requests/2.22.0 setuptools/45.2.0 requests-toolbelt/0.8.0 tqdm/4.30.0 CPython/3.8.10

File hashes

Hashes for dbt-light-1.0.0.tar.gz
Algorithm Hash digest
SHA256 ade81ef0ef491a5c8bd89e85e619c4e28113512caab491ff9ae1c61e736f6e36
MD5 614a4622bdc55332d4c39007ab098d8d
BLAKE2b-256 be7af12dcdeb3094bd6a8b4f1d41020741fa50ae9dcf1446f78378bc8094af67

See more details on using hashes here.

File details

Details for the file dbt_light-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: dbt_light-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 37.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.4.2 requests/2.22.0 setuptools/45.2.0 requests-toolbelt/0.8.0 tqdm/4.30.0 CPython/3.8.10

File hashes

Hashes for dbt_light-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 44079ddd45cbc38b607b45a5a78424a68abbdfb56c070470af6d0a2ebcdb569a
MD5 b20656ba16ca8b26a373e01489db86d7
BLAKE2b-256 77ce0fffae570bede1d7910f6343aa67b1023f0edb3a04b9aba716e8d8e58a04

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page