The greenplum adapter plugin for dbt (data build tool)
Project description
dbt enables data analysts and engineers to transform their data using the same practices that software engineers use to build applications.
dbt is the T in ELT. Organize, cleanse, denormalize, filter, rename, and pre-aggregate the raw data in your warehouse so that it's ready for analysis.
dbt-greenplum
The dbt-greenplum
package contains the code enabling dbt to work with Greenplum. This adapter based on postgres-adapter with a bit difference for a greenplum specific features
Installation
Easiest way to start use dbt-greenplum is to install it using pip
pip install dbt-greenplum==<version>
Where <version>
is same as your dbt version
Available versions:
- 0.19.2
- 1.0.4
- 1.2.0
- 1.4.0
- 1.5.0
- 1.8.0
Supported Features
You can specify following settings:
- Storage type
- heap
- appendoptimized
- Distribution
distributed randomly
by defautdistributed by (column, [ ... ] )
by setting updistributed_by
parameter in the model configdistributed replicated
by setting updistributed_replicated=true
parameter in the model config
- Table orientation
orientation=colum
by defaultorientation=row
by setting uporientation
parameter inrow
in the model config
- Compress type, level and blocksize with default values
compresstype=ZLIB, compresslevel=1, blocksize=32768
You can also specifyblocksize
,compresstype
,compresslevel
in the model config appendoptimized
preference by default istrue
, also you can override it by setting upappendoptimized
field in the model config- Partitions (see "Partition" chapter below)
- Additional incremental strategy
truncate-insert
by setting upincremental_strategy="truncate+insert"
parameter in the model config or+incremental_strategy: truncate_insert
in the dbt_project.yml
Heap table example
To create heap table set appendoptimized
parameter value to false
{{
config(
...
materialized='table',
appendoptimized=false
...
)
}}
select 1 as "id"
will produce following SQL code
create table "<db_name>"."<schema_name>"."<table_name>"
with (
appendoptimized=false
) as (
select 1 as "id"
)
DISTRIBUTED RANDOMLY;
Appendoptimized table example
You can use appendopimized
or appendonly
(legacy) to create appendoptimized table
Such model definition
{{
config(
materialized='table',
distributed_by='id',
appendoptimized=true,
orientation='column',
compresstype='ZLIB',
compresslevel=1,
blocksize=32768
)
}}
with source_data as (
select 1 as id
union all
select null as id
)
select *
from source_data
will produce following sql code
create table "dvault"."dv"."my_first_dbt_model__dbt_tmp"
with (
appendoptimized=true,
blocksize=32768,
orientation=column,
compresstype=ZLIB,
compresslevel=1
)
as (
with source_data as (
select 1 as id
union all
select null as id
)
select *
from source_data
)
distributed by (id);
alter table "dvault"."dv"."my_first_dbt_model__dbt_tmp" rename to "my_first_dbt_model";
Partitions
Greenplum does not support partitions with create table as
construction, so you need to build model in two steps
- create table schema
- insert data
To implement partitions into you dbt-model you need to specify on of the following config parameters:
fields_string
- definition of columns name, type and constraints- one of following way to configure partitions
raw_partition
by defaultpartition_type
,partition_column
,partition_spec
partition_type
,partition_column
,partition_start
,partition_end
,partition_every
partition_type
,partition_column
,partition_values
default_partition_name
- name of default partition 'other' by default
Let consider examples of definition model with partitions
- using
raw_partition
parameter{% set fields_string %} id int4 null, incomingdate timestamp NULL {% endset %} {% set raw_partition %} PARTITION BY RANGE (incomingdate) ( START ('2021-01-01'::timestamp) INCLUSIVE END ('2023-01-01'::timestamp) EXCLUSIVE EVERY (INTERVAL '1 day'), DEFAULT PARTITION extra ); {% endset %} {{ config( materialized='table', distributed_by='id', appendoptimized=true, orientation='column', compresstype='ZLIB', compresslevel=1, blocksize=32768, fields_string=fields_string, raw_partition=raw_partition, default_partition_name='other_data' ) }} with source_data as ( select 1 as id, '2022-02-22'::timestamp as incomingdate union all select null as id, '2022-02-25'::timestamp as incomingdate ) select * from source_data
will produce following sql codecreate table if not exists "database"."schema"."my_first_dbt_model__dbt_tmp" ( id int4 null, incomingdate timestamp NULL ) with ( appendoptimized=true, blocksize=32768, orientation=column, compresstype=ZLIB, compresslevel=1 ) DISTRIBUTED BY (id) PARTITION BY RANGE (incomingdate) ( START ('2021-01-01'::timestamp) INCLUSIVE END ('2023-01-01'::timestamp) EXCLUSIVE EVERY (INTERVAL '1 day'), DEFAULT PARTITION extra ); insert into "database"."schema"."my_first_dbt_model__dbt_tmp" ( with source_data as ( select 1 as id, '2022-02-22'::timestamp as incomingdate union all select null as id, '2022-02-25'::timestamp as incomingdate ) select * from source_data ); alter table "dvault"."dv"."my_first_dbt_model" rename to "my_first_dbt_model__dbt_backup"; drop table if exists "dvault"."dv"."my_first_dbt_model__dbt_backup" cascade; alter table "database"."schema"."my_first_dbt_model__dbt_tmp" rename to "my_first_dbt_model";
- Same result you can get using
partition_type
,partition_column
,partition_spec
parameters{% set fields_string %} id int4 null, incomingdate timestamp NULL {% endset %} {%- set partition_type = 'RANGE' -%} {%- set partition_column = 'incomingdate' -%} {% set partition_spec %} START ('2021-01-01'::timestamp) INCLUSIVE END ('2023-01-01'::timestamp) EXCLUSIVE EVERY (INTERVAL '1 day'), DEFAULT PARTITION extra {% endset %} {{ config( materialized='table', distributed_by='id', appendoptimized=true, orientation='column', compresstype='ZLIB', compresslevel=1, blocksize=32768, fields_string=fields_string, partition_type=partition_type, partition_column=partition_column, partition_spec=partition_spec, default_partition_name='other_data' ) }} with source_data as ( select 1 as id, '2022-02-22'::timestamp as incomingdate union all select null as id, '2022-02-25'::timestamp as incomingdate ) select * from source_data
- also, you can use third way
{% set fields_string %} id int4 null, incomingdate timestamp NULL {% endset %} {%- set partition_type = 'RANGE' -%} {%- set partition_column = 'incomingdate' -%} {%- set partition_start = "'2021-01-01'::timestamp" -%} {%- set partition_end = "'2022-01-01'::timestamp" -%} {%- set partition_every = '1 day' -%} {{ config( materialized='table', distributed_by='id', appendoptimized=true, orientation='column', compresstype='ZLIB', compresslevel=1, blocksize=32768, fields_string=fields_string, partition_type=partition_type, partition_column=partition_column, partition_start=partition_start, partition_end=partition_end, partition_every=partition_every, default_partition_name='other_data' ) }} with source_data as ( select 1 as id, '2022-02-22'::timestamp as incomingdate union all select null as id, '2022-02-25'::timestamp as incomingdate ) select * from source_data
- example of partition_type
LIST
is coming soon
Table partition hints
Too check generate sql script use -d
option:
dbt -d run <...> -m <models>
If you want implement complex partition logic with subpartition or something else use raw_partition
parameter
Truncate+insert
incremental strategy
You can use this incremental strategy to safely reload models without cascade dropping dependent objects due to Greenplum's transactional TRUNCATE
operation realization
Model definition example:
{{
config(
...
materialized='incremental',
incremental_strategy='truncate+insert',
...
)
}}
select *
from source_data
Getting started
- Install dbt
- Read the introduction and viewpoint
Join the dbt Community
- Be part of the conversation in the dbt Community Slack
- Read more on the dbt Community Discourse
Reporting bugs and contributing code
- Want to report a bug or request a feature? Let us know on Slack, or open an issue
- Want to help us build dbt? Check out the Contributing Guide
Code of Conduct
Everyone interacting in the dbt project's codebases, issue trackers, chat rooms, and mailing lists is expected to follow the dbt Code of Conduct.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file dbt_gp-1.8.0.tar.gz
.
File metadata
- Download URL: dbt_gp-1.8.0.tar.gz
- Upload date:
- Size: 12.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.0 CPython/3.12.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 89d2248136acab7ebd4110f0e9c7eeb57da5a2b05a918f18257e5261cdbfdad6 |
|
MD5 | a833437cf28a78e0b59868609a75a454 |
|
BLAKE2b-256 | 436d625ab1a8b5851bec780957f6050461df2b6635aea9f211bf6b6656c42b88 |