Skip to main content

The ByteHouse plugin for dbt (data build tool)

Project description

Introduction

dbt (Data Building Tool) is an open source tool that enables data analysts and engineers to transform data in their warehouses simply by writing select statements. dbt performs the T (Transform) of ETL and allows companies to write transformations as queries and orchestrate them in a more efficient way. ByteHouse dbt connector is a plugin enabling users to build their data warehouse ecosystem with dbt and ByteHouse.

Table of Contents

Requirements

Make sure you have dbt & python installed on your machine. If not, then you can follow this guide https://docs.getdbt.com/docs/get-started/installation

  • dbt v1.7.0 or greater
  • python v3.7 or greater

Creating ByteHouse Account

You need to create ByteHouse account in order to use bytehouse-dbt connector. You can simply create a free account with the process mentioned in our official website documentation: https://docs.bytehouse.cloud/en/docs/quick-start

You can also create ByteHouse account through Volcano Engine by ByteDance: https://www.volcengine.com/product/bytehouse-cloud

Installation

Create a new repository where we will instantiate a Python virtual environment.

mkdir dbt_bytehouse_demo
cd dbt_bytehouse_demo

python -m venv venv
source venv/bin/activate

Latest release version can be installed from here:

pip install dbt-bytehouse

Current development version can be installed from here:

pip install git+https://github.com/bytehouse-cloud/bytehouse-dbt@master#egg=bytehouse-driver

Check whether installation is successful by verifying bytehouse is available under Plugins.

dbt --version

Version

dbt Project Setup

dbt_project.yml

Every dbt project needs a dbt_project.yml file — this is how dbt knows a directory is a dbt project. dbt_project.yml file holds the context of your project and tells dbt how to build your data sets. Some common configurations for dbt_project.yml are:

YAML key Value
name Your project’s name in snake case
version Version of your project
profile The profile dbt uses to connect to ByteHouse
model-paths Directories to where your model and source files live
seed-paths Directories to where your seed files live
test-paths Directories to where your test files live
snapshot-paths Directories to where your snapshots live
docs-paths Directories to where your docs blocks live

profiles.yml

When you invoke dbt from the command line, dbt parses your dbt_project.yml and obtains the profile name. dbt then checks your profiles.yml file for a profile with the same name. A profile contains all the details/credentials required to connect to ByteHouse. dbt will search the current working directory for the profiles.yml file and will default to the ~/.dbt/ directory if not found.

<profile-name>:
  target: <target-name>
  outputs:
    <target-name>:
      type: bytehouse
      schema: <database-name>
      user: bytehouse
      password: <bytehouse-api-key>
      driver: native
      #optional fields
      host: <hostname>
      port: <port>
      region: <region-name>
      warehouse: <warehouse-name>
      retries: 1
      secure: True
      connect_timeout: 10
      send_receive_timeout: 300
      custom_settings: <empty>
YAML key Value
<profile-name> Name of the profile. Has to be the same name as the profile indicated in your dbt_project.yml file
target Default target your dbt project will use. It must be one of the targets you define in your profile
type Must be set to bytehouse
schema Database name
user Must be set to bytehouse
password ByteHouse API Token
driver Must be set to native
host [Optional] The host name of the connection
port [Optional] The port number of the host server
warehouse [Optional] The name of the virtual warehouse that you want to use for this session
retries [Optional] Number of times to retry the initial connection attempt if the error appears to be recoverable
secure [Optional] Whether the connection is secured by TLS. Suggested to set it to True
connect_timeout [Optional] Connection timeout in seconds. Default is 10 seconds
send_receive_timeout [Optional] Timeout for receiving data from or sending data to ByteHouse. Default is 5 minutes (300 seconds)
custom_settings [Optional] A mapping of ByteHouse specific user settings to use with the connection

Connection & Authentication Configurations

Host Address & API Key Configuration

Required parameters: host port user password

bytehouse_profile:
  target: dev
  outputs:
    dev:
      type: bytehouse
      driver: native

      # database
      schema: $DATABASE_NAME

      # target server address
      host: $HOST_ADDRESS  
      port: $PORT_NUMBER

      # account credentials
      user: bytehouse
      password: $API_KEY

      # additional settings
      secure: True

Project Initialization

dbt init command will prompt for project name & database adapters, where you have to select bytehouse. This will create a new folder with your project name, sample files & dbt_project.yml config file, enough to get you started with dbt.

dbt init

Init Update your profiles.yml with required authentication & target server credentials.

nano ~/.dbt/profiles.yml

As dbt has created a new folder with the same name as your project name, change your current directory to the project folder.

cd dbt_bytehouse_demo

Update your dbt_project.yml file to have the same profile_name as ~/.dbt/profiles.yml.

nano dbt_project.yml

Project

Test Warehouse Connection

Use dbt debug command to verify required dependencies & warehouse connection. In case of success, it will show you "All checks passed!"

dbt debug

Debug

Dataset Ingestion

To showcase different dbt functionalities, we will ingest a small imdb movie dataset, with the following schema. The DDL & insertion queries can be found here https://github.com/bytehouse-cloud/bytehouse-dbt/examples/data_loading.sql. You can use ByteHouse SQL worksheet to create the schema & insert the dataset. Schema To verify that dataset preparation was successful, we will execute this following query to summarize each actor along with their total number of movie appearances.

SELECT id,
  any(actor_name) as name,
  uniqExact(movie_id)    as num_movies,
  avg(rating)                as avg_rating,
  max(created_at) as updated_at
FROM (
  SELECT actors.id as id,
       concat(actors.first_name, ' ', actors.last_name) as actor_name,
       movies.id as movie_id,
       movies.rating as rating,
       created_at
  FROM  imdb.actors
        JOIN imdb.roles ON roles.actor_id = actors.id
        LEFT OUTER JOIN imdb.movies ON movies.id = roles.movie_id
)
GROUP BY id

The resultset should be like this: Schema Confirm

dbt Models

In a dbt project, a model is a sql file located inside models/ directory which will contain a SELECT statement referring to a transformation. The name of the model file will refer to the name of future table/view after dbt execution. When we execute dbt run command, dbt will build this model directly into ByteHouse by wrapping it in a create table / view materialization. Materialization type of your model will determine the actual SQL that dbt will use to create model in the warehouse.

schema.yml

The schema.yml file will define our tables & columns by referring through alias name. This schemas can later be used in different models/macros via source() function. Create schema.yml file under models/ directory

touch models/schema.yml

Define our model schema like this

version: 2

sources:
- name: imdb
  tables:
  - name: actors
  - name: roles
  - name: movies

Materialization types of Models

View Materializations

In case of view materialization, a model is transformed to a view on each single run by CREATE VIEW AS statement in ByteHouse. View materializations won't store the actual data, so it would be slower to query than table materializations. Let's create models/actor_insight.sql as view materialization.

touch models/actors_insight.sql
{{ config(materialized='view') }}

SELECT id,
  any(actor_name) as name,
  uniqExact(movie_id)    as num_movies,
  avg(rating)                as avg_rating,
  max(created_at) as updated_at
FROM (
  SELECT actors.id as id,
       concat(actors.first_name, ' ', actors.last_name) as actor_name,
       movies.id as movie_id,
       movies.rating as rating,
       created_at
  FROM  {{ source('imdb', 'actors') }}
        JOIN {{ source('imdb', 'roles') }} ON roles.actor_id = actors.id
        LEFT OUTER JOIN {{ source('imdb', 'movies') }} ON movies.id = roles.movie_id
)
GROUP BY id

Let's execute dbt run command to build this model in ByteHouse. Confirm Querying this view, we can replicate the results of our earlier query with a simpler syntax:

SELECT * FROM imdb.actors_insight ORDER BY num_movies DESC;

Confirm2

Table Materializations

In case of table materialization, your model would be rebuilt as a table on each single dbt run via a CREATE TABLE AS statement.

touch models/actors_insight_table.sql

We can use our previous view materialization sql with config change for table materialization sql.

{{ config(order_by='(updated_at, id, name)', materialized='table') }}

We can verify that both view & table materializations generate the same response.

 SELECT * FROM imdb.actors_insight_table ORDER BY num_movies DESC;

Incremental Materializations

For our previous table materialization, dbt will construct a table every time to materialize the model. For larger or complex transformations, this would be redundant and costly in terms of computing power. Incremental materializations solve this problem.
The first time a model is run, the table is built by transforming all rows of source data. On subsequent runs, dbt transforms only the rows in your source data that you tell dbt to filter for, inserting them into the target table which is the table that has already been built.
To tell dbt which rows it should transform on an incremental run, wrap valid SQL that filters for these rows in the is_incremental() macro. Your is_incremental() code will check for rows created or modified since the last time dbt ran this model.

touch models/actors_insight_incremental.sql
{{ config(order_by='(updated_at, id, name)', materialized='incremental') }}

SELECT id,
  any(actor_name) as name,
  uniqExact(movie_id)    as num_movies,
  avg(rating)                as avg_rating,
  max(created_at) as updated_at
FROM (
  SELECT actors.id as id,
       concat(actors.first_name, ' ', actors.last_name) as actor_name,
       movies.id as movie_id,
       movies.rating as rating,
       created_at
  FROM  {{ source('imdb', 'actors') }}
        JOIN {{ source('imdb', 'roles') }} ON roles.actor_id = actors.id
        LEFT OUTER JOIN {{ source('imdb', 'movies') }} ON movies.id = roles.movie_id
)
GROUP BY id

{% if is_incremental() %}

-- this filter will only be applied on an incremental run
where id > (select max(id) from {{ this }}) or updated_at > (select max(updated_at) from {{this}})

{% endif %}

We can verify that view, table & incremental materializations, all generate the same response.

 SELECT * FROM imdb.actors_insight_table ORDER BY num_movies DESC;
dbt run -m models/actors_insight_incremental.sql

Let's insert a few more rows to demonstrate the power of incremental materializations. Now the most appeared actor should be 'Chris Pratt'.

INSERT INTO imdb.movies VALUES (9, 'Passengers', 2016, 7);
INSERT INTO imdb.movies VALUES (10, 'The Tomorrow War', 2021, 6.5);

INSERT INTO imdb.roles (actor_id, movie_id, role_name) VALUES(4, 9, 'Jim Preston');
INSERT INTO imdb.roles (actor_id, movie_id, role_name) VALUES(4, 10, 'Dan Forester');
dbt run -m models/actors_insight_incremental.sql
SELECT * FROM imdb.actors_insight_incremental ORDER BY num_movies DESC;

Confirm3

How it works

  1. dbt will first create a temporary table named actors_insight_incremental_tmp & insert all those rows which pass our is_incremental() filter.
  2. A new table actors_insight_incremental_new will be created & rows from the old table actors_insight_incremental will be ingested here. dbt will make sure that unique_key (if any declared in config) constraint is maintained, by not allowing those rows which have the same unique_key as the previous temporary table.
  3. The rows from the temporary table would be ingested into the new table.
  4. Our previous table (actors_insight_incremental) & new table (actors_insight_new) will be exchanged.

Direct Write to CNCH Unique Key Table

Requirement

  1. dbt-bytehouse connector version to be at least 1.7.1.
  2. Needs to set config name cnch_unique_key instead of unique_key
  3. Needs to set config incremental_strategy as append
  4. Write a SQL expression to append the virtual warehouse id at the end of DBT model, example SETTINGS virtual_warehouse='vw-id', virtual_warehouse_write='vw-id'

Example

{{
    config(
        materialized='incremental',
        cnch_unique_key='id',
        incremental_strategy='append'    
    )
}}

SELECT
    id,
    transaction_date,
    user_id,
    amount,
    status
FROM 
    dbt_unique.transactions 

{% if is_incremental() %}

  WHERE transaction_date >= (SELECT max(transaction_date) FROM dbt_unique.transactions)

{% endif %}

SETTINGS virtual_warehouse='vw-id', virtual_warehouse_write='vw-id'

Project Documentation

dbt provides a way to generate documentation for your dbt project and render it as a website. Create models/actors_insight_incremental.yml to generate documentation for our models.

version: 2

models:
  - name: actors_insight_incremental
    description: Actor appearance summary based on roles
    columns: 
      - name: id
        description: The id number of actor
      - name: name
        description: The name of actor
      - name: num_movies
        description: The number of movies actor has appeared 
      - name: avg_rating
        description: Average rating
      - name: updated_at
        description: Latest update time

Use dbt docs generate to generate the documentation for your models & dbt docs serve to serve the documentation in your local browser on port 8000. Doc

Local Development

Update tests/integration/confest.py file to include your connection credentials. For running tests locally, follow these steps:

pip install -r dev_requirements.txt
python -m pytest

Original Author

ByteHouse wants to thank ClickHouse for original contribution to this connector.

License

This project is distributed under the terms of the Apache License (Version 2.0).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dbt-bytehouse-1.7.2.tar.gz (35.5 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page