The ByteHouse plugin for dbt (data build tool)
Project description
Introduction
dbt
(Data Building Tool) is an open source tool that enables data analysts and engineers to transform
data in their warehouses simply by writing select statements. dbt
performs the T (Transform) of ETL and
allows companies to write transformations as queries and orchestrate them in a more efficient way.
ByteHouse dbt connector is a plugin enabling users to build their data warehouse ecosystem with dbt
and ByteHouse.
Table of Contents
- Introduction
- Requirements
- Creating ByteHouse Account
- Installation
- dbt Project Setup
- Dataset Ingestion
- dbt Models
- schema.yml
- Materialization types of Models
- Project Documentation
- Local Development
- Original Author
- License
Requirements
Make sure you have dbt
& python
installed on your machine. If not, then you can follow this guide https://docs.getdbt.com/docs/get-started/installation
- dbt v1.7.0 or greater
- python v3.7 or greater
Creating ByteHouse Account
You need to create ByteHouse account in order to use bytehouse-dbt connector. You can simply create a free account with
the process mentioned in our official website documentation: https://docs.bytehouse.cloud/en/docs/quick-start
You can also create ByteHouse account through Volcano Engine by ByteDance: https://www.volcengine.com/product/bytehouse-cloud
Installation
Create a new repository where we will instantiate a Python
virtual environment.
mkdir dbt_bytehouse_demo
cd dbt_bytehouse_demo
python -m venv venv
source venv/bin/activate
Latest release version can be installed from here:
pip install dbt-bytehouse
Current development version can be installed from here:
pip install git+https://github.com/bytehouse-cloud/bytehouse-dbt@master#egg=bytehouse-driver
Check whether installation is successful by verifying bytehouse is available under Plugins.
dbt --version
dbt Project Setup
dbt_project.yml
Every dbt
project needs a dbt_project.yml
file — this is how dbt
knows a directory is a dbt
project. dbt_project.yml
file holds the context of your project and tells dbt
how to build your data sets. Some common configurations
for dbt_project.yml
are:
YAML key | Value |
name | Your project’s name in snake case |
version | Version of your project |
profile | The profile dbt uses to connect to ByteHouse |
model-paths | Directories to where your model and source files live |
seed-paths | Directories to where your seed files live |
test-paths | Directories to where your test files live |
snapshot-paths | Directories to where your snapshots live |
docs-paths | Directories to where your docs blocks live |
profiles.yml
When you invoke dbt
from the command line, dbt
parses your dbt_project.yml
and obtains the profile name. dbt
then
checks your profiles.yml
file for a profile with the same name. A profile contains all the details/credentials
required to connect to ByteHouse. dbt
will search the current working directory for the profiles.yml
file and will
default to the ~/.dbt/ directory
if not found.
<profile-name>:
target: <target-name>
outputs:
<target-name>:
type: bytehouse
schema: <database-name>
user: bytehouse
password: <bytehouse-api-key>
driver: native
#optional fields
host: <hostname>
port: <port>
region: <region-name>
warehouse: <warehouse-name>
retries: 1
secure: True
connect_timeout: 10
send_receive_timeout: 300
custom_settings: <empty>
YAML key | Value |
<profile-name> | Name of the profile. Has to be the same name as the profile indicated in your dbt_project.yml file |
target | Default target your dbt project will use. It must be one of the targets you define in your profile |
type | Must be set to bytehouse |
schema | Database name |
user | Must be set to bytehouse |
password | ByteHouse API Token |
driver | Must be set to native |
host | [Optional] The host name of the connection |
port | [Optional] The port number of the host server |
warehouse | [Optional] The name of the virtual warehouse that you want to use for this session |
retries | [Optional] Number of times to retry the initial connection attempt if the error appears to be recoverable |
secure | [Optional] Whether the connection is secured by TLS. Suggested to set it to True |
connect_timeout | [Optional] Connection timeout in seconds. Default is 10 seconds |
send_receive_timeout | [Optional] Timeout for receiving data from or sending data to ByteHouse. Default is 5 minutes (300 seconds) |
custom_settings | [Optional] A mapping of ByteHouse specific user settings to use with the connection |
Connection & Authentication Configurations
Host Address & API Key Configuration
Required parameters: host
port
user
password
bytehouse_profile:
target: dev
outputs:
dev:
type: bytehouse
driver: native
# database
schema: $DATABASE_NAME
# target server address
host: $HOST_ADDRESS
port: $PORT_NUMBER
# account credentials
user: bytehouse
password: $API_KEY
# additional settings
secure: True
Project Initialization
dbt init
command will prompt for project name & database adapters, where you have to select bytehouse. This will create
a new folder with your project name, sample files & dbt_project.yml
config file, enough to get you started with dbt.
dbt init
Update your profiles.yml with required authentication & target server credentials.
nano ~/.dbt/profiles.yml
As dbt
has created a new folder with the same name as your project name, change your current
directory to the project folder.
cd dbt_bytehouse_demo
Update your dbt_project.yml
file to have the same profile_name as ~/.dbt/profiles.yml
.
nano dbt_project.yml
Test Warehouse Connection
Use dbt debug
command to verify required dependencies & warehouse connection. In case of success, it will show you
"All checks passed!"
dbt debug
Dataset Ingestion
To showcase different dbt
functionalities, we will ingest a small imdb movie dataset, with the following schema.
The DDL & insertion queries can be found here https://github.com/bytehouse-cloud/bytehouse-dbt/examples/data_loading.sql.
You can use ByteHouse SQL worksheet to create the schema & insert the dataset.
To verify that dataset preparation was successful, we will execute this following query to summarize each actor along
with their total number of movie appearances.
SELECT id,
any(actor_name) as name,
uniqExact(movie_id) as num_movies,
avg(rating) as avg_rating,
max(created_at) as updated_at
FROM (
SELECT actors.id as id,
concat(actors.first_name, ' ', actors.last_name) as actor_name,
movies.id as movie_id,
movies.rating as rating,
created_at
FROM imdb.actors
JOIN imdb.roles ON roles.actor_id = actors.id
LEFT OUTER JOIN imdb.movies ON movies.id = roles.movie_id
)
GROUP BY id
The resultset should be like this:
dbt Models
In a dbt
project, a model is a sql file located inside models/
directory which will contain a SELECT
statement
referring to a transformation. The name of the model file will refer to the name of future table/view after
dbt
execution. When we execute dbt run
command, dbt
will build this model directly into ByteHouse by wrapping
it in a create table / view materialization. Materialization type of your model will determine the actual SQL
that dbt
will use to create model in the warehouse.
schema.yml
The schema.yml
file will define our tables & columns by referring through alias name. This schemas can later be
used in different models/macros via source()
function.
Create schema.yml
file under models/
directory
touch models/schema.yml
Define our model schema like this
version: 2
sources:
- name: imdb
tables:
- name: actors
- name: roles
- name: movies
Materialization types of Models
View Materializations
In case of view materialization, a model is transformed to a view on each single run by CREATE VIEW
AS statement in
ByteHouse. View materializations won't store the actual data, so it would be slower to query than table materializations.
Let's create models/actor_insight.sql
as view materialization.
touch models/actors_insight.sql
{{ config(materialized='view') }}
SELECT id,
any(actor_name) as name,
uniqExact(movie_id) as num_movies,
avg(rating) as avg_rating,
max(created_at) as updated_at
FROM (
SELECT actors.id as id,
concat(actors.first_name, ' ', actors.last_name) as actor_name,
movies.id as movie_id,
movies.rating as rating,
created_at
FROM {{ source('imdb', 'actors') }}
JOIN {{ source('imdb', 'roles') }} ON roles.actor_id = actors.id
LEFT OUTER JOIN {{ source('imdb', 'movies') }} ON movies.id = roles.movie_id
)
GROUP BY id
Let's execute dbt run
command to build this model in ByteHouse.
Querying this view, we can replicate the results of our earlier query with a simpler syntax:
SELECT * FROM imdb.actors_insight ORDER BY num_movies DESC;
Table Materializations
In case of table materialization, your model would be rebuilt as a table on each single dbt run
via a CREATE TABLE
AS statement.
touch models/actors_insight_table.sql
We can use our previous view materialization sql with config change for table materialization sql.
{{ config(order_by='(updated_at, id, name)', materialized='table') }}
We can verify that both view & table materializations generate the same response.
SELECT * FROM imdb.actors_insight_table ORDER BY num_movies DESC;
Incremental Materializations
For our previous table materialization, dbt
will construct a table every time to materialize the model. For larger
or complex transformations, this would be redundant and costly in terms of computing power. Incremental
materializations solve this problem.
The first time a model is run, the table is built by transforming all rows of source data. On subsequent runs,
dbt
transforms only the rows in your source data that you tell dbt
to filter for, inserting them into the target
table which is the table that has already been built.
To tell dbt
which rows it should transform on an incremental run, wrap valid SQL that filters for these rows
in the is_incremental()
macro. Your is_incremental()
code will check for rows created or modified since the
last time dbt
ran this model.
touch models/actors_insight_incremental.sql
{{ config(order_by='(updated_at, id, name)', materialized='incremental') }}
SELECT id,
any(actor_name) as name,
uniqExact(movie_id) as num_movies,
avg(rating) as avg_rating,
max(created_at) as updated_at
FROM (
SELECT actors.id as id,
concat(actors.first_name, ' ', actors.last_name) as actor_name,
movies.id as movie_id,
movies.rating as rating,
created_at
FROM {{ source('imdb', 'actors') }}
JOIN {{ source('imdb', 'roles') }} ON roles.actor_id = actors.id
LEFT OUTER JOIN {{ source('imdb', 'movies') }} ON movies.id = roles.movie_id
)
GROUP BY id
{% if is_incremental() %}
-- this filter will only be applied on an incremental run
where id > (select max(id) from {{ this }}) or updated_at > (select max(updated_at) from {{this}})
{% endif %}
We can verify that view, table & incremental materializations, all generate the same response.
SELECT * FROM imdb.actors_insight_table ORDER BY num_movies DESC;
dbt run -m models/actors_insight_incremental.sql
Let's insert a few more rows to demonstrate the power of incremental materializations. Now the most appeared actor should be 'Chris Pratt'.
INSERT INTO imdb.movies VALUES (9, 'Passengers', 2016, 7);
INSERT INTO imdb.movies VALUES (10, 'The Tomorrow War', 2021, 6.5);
INSERT INTO imdb.roles (actor_id, movie_id, role_name) VALUES(4, 9, 'Jim Preston');
INSERT INTO imdb.roles (actor_id, movie_id, role_name) VALUES(4, 10, 'Dan Forester');
dbt run -m models/actors_insight_incremental.sql
SELECT * FROM imdb.actors_insight_incremental ORDER BY num_movies DESC;
How it works
dbt
will first create a temporary table namedactors_insight_incremental_tmp
& insert all those rows which pass ouris_incremental()
filter.- A new table
actors_insight_incremental_new
will be created & rows from the old tableactors_insight_incremental
will be ingested here.dbt
will make sure thatunique_key
(if any declared in config) constraint is maintained, by not allowing those rows which have the sameunique_key
as the previous temporary table. - The rows from the temporary table would be ingested into the new table.
- Our previous table (
actors_insight_incremental
) & new table (actors_insight_new
) will be exchanged.
Direct Write to CNCH Unique Key Table
Requirement
dbt-bytehouse
connector version to be at least1.7.1
.- Needs to set config name
cnch_unique_key
instead ofunique_key
- Needs to set config
incremental_strategy
asappend
- Write a SQL expression to append the virtual warehouse id at the end of DBT model, example
SETTINGS virtual_warehouse='vw-id', virtual_warehouse_write='vw-id'
Example
{{
config(
materialized='incremental',
cnch_unique_key='id',
incremental_strategy='append'
)
}}
SELECT
id,
transaction_date,
user_id,
amount,
status
FROM
dbt_unique.transactions
{% if is_incremental() %}
WHERE transaction_date >= (SELECT max(transaction_date) FROM dbt_unique.transactions)
{% endif %}
SETTINGS virtual_warehouse='vw-id', virtual_warehouse_write='vw-id'
Project Documentation
dbt
provides a way to generate documentation for your dbt project and render it as a website.
Create models/actors_insight_incremental.yml
to generate documentation for our models.
version: 2
models:
- name: actors_insight_incremental
description: Actor appearance summary based on roles
columns:
- name: id
description: The id number of actor
- name: name
description: The name of actor
- name: num_movies
description: The number of movies actor has appeared
- name: avg_rating
description: Average rating
- name: updated_at
description: Latest update time
Use dbt docs generate
to generate the documentation for your models & dbt docs serve
to serve the documentation in
your local browser on port 8000.
Local Development
Update tests/integration/confest.py
file to include your connection credentials. For running tests locally, follow
these steps:
pip install -r dev_requirements.txt
python -m pytest
Original Author
ByteHouse wants to thank ClickHouse for original contribution to this connector.
License
This project is distributed under the terms of the Apache License (Version 2.0).
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file dbt-bytehouse-1.7.3.tar.gz
.
File metadata
- Download URL: dbt-bytehouse-1.7.3.tar.gz
- Upload date:
- Size: 35.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.9.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 55228dd5dc47c957e00c491325e6f98d6c8c516d8fda2d0f6a56d8b3e65b3f53 |
|
MD5 | a1ff3689186e40dde98d68ef892aa93f |
|
BLAKE2b-256 | 27716ff58426944d5dcac959d3b920960ea4e2f9176518edc21af4a5a67f43fc |
File details
Details for the file dbt_bytehouse-1.7.3-py3-none-any.whl
.
File metadata
- Download URL: dbt_bytehouse-1.7.3-py3-none-any.whl
- Upload date:
- Size: 45.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.9.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 112d8588dc37e9f5d62f1d108d72eb5c032e58a1d8db61d6e9a24273970dc00a |
|
MD5 | 5ce017a80c4ee50a42a015f6a2ffe74d |
|
BLAKE2b-256 | 8f0d0c1d66e293e61e8467b6dad0dadb2a49112b10c1ccfceb211f9d1294a458 |