Sync dbt sources and models from cloud warehouses to duckdb

Project description

Quacker

Quacker is a streamlined command-line interface (CLI) tool designed to replicate dbt sources as tables from a Cloud Data Warehouse into a local DuckDB database. This allows for faster and more cost-effective local development with dbt.

Quacker currently support syncing from either

Snowflake
BigQuery

Features

Replicates dbt sources to local DuckDB files.
Simplifies local dbt development and testing.
Supports flexible project and manifest directory paths.
Supports flexible dbt targets.
Supports multiple databases from the same warehouse.
(Optional) syncs table copies of selected dbt models to a DuckDB file as well.

Quick terminology

warehouse - A Cloud Data Warehouse e.g. Snowflake, BigQuery
database - The highest level of data organisation in a warehouse. In BigQuery a database is called a project.
schema - A logical grouping of tables within a database. In BigQuery a schema is called a dataset.

Getting Started

Prerequisites

Before using Quacker, you need to have the following set up:

Python installed
[Recommended] A venv virtual environment
A valid dbt project with a warehouse target profile
A valid target for your warehouse in your dbt profiles.yml file
The following adapters installed:
- dbt-duckdb
- dbt-<warehouse> e.g. dbt-bigquery
(Optional) environment variables loaded if you are using them in your dbt project.

DuckDB Profile

An example duckdb target profile in a profiles.yml is seen below.

In this example, my dbt sources exist in two snowflake databases

fivetran_database
snowflake

Arbitrarily I have chosenfivetran_database as the name of the database all dbt output will materialize in, but any of the warehouse database names could have been used here. All others need to be attached to the main database (here: the database of the name snowflake).

    dev_duckdb:
      type: duckdb
      path: data_duckdb/fivetran_database.duckdb
      attach:
        - path: data_duckdb/snowflake.duckdb
      schema: "{{ env_var('SNOWFLAKE_SCHEMA') }}"

Environment Variables

If your dbt project uses environment variables, you will need to load them before running quack sync. This is because

quack sync reads your profile to find the connection details of your warehouse.
some dbt setups use environment variables to store these connection details.

Here is an example of a target which uses environment variables to store the connection details to the warehouse database.

    dev_snowflake:
      account: "{{ env_var('SNOWFLAKE_ACCOUNT') }}"
      database: "{{ env_var('SNOWFLAKE_DATABASE') }}"
      password: "{{ env_var('SNOWFLAKE_PASSWORD') }}"
      role: "{{ env_var('SNOWFLAKE_ROLE') }}"
      schema: "{{ env_var('SNOWFLAKE_SCHEMA') }}"
      threads: 24
      type: snowflake
      user: "{{ env_var('SNOWFLAKE_USERNAME') }}"
      warehouse: "{{ env_var('SNOWFLAKE_WAREHOUSE') }}"

[Optional] Conditionally persist docs based on context

If you have the below in your dbt_project.yml, you will receive the following error when running against duckdb ERROR: alter_column_comment macro not implemented for adapter duckdb:

models:
  +persist_docs:
      relation: true
      columns: true

Therefore, you need to have it set to only persist when running against your non-duckdb target. E.g.:

  +persist_docs:
    relation: "{{ target.name == 'dev_snowflake' }}"
    columns: "{{ target.name == 'dev_snowflake' }}"

[Optional] Make your dbt models agnostic

Some sql syntax is not compatible with duckdb. For example, offset() is not supported in duckdb. Therefore, if you have a model that uses offset(), you will need to make it agnostic to the target type. This can be done using jinja and if statements. For example, the below code will work for both duckdb and bigquery targets:

{% if target.type == 'bigquery' -%}
    split(hubspot_contact_email_address, '@')[
        offset(1)
    ] as hubspot_contact_email_domain_extracted,
{%- elif target.type == 'duckdb' -%}
    split_part(hubspot_contact_email_address, '@', 2) as hubspot_contact_email_domain_extracted
{%- endif %}

Note If you cannot make your model agnostic to the target type, you can add it to the models_to_ignore list in the quacker_config.yml file. See Optional Configuration for more details.

Installation for use

pip3 install quacker

[Optional] Configuration; `quacker_config.yml`

Quacker can be configured using a quacker_config.yml file. This file should be placed in the same location as your dbt_project.yml file. However, you don't need to create this file if you are happy with the default configuration. Below are the things you can configure in the quacker_config.yml file.

`models_to_ignore`

A list of dbt models to "ignore". This is useful for models that are not compatible with DuckDB. For example, models that use UNNEST (BigQuery) or Python dbt models.

While we are "ignoring" these dbt models during we still need to be able to run dbt against DuckDB after the sync. To do this, the ignored dbt models are replicated as tables in the main DuckDB file during the sync.

Example quacker_config.yml:

models_to_ignore:
  - stg_shopify__customers # dbt model 1
  - int_core__customers # dbt model 2

Note: When subsequently running dbt against DuckDB, you will need to pass the --exclude argument during your dbt runs to avoid materialising these models in your duckdb database (you need to "ignore" them). For example, to "ignore" the two models during dbt operations: dbt run --exclude stg_shopify__customers int_core__customers.

If you are configuring models_to_ignore, you will also need this setting in quacker_config.yml so Quacker knows where to sync ignored models to. There are plans to remove this requirement in the future.

main_duckdb_database_name:
  - fivetran_database

Note: If you are using models_to_ignore, make sure that your duckdb target has the same schema name as the warehouse target. Otherwise your subsequent dbt runs might fail because some of the tables that your models reference with ref() are in a differently named schema than expected.

Usage

To start using Quacker, run the quack sync command with the appropriate optional flags:

quack sync \
--project-dir <relative-path-to-your-dbt-project-directory> \
--profiles-dir <full-path-to-your-profiles-directory> \
--manifest-dir <relative-path-to-your-manifest-directory> \
--compile-target <dbt-target-profile-name>

Optional Argument Flags

--project-dir: Relative path to the directory containing your dbt_project.yml file. If not specified evaluates to the current working directory.
--profiles-dir: Full path to the directory containing your profiles.yml file. If not specified, path is resolved using dbt's method.
--manifest-dir: Relative path to the directory containing your manifest.json file. If not specified, assumed to be in target/ relative to the project-dir.
--compile-target: The dbt target name to sync the data from. It's also the target Quacker uses when running dbt compile before extracting identifiers. If not specified, Quacker uses the default profile.

Example Usage

quack sync \
--project-dir . \
--profiles-dir /Users/username/path/to/profiles \
--manifest-dir ../poc_duckdb_for_local_dev/target \
--compile-target dev_snowflake

Help

To see the full list of available commands and arguments, run quack <subcommand> --help e.g.

quack sync --help

How It Works

any quack command performs the following steps:

Reads and parses the quacker_config.yml file if it exists.

quack sync performs the following extra steps:

Checks for the existence of a folder named data_duckdb/ and creates it if necessary. This is where the duckdb files will be created.
Compiles the dbt project.
Parses the manifest.json file to find identifiers of all dbt sources and, optionally, dbt models specified quacker_config.yml.
Queries the warehouse database for all sources and models_to_ignore with a cap of 10,000 rows. If any of the sources or models_to_ignore have more than 10,000 rows, Quacker will randomly sample 10,000 rows.
Saves the queried data into DuckDB files. For sources, one duckdb file is generated with the same name as the database in the warehouse instance. Formodels_to_ignore, the main DuckDB file (matching one of the source databases) is updated with the model's data. Here is how the main file is configured in a dbt profile.
After quack sync, the duckdb files will be stored in a folder named data_duckdb/. The folder structure will look something like this:
```
data_duckdb/
├── fivetran_database.duckdb
├── snowflake.duckdb
└── ...
```

Limitations

Table names with reserved SQL keywords

Some table names may be reserved and cannot be used within DuckDB e.g. order, select, table, view, where, with. If you encounter this issue, you can either:

rename the table in the source database
add it to models_to_ignore.

Otherwise, the error raised will be something like

duckdb.duckdb.CatalogException: Catalog Error: Table with name ORDER does not exist!
Did you mean ""ORDER""?
LINE 1: ...an_ra_shopify."ORDER" as select * from "ORDER"

Concurrent access to DuckDB files

You should not attempt to query the DuckDB database files while Quacker is running a sync operation. Two processes cannot connect to the same duckdb file at the same time.

The error raised will be something like

duckdb.duckdb.IOException: IO Error: Could not set lock on file "/Users/amir/Projects/poc_duckdb_for_local_dev/data_duckdb/fivetran_database.duckdb": Resource temporarily unavailable

Possible future enhancements

More `quacker_config.yml` settings

Customise row limit

Currently, Quacker limits the number of rows it queries from the source database to 10,000. This is to avoid accidentally querying large tables and slowing down the sync process. In the future, we could allow users to customise this limit in the quacker_config.yml file.

Customise name of folder where duckdb files are stored

Currently, Quacker uses the hardcoded name 'data_duckdb'

DuckDB profile check and creation

If no suitable duckdb target profile exists, we could create one. This would involve:

finding all duckdb targets for the project's profile
checking if any of the existing targets have all the source databases and use attach if sources are split across multiple databases
if none of the targets are suitable, creating a new duckdb target in the project's profile
returning a message to the user to inform them of the new duckdb target profile

Gifs for README

Add gifs to the README to show how to use Quacker and what it does.

Emphasise side-by-side comparison of running dbt against warehouse vs duckdb after Quacker sync

Error messages

Target issues

When the --compile-target is not found in the profiles.yml file
When the --compile-target is not of a supported warehouse type

Profiles.yml issues

When the profiles.yml file is not found in the expected location

TODO write up more relevant error messages to implement as they are discovered

Extract the main DuckDB database name from the duckdb dbt profile

This would remove the need for the main_duckdb_database_name setting in quacker_config.yml.

Dealing with non-compatible dbt models

phase 1 solution (done Jan 2024)
- Allow users to specify dbt models to ignore in a config file.
- During quack sync, replicate the ignored dbt models as tables in the main DuckDB file.
phase 2 solution
- Temporary until phase 4 solution is implemented
- add subcommand quack recommend, which from the config generates the --exclude argument so they can manually use it when running dbt against duckdb
phase 3 solution
- Add to quack recommend: suggest models to ignore based on incompatible syntax, such as UNNEST (BigQuery) or Python dbt models.
phase 4 solution
- 3- add a quack dbt command with args (for ignore functionality)
  - Run dbt command passed in argument e.g. run, build
    - To make this seamless use positional arguments if possible. E.g., command is quack dbt run --full-refresh. In this example, both run and --full-refresh are positional arguments.
    - Might need to use some sort of (*args, **kwargs) solution here?
  - Pass as an --exclude(?) on all the models which are ignored in the config
    - example dbt command generated and executed by Quacker: dbt run --full-refresh --exclude model1,model5
  - Stop and inform user if they try to
    - Pass --exclude themselves down to dbt
    - Run a dbt command which doesn't accept --exclude arg
    - dbt command fails
phase 5 solution
- 0- quack debug
  - TODO should this functionality be added to quack recommend or quack sync instead of a new command?
  - Among others, warn if any of the ignored models are not tables in duckdb and suggest re-run of 'quack sync'
Long-term solution
- quack recommend could generate DuckDB syntax equivalents wrapped in Jinja based on the target. For syntax not yet translated, it would fall back on the ignore recommendation.

Debugging tools

Develop quack debug to warn users of any ignored models that are not tables in DuckDB and suggest re-running quack sync.

Support for other databases

As needed, we could add support for other databases in addition to the ones we currently support:

Snowflake added Dec 2023
BigQuery added Jan 2024

Completed enhancements

Retrieve connection details from target profile (done Jan 2024)

Extract the Snowflake credentials directly from the dbt profile, avoiding the need for separate environment variables.
Investigate dbt's source code or dbt power users' methods for retrieving these credentials. failed
Instead, manually code to replicate the order in which dbt searches for the profiles.yml file (exact name). relevant dbt profile documentation:

Specified using the --profiles-dir runtime argument
Environment Variable DBT_PROFILES_DIR: If you have set the DBT_PROFILES_DIR environment variable, dbt will use the directory specified in this variable to look for the profiles.yml file
Current Working Directory: The current working directory is the directory from which you are running the dbt command (where dbt_project.yml is)
Default Directory ~/.dbt/

Simplifying sync (done Jan 2024)

Specify compile sync target, such as snowflake-prod or dev_snwflk with --sync_target argument.

this would

Development and Contribution

We welcome contributions and feedback on our tool! Please reach out to me if you have any questions or would like to contribute: amir.jab.93+quacker@gmail.com

PyPI

Installation for development

Clone the Quacker repository and install it using pip (ideally in a venv virtual environment):

git clone https://github.com/<your_username>/quacker.git
cd quacker
pip3 install -e .

The dot . represents the current directory. It can be replaced with a path to the Quacker repository if you cloned it elsewhere. No matter where Quacker is stored, you can run installed versions of it from anywhere on your machine as long as you are in the same virtual environment that you installed it in.

The -e flag is optional and is used to install Quacker in editable mode, which is useful during development of Quacker itself as it allows changes to be immediately effective without reinstallation. You don't need to use this flag if you are just using Quacker.

Contributors

Amir Jaber | GitHub | LinkedIn

Support

If you encounter any issues or have questions, please open an issue in the project's GitHub repository.

License

Quacker is released under the MIT License.

Project details

Release history Release notifications | RSS feed

0.7.2

Feb 1, 2024

0.7.0

Jan 14, 2024

0.6.5

Jan 13, 2024

This version

0.6.4

Jan 12, 2024

0.6.3

Jan 10, 2024

0.6.2

Jan 10, 2024

0.6.1

Jan 9, 2024

0.5.1

Jan 8, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

quacker-0.6.4.tar.gz (22.9 kB view details)

Uploaded Jan 12, 2024 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

quacker-0.6.4-py3-none-any.whl (21.0 kB view details)

Uploaded Jan 12, 2024 Python 3

File details

Details for the file quacker-0.6.4.tar.gz.

File metadata

Download URL: quacker-0.6.4.tar.gz
Upload date: Jan 12, 2024
Size: 22.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.11.3

File hashes

Hashes for quacker-0.6.4.tar.gz
Algorithm	Hash digest
SHA256	`68fa618c6b75b07b35e63ee78c272f8d12a06e6b5de88454dd9c28b47c42acd2`
MD5	`dda184a6b07eaf7e11baa60df7621d16`
BLAKE2b-256	`372986554b8d77bf5068e9a55f029fb4711fa7c75bc18f023145c63f8bce4a0c`

See more details on using hashes here.

File details

Details for the file quacker-0.6.4-py3-none-any.whl.

File metadata

Download URL: quacker-0.6.4-py3-none-any.whl
Upload date: Jan 12, 2024
Size: 21.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.11.3

File hashes

Hashes for quacker-0.6.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`064045b38ff1c73f33239246c7865c0959765997c0f3f7c64da3fcc85ae11e5b`
MD5	`7021108954176bf01569b7dd10c39164`
BLAKE2b-256	`09253b77db7f6a0671500ca15f906234eac8d0e985bb78110309c37a362690e2`

See more details on using hashes here.

quacker 0.6.4

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

Quacker

Features

Quick terminology

Getting Started

Prerequisites

DuckDB Profile

Environment Variables

[Optional] Conditionally persist docs based on context

[Optional] Make your dbt models agnostic

Installation for use

[Optional] Configuration; quacker_config.yml

models_to_ignore

Usage

Optional Argument Flags

Example Usage

Help

How It Works

Limitations

Table names with reserved SQL keywords

Concurrent access to DuckDB files

Possible future enhancements

More quacker_config.yml settings

DuckDB profile check and creation

Gifs for README

Error messages

Extract the main DuckDB database name from the duckdb dbt profile

Dealing with non-compatible dbt models

Debugging tools

Support for other databases

Completed enhancements

Retrieve connection details from target profile (done Jan 2024)

Simplifying sync (done Jan 2024)

Development and Contribution

PyPI

Installation for development

Contributors

Support

License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

[Optional] Configuration; `quacker_config.yml`

`models_to_ignore`

More `quacker_config.yml` settings