Sync dbt sources and models from cloud warehouses to duckdb
Project description
Quacker
Quacker is a streamlined command-line interface (CLI) tool designed to replicate dbt sources as tables from a Cloud Data Warehouse into a local DuckDB database. This allows for faster and more cost-effective local development with dbt.
Quacker currently support syncing from either
- Snowflake
- BigQuery
Features
- Replicates dbt
sourcesto localDuckDBfiles. - Simplifies local
dbtdevelopment and testing. - Supports flexible
projectandmanifestdirectory paths. - Supports flexible dbt
targets. - Supports multiple
databasesfrom the samewarehouse. - (Optional) syncs
tablecopies of selected dbtmodelsto a DuckDB file as well.
Quick terminology
warehouse- A Cloud Data Warehouse e.g.Snowflake,BigQuerydatabase- The highest level of data organisation in awarehouse. In BigQuery adatabaseis called aproject.schema- A logical grouping oftableswithin adatabase. In BigQuery aschemais called adataset.
Getting Started
Prerequisites
Before using Quacker, you need to have the following set up:
Pythoninstalled- [Recommended] A
venvvirtual environment - A valid
dbtproject with awarehousetargetprofile - A valid
targetfor yourwarehousein yourdbtprofiles.ymlfile - The following adapters installed:
dbt-duckdbdbt-<warehouse>e.g.dbt-bigquery
- (Optional)
environment variablesloaded if you are using them in your dbt project.
DuckDB Profile
An example duckdb target profile in a profiles.yml is seen below.
In this example, my dbt sources exist in two snowflake databases
fivetran_databasesnowflake
Arbitrarily I have chosenfivetran_database as the name of the database all dbt output will materialize in, but any of the warehouse database names could have been used here. All others need to be attached to the main database (here: the database of the name snowflake).
dev_duckdb:
type: duckdb
path: data_duckdb/fivetran_database.duckdb
attach:
- path: data_duckdb/snowflake.duckdb
schema: "{{ env_var('SNOWFLAKE_SCHEMA') }}"
Environment Variables
If your dbt project uses environment variables, you will need to load them before running quack sync. This is because
quack syncreads yourprofileto find the connection details of yourwarehouse.- some
dbtsetups useenvironment variablesto store these connection details.
Here is an example of a target which uses environment variables to store the connection details to the warehouse database.
dev_snowflake:
account: "{{ env_var('SNOWFLAKE_ACCOUNT') }}"
database: "{{ env_var('SNOWFLAKE_DATABASE') }}"
password: "{{ env_var('SNOWFLAKE_PASSWORD') }}"
role: "{{ env_var('SNOWFLAKE_ROLE') }}"
schema: "{{ env_var('SNOWFLAKE_SCHEMA') }}"
threads: 24
type: snowflake
user: "{{ env_var('SNOWFLAKE_USERNAME') }}"
warehouse: "{{ env_var('SNOWFLAKE_WAREHOUSE') }}"
[Optional] Conditionally persist docs based on context
If you have the below in your dbt_project.yml, you will receive the following error when running against duckdb
ERROR: alter_column_comment macro not implemented for adapter duckdb:
models:
+persist_docs:
relation: true
columns: true
Therefore, you need to have it set to only persist when running against your non-duckdb target. E.g.:
+persist_docs:
relation: "{{ target.name == 'dev_snowflake' }}"
columns: "{{ target.name == 'dev_snowflake' }}"
[Optional] Make your dbt models agnostic
Some sql syntax is not compatible with duckdb. For example, offset() is not supported in duckdb. Therefore, if you have a model that uses offset(), you will need to make it agnostic to the target type. This can be done using jinja and if statements. For example, the below code will work for both duckdb and bigquery targets:
{% if target.type == 'bigquery' -%}
split(hubspot_contact_email_address, '@')[
offset(1)
] as hubspot_contact_email_domain_extracted,
{%- elif target.type == 'duckdb' -%}
split_part(hubspot_contact_email_address, '@', 2) as hubspot_contact_email_domain_extracted
{%- endif %}
Note If you cannot make your
modelagnostic to thetargettype, you can add it to themodels_to_ignorelist in thequacker_config.ymlfile. See Optional Configuration for more details.
Installation for use
pip3 install quacker
[Optional] Configuration; quacker_config.yml
Quacker can be configured using a quacker_config.yml file. This file should be placed in the same location as your dbt_project.yml file. However, you don't need to create this file if you are happy with the default configuration. Below are the things you can configure in the quacker_config.yml file.
models_to_ignore
A list of dbt models to "ignore". This is useful for models that are not compatible with DuckDB. For example, models that use UNNEST (BigQuery) or Python dbt models.
While we are "ignoring" these dbt models during we still need to be able to run dbt against DuckDB after the sync. To do this, the ignored dbt models are replicated as tables in the main DuckDB file during the sync.
Example quacker_config.yml:
models_to_ignore:
- stg_shopify__customers # dbt model 1
- int_core__customers # dbt model 2
Note: When subsequently running
dbtagainstDuckDB, you will need to pass the--excludeargument during yourdbt runsto avoid materialising thesemodelsin yourduckdbdatabase(you need to "ignore" them). For example, to "ignore" the two models during dbt operations:dbt run --exclude stg_shopify__customers int_core__customers.
If you are configuring models_to_ignore, you will also need this setting in quacker_config.yml so Quacker knows where to sync ignored models to. There are plans to remove this requirement in the future.
main_duckdb_database_name:
- fivetran_database
Note: If you are using
models_to_ignore, make sure that yourduckdbtargethas the sameschemaname as thewarehousetarget. Otherwise your subsequentdbt runsmight fail because some of thetablesthat yourmodelsreference withref()are in a differently namedschemathan expected.
Usage
To start using Quacker, run the quack sync command with the appropriate optional flags:
quack sync \
--project-dir <relative-path-to-your-dbt-project-directory> \
--profiles-dir <full-path-to-your-profiles-directory> \
--manifest-dir <relative-path-to-your-manifest-directory> \
--compile-target <dbt-target-profile-name>
Optional Argument Flags
--project-dir: Relative path to the directory containing yourdbt_project.ymlfile. If not specified evaluates to the current working directory.--profiles-dir: Full path to the directory containing yourprofiles.ymlfile. If not specified, path is resolved using dbt's method.--manifest-dir: Relative path to the directory containing yourmanifest.jsonfile. If not specified, assumed to be intarget/relative to theproject-dir.--compile-target: The dbt target name to sync the data from. It's also the target Quacker uses when runningdbt compilebefore extracting identifiers. If not specified, Quacker uses thedefaultprofile.
Example Usage
quack sync \
--project-dir . \
--profiles-dir /Users/username/path/to/profiles \
--manifest-dir ../poc_duckdb_for_local_dev/target \
--compile-target dev_snowflake
Help
To see the full list of available commands and arguments, run quack <subcommand> --help e.g.
quack sync --help
How It Works
any quack command performs the following steps:
- Reads and parses the
quacker_config.ymlfile if it exists.
quack sync performs the following extra steps:
- Checks for the existence of a folder named
data_duckdb/and creates it if necessary. This is where theduckdbfiles will be created. - Compiles the dbt project.
- Parses the
manifest.jsonfile to find identifiers of all dbtsourcesand, optionally, dbtmodelsspecifiedquacker_config.yml. - Queries the
warehousedatabasefor allsourcesandmodels_to_ignorewith a cap of 10,000 rows. If any of thesourcesormodels_to_ignorehave more than 10,000 rows, Quacker will randomly sample 10,000 rows. - Saves the queried data into
DuckDBfiles. Forsources, oneduckdbfile is generated with the same name as thedatabasein thewarehouseinstance. Formodels_to_ignore, themainDuckDBfile (matching one of thesourcedatabases) is updated with themodel's data. Here is how the main file is configured in a dbt profile. - After quack sync, the
duckdbfiles will be stored in a folder nameddata_duckdb/. The folder structure will look something like this:data_duckdb/ ├── fivetran_database.duckdb ├── snowflake.duckdb └── ...
Limitations
Table names with reserved SQL keywords
Some table names may be reserved and cannot be used within DuckDB e.g. order, select, table, view, where, with. If you encounter this issue, you can either:
- rename the table in the source database
- add it to models_to_ignore.
Otherwise, the error raised will be something like
duckdb.duckdb.CatalogException: Catalog Error: Table with name ORDER does not exist!
Did you mean ""ORDER""?
LINE 1: ...an_ra_shopify."ORDER" as select * from "ORDER"
Concurrent access to DuckDB files
You should not attempt to query the DuckDB database files while Quacker is running a sync operation. Two processes cannot connect to the same duckdb file at the same time.
The error raised will be something like
duckdb.duckdb.IOException: IO Error: Could not set lock on file "/Users/amir/Projects/poc_duckdb_for_local_dev/data_duckdb/fivetran_database.duckdb": Resource temporarily unavailable
Possible future enhancements
More quacker_config.yml settings
Customise row limit
- Currently, Quacker limits the number of rows it queries from the source database to 10,000. This is to avoid accidentally querying large tables and slowing down the sync process. In the future, we could allow users to customise this limit in the
quacker_config.ymlfile.
Customise name of folder where duckdb files are stored
- Currently, Quacker uses the hardcoded name 'data_duckdb'
DuckDB profile check and creation
If no suitable duckdb target profile exists, we could create one. This would involve:
- finding all
duckdbtargets for theproject'sprofile - checking if any of the existing targets have all the
sourcedatabasesand useattachif sources are split across multipledatabases - if none of the
targetsare suitable, creating a newduckdbtargetin theproject'sprofile - returning a message to the user to inform them of the new
duckdbtargetprofile
Gifs for README
Add gifs to the README to show how to use Quacker and what it does.
- Emphasise side-by-side comparison of running dbt against warehouse vs duckdb after Quacker sync
Error messages
Target issues
- When the
--compile-targetis not found in theprofiles.ymlfile - When the
--compile-targetis not of a supportedwarehousetype
Profiles.yml issues
- When the
profiles.ymlfile is not found in the expected location
TODO write up more relevant error messages to implement as they are discovered
Extract the main DuckDB database name from the duckdb dbt profile
- This would remove the need for the
main_duckdb_database_namesetting inquacker_config.yml.
Dealing with non-compatible dbt models
-
phase 1 solution (done Jan 2024)
- Allow users to specify dbt models to ignore in a config file.
- During
quack sync, replicate the ignored dbt models as tables in the main DuckDB file.
-
phase 2 solution
- Temporary until phase 4 solution is implemented
- add subcommand
quack recommend, which from the config generates the--excludeargument so they can manually use it when running dbt against duckdb
-
phase 3 solution
- Add to
quack recommend: suggest models to ignore based on incompatible syntax, such as UNNEST (BigQuery) or Python dbt models.
- Add to
-
phase 4 solution
- 3- add a
quack dbtcommand with args (for ignore functionality)- Run dbt command passed in argument e.g.
run,build- To make this seamless use positional arguments if possible. E.g., command is
quack dbt run --full-refresh. In this example, bothrunand--full-refreshare positional arguments. - Might need to use some sort of (*args, **kwargs) solution here?
- To make this seamless use positional arguments if possible. E.g., command is
- Pass as an
--exclude(?) on all the models which are ignored in the config- example dbt command generated and executed by Quacker:
dbt run --full-refresh --exclude model1,model5
- example dbt command generated and executed by Quacker:
- Stop and inform user if they try to
- Pass
--excludethemselves down to dbt - Run a dbt command which doesn't accept
--excludearg - dbt command fails
- Pass
- Run dbt command passed in argument e.g.
- 3- add a
-
phase 5 solution
- 0-
quack debug- TODO should this functionality be added to
quack recommendorquack syncinstead of a new command? - Among others, warn if any of the ignored models are not tables in duckdb and suggest re-run of 'quack sync'
- TODO should this functionality be added to
- 0-
-
Long-term solution
quack recommendcould generate DuckDB syntax equivalents wrapped in Jinja based on the target. For syntax not yet translated, it would fall back on the ignore recommendation.
Debugging tools
- Develop
quack debugto warn users of any ignored models that are not tables in DuckDB and suggest re-runningquack sync.
Support for other databases
As needed, we could add support for other databases in addition to the ones we currently support:
- Snowflake added Dec 2023
- BigQuery added Jan 2024
Completed enhancements
Retrieve connection details from target profile (done Jan 2024)
- Extract the Snowflake credentials directly from the dbt profile, avoiding the need for separate environment variables.
- Investigate dbt's source code or dbt power users' methods for retrieving these credentials. failed
- Instead, manually code to replicate the order in which dbt searches for the
profiles.ymlfile (exact name). relevant dbt profile documentation:
- Specified using the
--profiles-dirruntime argument - Environment Variable
DBT_PROFILES_DIR: If you have set the DBT_PROFILES_DIR environment variable, dbt will use the directory specified in this variable to look for the profiles.yml file - Current Working Directory: The current working directory is the directory from which you are running the dbt command (where dbt_project.yml is)
- Default Directory
~/.dbt/
Simplifying sync (done Jan 2024)
Specify compile sync target, such as snowflake-prod or dev_snwflk with --sync_target argument.
- this would
Development and Contribution
We welcome contributions and feedback on our tool! Please reach out to me if you have any questions or would like to contribute: amir.jab.93+quacker@gmail.com
PyPI
Installation for development
Clone the Quacker repository and install it using pip (ideally in a venv virtual environment):
git clone https://github.com/<your_username>/quacker.git
cd quacker
pip3 install -e .
The dot . represents the current directory. It can be replaced with a path to the Quacker repository if you cloned it elsewhere. No matter where Quacker is stored, you can run installed versions of it from anywhere on your machine as long as you are in the same virtual environment that you installed it in.
The -e flag is optional and is used to install Quacker in editable mode, which is useful during development of Quacker itself as it allows changes to be immediately effective without reinstallation. You don't need to use this flag if you are just using Quacker.
Contributors
Support
If you encounter any issues or have questions, please open an issue in the project's GitHub repository.
License
Quacker is released under the MIT License.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file quacker-0.6.4.tar.gz.
File metadata
- Download URL: quacker-0.6.4.tar.gz
- Upload date:
- Size: 22.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
68fa618c6b75b07b35e63ee78c272f8d12a06e6b5de88454dd9c28b47c42acd2
|
|
| MD5 |
dda184a6b07eaf7e11baa60df7621d16
|
|
| BLAKE2b-256 |
372986554b8d77bf5068e9a55f029fb4711fa7c75bc18f023145c63f8bce4a0c
|
File details
Details for the file quacker-0.6.4-py3-none-any.whl.
File metadata
- Download URL: quacker-0.6.4-py3-none-any.whl
- Upload date:
- Size: 21.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
064045b38ff1c73f33239246c7865c0959765997c0f3f7c64da3fcc85ae11e5b
|
|
| MD5 |
7021108954176bf01569b7dd10c39164
|
|
| BLAKE2b-256 |
09253b77db7f6a0671500ca15f906234eac8d0e985bb78110309c37a362690e2
|