Skip to main content

A tool to keep dbt projects on rails

Project description

PyPI - Version PyPi Python Versions

dbt-opiner

Tool for keeping dbt standards aligned across dbt projects.

Disclaimer: This is an early stage project. Check the License for more information.

Table of Contents

  1. Installation and usage
    1. CLI
    2. Pre-commit hook
    3. Usage in CI pipelines
    4. Important notes and additional configurations
  2. Opinions
    1. Ignoring opinions (noqa)
    2. Default opinions
      1. O001 model must have a description
      2. O002 model description must have keywords
      3. O003 all columns must have description
      4. O004 All columns in model must be explicitly named at least once
      5. O005 model should have unique key
    3. Adding custom opinions
  3. Why?
  4. Contributing

Installation and usage

CLI

It can be installed as a python package (with pip pip install dbt-opiner, poetry poetry add dbt-opiner, etc.) and used as a cli. Run dbt-opiner -h to see the available commands and options.

Pre-commit hook

Also, it can be used as a pre-commit hook. To use it as a pre-commit hook, add the following to your .pre-commit-config.yaml file:

repos:
  - repo: https://github.com/dbt-opiner/dbt-opiner
    rev: 0.1.0
    hooks:
      - id: dbt-opiner-lint
        args: [-f]
        additional_dependencies: [dbt-duckdb == 1.8.2]

Make sure to list in additional dependencies the dbt connector you are using. In this case, duckdb.

Usage in CI pipelines

The tool can be used in CI pipelines. It will return a non-zero exit code if any opinion with severity Must is not met.

An environment variable DBT_TARGET can be set to specify the target to use when compiling the dbt manifest. If not set, the default target will be used.

Important notes and additional configurations

The tool expects all the linted files to belong to a git repository: it won't work with files that are not part of a git repository.

Extra configs can be set using a dbt-opiner.yaml file, that should be anywhere in the repository. If more than one dbt-opiner.yaml file is found, the tool will use the one in the highest path of the repository. If file is not provided, empty configuration will be used and default behaviour will be applied.

The dbt-opiner.yaml file should have the following structure:

sqlglot_dialect: duckdb # The dialect to use when parsing the sql files with sqlglot.

opinions_config: # Extra config for opinions. Check the opinions documentation for more info.
  ignore_opinions: # To ignore some opinions list the opinion codes. Optional.
    - O001
  ignore_files:
    # The opinion is ignored in all the files that match the regex.
    # Use the opinion code as key and a regex as value.
    - O002: ".*/models/dimensions/.*"
  extra_opinions_config:
    O002_keywords:
      - summary
      - granularity
  custom_opinions: # To set custom opinions. Optional.
    source: git # git or local. If local, it will load opinions from "the directory where this config file is"/custom_opinion/ directory.
    repository: https://github.com/dbt-opiner/dbt-opiner-custom-opinions.git # Only required if the source is git. TODO: Add support for private repositories.
    rev: 0.1.0 # Tag or commit sha, not branches. Revision is optional but encouraged. If not provided default main branch will be used.

files: #Regex to match the files to lint. Optional.
  sql: ".*/(models|macros|tests)/.*" #
  #yaml:
  #md:

TODO: add link to example repo.

Opinions

The opinions are defined in the opinions directory. They apply to certain dbt nodes (models, macros, or tests) and/or type of files (yaml, sql, md). The opinions have a code, a description, a severity, and a configuration. The severity levels are: Must (it's mandatory) and Should (it's highly recommended). The configuration is optional and can be used to set extra parameters for the opinion.

Ignoring opinions (noqa)

Opinions can be ignored at the global level, at the node level, or by regex matching file paths. To ignore an opinion at the global level, add the opinion code to the ignore_opinions list in the dbt-opiner.yaml file. To ignore an opinion at the node (model, macro, or test) level, add a comment with the format: noqa: dbt-opiner OXXX at the beginning of the sql or yaml file, where OXXX is the opinion code. Use a comma separated list if you want to ignore more than one opinion (e.g. noqa: dbt-opiner O001, O002). You can also ignore all opinions in a node by using noqa: dbt-opiner all. Note that if multiple nodes are defined in the same yaml file, the noqa comment will apply to all the nodes defined in that file. To ignore opinions for certaing regex matching file paths, add the opinion code as key and a regex as value to the ignore_files list in the dbt-opiner.yaml file.

TODO: add link to example repo.

Default opinions

O001 model must have a description [source]

Applies to: dbt models when either sql or yaml files are changed. Models must have descriptions. Empty descriptions are not allowed.

Descriptions are important for documentation and understanding the purpose of the model. A good description makes data more obvious. Include a description for the model in a yaml file or config block.

O002 model description must have keywords [source]

Applies to: dbt models when either sql or yaml files are changed. Models descriptions must have keywords.

Keywords help standarizing the description of the models, and ensure that all the important information is present in the description. Make sure the description of the model has all the required keywords.

The keywords can be set in the configuration like this:

opinions_config:
  O002_keywords:
    - summary
    - granularity

Keywords are case insensitive.

O003 all columns must have description [source]

Applies to: dbt models when either sql or yaml files are changed All columns in the model should have a description. Empty descriptions are not allowed.

Descriptions are important for documentation and understanding the purpose of the columns. A good description desambiguates the content of a column and helps making data more obvious.

This opinion has some caveats. The only way of really knowning the columns of a model is by running the model and checking the columns in the database or catalog.json. However we don't want to depend on the execution of the model.

This opinion checks if:

  • all defined columns in yaml files have a non empty description in the manifests.json.
  • the columns extracted by the sqlglot parser have a description in the manifest.json.

If the model is constructed in a way that not all columns are extracted by the sqlglot parser, this opinion will omit those columns from the check. Rule O004 will check against this condition and will fail if unresolved select * are found.

O004 All columns in model must be explicitly named at least once [source]

Applies to: dbt models when sql files are changed The final columns of the model must be explicitly named at least once.

This makes it easier to understand what columns are being selected at the end of the model, without needing to check the lineage and the sources columns.

Ideally, all the columns should be named in the final select statement or CTE. This rule doesn't check for that condition, but it checks if there is any unresolved select * statement in the model.

** For example **

  • Good: with customers as (select * from {{ref('customers')}}), orders as (select * from {{ref('orders')}}), joined as ( select customers.customer_id, customers.customer_name, orders.order_id, orders.order_date from customers join orders on customers.customer_id = orders.customer_id ) select * from joined

  • Also good: with customers as (select customer_id, customer_name from {{ref('customers')}}), orders as (select * from {{ref('orders')}}), joined as ( select customers.*, -- These are named in the CTE orders.order_id, orders.order_date from customers join orders on customers.customer_id = orders.customer_id ) select * from joined

  • Bad: with customers as (select * from {{ref('customers')}}), orders as (select * from {{ref('orders')}}), joined as ( select customers.*, -- We can't know the final columns that this model uses without checking dim_customers table orders.order_id, orders.order_date from customers join orders on customers.customer_id = orders.customer_id ) select * from joined

O005 model should have unique key [source]

Applies to: dbt models when sql files are changed Models should have a unique key defined in the config block of the model. This is useful to enforce the uniqueness of the model and to make the granularity of the model explicit.

Adding custom opinions

If you want to add your own opinions, you can do so by creating a new opinion classes in a custom_opinion directory in the same directory where the dbt-opiner.yaml file is located or in a github repository (see for example: this repo)

The opinion class should inherit from dbt_opiner.opinions.BaseOpinion and implement the _eval method. This method should check for file and node types (to avoid running the opinion in the wrong files or nodes) and evaluate the opinion and it can return a single or a list of dbt_opiner.linter.LintResult

If the custom opinion is in a repository and requires extra dependencies, define a class variable required_dependencies with the required dependencies (e.g. "numpy==2.0.1").

The custom opinion can use the configuration set in the dbt-opiner.yaml file. The config dictionary is injected when the class is instantiated. To access it, define a __init__ method with a config parameter (see for example [this])

All the configurations for ignoring opinions (noqa) will also apply to the custom opinions. Make sure you don't create conflicting opinion codes. As a best practice, use a prefix for the opinion codes that are specific to your organization (e.g. C001).

Why?

Inspired in Benn Stancil's blog post where he says:

My suspicion is that dbt [...] needs something that aggressively imposes [...] opinions on its users. It needs dbt on Rails: A framework that builds a project’s scaffolding, and tells us how to expand it—and not through education, but through functional requirements. Show where to put which bits of logic. Prevent me from doing circular things. Blare warnings at me when I use macros incorrectly. Force me to rigorously define “production.”

Far for being a dbt framework, this less ambitious tool aims to help enforcing standards and best practices in dbt projects at scale, as part of the federated data governance principle of a data mesh architecture.

Although other similar tools exist, they fall short in some aspects:

  • dbt-checkpoint doesn't work well with multi projects repositories and doesn't provide a way to define custom opinions.
  • dbt-score doesn't work as a pre-commit hook and it's oriented to check mainly the metadata of the nodes.

dbt-opiner tries to fill this gap. It can be used as a cli, as a pre-commit hook, and in CI pipelines, and it allows to define custom opinions that check multiple aspects of dbt projects such as:

  • Naming conventions
  • Data and unit tests checks
  • Documentation checks
  • Dependency checks
  • Security/Privacy checks
  • etc.

dbt-opiner is designed to be extensible and easy to configure, so it can be adapted to the specific needs of each organization.

This tool doesn't replace linters and formaters for sql code (SQLFluff, sqlftm) that are highly encouraged.

Contributing

Check the CONTRIBUTING.md file for more information.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dbt_opiner-0.1.0b0.tar.gz (25.4 kB view hashes)

Uploaded Source

Built Distribution

dbt_opiner-0.1.0b0-py3-none-any.whl (28.5 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page