Skip to main content

The Starrocks adapter plugin for dbt

Project description

dbt-starrocks

PyPI PyPI - Python Version PyPI - Downloads

This project is under development.

The dbt-starrocks package contains all the code to enable dbt to work with StarRocks.

This is an experimental plugin:

  • We have not tested it extensively
  • Requires StarRocks version 2.5.0 or higher
    • version 3.1.x is recommended
    • StarRocks versions 2.4 and below are no longer supported

Installation

This plugin can be installed via pip:

$ pip install dbt-starrocks

Supported features

StarRocks <= 2.5 StarRocks 2.5 ~ 3.1 StarRocks >= 3.1 StarRocks >= 3.4 Feature
Table materialization
View materialization
Materialized View materialization
Incremental materialization
Primary Key Model
Sources
Custom data tests
Docs generate
Expression Partition
Kafka
Dynamic Overwrite
*4 *4 Submit task
Microbatch (Insert Overwrite)
Microbatch (Dynamic Overwrite)

Notice

  1. When StarRocks Version < 2.5, Create table as can only set engine='OLAP' and table_type='DUPLICATE'
  2. When StarRocks Version >= 2.5, Create table as supports table_type='PRIMARY'
  3. When StarRocks Version < 3.1 distributed_by is required
  4. Verify the specific submit task support for your version, see SUBMIT TASK.

Profile Configuration

Example entry for profiles.yml:

starrocks:
  target: dev
  outputs:
    dev:
      type: starrocks
      host: localhost
      port: 9030
      schema: analytics
      username: your_starrocks_username
      password: your_starrocks_password
Option Description Required? Example
type The specific adapter to use Required starrocks
host The hostname to connect to Required 192.168.100.28
port The port to use Required 9030
schema Specify the schema (database) to build models into Required analytics
username The username to use to connect to the server Required dbt_admin
password The password to use for authenticating to the server Required correct-horse-battery-staple
version Let Plugin try to go to a compatible starrocks version Optional 3.1.0
use_pure set to "true" to use C extensions Optional true
is_async "true" to submit suitable tasks as etl tasks. Optional true
async_query_timeout Sets the query_timeout value when submitting a task to StarRocks Optional 300

More details about setting use_pure and other connection arguments here

Example

dbt seed properties(yml):

Complete configuration:

models:
  materialized: table                   // table, view, materialized_view or incremental
  engine: 'OLAP'
  keys: ['id', 'name', 'some_date']
  table_type: 'PRIMARY'                 // PRIMARY or DUPLICATE or UNIQUE
  distributed_by: ['id']
  buckets: 3                            // leave empty for auto bucketing
  indexs=[{ 'columns': 'idx_column' }]  
  partition_by: ['some_date']
  partition_by_init: ["PARTITION p1 VALUES [('1971-01-01 00:00:00'), ('1991-01-01 00:00:00')),PARTITION p1972 VALUES [('1991-01-01 00:00:00'), ('1999-01-01 00:00:00'))"]
  // RANGE, LIST, or Expr partition types should be used in conjunction with partition_by configuration
  // Expr partition type requires an expression (e.g., date_trunc) specified in partition_by
  order_by: ['some_column']             // only for PRIMARY table_type
  partition_type: 'RANGE'               // RANGE or LIST or Expr Need to be used in combination with partition_by configuration
  properties: {"replication_num":"1", "in_memory": "true"}
  refresh_method: 'async'               // only for materialized view default manual
  
  // For 'materialized=incremental' in version >= 3.4
  incremental_strategy: 'dynamic_overwrite' // Supported values: ['default', 'insert_overwrite', 'dynamic_overwrite']

  // For 'materialized=incremental' and 'incremental_strategy=microbatch'
  event_time: 'some_timestamp_column'     // The column name of the event time
  begin: '2025-01-01'                     // The start time of the incremental data
  lookback: 1                             // The lookback time of the each incremental run
  batch_size: 'day'                       // The batch size. Supported values ['year', 'month', 'day', 'hour']
  microbatch_use_dynamic_overwrite: true  // Whether to use dynamic_overwrite in version >= 3.4

dbt run config:

Example configuration:

{{ config(materialized='view') }}
{{ config(materialized='table', engine='OLAP', buckets=32, distributed_by=['id']) }}
{{ config(materialized='table', indexs=[{ 'columns': 'idx_column' }]) }}
{{ config(materialized='table', partition_by=['date_trunc("day", first_order)'], partition_type='Expr') }}
{{ config(materialized='table', table_type='PRIMARY', keys=['customer_id'], order_by=['first_name', 'last_name'] }}
{{ config(materialized='incremental', table_type='PRIMARY', engine='OLAP', buckets=32, distributed_by=['id']) }}
{{ config(materialized='incremental', partition_by=['my_partition_key'], partition_type='Expr', incremental_strategy='dynamic_overwrite') }}
{{ config(materialized='incremental', partition_by=['my_partition_key'], partition_type='Expr', incremental_strategy='microbatch', event_time='report_day', begin='2025-01-01', lookback=1, batch_size='day') }}
{{ config(materialized='incremental', partition_by=['my_partition_key'], partition_type='Expr', incremental_strategy='microbatch', event_time='report_day', begin='2025-01-01', lookback=1, batch_size='day', microbatch_use_dynamic_overwrite=true) }}
{{ config(materialized='materialized_view') }}
{{ config(materialized='materialized_view', properties={"storage_medium":"SSD"}) }}
{{ config(materialized='materialized_view', refresh_method="ASYNC START('2022-09-01 10:00:00') EVERY (interval 1 day)") }}

For materialized view only support partition_by、buckets、distributed_by、properties、refresh_method configuration.

Read From Catalog

First you need to add this catalog to starrocks. The following is an example of hive.

CREATE EXTERNAL CATALOG `hive_catalog`
PROPERTIES (
    "hive.metastore.uris"  =  "thrift://127.0.0.1:8087",
    "type"="hive"
);

How to add other types of catalogs can be found in the documentation. https://docs.starrocks.io/en-us/latest/data_source/catalog/catalog_overview Then write the sources.yaml file.

sources:
  - name: external_example
    schema: hive_catalog.hive_db
    tables:
      - name: hive_table_name

Finally, you might use below marco quote

{{ source('external_example', 'hive_table_name') }}

Dynamic Overwrite (StarRocks >= 3.4)

Add a new incremental_strategy property that supports the following values:

  • default (or omitted): Standard inserts without overwrite.
  • insert_overwrite: Will apply overwrite with dynamic_overwrite = false to the inserts.
  • dynamic_overwrite: Will apply overwrite with dynamic_overwrite = true to the inserts.

For more details on the different behaviors, see StarRocks' documentation for INSERT.

Submittable ETL tasks

The implementation of the submittable etl is located in the impl.py file.

Setting is_async: true in your profiles.yml will enable submitting suitable ETL tasks using the submit task feature of StarRocks.

This will be automatically wrapped around any statement that supports submission. Setting this manually is currently not supported by the adapter.

The following statements will be submitted automatically:

  • CREATE AS ... SELECT
  • INSERT INTO|OVERWRITE
  • CACHE SELECT ...

See StarRocks' documentation on SUBMIT TASK

Task Polling

Once the task has been submitted, the adapter will periodically poll StarRocks' information_schema.task_runs to retrieve the task status.

The polling is implemented using an exponential backoff, with a maximum delay of 10 minutes. The adapter's connection to the StarRocks' cluster will not be maintained during the waiting period. It will be re-opened right before the next status polling phase.

Controlling the task timeout

Using the async_query_timeout property in the profiles.yml will control the value of the query_timeout when submitting task.

It's going to be injected in the SQL query submitted to StarRocks:

submit /*+set_var(query_timeout={async_query_timeout})*/ task ...

Example profiles.yml configuration

my_profile:
  target: dev
  outputs:
    dev:
      type: starrocks
      host: host
      port: 9030
      schema: schema
      username: username
      password: password
      is_async: true
      async_query_timeout: 3600 # 1 hour

Test Adapter

Run the following

python3 -m pytest tests/functional

consult the project

Contributing

We welcome you to contribute to dbt-starrocks. Please see the Contributing Guide for more information.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dbt_starrocks-1.10.0.tar.gz (26.4 kB view details)

Uploaded Source

Built Distribution

dbt_starrocks-1.10.0-py3-none-any.whl (38.1 kB view details)

Uploaded Python 3

File details

Details for the file dbt_starrocks-1.10.0.tar.gz.

File metadata

  • Download URL: dbt_starrocks-1.10.0.tar.gz
  • Upload date:
  • Size: 26.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.9.21

File hashes

Hashes for dbt_starrocks-1.10.0.tar.gz
Algorithm Hash digest
SHA256 b59fdedea94acd40b5dcc665ab374b1cfd24b4b97441a18bbdc4d96315d5a908
MD5 1b5c6a7eac47cc2bce6280ba0f1b6c1b
BLAKE2b-256 def224d6ac8e9bb2fb6bef84a5119c22958f82b28bd70ed5bd0a8df2277c2b07

See more details on using hashes here.

File details

Details for the file dbt_starrocks-1.10.0-py3-none-any.whl.

File metadata

  • Download URL: dbt_starrocks-1.10.0-py3-none-any.whl
  • Upload date:
  • Size: 38.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.9.21

File hashes

Hashes for dbt_starrocks-1.10.0-py3-none-any.whl
Algorithm Hash digest
SHA256 2eb95882f24652e4df88367e488a6c198e154400a0c7e9fd4b2d8c857dec20c0
MD5 c2eb8153b66553ac6ed2a02350c4b703
BLAKE2b-256 708db1b77b8dac4a434d7cfe7a0f57a4ef9d5deb40eba30879084f63a8768d3a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page