dbt-spark

The SparkSQL plugin for dbt (data build tool)

Project description

dbt-spark

Documentation

For more information on using Spark with dbt, consult the dbt documentation.

Installation

This plugin can be installed via pip:

$ pip install dbt-spark

Configuring your profile

Connection Method

Connections can be made to Spark in two different modes. The http mode is used when connecting to a managed service such as Databricks, which provides an HTTP endpoint; the thrift mode is used to connect directly to the master node of a cluster (either on-premise or in the cloud).

A dbt profile can be configured to run against Spark using the following configuration:

Option	Description	Required?	Example
method	Specify the connection method (`thrift` or `http`)	Required	`http`
schema	Specify the schema (database) to build models into	Required	`analytics`
host	The hostname to connect to	Required	`yourorg.sparkhost.com`
port	The port to connect to the host on	Optional (default: 443 for `http`, 10001 for `thrift`)	`443`
token	The token to use for authenticating to the cluster	Required for `http`	`abc123`
cluster	The name of the cluster to connect to	Required for `http`	`01234-23423-coffeetime`
user	The username to use to connect to the cluster	Optional	`hadoop`
connect_timeout	The number of seconds to wait before retrying to connect to a Pending Spark cluster	Optional (default: 10)	`60`
connect_retries	The number of times to try connecting to a Pending Spark cluster before giving up	Optional (default: 0)	`5`

Usage with Amazon EMR

To connect to Spark running on an Amazon EMR cluster, you will need to run sudo /usr/lib/spark/sbin/start-thriftserver.sh on the master node of the cluster to start the Thrift server (see https://aws.amazon.com/premiumsupport/knowledge-center/jdbc-connection-emr/ for further context). You will also need to connect to port 10001, which will connect to the Spark backend Thrift server; port 10000 will instead connect to a Hive backend, which will not work correctly with dbt.

Example profiles.yml entries:

your_profile_name:
  target: dev
  outputs:
    dev:
      method: http
      type: spark
      schema: analytics
      host: yourorg.sparkhost.com
      port: 443
      token: abc123
      cluster: 01234-23423-coffeetime
      connect_retries: 5
      connect_timeout: 60

your_profile_name:
  target: dev
  outputs:
    dev:
      method: thrift
      type: spark
      schema: analytics
      host: 127.0.0.1
      port: 10001
      user: hadoop
      connect_retries: 5
      connect_timeout: 60

Usage Notes

Model Configuration

The following configurations can be supplied to models run with the dbt-spark plugin:

Option	Description	Required?	Example
file_format	The file format to use when creating tables	Optional	`parquet`

Incremental Models

Spark does not natively support delete, update, or merge statements. As such, incremental models are implemented differently than usual in this plugin. To use incremental models, specify a partition_by clause in your model config. dbt will use an insert overwrite query to overwrite the partitions included in your query. Be sure to re-select all of the relevant data for a partition when using incremental models.

{{ config(
    materialized='incremental',
    partition_by=['date_day'],
    file_format='parquet'
) }}

/*
  Every partition returned by this query will be overwritten
  when this model runs
*/

select
    date_day,
    count(*) as users

from {{ ref('events') }}
where date_day::date >= '2019-01-01'
group by 1

Reporting bugs and contributing code

Want to report a bug or request a feature? Let us know on Slack, or open an issue.

Code of Conduct

Everyone interacting in the dbt project's codebases, issue trackers, chat rooms, and mailing lists is expected to follow the PyPA Code of Conduct.

Project details

Release history Release notifications | RSS feed

1.8.0

May 9, 2024

1.8.0rc1 pre-release

May 6, 2024

1.8.0b2 pre-release

Apr 3, 2024

1.8.0b1 pre-release

Mar 2, 2024

1.7.1

Nov 8, 2023

1.7.0

Nov 2, 2023

1.7.0rc1 pre-release

Oct 12, 2023

1.7.0b2 pre-release

Oct 2, 2023

1.7.0b1 pre-release

Aug 17, 2023

1.6.2

Dec 13, 2023

1.6.1

Nov 9, 2023

1.6.0

Jul 31, 2023

1.6.0rc1 pre-release

Jul 17, 2023

1.6.0b3 pre-release

Jun 9, 2023

1.6.0b2 pre-release

May 25, 2023

1.6.0b1 pre-release

May 12, 2023

1.5.3

Dec 13, 2023

1.5.2

Aug 7, 2023

1.5.1

Aug 1, 2023

1.5.0

Apr 27, 2023

1.5.0rc1 pre-release

Apr 22, 2023

1.5.0b4 pre-release

Mar 30, 2023

1.5.0b3 pre-release

Mar 17, 2023

1.5.0b2 pre-release

Mar 3, 2023

1.5.0b1 pre-release

Feb 23, 2023

1.4.3

Aug 7, 2023

1.4.2

May 26, 2023

1.4.1

Jan 27, 2023

1.4.0

Jan 25, 2023

1.4.0rc1 pre-release

Jan 11, 2023

1.4.0b1 pre-release

Dec 15, 2022

1.3.3

Aug 7, 2023

1.3.2

May 31, 2023

1.3.1

Jan 20, 2023

1.3.0

Oct 12, 2022

1.3.0rc2 pre-release

Oct 3, 2022

1.3.0rc1 pre-release

Sep 28, 2022

1.3.0b2 pre-release

Aug 30, 2022

1.3.0b1 pre-release

Jul 29, 2022

1.2.1

Dec 23, 2022

1.2.0

Jul 26, 2022

1.2.0rc1 pre-release

Jul 12, 2022

1.2.0b1 pre-release

Jun 24, 2022

1.1.1

Dec 23, 2022

1.1.0

Apr 28, 2022

1.1.0rc1 pre-release

Apr 13, 2022

1.1.0b1 pre-release

Mar 23, 2022

1.0.2

Dec 23, 2022

1.0.1

Apr 19, 2022

1.0.1rc1 pre-release

Apr 6, 2022

1.0.0

Dec 3, 2021

1.0.0rc2 pre-release

Nov 24, 2021

1.0.0rc1 pre-release

Nov 10, 2021

0.21.1

Nov 29, 2021

0.21.1rc2 pre-release

Nov 15, 2021

0.21.1rc1 pre-release

Nov 3, 2021

0.21.0

Oct 4, 2021

0.21.0rc2 pre-release

Sep 27, 2021

0.21.0rc1 pre-release

Sep 20, 2021

0.21.0b2 pre-release

Aug 20, 2021

0.21.0b1 pre-release

Aug 3, 2021

0.20.2

Sep 8, 2021

0.20.2rc2 pre-release

Aug 27, 2021

0.20.2rc1 pre-release

Aug 17, 2021

0.20.1

Aug 2, 2021

0.20.1rc1 pre-release

Aug 2, 2021

0.20.0

Jul 12, 2021

0.20.0rc2 pre-release

Jul 7, 2021

0.20.0rc1 pre-release

Jun 8, 2021

0.19.2

Jun 30, 2021

0.19.2rc2 pre-release

Jun 8, 2021

0.19.1

Apr 2, 2021

0.19.1rc1 pre-release

Mar 23, 2021

0.19.1b2 pre-release

Feb 26, 2021

0.19.0.1

Feb 26, 2021

0.19.0

Feb 22, 2021

0.19.0rc1 pre-release

Jan 8, 2021

0.18.2

Mar 22, 2021

0.18.1.1

Nov 13, 2020

0.18.1

Nov 6, 2020

0.18.0

Sep 16, 2020

0.17.2

Aug 10, 2020

0.17.1

Jul 23, 2020

0.17.0

Jun 10, 2020

0.16.1

Apr 27, 2020

0.16.0

Apr 8, 2020

0.15.3

Mar 23, 2020

0.14.3

Mar 23, 2020

This version

0.13.0

Jul 3, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dbt-spark-0.13.0.tar.gz (12.8 kB view hashes)

Uploaded Jul 3, 2019 Source

Built Distribution

dbt_spark-0.13.0-py3.7.egg (26.2 kB view hashes)

Uploaded Jul 3, 2019 Source

Hashes for dbt-spark-0.13.0.tar.gz

Hashes for dbt-spark-0.13.0.tar.gz
Algorithm	Hash digest
SHA256	`d0c3255edadec5a2d423ca7fd20a4d2b0ba45c75fc0b73b554121a98f74c72c6`
MD5	`aba7d7199a6f4f76fcc8c0933cbc5a4d`
BLAKE2b-256	`bb37fe34166ef27c5d71022ae27ec2445c8c0227b3f17bd5999e5893e6012ca8`

Hashes for dbt_spark-0.13.0-py3.7.egg

Hashes for dbt_spark-0.13.0-py3.7.egg
Algorithm	Hash digest
SHA256	`65d8d9ccfd5185cfaba1652bb732d69e25eda12dbafcdb67943615d3255e6242`
MD5	`b4985cce5174703043df23a701f7cce3`
BLAKE2b-256	`6a79686f13b7bfa55ff80abc40c3db0a61f59fafba6c17e9b8fcebb153eed6bf`