The SparkSQL plugin for dbt (data build tool)
Project description
dbt-spark
Documentation
For more information on using Spark with dbt, consult the dbt documentation.
Installation
This plugin can be installed via pip:
# Install prerequisites:
$ pip install pyhive[hive]
# Install dbt-spark:
$ pip install dbt-spark
Configuring your profile
Connection Method
Connections can be made to Spark in two different modes. The http
mode is used when connecting to a managed service such as Databricks, which provides an HTTP endpoint; the thrift
mode is used to connect directly to the master node of a cluster (either on-premise or in the cloud).
A dbt profile can be configured to run against Spark using the following configuration:
Option | Description | Required? | Example |
---|---|---|---|
method | Specify the connection method (thrift or http ) |
Required | http |
schema | Specify the schema (database) to build models into | Required | analytics |
host | The hostname to connect to | Required | yourorg.sparkhost.com |
port | The port to connect to the host on | Optional (default: 443 for http , 10001 for thrift ) |
443 |
token | The token to use for authenticating to the cluster | Required for http |
abc123 |
cluster | The name of the cluster to connect to | Required for http |
01234-23423-coffeetime |
user | The username to use to connect to the cluster | Optional | hadoop |
connect_timeout | The number of seconds to wait before retrying to connect to a Pending Spark cluster | Optional (default: 10) | 60 |
connect_retries | The number of times to try connecting to a Pending Spark cluster before giving up | Optional (default: 0) | 5 |
Usage with Amazon EMR
To connect to Spark running on an Amazon EMR cluster, you will need to run sudo /usr/lib/spark/sbin/start-thriftserver.sh
on the master node of the cluster to start the Thrift server (see https://aws.amazon.com/premiumsupport/knowledge-center/jdbc-connection-emr/ for further context). You will also need to connect to port 10001
, which will connect to the Spark backend Thrift server; port 10000
will instead connect to a Hive backend, which will not work correctly with dbt.
Example profiles.yml entries:
your_profile_name:
target: dev
outputs:
dev:
method: http
type: spark
schema: analytics
host: yourorg.sparkhost.com
port: 443
token: abc123
cluster: 01234-23423-coffeetime
connect_retries: 5
connect_timeout: 60
your_profile_name:
target: dev
outputs:
dev:
method: thrift
type: spark
schema: analytics
host: 127.0.0.1
port: 10001
user: hadoop
connect_retries: 5
connect_timeout: 60
Usage Notes
Model Configuration
The following configurations can be supplied to models run with the dbt-spark plugin:
Option | Description | Required? | Example |
---|---|---|---|
file_format | The file format to use when creating tables | Optional | parquet |
Incremental Models
Spark does not natively support delete
, update
, or merge
statements. As such, incremental models
are implemented differently than usual in this plugin. To use incremental models, specify a partition_by
clause in your model config.
dbt will use an insert overwrite
query to overwrite the partitions included in your query. Be sure to re-select all of the relevant
data for a partition when using incremental models.
{{ config(
materialized='incremental',
partition_by=['date_day'],
file_format='parquet'
) }}
/*
Every partition returned by this query will be overwritten
when this model runs
*/
select
date_day,
count(*) as users
from {{ ref('events') }}
where date_day::date >= '2019-01-01'
group by 1
Reporting bugs and contributing code
Code of Conduct
Everyone interacting in the dbt project's codebases, issue trackers, chat rooms, and mailing lists is expected to follow the PyPA Code of Conduct.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for dbt_spark-0.14.3-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 019ee2f1d81831f74d7c78ff151811811ac897b277cf8b7617fd6c7edc290bd2 |
|
MD5 | 2086bf50c5cfd9f85a84f1e4b4eb0aa4 |
|
BLAKE2b-256 | 6f37f2e676ca33c19f9881252ac1e83960d1125035d8521d1076f0f6041e22f4 |