The Apache Iceberg adapter plugin for dbt with spark-submit and spark-sql CLI support
Project description
dbt
dbt enables data analysts and engineers to transform their data using the same practices that software engineers use to build applications.
dbt is the T in ELT. Organize, cleanse, denormalize, filter, rename, and pre-aggregate the raw data in your warehouse so that it's ready for analysis.
dbt-spark
dbt-spark enables dbt to work with Apache Spark.
For more information on using dbt with Spark, consult the docs.
Getting started
Review the repository README.md as most of that information pertains to dbt-spark.
Running locally
A docker-compose environment starts three services:
| Service | Description | Port |
|---|---|---|
dbt-spark3-thrift |
Spark Thrift Server (HiveThriftServer2) — long-running JDBC endpoint |
10000 |
dbt-spark-history |
Spark History Server — UI for completed queries and jobs | 18080 |
dbt-hive-metastore |
Postgres-backed Hive Metastore | internal |
Note: dbt-spark now supports Spark 4.1.0.
The following command starts all containers:
docker-compose up -d
It will take a bit of time for the instance to start; you can check the logs of the containers. If the instance doesn't start correctly, try the complete reset command listed below and then try again.
Create a profile using the spark_sql method in thrift server mode:
spark_testing:
target: local
outputs:
local:
type: spark
method: spark_sql
host: 127.0.0.1
port: 10000
user: dbt
schema: analytics
connect_retries: 5
connect_timeout: 60
retry_all: true
spark_history_server: http://localhost:18080
Or using the thrift method directly:
spark_testing:
target: local
outputs:
local:
type: spark
method: thrift
host: 127.0.0.1
port: 10000
user: dbt
schema: analytics
connect_retries: 5
connect_timeout: 60
retry_all: true
Connecting to the local Spark instance:
- Spark UI (active queries): http://localhost:4040/sqlserver/
- Spark History Server (completed queries): http://localhost:18080
- Thrift endpoint:
jdbc:hive2://localhost:10000— credentialsdbt:dbt
Note that the Hive metastore data is persisted under ./.hive-metastore/, Spark warehouse data under ./.spark-warehouse/, and event logs under ./.spark-events/. To completely reset your environment run:
docker-compose down
rm -rf ./.hive-metastore/ ./.spark-warehouse/ ./.spark-events/
Additional Configuration for MacOS
If installing on MacOS, use homebrew to install required dependencies.
brew install unixodbc
Configuring spark-submit and spark-sql methods
The spark_submit and spark_sql connection methods run SQL and Python models using either a long-running Spark Thrift Server or the Spark CLI tools directly.
Requirements
- A working Spark installation accessible via
SPARK_HOMEor onPATH(CLI mode), or a runningHiveThriftServer2(thrift server mode). - For
spark_submit:spark-submitmust be executable. - For
spark_sqlCLI mode:spark-sqlmust be executable. - For
spark_sqlthrift server mode:pyhivemust be installed (pip install dbt-spark[PyHive]).
spark-sql — thrift server mode (recommended)
When host is set, spark_sql connects to a long-running Spark Thrift Server (HiveThriftServer2) via the Thrift protocol. This avoids the JVM startup cost of spawning a new Spark application for every dbt invocation, and lets you share a single Spark context across many dbt runs. The Spark UI and History Server remain accessible throughout.
my_spark_project:
target: dev
outputs:
dev:
type: spark
method: spark_sql
host: 127.0.0.1
port: 10000
schema: analytics
user: dbt
connect_retries: 5
connect_timeout: 60
retry_all: true
# Optional: URL of the Spark History Server UI (informational — logged at startup)
spark_history_server: http://127.0.0.1:18080
# Optional: server-side Spark configuration applied to each session
server_side_parameters:
spark.executor.memory: "4g"
spark.eventLog.enabled: "true"
spark.eventLog.dir: /tmp/spark-events
Start a thrift server locally using the bundled docker-compose:
docker-compose up -d
Or start one manually:
$SPARK_HOME/sbin/start-thriftserver.sh \
--master local[*] \
--conf spark.sql.ansi.enabled=false \
--conf spark.eventLog.enabled=true \
--conf spark.eventLog.dir=/tmp/spark-events
spark-sql — CLI mode
When host is not set, dbt executes each statement with spark-sql -e '...' (or -f for batches). A new Spark application is launched per dbt invocation.
my_spark_project:
target: dev
outputs:
dev:
type: spark
method: spark_sql
schema: analytics
# Optional: explicit path to SPARK_HOME if not set in environment
spark_home: /opt/spark
# Optional: extra CLI flags passed to spark-sql
spark_sql_args:
- "--master"
- "local[*]"
- "--conf"
- "spark.executor.memory=4g"
spark-submit
Use spark_submit when your project includes Python models. dbt executes Python models via spark-submit. SQL statements (e.g. catalog introspection) fall back to thrift when host is set, or to the spark-sql CLI otherwise.
With a thrift server (recommended):
my_spark_project:
target: dev
outputs:
dev:
type: spark
method: spark_submit
host: 127.0.0.1
port: 10000
schema: analytics
user: dbt
spark_history_server: http://127.0.0.1:18080
# Extra CLI flags for spark-submit (Python models only)
spark_submit_args:
- "--master"
- "local[*]"
- "--conf"
- "spark.executor.memory=4g"
spark_submit_timeout: 3600
CLI only (no thrift server):
my_spark_project:
target: dev
outputs:
dev:
type: spark
method: spark_submit
schema: analytics
spark_home: /opt/spark
spark_submit_args:
- "--master"
- "local[*]"
- "--conf"
- "spark.executor.memory=4g"
spark_sql_args:
- "--master"
- "local[*]"
spark_submit_timeout: 3600
Profile fields
| Field | Required | Default | Description |
|---|---|---|---|
method |
Yes | — | spark_sql or spark_submit |
schema |
Yes | — | The default schema (database) to use |
host |
No | — | Thrift server hostname. When set, spark_sql and spark_submit connect via Thrift instead of CLI |
port |
No | 443 |
Thrift server port (typically 10000) |
user |
No | — | Username for thrift server authentication |
auth |
No | — | Auth mechanism for thrift (e.g. NONE, LDAP, KERBEROS) |
use_ssl |
No | false |
Use SSL/TLS for the thrift connection |
spark_history_server |
No | — | URL of the Spark History Server UI (informational) |
spark_home |
No | $SPARK_HOME env var |
Path to your Spark installation (CLI mode) |
spark_sql_args |
No | [] |
Extra CLI flags for spark-sql (CLI mode only) |
spark_submit_args |
No | [] |
Extra CLI flags for spark-submit (Python models only) |
spark_submit_timeout |
No | null (no timeout) |
Max seconds to wait for a spark-submit job |
poll_interval |
No | 5 |
Seconds between status polls for thrift queries |
query_timeout |
No | null (no timeout) |
Max seconds for a single thrift query |
query_retries |
No | 1 |
Times to retry on connection loss during query polling |
server_side_parameters |
No | {} |
Spark configuration sent to the thrift server per session |
Notes
- Thrift server mode (
hostset) requirespip install dbt-spark[PyHive]for bothspark_sqlandspark_submit. spark_submitonly invokesspark-submitfor Python models. SQL statements use thrift whenhostis set, otherwisespark-sqlCLI.spark_sql_argsis only used in CLI mode; it is ignored whenhostis set.- If
spark_homeis not set in the profile, dbt falls back to theSPARK_HOMEenvironment variable, then tospark-sql/spark-submitonPATH. - The Spark History Server requires event logging on the thrift server (
spark.eventLog.enabled=true). The bundled docker-compose configures this automatically.
Contribute
- Want to help us build
dbt-spark? Check out the Contributing Guide.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dbt_iceberg-1.0.4.tar.gz.
File metadata
- Download URL: dbt_iceberg-1.0.4.tar.gz
- Upload date:
- Size: 85.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d561101b84cd106a7b8efc337529972c4eb4a44aa5197b6ef1e225f7e3355325
|
|
| MD5 |
84d7c8ff772d1e7b5bad7d5d8349921b
|
|
| BLAKE2b-256 |
6834a9dd63c6da99ce0d0729a6e1c00895408994a21d8dc9942002ebd397de56
|
Provenance
The following attestation bundles were made for dbt_iceberg-1.0.4.tar.gz:
Publisher:
publish-dbt-iceberg.yml on theserverkid/dbt-adapters
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
dbt_iceberg-1.0.4.tar.gz -
Subject digest:
d561101b84cd106a7b8efc337529972c4eb4a44aa5197b6ef1e225f7e3355325 - Sigstore transparency entry: 1003625711
- Sigstore integration time:
-
Permalink:
theserverkid/dbt-adapters@75054593390721f837ef50c13cac71e632580ca7 -
Branch / Tag:
refs/heads/iceberg - Owner: https://github.com/theserverkid
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-dbt-iceberg.yml@75054593390721f837ef50c13cac71e632580ca7 -
Trigger Event:
push
-
Statement type:
File details
Details for the file dbt_iceberg-1.0.4-py3-none-any.whl.
File metadata
- Download URL: dbt_iceberg-1.0.4-py3-none-any.whl
- Upload date:
- Size: 57.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3fed0efc30a427175a14a442eccbbd13359b8ef56e75244712810b0825a50eff
|
|
| MD5 |
3c294d6eb21e234e4ade52466ef753db
|
|
| BLAKE2b-256 |
8564763da0c5de997096b37ce09a2ee8255a90a97eaba2da088e6ce689a0b41a
|
Provenance
The following attestation bundles were made for dbt_iceberg-1.0.4-py3-none-any.whl:
Publisher:
publish-dbt-iceberg.yml on theserverkid/dbt-adapters
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
dbt_iceberg-1.0.4-py3-none-any.whl -
Subject digest:
3fed0efc30a427175a14a442eccbbd13359b8ef56e75244712810b0825a50eff - Sigstore transparency entry: 1003625719
- Sigstore integration time:
-
Permalink:
theserverkid/dbt-adapters@75054593390721f837ef50c13cac71e632580ca7 -
Branch / Tag:
refs/heads/iceberg - Owner: https://github.com/theserverkid
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-dbt-iceberg.yml@75054593390721f837ef50c13cac71e632580ca7 -
Trigger Event:
push
-
Statement type: