Skip to main content

The Apache Iceberg adapter plugin for dbt with spark-submit and spark-sql CLI support

Project description

dbt logo

dbt

dbt enables data analysts and engineers to transform their data using the same practices that software engineers use to build applications.

dbt is the T in ELT. Organize, cleanse, denormalize, filter, rename, and pre-aggregate the raw data in your warehouse so that it's ready for analysis.

dbt-spark

dbt-spark enables dbt to work with Apache Spark. For more information on using dbt with Spark, consult the docs.

Getting started

Review the repository README.md as most of that information pertains to dbt-spark.

Running locally

A docker-compose environment starts three services:

Service Description Port
dbt-spark3-thrift Spark Thrift Server (HiveThriftServer2) — long-running JDBC endpoint 10000
dbt-spark-history Spark History Server — UI for completed queries and jobs 18080
dbt-hive-metastore Postgres-backed Hive Metastore internal

Note: dbt-spark now supports Spark 4.1.0.

The following command starts all containers:

docker-compose up -d

It will take a bit of time for the instance to start; you can check the logs of the containers. If the instance doesn't start correctly, try the complete reset command listed below and then try again.

Create a profile using the spark_sql method in thrift server mode:

spark_testing:
  target: local
  outputs:
    local:
      type: spark
      method: spark_sql
      host: 127.0.0.1
      port: 10000
      user: dbt
      schema: analytics
      connect_retries: 5
      connect_timeout: 60
      retry_all: true
      spark_history_server: http://localhost:18080

Or using the thrift method directly:

spark_testing:
  target: local
  outputs:
    local:
      type: spark
      method: thrift
      host: 127.0.0.1
      port: 10000
      user: dbt
      schema: analytics
      connect_retries: 5
      connect_timeout: 60
      retry_all: true

Connecting to the local Spark instance:

Note that the Hive metastore data is persisted under ./.hive-metastore/, Spark warehouse data under ./.spark-warehouse/, and event logs under ./.spark-events/. To completely reset your environment run:

docker-compose down
rm -rf ./.hive-metastore/ ./.spark-warehouse/ ./.spark-events/

Additional Configuration for MacOS

If installing on MacOS, use homebrew to install required dependencies.

brew install unixodbc

Configuring spark-submit and spark-sql methods

The spark_submit and spark_sql connection methods run SQL and Python models using either a long-running Spark Thrift Server or the Spark CLI tools directly.

Requirements

  • A working Spark installation accessible via SPARK_HOME or on PATH (CLI mode), or a running HiveThriftServer2 (thrift server mode).
  • For spark_submit: spark-submit must be executable.
  • For spark_sql CLI mode: spark-sql must be executable.
  • For spark_sql thrift server mode: pyhive must be installed (pip install dbt-spark[PyHive]).

spark-sql — thrift server mode (recommended)

When host is set, spark_sql connects to a long-running Spark Thrift Server (HiveThriftServer2) via the Thrift protocol. This avoids the JVM startup cost of spawning a new Spark application for every dbt invocation, and lets you share a single Spark context across many dbt runs. The Spark UI and History Server remain accessible throughout.

my_spark_project:
  target: dev
  outputs:
    dev:
      type: spark
      method: spark_sql
      host: 127.0.0.1
      port: 10000
      schema: analytics
      user: dbt
      connect_retries: 5
      connect_timeout: 60
      retry_all: true
      # Optional: URL of the Spark History Server UI (informational — logged at startup)
      spark_history_server: http://127.0.0.1:18080
      # Optional: server-side Spark configuration applied to each session
      server_side_parameters:
        spark.executor.memory: "4g"
        spark.eventLog.enabled: "true"
        spark.eventLog.dir: /tmp/spark-events

Start a thrift server locally using the bundled docker-compose:

docker-compose up -d

Or start one manually:

$SPARK_HOME/sbin/start-thriftserver.sh \
  --master local[*] \
  --conf spark.sql.ansi.enabled=false \
  --conf spark.eventLog.enabled=true \
  --conf spark.eventLog.dir=/tmp/spark-events

spark-sql — CLI mode

When host is not set, dbt executes each statement with spark-sql -e '...' (or -f for batches). A new Spark application is launched per dbt invocation.

my_spark_project:
  target: dev
  outputs:
    dev:
      type: spark
      method: spark_sql
      schema: analytics
      # Optional: explicit path to SPARK_HOME if not set in environment
      spark_home: /opt/spark
      # Optional: extra CLI flags passed to spark-sql
      spark_sql_args:
        - "--master"
        - "local[*]"
        - "--conf"
        - "spark.executor.memory=4g"

spark-submit

Use spark_submit when your project includes Python models. dbt executes Python models via spark-submit. SQL statements (e.g. catalog introspection) fall back to thrift when host is set, or to the spark-sql CLI otherwise.

With a thrift server (recommended):

my_spark_project:
  target: dev
  outputs:
    dev:
      type: spark
      method: spark_submit
      host: 127.0.0.1
      port: 10000
      schema: analytics
      user: dbt
      spark_history_server: http://127.0.0.1:18080
      # Extra CLI flags for spark-submit (Python models only)
      spark_submit_args:
        - "--master"
        - "local[*]"
        - "--conf"
        - "spark.executor.memory=4g"
      spark_submit_timeout: 3600

CLI only (no thrift server):

my_spark_project:
  target: dev
  outputs:
    dev:
      type: spark
      method: spark_submit
      schema: analytics
      spark_home: /opt/spark
      spark_submit_args:
        - "--master"
        - "local[*]"
        - "--conf"
        - "spark.executor.memory=4g"
      spark_sql_args:
        - "--master"
        - "local[*]"
      spark_submit_timeout: 3600

Profile fields

Field Required Default Description
method Yes spark_sql or spark_submit
schema Yes The default schema (database) to use
host No Thrift server hostname. When set, spark_sql and spark_submit connect via Thrift instead of CLI
port No 443 Thrift server port (typically 10000)
user No Username for thrift server authentication
auth No Auth mechanism for thrift (e.g. NONE, LDAP, KERBEROS)
use_ssl No false Use SSL/TLS for the thrift connection
spark_history_server No URL of the Spark History Server UI (informational)
spark_home No $SPARK_HOME env var Path to your Spark installation (CLI mode)
spark_sql_args No [] Extra CLI flags for spark-sql (CLI mode only)
spark_submit_args No [] Extra CLI flags for spark-submit (Python models only)
spark_submit_timeout No null (no timeout) Max seconds to wait for a spark-submit job
poll_interval No 5 Seconds between status polls for thrift queries
query_timeout No null (no timeout) Max seconds for a single thrift query
query_retries No 1 Times to retry on connection loss during query polling
server_side_parameters No {} Spark configuration sent to the thrift server per session

Notes

  • Thrift server mode (host set) requires pip install dbt-spark[PyHive] for both spark_sql and spark_submit.
  • spark_submit only invokes spark-submit for Python models. SQL statements use thrift when host is set, otherwise spark-sql CLI.
  • spark_sql_args is only used in CLI mode; it is ignored when host is set.
  • If spark_home is not set in the profile, dbt falls back to the SPARK_HOME environment variable, then to spark-sql/spark-submit on PATH.
  • The Spark History Server requires event logging on the thrift server (spark.eventLog.enabled=true). The bundled docker-compose configures this automatically.

Contribute

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dbt_iceberg-1.0.5.tar.gz (85.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dbt_iceberg-1.0.5-py3-none-any.whl (57.3 kB view details)

Uploaded Python 3

File details

Details for the file dbt_iceberg-1.0.5.tar.gz.

File metadata

  • Download URL: dbt_iceberg-1.0.5.tar.gz
  • Upload date:
  • Size: 85.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for dbt_iceberg-1.0.5.tar.gz
Algorithm Hash digest
SHA256 57588a1cf2b1d71c2eb4b261f228a6f2b149f0fe3ad9e73bee2c3a7ca616b94f
MD5 d4ab094e5eec784a6a6e28e6989f5da3
BLAKE2b-256 ba048a48a408fbeb58bcd8a66fcf231f783cc81467c70673c7fd9d609ca86816

See more details on using hashes here.

Provenance

The following attestation bundles were made for dbt_iceberg-1.0.5.tar.gz:

Publisher: publish-dbt-iceberg.yml on theserverkid/dbt-adapters

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file dbt_iceberg-1.0.5-py3-none-any.whl.

File metadata

  • Download URL: dbt_iceberg-1.0.5-py3-none-any.whl
  • Upload date:
  • Size: 57.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for dbt_iceberg-1.0.5-py3-none-any.whl
Algorithm Hash digest
SHA256 4a9fe2c65b829335fb274212791e83c9f6073a8391f81d71c1ca9b7060da9d22
MD5 160705cf8c60bd0b7f39753f349f816b
BLAKE2b-256 ebeee32bb778349263af93c234de5530785a8e20bd48f8d40000e230b87a4508

See more details on using hashes here.

Provenance

The following attestation bundles were made for dbt_iceberg-1.0.5-py3-none-any.whl:

Publisher: publish-dbt-iceberg.yml on theserverkid/dbt-adapters

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page