Skip to main content

The Apache Iceberg adapter plugin for dbt with spark-submit and spark-sql CLI support

Project description

dbt logo

dbt

dbt enables data analysts and engineers to transform their data using the same practices that software engineers use to build applications.

dbt is the T in ELT. Organize, cleanse, denormalize, filter, rename, and pre-aggregate the raw data in your warehouse so that it's ready for analysis.

dbt-spark

dbt-spark enables dbt to work with Apache Spark. For more information on using dbt with Spark, consult the docs.

Getting started

Review the repository README.md as most of that information pertains to dbt-spark.

Running locally

A docker-compose environment starts three services:

Service Description Port
dbt-spark3-thrift Spark Thrift Server (HiveThriftServer2) — long-running JDBC endpoint 10000
dbt-spark-history Spark History Server — UI for completed queries and jobs 18080
dbt-hive-metastore Postgres-backed Hive Metastore internal

Note: dbt-spark now supports Spark 4.1.0.

The following command starts all containers:

docker-compose up -d

It will take a bit of time for the instance to start; you can check the logs of the containers. If the instance doesn't start correctly, try the complete reset command listed below and then try again.

Create a profile using the spark_sql method in thrift server mode:

spark_testing:
  target: local
  outputs:
    local:
      type: spark
      method: spark_sql
      host: 127.0.0.1
      port: 10000
      user: dbt
      schema: analytics
      connect_retries: 5
      connect_timeout: 60
      retry_all: true
      spark_history_server: http://localhost:18080

Or using the thrift method directly:

spark_testing:
  target: local
  outputs:
    local:
      type: spark
      method: thrift
      host: 127.0.0.1
      port: 10000
      user: dbt
      schema: analytics
      connect_retries: 5
      connect_timeout: 60
      retry_all: true

Connecting to the local Spark instance:

Note that the Hive metastore data is persisted under ./.hive-metastore/, Spark warehouse data under ./.spark-warehouse/, and event logs under ./.spark-events/. To completely reset your environment run:

docker-compose down
rm -rf ./.hive-metastore/ ./.spark-warehouse/ ./.spark-events/

Additional Configuration for MacOS

If installing on MacOS, use homebrew to install required dependencies.

brew install unixodbc

Configuring spark-submit and spark-sql methods

The spark_submit and spark_sql connection methods run SQL and Python models using either a long-running Spark Thrift Server or the Spark CLI tools directly.

Requirements

  • A working Spark installation accessible via SPARK_HOME or on PATH (CLI mode), or a running HiveThriftServer2 (thrift server mode).
  • For spark_submit: spark-submit must be executable.
  • For spark_sql CLI mode: spark-sql must be executable.
  • For spark_sql thrift server mode: pyhive must be installed (pip install dbt-spark[PyHive]).

spark-sql — thrift server mode (recommended)

When host is set, spark_sql connects to a long-running Spark Thrift Server (HiveThriftServer2) via the Thrift protocol. This avoids the JVM startup cost of spawning a new Spark application for every dbt invocation, and lets you share a single Spark context across many dbt runs. The Spark UI and History Server remain accessible throughout.

my_spark_project:
  target: dev
  outputs:
    dev:
      type: spark
      method: spark_sql
      host: 127.0.0.1
      port: 10000
      schema: analytics
      user: dbt
      connect_retries: 5
      connect_timeout: 60
      retry_all: true
      # Optional: URL of the Spark History Server UI (informational — logged at startup)
      spark_history_server: http://127.0.0.1:18080
      # Optional: server-side Spark configuration applied to each session
      server_side_parameters:
        spark.executor.memory: "4g"
        spark.eventLog.enabled: "true"
        spark.eventLog.dir: /tmp/spark-events

Start a thrift server locally using the bundled docker-compose:

docker-compose up -d

Or start one manually:

$SPARK_HOME/sbin/start-thriftserver.sh \
  --master local[*] \
  --conf spark.sql.ansi.enabled=false \
  --conf spark.eventLog.enabled=true \
  --conf spark.eventLog.dir=/tmp/spark-events

spark-sql — CLI mode

When host is not set, dbt executes each statement with spark-sql -e '...' (or -f for batches). A new Spark application is launched per dbt invocation.

my_spark_project:
  target: dev
  outputs:
    dev:
      type: spark
      method: spark_sql
      schema: analytics
      # Optional: explicit path to SPARK_HOME if not set in environment
      spark_home: /opt/spark
      # Optional: extra CLI flags passed to spark-sql
      spark_sql_args:
        - "--master"
        - "local[*]"
        - "--conf"
        - "spark.executor.memory=4g"

spark-submit

Use spark_submit when your project includes Python models. dbt executes Python models via spark-submit. SQL statements (e.g. catalog introspection) fall back to thrift when host is set, or to the spark-sql CLI otherwise.

With a thrift server (recommended):

my_spark_project:
  target: dev
  outputs:
    dev:
      type: spark
      method: spark_submit
      host: 127.0.0.1
      port: 10000
      schema: analytics
      user: dbt
      spark_history_server: http://127.0.0.1:18080
      # Extra CLI flags for spark-submit (Python models only)
      spark_submit_args:
        - "--master"
        - "local[*]"
        - "--conf"
        - "spark.executor.memory=4g"
      spark_submit_timeout: 3600

CLI only (no thrift server):

my_spark_project:
  target: dev
  outputs:
    dev:
      type: spark
      method: spark_submit
      schema: analytics
      spark_home: /opt/spark
      spark_submit_args:
        - "--master"
        - "local[*]"
        - "--conf"
        - "spark.executor.memory=4g"
      spark_sql_args:
        - "--master"
        - "local[*]"
      spark_submit_timeout: 3600

spark-auto — automatic routing (recommended for mixed environments)

Use spark_auto when you want a single profile that works whether or not a Spark Thrift Server is running. At connection time the adapter tries to reach the Thrift server; if it is unavailable it transparently falls back to the spark-sql CLI. Python models always use spark-submit regardless of which SQL path is active.

my_spark_project:
  target: dev
  outputs:
    dev:
      type: iceberg
      method: spark_auto
      host: 127.0.0.1       # Thrift server — tried first; CLI used if unreachable
      port: 10000
      user: dbt
      schema: analytics
      connect_retries: 5
      connect_timeout: 60
      retry_all: true
      spark_history_server: http://127.0.0.1:18080
      # Spark config applied when connecting via Thrift
      server_side_parameters:
        spark.sql.extensions: org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
        spark.sql.defaultCatalog: my_catalog
      # Extra flags passed to spark-sql CLI (fallback path)
      spark_sql_args:
        - "--conf"
        - "spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions"
      # Extra flags for spark-submit (Python models)
      spark_submit_args:
        - "--conf"
        - "spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions"
      spark_submit_timeout: 3600

Profile fields

Field Required Default Description
method Yes spark_sql, spark_submit, or spark_auto
schema Yes The default schema (database) to use
host No Thrift server hostname. When set, spark_sql, spark_submit, and spark_auto connect via Thrift instead of CLI. spark_auto falls back to CLI if the server is unreachable.
port No 443 Thrift server port (typically 10000)
user No Username for thrift server authentication
auth No Auth mechanism for thrift (e.g. NONE, LDAP, KERBEROS)
use_ssl No false Use SSL/TLS for the thrift connection
spark_history_server No URL of the Spark History Server UI (informational)
spark_home No $SPARK_HOME env var Path to your Spark installation (CLI mode)
spark_sql_args No [] Extra CLI flags for spark-sql (CLI/fallback mode only)
spark_submit_args No [] Extra CLI flags for spark-submit (Python models only)
spark_submit_timeout No null (no timeout) Max seconds to wait for a spark-submit job
poll_interval No 5 Seconds between status polls for thrift queries
query_timeout No null (no timeout) Max seconds for a single thrift query
query_retries No 1 Times to retry on connection loss during query polling
server_side_parameters No {} Spark configuration sent to the thrift server per session

Notes

  • Thrift server mode (host set) requires pip install dbt-spark[PyHive] for spark_sql, spark_submit, and spark_auto.
  • spark_submit and spark_auto only invoke spark-submit for Python models. SQL statements use thrift when host is set (or reachable for spark_auto), otherwise spark-sql CLI.
  • spark_sql_args is only used in CLI mode; it is ignored when connected via Thrift.
  • spark_auto evaluates thrift availability at connection time — if the server comes up mid-run, subsequent connections will use Thrift.
  • If spark_home is not set in the profile, dbt falls back to the SPARK_HOME environment variable, then to spark-sql/spark-submit on PATH.
  • The Spark History Server requires event logging on the thrift server (spark.eventLog.enabled=true). The bundled docker-compose configures this automatically.

Contribute

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dbt_iceberg-1.0.15.tar.gz (87.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dbt_iceberg-1.0.15-py3-none-any.whl (59.6 kB view details)

Uploaded Python 3

File details

Details for the file dbt_iceberg-1.0.15.tar.gz.

File metadata

  • Download URL: dbt_iceberg-1.0.15.tar.gz
  • Upload date:
  • Size: 87.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for dbt_iceberg-1.0.15.tar.gz
Algorithm Hash digest
SHA256 b3c3be79851f655f33246688dbda24c7c5cbf0011b8943e60b12c23f103126cb
MD5 056e0bb48b60d15c89ce5fa1784da454
BLAKE2b-256 a73bba3df1459164f30e979bb8550e5e58f3531d6b93d00cf389ee0f98c07ada

See more details on using hashes here.

Provenance

The following attestation bundles were made for dbt_iceberg-1.0.15.tar.gz:

Publisher: publish-dbt-iceberg.yml on theserverkid/dbt-adapters

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file dbt_iceberg-1.0.15-py3-none-any.whl.

File metadata

  • Download URL: dbt_iceberg-1.0.15-py3-none-any.whl
  • Upload date:
  • Size: 59.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for dbt_iceberg-1.0.15-py3-none-any.whl
Algorithm Hash digest
SHA256 12d991ac57aebf729e96ae5f1b6b88390a0dfb164b232942c5d03a8de01895bb
MD5 7e880dab269e0efbdcccf46cf7bbcf01
BLAKE2b-256 604a4ffd12ebc51b880b4c44a47b545cfae1fea9a0c0f74f0537d214889e7a94

See more details on using hashes here.

Provenance

The following attestation bundles were made for dbt_iceberg-1.0.15-py3-none-any.whl:

Publisher: publish-dbt-iceberg.yml on theserverkid/dbt-adapters

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page