The Apache Iceberg adapter plugin for dbt with spark-submit and spark-sql CLI support

These details have not been verified by PyPI

Project links

Project description

dbt logo

dbt

dbt enables data analysts and engineers to transform their data using the same practices that software engineers use to build applications.

dbt is the T in ELT. Organize, cleanse, denormalize, filter, rename, and pre-aggregate the raw data in your warehouse so that it's ready for analysis.

dbt-spark

dbt-spark enables dbt to work with Apache Spark. For more information on using dbt with Spark, consult the docs.

Getting started

Review the repository README.md as most of that information pertains to dbt-spark.

Running locally

A docker-compose environment starts three services:

Service	Description	Port
`dbt-spark3-thrift`	Spark Thrift Server (`HiveThriftServer2`) — long-running JDBC endpoint	`10000`
`dbt-spark-history`	Spark History Server — UI for completed queries and jobs	`18080`
`dbt-hive-metastore`	Postgres-backed Hive Metastore	internal

Note: dbt-spark now supports Spark 4.1.0.

The following command starts all containers:

docker-compose up -d

It will take a bit of time for the instance to start; you can check the logs of the containers. If the instance doesn't start correctly, try the complete reset command listed below and then try again.

Create a profile using the spark_sql method in thrift server mode:

spark_testing:
  target: local
  outputs:
    local:
      type: spark
      method: spark_sql
      host: 127.0.0.1
      port: 10000
      user: dbt
      schema: analytics
      connect_retries: 5
      connect_timeout: 60
      retry_all: true
      spark_history_server: http://localhost:18080

Or using the thrift method directly:

spark_testing:
  target: local
  outputs:
    local:
      type: spark
      method: thrift
      host: 127.0.0.1
      port: 10000
      user: dbt
      schema: analytics
      connect_retries: 5
      connect_timeout: 60
      retry_all: true

Connecting to the local Spark instance:

Spark UI (active queries): http://localhost:4040/sqlserver/
Spark History Server (completed queries): http://localhost:18080
Thrift endpoint: jdbc:hive2://localhost:10000 — credentials dbt:dbt

Note that the Hive metastore data is persisted under ./.hive-metastore/, Spark warehouse data under ./.spark-warehouse/, and event logs under ./.spark-events/. To completely reset your environment run:

docker-compose down
rm -rf ./.hive-metastore/ ./.spark-warehouse/ ./.spark-events/

Additional Configuration for MacOS

If installing on MacOS, use homebrew to install required dependencies.

brew install unixodbc

Configuring spark-submit and spark-sql methods

The spark_submit and spark_sql connection methods run SQL and Python models using either a long-running Spark Thrift Server or the Spark CLI tools directly.

Requirements

A working Spark installation accessible via SPARK_HOME or on PATH (CLI mode), or a running HiveThriftServer2 (thrift server mode).
For spark_submit: spark-submit must be executable.
For spark_sql CLI mode: spark-sql must be executable.
For spark_sql thrift server mode: pyhive must be installed (pip install dbt-spark[PyHive]).

spark-sql — thrift server mode (recommended)

When host is set, spark_sql connects to a long-running Spark Thrift Server (HiveThriftServer2) via the Thrift protocol. This avoids the JVM startup cost of spawning a new Spark application for every dbt invocation, and lets you share a single Spark context across many dbt runs. The Spark UI and History Server remain accessible throughout.

my_spark_project:
  target: dev
  outputs:
    dev:
      type: spark
      method: spark_sql
      host: 127.0.0.1
      port: 10000
      schema: analytics
      user: dbt
      connect_retries: 5
      connect_timeout: 60
      retry_all: true
      # Optional: URL of the Spark History Server UI (informational — logged at startup)
      spark_history_server: http://127.0.0.1:18080
      # Optional: server-side Spark configuration applied to each session
      server_side_parameters:
        spark.executor.memory: "4g"
        spark.eventLog.enabled: "true"
        spark.eventLog.dir: /tmp/spark-events

Start a thrift server locally using the bundled docker-compose:

docker-compose up -d

Or start one manually:

$SPARK_HOME/sbin/start-thriftserver.sh \
  --master local[*] \
  --conf spark.sql.ansi.enabled=false \
  --conf spark.eventLog.enabled=true \
  --conf spark.eventLog.dir=/tmp/spark-events

spark-sql — CLI mode

When host is not set, dbt executes each statement with spark-sql -e '...' (or -f for batches). A new Spark application is launched per dbt invocation.

my_spark_project:
  target: dev
  outputs:
    dev:
      type: spark
      method: spark_sql
      schema: analytics
      # Optional: explicit path to SPARK_HOME if not set in environment
      spark_home: /opt/spark
      # Optional: extra CLI flags passed to spark-sql
      spark_sql_args:
        - "--master"
        - "local[*]"
        - "--conf"
        - "spark.executor.memory=4g"

spark-submit

Use spark_submit when your project includes Python models. dbt executes Python models via spark-submit. SQL statements (e.g. catalog introspection) fall back to thrift when host is set, or to the spark-sql CLI otherwise.

With a thrift server (recommended):

my_spark_project:
  target: dev
  outputs:
    dev:
      type: spark
      method: spark_submit
      host: 127.0.0.1
      port: 10000
      schema: analytics
      user: dbt
      spark_history_server: http://127.0.0.1:18080
      # Extra CLI flags for spark-submit (Python models only)
      spark_submit_args:
        - "--master"
        - "local[*]"
        - "--conf"
        - "spark.executor.memory=4g"
      spark_submit_timeout: 3600

CLI only (no thrift server):

my_spark_project:
  target: dev
  outputs:
    dev:
      type: spark
      method: spark_submit
      schema: analytics
      spark_home: /opt/spark
      spark_submit_args:
        - "--master"
        - "local[*]"
        - "--conf"
        - "spark.executor.memory=4g"
      spark_sql_args:
        - "--master"
        - "local[*]"
      spark_submit_timeout: 3600

spark-auto — automatic routing (recommended for mixed environments)

Use spark_auto when you want a single profile that works whether or not a Spark Thrift Server is running. At connection time the adapter tries to reach the Thrift server; if it is unavailable it transparently falls back to the spark-sql CLI. Python models always use spark-submit regardless of which SQL path is active.

my_spark_project:
  target: dev
  outputs:
    dev:
      type: iceberg
      method: spark_auto
      host: 127.0.0.1       # Thrift server — tried first; CLI used if unreachable
      port: 10000
      user: dbt
      schema: analytics
      connect_retries: 5
      connect_timeout: 60
      retry_all: true
      spark_history_server: http://127.0.0.1:18080
      # Spark config applied when connecting via Thrift
      server_side_parameters:
        spark.sql.extensions: org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
        spark.sql.defaultCatalog: my_catalog
      # Extra flags passed to spark-sql CLI (fallback path)
      spark_sql_args:
        - "--conf"
        - "spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions"
      # Extra flags for spark-submit (Python models)
      spark_submit_args:
        - "--conf"
        - "spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions"
      spark_submit_timeout: 3600

Profile fields

Field	Required	Default	Description
`method`	Yes	—	`spark_sql`, `spark_submit`, or `spark_auto`
`schema`	Yes	—	The default schema (database) to use
`host`	No	—	Thrift server hostname. When set, `spark_sql`, `spark_submit`, and `spark_auto` connect via Thrift instead of CLI. `spark_auto` falls back to CLI if the server is unreachable.
`port`	No	`443`	Thrift server port (typically `10000`)
`user`	No	—	Username for thrift server authentication
`auth`	No	—	Auth mechanism for thrift (e.g. `NONE`, `LDAP`, `KERBEROS`)
`use_ssl`	No	`false`	Use SSL/TLS for the thrift connection
`spark_history_server`	No	—	URL of the Spark History Server UI (informational)
`spark_home`	No	`$SPARK_HOME` env var	Path to your Spark installation (CLI mode)
`spark_sql_args`	No	`[]`	Extra CLI flags for `spark-sql` (CLI/fallback mode only)
`spark_submit_args`	No	`[]`	Extra CLI flags for `spark-submit` (Python models only)
`spark_submit_timeout`	No	`null` (no timeout)	Max seconds to wait for a `spark-submit` job
`poll_interval`	No	`5`	Seconds between status polls for thrift queries
`query_timeout`	No	`null` (no timeout)	Max seconds for a single thrift query
`query_retries`	No	`1`	Times to retry on connection loss during query polling
`server_side_parameters`	No	`{}`	Spark configuration sent to the thrift server per session

Notes

Thrift server mode (host set) requires pip install dbt-spark[PyHive] for spark_sql, spark_submit, and spark_auto.
spark_submit and spark_auto only invoke spark-submit for Python models. SQL statements use thrift when host is set (or reachable for spark_auto), otherwise spark-sql CLI.
spark_sql_args is only used in CLI mode; it is ignored when connected via Thrift.
spark_auto evaluates thrift availability at connection time — if the server comes up mid-run, subsequent connections will use Thrift.
If spark_home is not set in the profile, dbt falls back to the SPARK_HOME environment variable, then to spark-sql/spark-submit on PATH.
The Spark History Server requires event logging on the thrift server (spark.eventLog.enabled=true). The bundled docker-compose configures this automatically.

Contribute

Want to help us build dbt-spark? Check out the Contributing Guide.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.0.22

Mar 14, 2026

1.0.21

Mar 13, 2026

1.0.20

Mar 12, 2026

1.0.19

Mar 4, 2026

1.0.18

Mar 4, 2026

1.0.17

Mar 1, 2026

1.0.16

Feb 28, 2026

This version

1.0.15

Feb 28, 2026

1.0.14

Feb 28, 2026

1.0.13

Feb 28, 2026

1.0.12

Feb 28, 2026

1.0.11

Feb 28, 2026

1.0.10

Feb 28, 2026

1.0.9

Feb 28, 2026

1.0.8

Feb 27, 2026

1.0.7

Feb 27, 2026

1.0.6

Feb 27, 2026

1.0.5

Feb 27, 2026

1.0.4

Feb 27, 2026

1.0.3

Feb 27, 2026

1.0.2

Feb 27, 2026

1.0.1

Feb 26, 2026

1.0.0

Feb 26, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dbt_iceberg-1.0.15.tar.gz (87.8 kB view details)

Uploaded Feb 28, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

dbt_iceberg-1.0.15-py3-none-any.whl (59.6 kB view details)

Uploaded Feb 28, 2026 Python 3

File details

Details for the file dbt_iceberg-1.0.15.tar.gz.

File metadata

Download URL: dbt_iceberg-1.0.15.tar.gz
Upload date: Feb 28, 2026
Size: 87.8 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for dbt_iceberg-1.0.15.tar.gz
Algorithm	Hash digest
SHA256	`b3c3be79851f655f33246688dbda24c7c5cbf0011b8943e60b12c23f103126cb`
MD5	`056e0bb48b60d15c89ce5fa1784da454`
BLAKE2b-256	`a73bba3df1459164f30e979bb8550e5e58f3531d6b93d00cf389ee0f98c07ada`

See more details on using hashes here.

Provenance

The following attestation bundles were made for dbt_iceberg-1.0.15.tar.gz:

Publisher: publish-dbt-iceberg.yml on theserverkid/dbt-adapters

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: dbt_iceberg-1.0.15.tar.gz
- Subject digest: b3c3be79851f655f33246688dbda24c7c5cbf0011b8943e60b12c23f103126cb
- Sigstore transparency entry: 1004850011
- Sigstore integration time: Feb 28, 2026
Source repository:
- Permalink: theserverkid/dbt-adapters@e800e073c5c168a5c35ad941517be5200f4bad5c
- Branch / Tag: refs/heads/iceberg
- Owner: https://github.com/theserverkid
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish-dbt-iceberg.yml@e800e073c5c168a5c35ad941517be5200f4bad5c
- Trigger Event: push

File details

Details for the file dbt_iceberg-1.0.15-py3-none-any.whl.

File metadata

Download URL: dbt_iceberg-1.0.15-py3-none-any.whl
Upload date: Feb 28, 2026
Size: 59.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for dbt_iceberg-1.0.15-py3-none-any.whl
Algorithm	Hash digest
SHA256	`12d991ac57aebf729e96ae5f1b6b88390a0dfb164b232942c5d03a8de01895bb`
MD5	`7e880dab269e0efbdcccf46cf7bbcf01`
BLAKE2b-256	`604a4ffd12ebc51b880b4c44a47b545cfae1fea9a0c0f74f0537d214889e7a94`

See more details on using hashes here.

Provenance

The following attestation bundles were made for dbt_iceberg-1.0.15-py3-none-any.whl:

Publisher: publish-dbt-iceberg.yml on theserverkid/dbt-adapters

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: dbt_iceberg-1.0.15-py3-none-any.whl
- Subject digest: 12d991ac57aebf729e96ae5f1b6b88390a0dfb164b232942c5d03a8de01895bb
- Sigstore transparency entry: 1004850025
- Sigstore integration time: Feb 28, 2026
Source repository:
- Permalink: theserverkid/dbt-adapters@e800e073c5c168a5c35ad941517be5200f4bad5c
- Branch / Tag: refs/heads/iceberg
- Owner: https://github.com/theserverkid
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish-dbt-iceberg.yml@e800e073c5c168a5c35ad941517be5200f4bad5c
- Trigger Event: push

dbt-iceberg 1.0.15

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

dbt

dbt-spark

Getting started

Running locally

Additional Configuration for MacOS

Configuring spark-submit and spark-sql methods

Requirements

spark-sql — thrift server mode (recommended)

spark-sql — CLI mode

spark-submit

spark-auto — automatic routing (recommended for mixed environments)

Profile fields

Notes

Contribute

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance