Skip to main content

Run any-dialect dbt models on Spark unchanged — transpiles each model's SQL to Spark at dbt compile time via sqlglot.

Project description

dbt-polyglot

Run a dbt project written in another SQL dialect (Snowflake, BigQuery, Redshift, …) on Spark — unchanged. Each model's SQL is transpiled to Spark with sqlglot at dbt's compile phase, so the SQL dbt actually executes (and what lands in target/compiled/) is already Spark.

The only changes are configuration — your model .sql files are never edited. Drop the package into any existing dbt repo, point profiles.yml at Spark, declare the source dialect in dbt_project.yml, and dbt build.

Why this exists: Spark has no QUALIFY clause ([PARSE_SYNTAX_ERROR] … near 'QUALIFY'), plus dozens of smaller dialect gaps (IFF, NVL, :: casts, DATEADD, null ordering, …). A portable/Snowflake-style model fails on Spark until its SQL is translated. This package does that translation transparently, in-place, at compile time.


Install

It is a normal Python package — install it into the same virtualenv your dbt runs in. Installation auto-activates the patch (via a .pth file that imports the module on interpreter start-up; see Installation: why pip, not dbt deps).

pip install dbt-polyglot

From a git checkout (bleeding edge):

pip install "git+https://github.com/SaketKumar/dbt-polyglot.git"

Local / editable (developing the package):

pip install -e path/to/dbt-polyglot

You also need a Spark adapter for dbt (this package does not pull one in, so you can choose your connection method):

pip install "dbt-spark[PyHive]"     # Thrift/HiveServer2, used in the examples below

Configure (the only changes you make)

1. profiles.yml — point the output at Spark

your_profile:
  target: dev
  outputs:
    dev:
      type: spark
      method: thrift
      host: "{{ env_var('DBT_SPARK_HOST', 'localhost') }}"
      port: "{{ env_var('DBT_SPARK_PORT', 10000) | int }}"
      schema: analytics

2. dbt_project.yml — declare your models' source dialect

models:
  your_project:
    +transpile_from: snowflake     # the dialect your models are written in
    # +transpile_to: spark         # optional, default 'spark'

transpile_from accepts any dialect sqlglot understands — snowflake, bigquery, redshift, tsql, postgres, duckdb, presto, trino, … transpile_to defaults to spark and rarely needs changing.

You can scope it to a subtree (models.your_project.staging.+transpile_from: …) or override it per model — a per-model config beats the project default:

-- models/marts/latest_order.sql  (written in Snowflake SQL, runs on Spark)
{{ config(materialized='table', transpile_from='snowflake') }}

select *
from {{ ref('orders') }}
qualify row_number() over (partition by customer_id order by ordered_at desc) = 1

That's it. dbt build now runs your existing models on Spark, no model edits.


How it works

At dbt compile, the package wraps dbt.compilation.Compiler._compile_code and runs an extra step on each opted-in model's compiled SQL body:

parse(read=transpile_from)  →  apply SPARK_FIXUPS  →  generate(transpile_to, pretty=True)

Because the rewrite happens on the model body before dbt wraps it in the materialization DDL (create table … as …), both target/compiled/ and the SQL sent to Spark are pure Spark — there is no mixed-dialect string and no separate output directory.

The fix-up layer (what makes it trustable)

sqlglot's output is occasionally valid in its model of Spark but rejected by Spark's real parser. The classic case: x NOT IN (subquery), which sqlglot's Snowflake reader canonicalizes to the unsupported x <> ALL (subquery). The SPARK_FIXUPS registry is a list of small AST transforms applied to the parsed tree before Spark SQL is generated; the first one rewrites quantified-subquery comparisons (<> ALL / = ANY (subq)) back to NOT x IN / x IN (subq). The registry is extensible — one EXPLAIN-verified transform per gap discovered.

Trust model — verified, or fails loud (never silently wrong)

A model is either converted to valid Spark SQL or it fails loudly with a clear dbt/Spark error naming the model. It never silently emits a wrong result from an un-converted construct:

  • Fail-soft + loud. If sqlglot can't parse the SQL as the source dialect, or produces empty/multi-statement output, the patch logs a WARNING (visible in the dbt run) and passes the original SQL through unchanged. Spark then either runs it (it was already valid) or rejects it loudly — so the failure surfaces, it is never hidden.

To certify a whole repo upfront — before a heavy run — use dbt's own native validation. No extra tooling: dbt already runs SQL through your profiles.yml adapter, against whatever warehouse you target.

dbt build --empty              # build every model with 0 input rows (DAG-ordered)
dbt build --empty --select marts.*   # any dbt selector works
dbt show --limit 0 -s my_model # read-only: validate the SELECT without materializing

--empty limits every ref/source to zero rows, so dbt executes each model's real SQL against the warehouse — moving no data — and fails loudly, naming the model, if the transpiled SQL is invalid. Because it builds in dependency order, there is no "upstream not built" ambiguity. That makes dbt build --empty a drop-in CI gate (it exits non-zero on the first invalid model). dbt show --limit 0 is the non-destructive variant when the target role can't create objects.

Scope

Every opted-in model is transpiled — the full sqlglot breadth (IFFIF, NVLCOALESCE, ::CAST, DATEADDDATE_ADD, QUALIFY→windowed subquery, …). To transpile only part of a project, scope +transpile_from to a folder/model subtree (or set it per model) — the dbt-native way — rather than a global on/off.

No-op guarantee

If transpile_from is unset, or equals transpile_to (you're already writing Spark SQL), the model is never touchedsqlglot is not even called and nothing is reformatted.

A note on NULLS LAST in the output (intentional)

Snowflake and Spark have opposite default null ordering (Snowflake sorts NULLs largest → last; Spark sorts them smallest → first). When translating a Snowflake ORDER BY x, sqlglot appends an explicit … NULLS LAST to preserve Snowflake semantics — without it, a QUALIFY ROW_NUMBER() … = 1 top-N pick could choose a different row. It is added only on a true cross-dialect translation, and is semantically required — do not strip it.


Installation: why pip, not dbt deps

dbt deps cannot install this — you must pip install it. They do different things:

  • dbt deps installs dbt packages: bundles of dbt macros, models, seeds, and tests (the things listed in packages.yml / dependencies.yml). It pulls SQL/Jinja assets into dbt_packages/ and never installs or runs Python code.
  • dbt-polyglot is a Python package. It works by monkeypatching a dbt-core function at runtime, and it activates through a .pth file that Python executes on interpreter start-up. Both of those are Python-installer concerns — only pip (or uv, poetry, etc.) places a .pth into site-packages and registers the dependency.

So it is installed exactly like dbt-core or an adapter, into the same environment as your dbt. It does not appear in packages.yml.


Package contents

A standard src-layout package — src/dbt_polyglot/ holds the import package, plus a .pth that activates it on start-up:

File Role
src/dbt_polyglot/__init__.py Import-time activation: patches the dbt Compiler.
src/dbt_polyglot/transpile.py The compile-phase patch (patch_compiler) + core spark_safe_transpile.
src/dbt_polyglot/fixups.py The SPARK_FIXUPS registry of AST transforms.
dbt_polyglot.pth One line (import dbt_polyglot); auto-activates on start-up. Installed into site-packages by the build_py shim in setup.py.
pyproject.toml / setup.py PEP 517 metadata; setup.py exists only to place the .pth into purelib.
LICENSE Apache-2.0.

This package is intentionally limited to transpilation. Validating the result is left to dbt's native dbt build --empty (see Trust model above); catalog routing (mapping file_format → a Spark catalog) and seed re-runnability are separate concerns and are not bundled here.


Compatibility & caveats

  • dbt-core private method. The patch wraps dbt.compilation.Compiler._compile_code, a private dbt-core method. It forwards *args/**kwargs to tolerate signature drift and is fully import-guarded (if dbt-core or sqlglot aren't importable, or the seam moves, the patch does nothing rather than breaking the interpreter). Still, pin a supported dbt-core range when depending on this in production, and re-verify after major dbt upgrades.
  • sqlglot coverage. sqlglot maps a large surface but not everything. Exotic dialect features — Snowflake LATERAL FLATTEN, VARIANT/OBJECT/ARRAY semantics, : path access, LISTAGG, and similar — may not translate cleanly. Those surface via the fail-soft WARNING and dbt build --empty, by design, rather than silently.
  • Self-contained. The module imports nothing from any host project, so it can be lifted into its own repo unchanged.

License

Apache-2.0 — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dbt_polyglot-0.1.0.tar.gz (17.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dbt_polyglot-0.1.0-py3-none-any.whl (12.6 kB view details)

Uploaded Python 3

File details

Details for the file dbt_polyglot-0.1.0.tar.gz.

File metadata

  • Download URL: dbt_polyglot-0.1.0.tar.gz
  • Upload date:
  • Size: 17.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.6.1

File hashes

Hashes for dbt_polyglot-0.1.0.tar.gz
Algorithm Hash digest
SHA256 f892bbf203563f85ebb70f306044d897f15f0dc65704ec4889e551bd258279c6
MD5 57b0ed7ded6a2d5c7cf145710398942b
BLAKE2b-256 f1049a833ce1dd2de3856428e76544a00cf0170bbfe4fd8a485e5a6c22fa0b49

See more details on using hashes here.

File details

Details for the file dbt_polyglot-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for dbt_polyglot-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e6e86a6a4bd705ac2b040a003d74ed77781022efd399394781c2cf5c0df92478
MD5 bb72e16c382980b714e1e1ec5cd06d09
BLAKE2b-256 3d6c092ad451b597fa175cbc2e18ca1337b14253a299ed7b6686c1d88cd10697

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page