Skip to main content

Compile-time SQL-dialect transpiler for dbt (via sqlglot): run models written in one dialect on another warehouse — Spark-first.

Project description

dbt-polyglot

Run dbt models written in one SQL dialect on a warehouse that speaks another — unchanged. Each model's SQL is transpiled with sqlglot at dbt's compile phase, so the SQL dbt actually executes (and what lands in target/compiled/) is already in your target dialect. Your model .sql files are never edited.

You declare two things in config: the dialect your models are written in (transpile_from) and the dialect of the warehouse you run on (transpile_to, default spark). Drop the package into any existing dbt repo, point profiles.yml at your warehouse, add one flag to dbt_project.yml, and dbt build.

Why this exists: dialects diverge. Spark, for example, has no QUALIFY clause ([PARSE_SYNTAX_ERROR] … near 'QUALIFY'), plus dozens of smaller gaps (IFF, NVL, :: casts, DATEADD, null ordering, …) — a Snowflake-style model simply fails there until its SQL is translated. This package does that translation transparently, in-place, at compile time.

Spark is the first-class target today — it carries the correctness fix-up layer and the validation story (see Targets). Any other sqlglot dialect works as a target too, best-effort.


Install

It is a normal Python package — install it into the same virtualenv your dbt runs in. Installation auto-activates the patch (via a .pth file that imports the module on interpreter start-up; see Installation: why pip, not dbt deps).

pip install dbt-polyglot

From a git checkout (bleeding edge):

pip install "git+https://github.com/SaketKumar/dbt-polyglot.git"

Local / editable (developing the package):

pip install -e path/to/dbt-polyglot

You also need a Spark adapter for dbt (this package does not pull one in, so you can choose your connection method):

pip install "dbt-spark[PyHive]"     # Thrift/HiveServer2, used in the examples below

Configure (the only changes you make)

1. profiles.yml — point the output at your warehouse (Spark shown)

your_profile:
  target: dev
  outputs:
    dev:
      type: spark
      method: thrift
      host: "{{ env_var('DBT_SPARK_HOST', 'localhost') }}"
      port: "{{ env_var('DBT_SPARK_PORT', 10000) | int }}"
      schema: analytics

2. dbt_project.yml — declare the source dialect (and target, if not Spark)

models:
  your_project:
    +transpile_from: snowflake     # the dialect your models are WRITTEN in
    # +transpile_to: spark         # your WAREHOUSE's dialect (default: spark)

Both accept any dialect sqlglot understands — snowflake, bigquery, redshift, tsql, postgres, duckdb, databricks, presto, trino, … transpile_to defaults to spark; set it to match the warehouse profiles.yml connects to. It must agree with your dbt adapter, since dbt executes the transpiled SQL there.

You can scope it to a subtree (models.your_project.staging.+transpile_from: …) or override it per model — a per-model config beats the project default:

-- models/marts/latest_order.sql  (written in Snowflake SQL, runs on Spark)
{{ config(materialized='table', transpile_from='snowflake') }}

select *
from {{ ref('orders') }}
qualify row_number() over (partition by customer_id order by ordered_at desc) = 1

That's it. dbt build now runs your existing models on Spark, no model edits.


How it works

At dbt compile, the package wraps dbt.compilation.Compiler._compile_code and runs an extra step on each opted-in model's compiled SQL body:

parse(read=transpile_from)  →  apply fix-ups (Spark target only)  →  generate(transpile_to, pretty=True)

Because the rewrite happens on the model body before dbt wraps it in the materialization DDL (create table … as …), both target/compiled/ and the SQL sent to Spark are pure Spark — there is no mixed-dialect string and no separate output directory.

The fix-up layer (what makes it trustable)

sqlglot's output is occasionally valid in its model of Spark but rejected by Spark's real parser. The classic case: x NOT IN (subquery), which sqlglot's Snowflake reader canonicalizes to the unsupported x <> ALL (subquery). The SPARK_FIXUPS registry is a list of small AST transforms applied to the parsed tree before Spark SQL is generated; the first one rewrites quantified-subquery comparisons (<> ALL / = ANY (subq)) back to NOT x IN / x IN (subq). The registry is extensible — one EXPLAIN-verified transform per gap discovered.

Targets

At the engine level the transpile is N×N — any sqlglot source dialect to any target, chosen by transpile_from / transpile_to. In practice there's a maturity gradient:

  • Spark — first-class. The SPARK_FIXUPS correctness layer runs only when transpile_to=spark, and Spark is the target both the fix-ups and the dbt build --empty validation story are tested against. This is the production-trustworthy path.
  • Any other target — best-effort. You get raw sqlglot output with no repair layer. Often correct, but sqlglot can emit constructs the real engine rejects with nothing to catch them — e.g. Snowflake x NOT IN (subquery) transpiled to BigQuery becomes the unsupported x <> ALL (subquery) (the very case SPARK_FIXUPS repairs for Spark).

Promoting another target to first-class is a bounded extension: add a <TARGET>_FIXUPS registry beside SPARK_FIXUPS and key fix-up selection on transpile_to. Either way, transpile_to must match your dbt adapter — dbt runs the output against that warehouse.

Trust model — verified, or fails loud (never silently wrong)

A model is either converted to valid Spark SQL or it fails loudly with a clear dbt/Spark error naming the model. It never silently emits a wrong result from an un-converted construct:

  • Fail-soft + loud. If sqlglot can't parse the SQL as the source dialect, or produces empty/multi-statement output, the patch logs a WARNING (visible in the dbt run) and passes the original SQL through unchanged. Spark then either runs it (it was already valid) or rejects it loudly — so the failure surfaces, it is never hidden.

To certify a whole repo upfront — before a heavy run — use dbt's own native validation. No extra tooling: dbt already runs SQL through your profiles.yml adapter, against whatever warehouse you target.

dbt build --empty              # build every model with 0 input rows (DAG-ordered)
dbt build --empty --select marts.*   # any dbt selector works
dbt show --limit 0 -s my_model # read-only: validate the SELECT without materializing

--empty limits every ref/source to zero rows, so dbt executes each model's real SQL against the warehouse — moving no data — and fails loudly, naming the model, if the transpiled SQL is invalid. Because it builds in dependency order, there is no "upstream not built" ambiguity. That makes dbt build --empty a drop-in CI gate (it exits non-zero on the first invalid model). dbt show --limit 0 is the non-destructive variant when the target role can't create objects.

Scope

Every opted-in model is transpiled — the full sqlglot breadth (IFFIF, NVLCOALESCE, ::CAST, DATEADDDATE_ADD, QUALIFY→windowed subquery, …). To transpile only part of a project, scope +transpile_from to a folder/model subtree (or set it per model) — the dbt-native way — rather than a global on/off.

No-op guarantee

If transpile_from is unset, or equals transpile_to (you're already writing Spark SQL), the model is never touchedsqlglot is not even called and nothing is reformatted.

A note on NULLS LAST in the output (intentional)

Snowflake and Spark have opposite default null ordering (Snowflake sorts NULLs largest → last; Spark sorts them smallest → first). When translating a Snowflake ORDER BY x, sqlglot appends an explicit … NULLS LAST to preserve Snowflake semantics — without it, a QUALIFY ROW_NUMBER() … = 1 top-N pick could choose a different row. It is added only on a true cross-dialect translation, and is semantically required — do not strip it.


Installation: why pip, not dbt deps

dbt deps cannot install this — you must pip install it. They do different things:

  • dbt deps installs dbt packages: bundles of dbt macros, models, seeds, and tests (the things listed in packages.yml / dependencies.yml). It pulls SQL/Jinja assets into dbt_packages/ and never installs or runs Python code.
  • dbt-polyglot is a Python package. It works by monkeypatching a dbt-core function at runtime, and it activates through a .pth file that Python executes on interpreter start-up. Both of those are Python-installer concerns — only pip (or uv, poetry, etc.) places a .pth into site-packages and registers the dependency.

So it is installed exactly like dbt-core or an adapter, into the same environment as your dbt. It does not appear in packages.yml.


Package contents

A standard src-layout package — src/dbt_polyglot/ holds the import package, plus a .pth that activates it on start-up:

File Role
src/dbt_polyglot/__init__.py Import-time activation: patches the dbt Compiler.
src/dbt_polyglot/transpile.py The compile-phase patch (patch_compiler) + core spark_safe_transpile.
src/dbt_polyglot/fixups.py The SPARK_FIXUPS registry of AST transforms.
dbt_polyglot.pth One line (import dbt_polyglot); auto-activates on start-up. Installed into site-packages by the build_py shim in setup.py.
pyproject.toml / setup.py PEP 517 metadata; setup.py exists only to place the .pth into purelib.
LICENSE Apache-2.0.

This package is intentionally limited to transpilation. Validating the result is left to dbt's native dbt build --empty (see Trust model above); catalog routing (mapping file_format → a Spark catalog) and seed re-runnability are separate concerns and are not bundled here.


Compatibility & caveats

  • dbt-core private method. The patch wraps dbt.compilation.Compiler._compile_code, a private dbt-core method. It forwards *args/**kwargs to tolerate signature drift and is fully import-guarded (if dbt-core or sqlglot aren't importable, or the seam moves, the patch does nothing rather than breaking the interpreter). Still, pin a supported dbt-core range when depending on this in production, and re-verify after major dbt upgrades.
  • sqlglot coverage. sqlglot maps a large surface but not everything. Exotic dialect features — Snowflake LATERAL FLATTEN, VARIANT/OBJECT/ARRAY semantics, : path access, LISTAGG, and similar — may not translate cleanly. Those surface via the fail-soft WARNING and dbt build --empty, by design, rather than silently.
  • Self-contained. The module imports nothing from any host project, so it can be lifted into its own repo unchanged.

License

Apache-2.0 — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dbt_polyglot-0.1.1.tar.gz (18.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dbt_polyglot-0.1.1-py3-none-any.whl (13.4 kB view details)

Uploaded Python 3

File details

Details for the file dbt_polyglot-0.1.1.tar.gz.

File metadata

  • Download URL: dbt_polyglot-0.1.1.tar.gz
  • Upload date:
  • Size: 18.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.11.24 {"installer":{"name":"uv","version":"0.11.24","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for dbt_polyglot-0.1.1.tar.gz
Algorithm Hash digest
SHA256 f1e004c118b2fd1e79eddf67f2caeb6b88350281eeb3d37a7d50c7c7c319db95
MD5 894b1f0b7bfdc8ae713080793a9f6eee
BLAKE2b-256 872689bbfebe6da330d6219af8f2592def628226f30e7dad31a70df58d7869c8

See more details on using hashes here.

File details

Details for the file dbt_polyglot-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: dbt_polyglot-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 13.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.11.24 {"installer":{"name":"uv","version":"0.11.24","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for dbt_polyglot-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 a209c4b72ae201a8804d65eb33ff0acf47673ee4108ed4fbc173d25572071fb8
MD5 9f72abdf4909e2eb5728802fdb3cbbd5
BLAKE2b-256 b937706817e82476d768fb5c135ca20bab392e530d4bdcd5f96d493a25533986

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page