Databricks dbt factory library for creating Databricks Job definition where individual models are run as separate tasks.
Project description
Databricks dbt factory
Databricks dbt Factory is a lightweight library that generates a Databricks Workflow from a dbt project. It creates individual Databricks Workflow tasks based on your dbt manifest for each dbt object type, covering dbt models, tests, seeds, and snapshots.
The tool creates a new job specification, such as Databricks Assets Bundle (DAB), or can update an existing one.
Table of Contents
Motivation
By default, dbt's integration with Databricks Workflows treats an entire dbt project as a single execution unit — a black box.
Databricks dbt Factory changes that by updating Databricks Workflow specs to run dbt objects (models, tests, seeds, snapshots) as individual tasks.
Benefits
✅ Faster execution - speed up dbt projects exeuction on Databricks
✅ Visibility & Simplified troubleshooting — Quickly pinpoint and fix issues at the model level.
✅ Enhanced logging & notifications — Gain detailed logs and precise error alerts for faster debugging.
✅ Improved retriability — Retry only the failed model tasks without rerunning the full project.
✅ Seamless testing — Automatically run dbt data tests on tables right after each model finishes, enabling faster validation and feedback.
How it works
The tool reads the dbt manifest file and the existing DAB workflow definition, and generates a new definition.
Installation
pip install databricks-dbt-factory
Usage
The factory reads a job template (a minimal DAB-style YAML with an empty tasks list) and a dbt manifest, then outputs a complete job definition with one task per dbt node.
Job template
Create a minimal job template YAML. This is the skeleton the factory injects tasks into:
resources:
jobs:
my_dbt_job:
name: my_dbt_job
queue:
enabled: true
environments:
- environment_key: Default
spec:
client: '1'
dependencies:
- dbt-databricks
To use a workspace base environment instead of inline dependencies (recommended for notebook tasks on serverless — requires Databricks CLI >= 0.292.0):
environments:
- environment_key: Default
spec:
base_environment: "/Workspace/Shared/envs/my_base_env.yaml"
Note: client and base_environment are mutually exclusive — use one or the other.
Generating native dbt tasks within Databricks Workflows
databricks_dbt_factory \
--dbt-manifest-path target/manifest.json \
--input-job-spec-path job_template.yaml \
--target-job-spec-path job_definition.yaml \
--source GIT \
--target dev
This generates dbt_task entries — the native Databricks dbt task type.
Note that --input-job-spec-path and --target-job-spec-path can be the same file, in which case the job spec is updated in place.
Generating notebook tasks within Databricks Workflows (recommended for best performance)
This is the recommended way to run dbt on Databricks. It gives much faster start time.
It uses a pre-cached base environment where dbt-databricks is already installed and ready on each new task which saves roughly 30 seconds of pip-install time per task. Native dbt_task on Serverless has to install dbt fresh every time.
How it works. A small runner notebook (shipped with this package) triggers dbt for each task. dbt is lightweight — it parses your project, figures out what SQL to run, and sends that SQL to your SQL warehouse. The actual model transformation runs in the warehouse, not in the notebook. The notebook (and whatever compute runs it, serverless or a cluster) is just the trigger — it doesn't crunch any data itself.
databricks_dbt_factory \
--dbt-manifest-path target/manifest.json \
--input-job-spec-path job_template.yaml \
--target-job-spec-path job_definition.yaml \
--task-type notebook \
--source GIT \
--target dev
The packaged runner notebook (run_dbt_command.py) is copied next to the generated job spec
automatically. The databricks bundle deploy DAB command uploads it to the workspace along with the job.
Pass --notebook-path <path> if you want to pin the notebook elsewhere and manage it yourself.
When pinning
--notebook-path, always provide--project-directoryas an absolute workspace path to make sure the dbt project directory is resolved correctly. In auto-copy mode the factory places the runner at the project root and rewrites paths accordingly. When you pin the notebook somewhere else, the factory can't know where your project lives relative to it — only an absolute--project-directory(e.g./Workspace/Users/you@example.com/my_dbt_project) is guaranteed to work at runtime.
If your dbt project lives in the workspace instead of git (--source WORKSPACE), also pass --project-directory and --profiles-directory pointing at the absolute workspace paths of the uploaded project, e.g.:
databricks_dbt_factory ... \
--task-type notebook \
--source WORKSPACE \
--project-directory /Workspace/Users/you@example.com/my_dbt_project \
--profiles-directory /Workspace/Users/you@example.com/my_dbt_project
Providing your own cluster (non-serverless mode)
To trigger tasks from a dedicated job cluster instead of serverless, use --job-cluster-key
and define the cluster in your job template. The cluster only runs dbt's lightweight
orchestration step (parse, compile, dispatch) — the actual SQL still executes on the SQL
warehouse configured in your profiles.yml. Small cluster is enough.
resources:
jobs:
my_dbt_job:
name: my_dbt_job
job_clusters:
- job_cluster_key: dbt_cluster
new_cluster:
spark_version: 16.2.x-scala2.12
num_workers: 1
node_type_id: i3.xlarge
databricks_dbt_factory \
--dbt-manifest-path target/manifest.json \
--input-job-spec-path job_template.yaml \
--target-job-spec-path job_definition.yaml \
--task-type notebook \
--job-cluster-key dbt_cluster \
--source GIT \
--target dev
Arguments
--new-job-name(type: str, optional, default: None): Optional job name. If provided, the existing job name in the job spec is updated.--dbt-manifest-path(type: str, required): Path to the dbt manifest file.--input-job-spec-path(type: str, required): Path to the input job spec file (the job template).--target-job-spec-path(type: str, required): Path to the target job spec file.--target(type: str, optional): dbt target to use. If not provided, the default target from the dbt profile will be used.--source(type: str, optional, default: None): Project source (GITorWORKSPACE). If not provided,WORKSPACEwill be used.--task-type(type: str, optional, default: "dbt"): Task type to generate —dbtfor native dbt_task,notebookfor notebook_task wrapper.--notebook-path(type: str, optional): Path to the dbt runner notebook used when--task-type notebook. If omitted, the packaged runner notebook is copied next to the generated job spec and referenced relatively, sodatabricks bundle deployuploads it automatically. When provided, also pass--project-directoryas an absolute workspace path — see the note in Generating notebook tasks.--warehouse_id(type: str, optional): SQL Warehouse ID. Only used with native dbt_task.--schema(type: str, optional): Metastore schema. Only used with native dbt_task.--catalog(type: str, optional): Metastore catalog. Only used with native dbt_task.--profiles-directory(type: str, optional): Path to the profiles directory.--project-directory(type: str, optional): Path to the dbt project directory.--environment-key(type: str, optional, default: Default): Key of the serverless environment. Mutually exclusive with--job-cluster-key.--job-cluster-key(type: str, optional): Job cluster key for running tasks on job compute instead of serverless. Mutually exclusive with--environment-key.--extra-dbt-command-options(type: str, optional, default: ""): Additional dbt command options to include.--no-run-tests(flag, default: tests enabled): Skip generating dbt test tasks. Tests are included by default.--bundle-tests(flag, default: disabled): Performance boost — bundle single-model tests per resource into onedbt test --select <resource>task. Fewer Databricks tasks means fewer task startups, fewer dbt cold starts, and noticeably faster end-to-end runtime for projects with many tests. Downstream models/seeds/snapshots gate on the upstream'stests_<resource>task so failing tests still halt the DAG. Cross-model tests are emitted as their own tasks with multi-resource deps. See Test handling.--enable-dbt-deps(flag, default: disabled): Rundbt depsbefore each task.--dbt-tasks-deps(type: str, optional, default: None): Comma separated list of tasks for which dbt deps should be run (e.g. "diamonds_prices,second_dbt_model"). Only in effect if--enable-dbt-depsis set.--dry-run(flag, default: disabled): Print generated tasks without updating the job spec file.
You can also check all input arguments by running databricks_dbt_factory --help.
DBT Tests handling
The factory produces tasks for dbt tests from the manifest by default (pass --no-run-tests
to skip them). Two modes are available, controlled by --bundle-tests:
Per-test (default)
One Databricks task per dbt test node, running dbt test --select <test_name>. Each test task's
depends_on includes every model/seed/snapshot the test references, so multi-model tests
(e.g. relationships) only run after all their endpoints are built. Downstream models are
gated only on error-severity tests: every model/seed/snapshot task depends on the
severity: error test tasks attached to its upstream resources, so a failing error test skips
the downstream task. This matches dbt build semantics. severity: warn tests still run as their own tasks but are
kept out of downstream depends_on — they surface findings without cluttering the DAG or
blocking anything.
- Pros: per-test failures are individually visible in the Databricks UI; downstream
execution halts on error-severity test failure just like
dbt build; cross-model tests wait for every endpoint they reference; warn tests stay informational, no DAG gating. - Cons: larger DAG (one task per test, and dbt projects routinely have many more tests than
models); each downstream model's
depends_onlist grows with error-severity upstream tests.
Bundled (--bundle-tests) — recommended for performance
This is the faster mode. For projects with many tests (most real-world projects have far more tests than models), bundling dramatically reduces end-to-end runtime by cutting down on:
- Task startup overhead. Every Databricks task pays a cold-start tax. Going from N test tasks per resource down to one means N−1 fewer cold starts per resource.
- Repeated dbt initialization. Each
dbt testinvocation parses the manifest, connects to the warehouse, and sets up the adapter. Bundling reduces this from once-per-test to once-per-resource. - DAG coordination. Fewer tasks means less scheduler pressure on the job run.
For a 100-model project with ~5 tests per model, that's ~500 test tasks collapsing to ~100 — typically a large wall-clock win.
The factory classifies each dbt test node into one of two buckets based on its depends_on:
-
Single-model tests (most tests:
unique,not_null,accepted_values, column-level checks, …) — collapsed into one Databricks task per tested resource, with task keytests_<resource_task_key>(e.g.tests_model_my_project_customers) that runs all the resource's single-model tests together. -
Cross-model tests (e.g.
relationships, custom tests that reference multiple models) — emitted as their own tasks, one per test node, with deps on every resource the test references. These run in parallel with the bundled tasks; they don't fit inside a bundle because their correctness requires all their endpoints to be built first.
Downstream models/seeds/snapshots that depend on a tested resource are rewired to depend on
the upstream's tests_<resource> task, so data only flows downstream after its upstream
single-model tests pass. Cross-model test tasks don't gate downstream execution — they run
as leaf assertions.
Severity handling: warn-severity test failures exit 0 in dbt, so the bundled tests_<resource>
task is green and downstream still runs. Error-severity failures exit non-zero, the
tests_<resource> task goes red, and downstream is skipped. Same end result as per-test mode
(warn ≠ blocking, error = blocking), just via dbt's exit code rather than our dep-graph
filtering.
- Pros: faster — fewer task startups and fewer dbt invocations translate directly into shorter end-to-end run times; smaller, cleaner DAG in the UI.
- Cons: per-test failure visibility is lost inside a bundle — a failure shows up as one red
tests_<resource>task rather than a specific red<test_name>task in the UI; drill into the task logs to see which individual test(s) failed. (Cross-model test tasks retain their per-test visibility because they aren't bundled.)
Task types
The factory supports two task types, controlled by --task-type:
dbt (default)
Generates native Databricks dbt_task entries. This is the standard approach that
uses Databricks' built-in dbt integration. Works with both classic compute and serverless.
Limitations on Serverless: Native dbt tasks do not support workspace base environments (requiring installing dependencies on every task)
or environment variables. If you need either of these, use the notebook task type instead.
notebook
Generates notebook_task entries that wrap dbt execution via the dbtRunner Python API.
Each task calls a shared runner notebook (run_dbt_command.py) with parameterized dbt commands.
Advantages over native dbt_task:
- Faster execution by avoiding cold start problem - all dependencies can be pre pre-cached inside
base_environment - Supports running the dbt process on job compute via
--job-cluster-key(SQL execution still uses the warehouse inprofiles.yml) - More flexibility - The runner notebook is editable. Want to load secrets from a scope before dbt runs? Run dbt, then call a Python API with the result? Emit a Slack message on failure? Tag the run with Git SHA? Add a few lines to the runner notebook.
End-to-end example
A complete working project is available at mwojtyczka/dbt-demo. The steps below walk through running it end-to-end.
-
Clone the demo project.
git clone https://github.com/mwojtyczka/dbt-demo.git cd dbt-demo
-
Install dependencies.
pip install dbt-databricks databricks-dbt-factory
Install the Databricks CLI:
brew install databricks
-
Set auth environment variables. The demo's
profiles.ymlreads these to connect to Databricks:export DBT_HOST="https://<your-workspace>.cloud.databricks.com" export DBT_ACCESS_TOKEN="<your-pat>"
-
Compile the dbt project to produce dbt manifest file (
target/manifest.json), which the factory reads:dbt compile -
Create Databricks Workflow. This reads the manifest and the job template (
resources/dbt_sql_job.yml) and writes a new, fully-expanded job spec toresources/dbt_sql_job_explicit_tasks.yml— one task per dbt node, wired up with the right dependencies:databricks_dbt_factory \ --dbt-manifest-path target/manifest.json \ --input-job-spec-path resources/dbt_sql_job.yml \ --target-job-spec-path resources/dbt_sql_job_explicit_tasks.yml \ --target '${bundle.target}' \ --project-directory ../ \ --profiles-directory . \ --environment-key Default \ --new-job-name dbt_sql_job_explicit_tasks
For best performance, add
--task-type notebookto the command above — it routes dbt execution through the packaged runner notebook (pre-cached base environments, faster cold starts). See Generating notebook tasks for the full rationale. -
Authenticate the Databricks CLI to your workspace. The
databricks.ymlin the demo references a specific profile (e.g.FIELD-ENG) under each target. Log in so that profile resolves:databricks auth login --host https://<your-workspace>.cloud.databricks.com
You can verify with
databricks auth profiles. If yourdatabricks.ymluses a different profile name, pass--profile <name>on the login command to match. -
Deploy and run the bundle:
databricks bundle deploy --target dev databricks bundle run dbt_sql_job_explicit_tasks
Open the run URL the CLI prints to watch the generated task graph execute in the Databricks UI.
Contribution
See contribution guidance here.
License
databricks-dbt-factory is distributed under the terms of the MIT license.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file databricks_dbt_factory-0.2.0.tar.gz.
File metadata
- Download URL: databricks_dbt_factory-0.2.0.tar.gz
- Upload date:
- Size: 33.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4f332099be6e0baf97539855cd92e308ee1feb536929cfbe4444d24d7effbdd8
|
|
| MD5 |
c069c16c9de9cf45caef706910438db6
|
|
| BLAKE2b-256 |
58036a0ff0d212036b92147af2a67d485718e742d3145f5c1e23b5c31bc6e041
|
Provenance
The following attestation bundles were made for databricks_dbt_factory-0.2.0.tar.gz:
Publisher:
release.yml on mwojtyczka/databricks-dbt-factory
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
databricks_dbt_factory-0.2.0.tar.gz -
Subject digest:
4f332099be6e0baf97539855cd92e308ee1feb536929cfbe4444d24d7effbdd8 - Sigstore transparency entry: 1372707380
- Sigstore integration time:
-
Permalink:
mwojtyczka/databricks-dbt-factory@646d61ea43404647a46801f8fc4805f3b70e3b3f -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/mwojtyczka
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@646d61ea43404647a46801f8fc4805f3b70e3b3f -
Trigger Event:
push
-
Statement type:
File details
Details for the file databricks_dbt_factory-0.2.0-py3-none-any.whl.
File metadata
- Download URL: databricks_dbt_factory-0.2.0-py3-none-any.whl
- Upload date:
- Size: 23.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3229a55e066c78bf8eb36d6b0070a43da673799a51817d4588aaa14f00695542
|
|
| MD5 |
eec5a43580970a02b28b065ca24e25a2
|
|
| BLAKE2b-256 |
2d1b79eaeb8e8251cc2cc9186d0453f46ffad10faaa7d3ba6325d1da19e8a369
|
Provenance
The following attestation bundles were made for databricks_dbt_factory-0.2.0-py3-none-any.whl:
Publisher:
release.yml on mwojtyczka/databricks-dbt-factory
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
databricks_dbt_factory-0.2.0-py3-none-any.whl -
Subject digest:
3229a55e066c78bf8eb36d6b0070a43da673799a51817d4588aaa14f00695542 - Sigstore transparency entry: 1372707497
- Sigstore integration time:
-
Permalink:
mwojtyczka/databricks-dbt-factory@646d61ea43404647a46801f8fc4805f3b70e3b3f -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/mwojtyczka
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@646d61ea43404647a46801f8fc4805f3b70e3b3f -
Trigger Event:
push
-
Statement type: