A Microsoft Fabric Spark adapter plugin for dbt
Reason this release was yanked:
experimental
Project description
dbt enables data analysts and engineers to transform their data using the same practices that software engineers use to build applications.
dbt is the T in ELT. Organize, cleanse, denormalize, filter, rename, and pre-aggregate the raw data in your warehouse so that it's ready for analysis.
dbt-fabricspark
The dbt-fabricspark package contains all of the code enabling dbt to work with Apache Spark in Microsoft Fabric. This adapter connects to Fabric Lakehouses via Livy endpoints and supports both schema-enabled and non-schema Lakehouse configurations.
Key Features
- Livy session management with session reuse across dbt runs
- Lakehouse with schema support — auto-detects schema-enabled lakehouses and uses three-part naming (
lakehouse.schema.table) - Lakehouse without schema — standard two-part naming (
lakehouse.table) - Materializations: table, view, incremental (append, merge, insert_overwrite), seed, snapshot
- Fabric Environment support via
environmentIdconfiguration - Security: credential masking, UUID validation, HTTPS + domain validation, thread-safe token refresh
- Resilience: HTTP 5xx retry with exponential backoff, bounded polling with configurable timeouts
Getting started
- Install dbt
- Read the introduction and viewpoint
Installation
pip install dbt-fabricspark
Configuration
Use a Livy endpoint to connect to Apache Spark in Microsoft Fabric. Configure your profiles.yml to connect via Livy endpoints.
Connection Modes
The adapter supports two connection modes via the livy_mode setting:
-
Local mode (
livy_mode: local) — Connects to a self-hosted Spark instance running in a Docker container (contributed by @mdrakiburrahman). This mode supports thereuse_sessionflag and does not require Fabric compute, making it ideal for offline development and testing. -
Fabric mode (
livy_mode: fabric, default) — Connects to Apache Spark in Microsoft Fabric via the Fabric Livy API. For development workflows, enablereuse_session: trueto persist the Livy session ID to a local file (configured viasession_id_file, defaults to./livy-session-id.txt). On subsequentdbtruns, the adapter reuses the existing session from the persisted file instead of creating a new one. If the file does not exist or the session has expired, a new session is created automatically.
Lakehouse without Schema
For standard Lakehouses (schema not enabled), use two-part naming. The schema field is set to the lakehouse name:
fabric-spark-test:
target: fabricspark-dev
outputs:
fabricspark-dev:
# Connection
type: fabricspark
method: livy
endpoint: https://api.fabric.microsoft.com/v1
workspaceid: <your-workspace-id>
lakehouseid: <your-lakehouse-id>
lakehouse: my_lakehouse
schema: my_lakehouse
threads: 1
# Authentication (CLI for local dev, SPN for CI/CD)
authentication: CLI
# client_id: <your-client-id> # Required for SPN
# tenant_id: <your-tenant-id> # Required for SPN
# client_secret: <your-client-secret> # Required for SPN
# Fabric Environment (optional)
# environmentId: <your-environment-id>
# Session management
reuse_session: true
session_idle_timeout: "30m"
# session_id_file: ./livy-session-id.txt # Default path
# Timeouts
connect_retries: 1
connect_timeout: 10
http_timeout: 120 # Seconds per HTTP request
session_start_timeout: 600 # Max wait for session start (10 min)
statement_timeout: 3600 # Max wait for statement result (1 hour)
poll_wait: 10 # Seconds between session start polls
poll_statement_wait: 5 # Seconds between statement result polls
# Retry & Shortcuts
retry_all: true
# create_shortcuts: false
# shortcuts_json_str: '<json-string>'
# Spark configuration (optional)
# spark_config:
# name: "my-spark-session"
# spark.executor.memory: "4g"
In this mode:
- Tables are referenced as
lakehouse.table_name - The
schemafield should match thelakehousename - All objects are created directly under the lakehouse
Lakehouse with Schema (Schema-Enabled)
For schema-enabled Lakehouses, you can organize tables into schemas within the lakehouse. The adapter auto-detects whether a lakehouse has schemas enabled via the Fabric REST API (properties.defaultSchema):
fabric-spark-test:
target: fabricspark-dev
outputs:
fabricspark-dev:
# Connection
type: fabricspark
method: livy
endpoint: https://api.fabric.microsoft.com/v1
workspaceid: <your-workspace-id>
lakehouseid: <your-lakehouse-id>
lakehouse: my_lakehouse
schema: my_schema # Different from lakehouse name
threads: 1
# Authentication (CLI for local dev, SPN for CI/CD)
authentication: CLI
# client_id: <your-client-id> # Required for SPN
# tenant_id: <your-tenant-id> # Required for SPN
# client_secret: <your-client-secret> # Required for SPN
# Fabric Environment (optional)
# environmentId: <your-environment-id>
# Session management
reuse_session: true
session_idle_timeout: "30m"
# session_id_file: ./livy-session-id.txt # Default path
# Timeouts
connect_retries: 1
connect_timeout: 10
http_timeout: 120 # Seconds per HTTP request
session_start_timeout: 600 # Max wait for session start (10 min)
statement_timeout: 3600 # Max wait for statement result (1 hour)
poll_wait: 10 # Seconds between session start polls
poll_statement_wait: 5 # Seconds between statement result polls
# Retry & Shortcuts
retry_all: true
# create_shortcuts: false
# shortcuts_json_str: '<json-string>'
# Spark configuration (optional)
# spark_config:
# name: "my-spark-session"
# spark.executor.memory: "4g"
In this mode:
- Tables are referenced using three-part naming:
lakehouse.schema.table_name - The
schemafield specifies the target schema within the lakehouse - dbt's
generate_schema_nameandgenerate_database_namemacros are lakehouse-aware - Schemas are created automatically via
CREATE DATABASE IF NOT EXISTS lakehouse.schema - Incremental models use persisted staging tables (instead of temp views) to work around Spark's
REQUIRES_SINGLE_PART_NAMESPACElimitation
Schema Detection
The adapter detects whether a lakehouse has schemas enabled using two complementary mechanisms:
-
Runtime detection (Fabric REST API): During
connection.open(), the adapter calls the Fabric REST API to fetch lakehouse properties. If the response containsdefaultSchema, the lakehouse is treated as schema-enabled and three-part naming is used. -
Parse-time detection (profile heuristic): During manifest parsing (before any connection is opened), the adapter checks whether
schemadiffers fromlakehousein your profile. When they differ (e.g.,lakehouse: bronze,schema: dbo), the adapter infers schema-enabled mode. This ensures correct schema resolution at compile time.
Important: For schema-enabled lakehouses, always set
schemato a value different fromlakehousein your profile (e.g.,schema: dbo). Ifschemaequalslakehouse, the adapter cannot distinguish schema-enabled from non-schema mode at parse time, and the lakehouse name will be used as the schema name instead.
| Lakehouse Type | lakehouse |
schema |
Naming |
|---|---|---|---|
| Without schema | my_lakehouse |
my_lakehouse |
my_lakehouse.table_name |
| With schema | my_lakehouse |
dbo |
my_lakehouse.dbo.table_name |
Cross-Lakehouse Writes
A single profile can write to multiple lakehouses using the database config on individual models. The profile's lakehouse is the default target; set database on a model to redirect writes to a different lakehouse in the same workspace.
# profiles.yml — profile targets the "bronze" lakehouse
fabric-spark:
type: fabricspark
lakehouse: bronze
schema: dbo
# ... other settings
-- models/silver/silver_orders.sql — writes to the "silver" lakehouse
{{ config(
materialized='table',
database='silver',
schema='dbo'
) }}
select * from {{ ref('bronze_orders') }}
In this example:
- Seeds and bronze models write to
bronze.dbo.*(the default lakehouse) - Silver models write to
silver.dbo.*viadatabase='silver' - Gold models write to
gold.dbo.*viadatabase='gold' - All three lakehouses must exist in the same Fabric workspace and have schemas enabled
Configuration Reference
| Option | Type | Default | Description |
|---|---|---|---|
type |
string | — | Must be fabricspark |
method |
string | livy |
Connection method |
endpoint |
string | https://api.fabric.microsoft.com/v1 |
Fabric API endpoint URL |
workspaceid |
string | — | Fabric workspace UUID |
lakehouseid |
string | — | Lakehouse UUID |
lakehouse |
string | — | Lakehouse name |
schema |
string | — | Schema name. Must equal lakehouse for non-schema lakehouses, must differ from lakehouse for schema-enabled (e.g., dbo) |
threads |
int | 1 |
Number of threads for parallel execution |
| Authentication | |||
authentication |
string | CLI |
Auth method: CLI, SPN, or fabric_notebook |
client_id |
string | — | Service principal client ID (SPN only) |
tenant_id |
string | — | Azure AD tenant ID (SPN only) |
client_secret |
string | — | Service principal secret (SPN only) |
accessToken |
string | — | Direct access token (optional) |
| Environment | |||
environmentId |
string | — | Fabric Environment ID for Spark configuration |
spark_config |
dict | {} |
Spark session configuration (must include name key) |
| Session Management | |||
reuse_session |
bool | false |
Keep Livy sessions alive for reuse across runs |
session_id_file |
string | ./livy-session-id.txt |
Path to file storing session ID for reuse |
session_idle_timeout |
string | 30m |
Livy session idle timeout (e.g. 30m, 1h) |
| Timeouts & Polling | |||
connect_retries |
int | 1 |
Number of connection retries |
connect_timeout |
int | 10 |
Connection timeout in seconds |
http_timeout |
int | 120 |
Seconds per HTTP request to Fabric API |
session_start_timeout |
int | 600 |
Max seconds to wait for session start |
statement_timeout |
int | 3600 |
Max seconds to wait for statement result |
poll_wait |
int | 10 |
Seconds between session start polls |
poll_statement_wait |
int | 5 |
Seconds between statement result polls |
| Other | |||
retry_all |
bool | false |
Retry all operations on failure |
create_shortcuts |
bool | false |
Enable Fabric shortcut creation |
shortcuts_json_str |
string | — | JSON string defining shortcuts |
livy_mode |
string | fabric |
fabric for Fabric cloud, local for local Livy |
livy_url |
string | http://localhost:8998 |
Local Livy URL (local mode only) |
Authentication Modes
| Mode | Value | Use Case | Required Fields |
|---|---|---|---|
| Azure CLI | CLI |
Local development. Uses az login credentials. |
None (run az login first) |
| Service Principal | SPN |
CI/CD and automation. Uses Azure AD app registration. | client_id, tenant_id, client_secret |
| Fabric Notebook | fabric_notebook |
Running dbt inside a Fabric notebook. Uses notebookutils.credentials. |
None (runs in Fabric runtime) |
Materialized Lake Views
Materialized lake views are a Fabric-native construct that materializes a SQL query as a Delta table in your lakehouse, with automatic lineage-based refresh managed by Fabric.
Prerequisites
- Schema-enabled lakehouse
- Fabric Runtime 1.3+
- Source tables must be Delta tables
Basic Usage
-- models/silver/silver_cleaned_orders.sql
{{ config(
materialized='materialized_lake_view',
database='silver',
schema='dbo'
) }}
SELECT
o.order_id,
o.product_id,
p.product_name,
o.quantity,
p.price,
o.quantity * p.price AS revenue
FROM {{ ref('bronze_orders') }} o
JOIN {{ ref('bronze_products') }} p
ON o.product_id = p.product_id
Configuration Options
| Option | Type | Default | Description |
|---|---|---|---|
materialized |
string | — | Must be 'materialized_lake_view' |
database |
string | target lakehouse | Target lakehouse for cross-lakehouse writes |
schema |
string | target schema | Target schema within the lakehouse |
partitioned_by |
list | — | Columns to partition the MLV by |
mlv_comment |
string | — | Description stored with the MLV definition |
mlv_constraints |
list | [] |
Data quality constraints (see below) |
tblproperties |
dict | — | Key-value metadata properties |
enable_cdf |
bool | true |
Auto-enable Change Data Feed on source tables |
mlv_on_demand |
bool | false |
Trigger immediate refresh after creation |
mlv_schedule |
dict | — | Schedule config for periodic refresh (see below) |
Data Quality Constraints
{{ config(
materialized='materialized_lake_view',
mlv_constraints=[
{"name": "valid_quantity", "expression": "quantity > 0", "on_mismatch": "DROP"},
{"name": "valid_price", "expression": "price >= 0", "on_mismatch": "FAIL"}
]
) }}
Each constraint has:
name— Constraint identifierexpression— Boolean expression each row must satisfyon_mismatch—DROP(silently remove violating rows) orFAIL(stop refresh with error, default)
Change Data Feed
The adapter automatically enables Change Data Feed (CDF) on all upstream source tables referenced via ref() before creating the MLV. This enables optimal incremental refresh. To disable:
{{ config(
materialized='materialized_lake_view',
enable_cdf=false
) }}
On-Demand Refresh
Trigger an immediate MLV lineage refresh after creation:
{{ config(
materialized='materialized_lake_view',
mlv_on_demand=true
) }}
This calls the Fabric Job Scheduler API:
POST /v1/workspaces/{workspaceId}/lakehouses/{lakehouseId}/jobs/RefreshMaterializedLakeViews/instances
Scheduled Refresh
Create or update a periodic refresh schedule. The adapter uses the Fabric Job Scheduler API to manage schedules. Only one active schedule per lakehouse lineage is supported — the adapter automatically updates an existing schedule if one is found.
Cron schedule (interval in minutes):
{{ config(
materialized='materialized_lake_view',
mlv_schedule={
"enabled": true,
"configuration": {
"startDateTime": "2026-04-10T00:00:00",
"endDateTime": "2026-12-31T23:59:59",
"localTimeZoneId": "Central Standard Time",
"type": "Cron",
"interval": 10
}
}
) }}
Daily schedule (specific times):
{{ config(
materialized='materialized_lake_view',
mlv_schedule={
"enabled": true,
"configuration": {
"startDateTime": "2026-04-10T00:00:00",
"endDateTime": "2026-12-31T23:59:59",
"localTimeZoneId": "Central Standard Time",
"type": "Daily",
"times": ["06:00", "18:00"]
}
}
) }}
Weekly schedule (specific days and times):
{{ config(
materialized='materialized_lake_view',
mlv_schedule={
"enabled": true,
"configuration": {
"startDateTime": "2026-04-10T00:00:00",
"endDateTime": "2026-12-31T23:59:59",
"localTimeZoneId": "Central Standard Time",
"type": "Weekly",
"weekdays": ["Monday", "Wednesday", "Friday"],
"times": ["08:00"]
}
}
) }}
Full Example with All Options
{{ config(
materialized='materialized_lake_view',
database='gold',
schema='dbo',
partitioned_by=['product_type'],
mlv_comment='Product sales summary with quality checks',
mlv_constraints=[
{"name": "positive_revenue", "expression": "total_revenue >= 0", "on_mismatch": "DROP"}
],
tblproperties={"quality_tier": "gold"},
enable_cdf=true,
mlv_on_demand=true
) }}
SELECT
product_id,
product_name,
product_type,
SUM(quantity) AS total_quantity_sold,
SUM(revenue) AS total_revenue
FROM {{ ref('silver_order_items') }}
GROUP BY product_id, product_name, product_type
Generated SQL:
CREATE OR REPLACE MATERIALIZED LAKE VIEW gold.dbo.product_sales_summary
(
CONSTRAINT positive_revenue CHECK (total_revenue >= 0) ON MISMATCH DROP
)
PARTITIONED BY (product_type)
COMMENT 'Product sales summary with quality checks'
TBLPROPERTIES ("quality_tier"="gold")
AS
SELECT ...
Limitations
- No ALTER on definition — Changing the SELECT query, constraints, or partitioning requires drop + recreate. The adapter uses
CREATE OR REPLACEwhich handles this automatically. - Only RENAME via ALTER —
ALTER MATERIALIZED LAKE VIEW ... RENAME TO ...is the only supported ALTER operation. - No DML —
INSERT,UPDATE,DELETEare not supported on MLVs. - No UDFs — User-defined functions are not supported in the SELECT query.
- No time-travel —
VERSION AS OF/TIMESTAMP AS OFsyntax is not supported. - No temp views as sources — The SELECT query can reference tables and other MLVs, but not temporary views.
- Schedule is per-lakehouse — One active schedule per lakehouse lineage, not per MLV.
Reporting bugs and contributing code
- Want to report a bug or request a feature? Let us know on Slack, or open an issue
- Want to help us build dbt? Check out the Contributing Guide
Join the dbt Community
- Be part of the conversation in the dbt Community Slack
- Read more on the dbt Community Discourse
Code of Conduct
Everyone interacting in the dbt project's codebases, issue trackers, chat rooms, and mailing lists is expected to follow the dbt Code of Conduct.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file vd_dbt_fabricspark-1.9.14.tar.gz.
File metadata
- Download URL: vd_dbt_fabricspark-1.9.14.tar.gz
- Upload date:
- Size: 349.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.13.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a9ec2bd90f60a963521d60e6f6c2caf551a68ce1f4c8e60df637a8109627ee71
|
|
| MD5 |
5035c4a1a3c9cabe803d37114387c704
|
|
| BLAKE2b-256 |
5b45fec3e5492a801ab37a914a4b9dc2c6c80f3434db5a25434f6d9993ab14ea
|
File details
Details for the file vd_dbt_fabricspark-1.9.14-py3-none-any.whl.
File metadata
- Download URL: vd_dbt_fabricspark-1.9.14-py3-none-any.whl
- Upload date:
- Size: 94.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.13.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c2d791ac2dc2759646217409d86e6cbd4ed7b571a5ee936c239741b243f497a3
|
|
| MD5 |
8453a73b4d744fb710c36d8c504892ab
|
|
| BLAKE2b-256 |
57348aed1130a0bfe4c7bc444d94b0a482e65d80e8200b13a1996c8b91302046
|