Skip to main content

A Microsoft Fabric Spark adapter plugin for dbt

Reason this release was yanked:

experimental

Project description

Tests and Code Checks Adapter Integration Tests Release to PyPI

Python dbt-core License


dbt enables data analysts and engineers to transform their data using the same practices that software engineers use to build applications.

dbt is the T in ELT. Organize, cleanse, denormalize, filter, rename, and pre-aggregate the raw data in your warehouse so that it's ready for analysis.

dbt-fabricspark

The dbt-fabricspark package contains all of the code enabling dbt to work with Apache Spark in Microsoft Fabric. This adapter connects to Fabric Lakehouses via Livy endpoints and supports both schema-enabled and non-schema Lakehouse configurations.

Key Features

  • Livy session management with session reuse across dbt runs
  • Lakehouse with schema support — auto-detects schema-enabled lakehouses and uses three-part naming (lakehouse.schema.table)
  • Lakehouse without schema — standard two-part naming (lakehouse.table)
  • Materializations: table, view, incremental (append, merge, insert_overwrite), seed, snapshot
  • Fabric Environment support via environmentId configuration
  • Security: credential masking, UUID validation, HTTPS + domain validation, thread-safe token refresh
  • Resilience: HTTP 5xx retry with exponential backoff, bounded polling with configurable timeouts

Getting started

Installation

pip install dbt-fabricspark

Configuration

Use a Livy endpoint to connect to Apache Spark in Microsoft Fabric. Configure your profiles.yml to connect via Livy endpoints.

Connection Modes

The adapter supports two connection modes via the livy_mode setting:

  • Local mode (livy_mode: local) — Connects to a self-hosted Spark instance running in a Docker container (contributed by @mdrakiburrahman). This mode supports the reuse_session flag and does not require Fabric compute, making it ideal for offline development and testing.

  • Fabric mode (livy_mode: fabric, default) — Connects to Apache Spark in Microsoft Fabric via the Fabric Livy API. For development workflows, enable reuse_session: true to persist the Livy session ID to a local file (configured via session_id_file, defaults to ./livy-session-id.txt). On subsequent dbt runs, the adapter reuses the existing session from the persisted file instead of creating a new one. If the file does not exist or the session has expired, a new session is created automatically.

Lakehouse without Schema

For standard Lakehouses (schema not enabled), use two-part naming. The schema field is set to the lakehouse name:

fabric-spark-test:
  target: fabricspark-dev
  outputs:
    fabricspark-dev:
        # Connection
        type: fabricspark
        method: livy
        endpoint: https://api.fabric.microsoft.com/v1
        workspaceid: <your-workspace-id>
        lakehouseid: <your-lakehouse-id>
        lakehouse: my_lakehouse
        schema: my_lakehouse
        threads: 1

        # Authentication (CLI for local dev, SPN for CI/CD)
        authentication: CLI
        # client_id: <your-client-id>        # Required for SPN
        # tenant_id: <your-tenant-id>        # Required for SPN
        # client_secret: <your-client-secret> # Required for SPN

        # Fabric Environment (optional)
        # environmentId: <your-environment-id>

        # Session management
        reuse_session: true
        session_idle_timeout: "30m"
        # session_id_file: ./livy-session-id.txt  # Default path

        # Timeouts
        connect_retries: 1
        connect_timeout: 10
        http_timeout: 120                   # Seconds per HTTP request
        session_start_timeout: 600          # Max wait for session start (10 min)
        statement_timeout: 3600             # Max wait for statement result (1 hour)
        poll_wait: 10                       # Seconds between session start polls
        poll_statement_wait: 5              # Seconds between statement result polls

        # Retry & Shortcuts
        retry_all: true
        # create_shortcuts: false
        # shortcuts_json_str: '<json-string>'

        # Spark configuration (optional)
        # spark_config:
        #   name: "my-spark-session"
        #   spark.executor.memory: "4g"

In this mode:

  • Tables are referenced as lakehouse.table_name
  • The schema field should match the lakehouse name
  • All objects are created directly under the lakehouse

Lakehouse with Schema (Schema-Enabled)

For schema-enabled Lakehouses, you can organize tables into schemas within the lakehouse. The adapter auto-detects whether a lakehouse has schemas enabled via the Fabric REST API (properties.defaultSchema):

fabric-spark-test:
  target: fabricspark-dev
  outputs:
    fabricspark-dev:
        # Connection
        type: fabricspark
        method: livy
        endpoint: https://api.fabric.microsoft.com/v1
        workspaceid: <your-workspace-id>
        lakehouseid: <your-lakehouse-id>
        lakehouse: my_lakehouse
        schema: my_schema              # Different from lakehouse name
        threads: 1

        # Authentication (CLI for local dev, SPN for CI/CD)
        authentication: CLI
        # client_id: <your-client-id>        # Required for SPN
        # tenant_id: <your-tenant-id>        # Required for SPN
        # client_secret: <your-client-secret> # Required for SPN

        # Fabric Environment (optional)
        # environmentId: <your-environment-id>

        # Session management
        reuse_session: true
        session_idle_timeout: "30m"
        # session_id_file: ./livy-session-id.txt  # Default path

        # Timeouts
        connect_retries: 1
        connect_timeout: 10
        http_timeout: 120                   # Seconds per HTTP request
        session_start_timeout: 600          # Max wait for session start (10 min)
        statement_timeout: 3600             # Max wait for statement result (1 hour)
        poll_wait: 10                       # Seconds between session start polls
        poll_statement_wait: 5              # Seconds between statement result polls

        # Retry & Shortcuts
        retry_all: true
        # create_shortcuts: false
        # shortcuts_json_str: '<json-string>'

        # Spark configuration (optional)
        # spark_config:
        #   name: "my-spark-session"
        #   spark.executor.memory: "4g"

In this mode:

  • Tables are referenced using three-part naming: lakehouse.schema.table_name
  • The schema field specifies the target schema within the lakehouse
  • dbt's generate_schema_name and generate_database_name macros are lakehouse-aware
  • Schemas are created automatically via CREATE DATABASE IF NOT EXISTS lakehouse.schema
  • Incremental models use persisted staging tables (instead of temp views) to work around Spark's REQUIRES_SINGLE_PART_NAMESPACE limitation

Schema Detection

The adapter detects whether a lakehouse has schemas enabled using two complementary mechanisms:

  1. Runtime detection (Fabric REST API): During connection.open(), the adapter calls the Fabric REST API to fetch lakehouse properties. If the response contains defaultSchema, the lakehouse is treated as schema-enabled and three-part naming is used.

  2. Parse-time detection (profile heuristic): During manifest parsing (before any connection is opened), the adapter checks whether schema differs from lakehouse in your profile. When they differ (e.g., lakehouse: bronze, schema: dbo), the adapter infers schema-enabled mode. This ensures correct schema resolution at compile time.

Important: For schema-enabled lakehouses, always set schema to a value different from lakehouse in your profile (e.g., schema: dbo). If schema equals lakehouse, the adapter cannot distinguish schema-enabled from non-schema mode at parse time, and the lakehouse name will be used as the schema name instead.

Lakehouse Type lakehouse schema Naming
Without schema my_lakehouse my_lakehouse my_lakehouse.table_name
With schema my_lakehouse dbo my_lakehouse.dbo.table_name

Cross-Lakehouse Writes

A single profile can write to multiple lakehouses using the database config on individual models. The profile's lakehouse is the default target; set database on a model to redirect writes to a different lakehouse in the same workspace.

# profiles.yml — profile targets the "bronze" lakehouse
fabric-spark:
  type: fabricspark
  lakehouse: bronze
  schema: dbo
  # ... other settings
-- models/silver/silver_orders.sql — writes to the "silver" lakehouse
{{ config(
    materialized='table',
    database='silver',
    schema='dbo'
) }}

select * from {{ ref('bronze_orders') }}

In this example:

  • Seeds and bronze models write to bronze.dbo.* (the default lakehouse)
  • Silver models write to silver.dbo.* via database='silver'
  • Gold models write to gold.dbo.* via database='gold'
  • All three lakehouses must exist in the same Fabric workspace and have schemas enabled

Configuration Reference

Option Type Default Description
type string Must be fabricspark
method string livy Connection method
endpoint string https://api.fabric.microsoft.com/v1 Fabric API endpoint URL
workspaceid string Fabric workspace UUID
lakehouseid string Lakehouse UUID
lakehouse string Lakehouse name
schema string Schema name. Must equal lakehouse for non-schema lakehouses, must differ from lakehouse for schema-enabled (e.g., dbo)
threads int 1 Number of threads for parallel execution
Authentication
authentication string CLI Auth method: CLI, SPN, or fabric_notebook
client_id string Service principal client ID (SPN only)
tenant_id string Azure AD tenant ID (SPN only)
client_secret string Service principal secret (SPN only)
accessToken string Direct access token (optional)
Environment
environmentId string Fabric Environment ID for Spark configuration
spark_config dict {} Spark session configuration (must include name key)
Session Management
reuse_session bool false Keep Livy sessions alive for reuse across runs
session_id_file string ./livy-session-id.txt Path to file storing session ID for reuse
session_idle_timeout string 30m Livy session idle timeout (e.g. 30m, 1h)
Timeouts & Polling
connect_retries int 1 Number of connection retries
connect_timeout int 10 Connection timeout in seconds
http_timeout int 120 Seconds per HTTP request to Fabric API
session_start_timeout int 600 Max seconds to wait for session start
statement_timeout int 3600 Max seconds to wait for statement result
poll_wait int 10 Seconds between session start polls
poll_statement_wait int 5 Seconds between statement result polls
Other
retry_all bool false Retry all operations on failure
create_shortcuts bool false Enable Fabric shortcut creation
shortcuts_json_str string JSON string defining shortcuts
livy_mode string fabric fabric for Fabric cloud, local for local Livy
livy_url string http://localhost:8998 Local Livy URL (local mode only)

Authentication Modes

Mode Value Use Case Required Fields
Azure CLI CLI Local development. Uses az login credentials. None (run az login first)
Service Principal SPN CI/CD and automation. Uses Azure AD app registration. client_id, tenant_id, client_secret
Fabric Notebook fabric_notebook Running dbt inside a Fabric notebook. Uses notebookutils.credentials. None (runs in Fabric runtime)

Materialized Lake Views

Materialized lake views are a Fabric-native construct that materializes a SQL query as a Delta table in your lakehouse, with automatic lineage-based refresh managed by Fabric.

Prerequisites

  • Schema-enabled lakehouse
  • Fabric Runtime 1.3+
  • Source tables must be Delta tables

Basic Usage

-- models/silver/silver_cleaned_orders.sql
{{ config(
    materialized='materialized_lake_view',
    database='silver',
    schema='dbo'
) }}

SELECT
    o.order_id,
    o.product_id,
    p.product_name,
    o.quantity,
    p.price,
    o.quantity * p.price AS revenue
FROM {{ ref('bronze_orders') }} o
JOIN {{ ref('bronze_products') }} p
    ON o.product_id = p.product_id

Configuration Options

Option Type Default Description
materialized string Must be 'materialized_lake_view'
database string target lakehouse Target lakehouse for cross-lakehouse writes
schema string target schema Target schema within the lakehouse
partitioned_by list Columns to partition the MLV by
mlv_comment string Description stored with the MLV definition
mlv_constraints list [] Data quality constraints (see below)
tblproperties dict Key-value metadata properties
enable_cdf bool true Auto-enable Change Data Feed on source tables
mlv_on_demand bool false Trigger immediate refresh after creation
mlv_schedule dict Schedule config for periodic refresh (see below)

Data Quality Constraints

{{ config(
    materialized='materialized_lake_view',
    mlv_constraints=[
        {"name": "valid_quantity", "expression": "quantity > 0", "on_mismatch": "DROP"},
        {"name": "valid_price", "expression": "price >= 0", "on_mismatch": "FAIL"}
    ]
) }}

Each constraint has:

  • name — Constraint identifier
  • expression — Boolean expression each row must satisfy
  • on_mismatchDROP (silently remove violating rows) or FAIL (stop refresh with error, default)

Change Data Feed

The adapter automatically enables Change Data Feed (CDF) on all upstream source tables referenced via ref() before creating the MLV. This enables optimal incremental refresh. To disable:

{{ config(
    materialized='materialized_lake_view',
    enable_cdf=false
) }}

On-Demand Refresh

Trigger an immediate MLV lineage refresh after creation:

{{ config(
    materialized='materialized_lake_view',
    mlv_on_demand=true
) }}

This calls the Fabric Job Scheduler API:

POST /v1/workspaces/{workspaceId}/lakehouses/{lakehouseId}/jobs/RefreshMaterializedLakeViews/instances

Scheduled Refresh

Create or update a periodic refresh schedule. The adapter uses the Fabric Job Scheduler API to manage schedules. Only one active schedule per lakehouse lineage is supported — the adapter automatically updates an existing schedule if one is found.

Cron schedule (interval in minutes):

{{ config(
    materialized='materialized_lake_view',
    mlv_schedule={
        "enabled": true,
        "configuration": {
            "startDateTime": "2026-04-10T00:00:00",
            "endDateTime": "2026-12-31T23:59:59",
            "localTimeZoneId": "Central Standard Time",
            "type": "Cron",
            "interval": 10
        }
    }
) }}

Daily schedule (specific times):

{{ config(
    materialized='materialized_lake_view',
    mlv_schedule={
        "enabled": true,
        "configuration": {
            "startDateTime": "2026-04-10T00:00:00",
            "endDateTime": "2026-12-31T23:59:59",
            "localTimeZoneId": "Central Standard Time",
            "type": "Daily",
            "times": ["06:00", "18:00"]
        }
    }
) }}

Weekly schedule (specific days and times):

{{ config(
    materialized='materialized_lake_view',
    mlv_schedule={
        "enabled": true,
        "configuration": {
            "startDateTime": "2026-04-10T00:00:00",
            "endDateTime": "2026-12-31T23:59:59",
            "localTimeZoneId": "Central Standard Time",
            "type": "Weekly",
            "weekdays": ["Monday", "Wednesday", "Friday"],
            "times": ["08:00"]
        }
    }
) }}

Full Example with All Options

{{ config(
    materialized='materialized_lake_view',
    database='gold',
    schema='dbo',
    partitioned_by=['product_type'],
    mlv_comment='Product sales summary with quality checks',
    mlv_constraints=[
        {"name": "positive_revenue", "expression": "total_revenue >= 0", "on_mismatch": "DROP"}
    ],
    tblproperties={"quality_tier": "gold"},
    enable_cdf=true,
    mlv_on_demand=true
) }}

SELECT
    product_id,
    product_name,
    product_type,
    SUM(quantity) AS total_quantity_sold,
    SUM(revenue) AS total_revenue
FROM {{ ref('silver_order_items') }}
GROUP BY product_id, product_name, product_type

Generated SQL:

CREATE OR REPLACE MATERIALIZED LAKE VIEW gold.dbo.product_sales_summary
(
    CONSTRAINT positive_revenue CHECK (total_revenue >= 0) ON MISMATCH DROP
)
PARTITIONED BY (product_type)
COMMENT 'Product sales summary with quality checks'
TBLPROPERTIES ("quality_tier"="gold")
AS
SELECT ...

Limitations

  • No ALTER on definition — Changing the SELECT query, constraints, or partitioning requires drop + recreate. The adapter uses CREATE OR REPLACE which handles this automatically.
  • Only RENAME via ALTERALTER MATERIALIZED LAKE VIEW ... RENAME TO ... is the only supported ALTER operation.
  • No DMLINSERT, UPDATE, DELETE are not supported on MLVs.
  • No UDFs — User-defined functions are not supported in the SELECT query.
  • No time-travelVERSION AS OF / TIMESTAMP AS OF syntax is not supported.
  • No temp views as sources — The SELECT query can reference tables and other MLVs, but not temporary views.
  • Schedule is per-lakehouse — One active schedule per lakehouse lineage, not per MLV.

Reporting bugs and contributing code

Join the dbt Community

Code of Conduct

Everyone interacting in the dbt project's codebases, issue trackers, chat rooms, and mailing lists is expected to follow the dbt Code of Conduct.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vd_dbt_fabricspark-1.9.14.tar.gz (349.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

vd_dbt_fabricspark-1.9.14-py3-none-any.whl (94.0 kB view details)

Uploaded Python 3

File details

Details for the file vd_dbt_fabricspark-1.9.14.tar.gz.

File metadata

  • Download URL: vd_dbt_fabricspark-1.9.14.tar.gz
  • Upload date:
  • Size: 349.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.13.2

File hashes

Hashes for vd_dbt_fabricspark-1.9.14.tar.gz
Algorithm Hash digest
SHA256 a9ec2bd90f60a963521d60e6f6c2caf551a68ce1f4c8e60df637a8109627ee71
MD5 5035c4a1a3c9cabe803d37114387c704
BLAKE2b-256 5b45fec3e5492a801ab37a914a4b9dc2c6c80f3434db5a25434f6d9993ab14ea

See more details on using hashes here.

File details

Details for the file vd_dbt_fabricspark-1.9.14-py3-none-any.whl.

File metadata

File hashes

Hashes for vd_dbt_fabricspark-1.9.14-py3-none-any.whl
Algorithm Hash digest
SHA256 c2d791ac2dc2759646217409d86e6cbd4ed7b571a5ee936c239741b243f497a3
MD5 8453a73b4d744fb710c36d8c504892ab
BLAKE2b-256 57348aed1130a0bfe4c7bc444d94b0a482e65d80e8200b13a1996c8b91302046

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page