Skip to main content

A Microsoft Fabric Spark adapter plugin for dbt

Project description

Tests and Code Checks Adapter Integration Tests Release to PyPI

Python dbt-core License


dbt enables data analysts and engineers to transform their data using the same practices that software engineers use to build applications.

dbt is the T in ELT. Organize, cleanse, denormalize, filter, rename, and pre-aggregate the raw data in your warehouse so that it's ready for analysis.

dbt-fabricspark

The dbt-fabricspark package contains all of the code enabling dbt to work with Apache Spark in Microsoft Fabric. This adapter connects to Fabric Lakehouses via Livy endpoints and supports both schema-enabled and non-schema Lakehouse configurations.

Current version: 1.9.3

Key Features

  • Livy session management with session reuse across dbt runs
  • Lakehouse with schema support — auto-detects schema-enabled lakehouses and uses three-part naming (lakehouse.schema.table)
  • Lakehouse without schema — standard two-part naming (lakehouse.table)
  • Materializations: table, view, incremental (append, merge, insert_overwrite), seed, snapshot
  • Fabric Environment support via environmentId configuration
  • Security: credential masking, UUID validation, HTTPS + domain validation, thread-safe token refresh
  • Resilience: HTTP 5xx retry with exponential backoff, bounded polling with configurable timeouts

Getting started

Installation

pip install dbt-fabricspark

Configuration

Use a Livy endpoint to connect to Apache Spark in Microsoft Fabric. Configure your profiles.yml to connect via Livy endpoints.

Lakehouse without Schema

For standard Lakehouses (schema not enabled), use two-part naming. The schema field is set to the lakehouse name:

fabric-spark-test:
  target: fabricspark-dev
  outputs:
    fabricspark-dev:
        # Connection
        type: fabricspark
        method: livy
        endpoint: https://api.fabric.microsoft.com/v1
        workspaceid: <your-workspace-id>
        lakehouseid: <your-lakehouse-id>
        lakehouse: my_lakehouse
        schema: my_lakehouse
        threads: 1

        # Authentication (CLI for local dev, SPN for CI/CD)
        authentication: CLI
        # client_id: <your-client-id>        # Required for SPN
        # tenant_id: <your-tenant-id>        # Required for SPN
        # client_secret: <your-client-secret> # Required for SPN

        # Fabric Environment (optional)
        # environmentId: <your-environment-id>

        # Session management
        reuse_session: true
        session_idle_timeout: "30m"
        # session_id_file: ./livy-session-id.txt  # Default path

        # Timeouts
        connect_retries: 1
        connect_timeout: 10
        http_timeout: 120                   # Seconds per HTTP request
        session_start_timeout: 600          # Max wait for session start (10 min)
        statement_timeout: 3600             # Max wait for statement result (1 hour)
        poll_wait: 10                       # Seconds between session start polls
        poll_statement_wait: 5              # Seconds between statement result polls

        # Retry & Shortcuts
        retry_all: true
        # create_shortcuts: false
        # shortcuts_json_str: '<json-string>'

        # Spark configuration (optional)
        # spark_config:
        #   name: "my-spark-session"
        #   spark.executor.memory: "4g"

In this mode:

  • Tables are referenced as lakehouse.table_name
  • The schema field should match the lakehouse name
  • All objects are created directly under the lakehouse

Lakehouse with Schema (Schema-Enabled)

For schema-enabled Lakehouses, you can organize tables into schemas within the lakehouse. The adapter auto-detects whether a lakehouse has schemas enabled via the Fabric REST API (properties.defaultSchema):

fabric-spark-test:
  target: fabricspark-dev
  outputs:
    fabricspark-dev:
        # Connection
        type: fabricspark
        method: livy
        endpoint: https://api.fabric.microsoft.com/v1
        workspaceid: <your-workspace-id>
        lakehouseid: <your-lakehouse-id>
        lakehouse: my_lakehouse
        schema: my_schema              # Different from lakehouse name
        threads: 1

        # Authentication (CLI for local dev, SPN for CI/CD)
        authentication: CLI
        # client_id: <your-client-id>        # Required for SPN
        # tenant_id: <your-tenant-id>        # Required for SPN
        # client_secret: <your-client-secret> # Required for SPN

        # Fabric Environment (optional)
        # environmentId: <your-environment-id>

        # Session management
        reuse_session: true
        session_idle_timeout: "30m"
        # session_id_file: ./livy-session-id.txt  # Default path

        # Timeouts
        connect_retries: 1
        connect_timeout: 10
        http_timeout: 120                   # Seconds per HTTP request
        session_start_timeout: 600          # Max wait for session start (10 min)
        statement_timeout: 3600             # Max wait for statement result (1 hour)
        poll_wait: 10                       # Seconds between session start polls
        poll_statement_wait: 5              # Seconds between statement result polls

        # Retry & Shortcuts
        retry_all: true
        # create_shortcuts: false
        # shortcuts_json_str: '<json-string>'

        # Spark configuration (optional)
        # spark_config:
        #   name: "my-spark-session"
        #   spark.executor.memory: "4g"

In this mode:

  • Tables are referenced using three-part naming: lakehouse.schema.table_name
  • The schema field specifies the target schema within the lakehouse
  • dbt's generate_schema_name and generate_database_name macros are lakehouse-aware
  • Schemas are created automatically via CREATE DATABASE IF NOT EXISTS lakehouse.schema
  • Incremental models use persisted staging tables (instead of temp views) to work around Spark's REQUIRES_SINGLE_PART_NAMESPACE limitation

Schema Detection

The adapter detects whether a lakehouse has schemas enabled using two complementary mechanisms:

  1. Runtime detection (Fabric REST API): During connection.open(), the adapter calls the Fabric REST API to fetch lakehouse properties. If the response contains defaultSchema, the lakehouse is treated as schema-enabled and three-part naming is used.

  2. Parse-time detection (profile heuristic): During manifest parsing (before any connection is opened), the adapter checks whether schema differs from lakehouse in your profile. When they differ (e.g., lakehouse: bronze, schema: dbo), the adapter infers schema-enabled mode. This ensures correct schema resolution at compile time.

Important: For schema-enabled lakehouses, always set schema to a value different from lakehouse in your profile (e.g., schema: dbo). If schema equals lakehouse, the adapter cannot distinguish schema-enabled from non-schema mode at parse time, and the lakehouse name will be used as the schema name instead.

Lakehouse Type lakehouse schema Naming
Without schema my_lakehouse my_lakehouse my_lakehouse.table_name
With schema my_lakehouse dbo my_lakehouse.dbo.table_name

Cross-Lakehouse Writes

A single profile can write to multiple lakehouses using the database config on individual models. The profile's lakehouse is the default target; set database on a model to redirect writes to a different lakehouse in the same workspace.

# profiles.yml — profile targets the "bronze" lakehouse
fabric-spark:
  type: fabricspark
  lakehouse: bronze
  schema: dbo
  # ... other settings
-- models/silver/silver_orders.sql — writes to the "silver" lakehouse
{{ config(
    materialized='table',
    database='silver',
    schema='dbo'
) }}

select * from {{ ref('bronze_orders') }}

In this example:

  • Seeds and bronze models write to bronze.dbo.* (the default lakehouse)
  • Silver models write to silver.dbo.* via database='silver'
  • Gold models write to gold.dbo.* via database='gold'
  • All three lakehouses must exist in the same Fabric workspace and have schemas enabled

Configuration Reference

Option Type Default Description
type string Must be fabricspark
method string livy Connection method
endpoint string https://api.fabric.microsoft.com/v1 Fabric API endpoint URL
workspaceid string Fabric workspace UUID
lakehouseid string Lakehouse UUID
lakehouse string Lakehouse name
schema string Schema name. Must equal lakehouse for non-schema lakehouses, must differ from lakehouse for schema-enabled (e.g., dbo)
threads int 1 Number of threads for parallel execution
Authentication
authentication string CLI Auth method: CLI, SPN, or fabric_notebook
client_id string Service principal client ID (SPN only)
tenant_id string Azure AD tenant ID (SPN only)
client_secret string Service principal secret (SPN only)
accessToken string Direct access token (optional)
Environment
environmentId string Fabric Environment ID for Spark configuration
spark_config dict {} Spark session configuration (must include name key)
Session Management
reuse_session bool false Keep Livy sessions alive for reuse across runs
session_id_file string ./livy-session-id.txt Path to file storing session ID for reuse
session_idle_timeout string 30m Livy session idle timeout (e.g. 30m, 1h)
Timeouts & Polling
connect_retries int 1 Number of connection retries
connect_timeout int 10 Connection timeout in seconds
http_timeout int 120 Seconds per HTTP request to Fabric API
session_start_timeout int 600 Max seconds to wait for session start
statement_timeout int 3600 Max seconds to wait for statement result
poll_wait int 10 Seconds between session start polls
poll_statement_wait int 5 Seconds between statement result polls
Other
retry_all bool false Retry all operations on failure
create_shortcuts bool false Enable Fabric shortcut creation
shortcuts_json_str string JSON string defining shortcuts
livy_mode string fabric fabric for Fabric cloud, local for local Livy
livy_url string http://localhost:8998 Local Livy URL (local mode only)

Authentication Modes

Mode Value Use Case Required Fields
Azure CLI CLI Local development. Uses az login credentials. None (run az login first)
Service Principal SPN CI/CD and automation. Uses Azure AD app registration. client_id, tenant_id, client_secret
Fabric Notebook fabric_notebook Running dbt inside a Fabric notebook. Uses notebookutils.credentials. None (runs in Fabric runtime)

Reporting bugs and contributing code

Join the dbt Community

Code of Conduct

Everyone interacting in the dbt project's codebases, issue trackers, chat rooms, and mailing lists is expected to follow the dbt Code of Conduct.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dbt_fabricspark-1.9.4.tar.gz (215.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dbt_fabricspark-1.9.4-py3-none-any.whl (66.2 kB view details)

Uploaded Python 3

File details

Details for the file dbt_fabricspark-1.9.4.tar.gz.

File metadata

  • Download URL: dbt_fabricspark-1.9.4.tar.gz
  • Upload date:
  • Size: 215.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.6 {"installer":{"name":"uv","version":"0.11.6","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for dbt_fabricspark-1.9.4.tar.gz
Algorithm Hash digest
SHA256 46a1a179972077a763f588c13298297e9933b43b9b132a02de8b5065a770220e
MD5 4d32e39230e99d6d76df7aeccae19854
BLAKE2b-256 3d9436ad579305db248365356d312cdd792804217bfe33276978ec433b177b0f

See more details on using hashes here.

File details

Details for the file dbt_fabricspark-1.9.4-py3-none-any.whl.

File metadata

  • Download URL: dbt_fabricspark-1.9.4-py3-none-any.whl
  • Upload date:
  • Size: 66.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.6 {"installer":{"name":"uv","version":"0.11.6","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for dbt_fabricspark-1.9.4-py3-none-any.whl
Algorithm Hash digest
SHA256 4a887df030566bae86b572847e18c6dc4f7c0e4542453808fae8399467647838
MD5 be2ddc9db101542a46c2a407203a3475
BLAKE2b-256 1a795e367b7138240db026b5f0e83415ef4550e9a1143a4f20fb11ecacde13bc

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page