Package for Fabric Engineers
Project description
FabricEngineer Package
Description
FabricEngineer is a comprehensive Python package designed specifically for Microsoft Fabric developers to streamline data transformation workflows and automate complex ETL processes. This package provides enterprise-grade solutions for building robust data pipelines with minimal boilerplate code.
Key Features
🚀 Silver Layer Data Ingestion Services
- Insert-Only Pattern: Efficient data ingestion with support for schema evolution and historization
- SCD Type 2 (Slowly Changing Dimensions): Complete implementation of Type 2 SCD with automatic history tracking
- Delta Load Support: Optimized incremental data processing with broadcast join capabilities
- Schema Evolution: Automatic handling of schema changes with backward compatibility
📊 Materialized Lake Views (MLV)
- Automated MLV Generation: Create and manage materialized views with SQL generation
- Schema-aware Operations: Intelligent handling of schema changes and column evolution
- Lakehouse Integration: Seamless integration with Microsoft Fabric Lakehouse architecture
🔧 Advanced Data Engineering Features
- Configurable Transformations: Flexible transformation pipelines with custom business logic
- Data Quality Controls: Built-in validation and data quality checks
- Performance Optimization: Broadcast joins, partition strategies, and optimized query patterns
- Comprehensive Logging: Integrated logging and performance monitoring with TimeLogger
Installation
pip install fabricengineer-py
Quick Start Guide
Prerequisites
- Microsoft Fabric workspace with Lakehouse
- PySpark environment
- Python 3.11+
Usage Examples
Silver Layer Data Ingestion
Insert-Only Pattern
The Insert-Only service is ideal for append-only scenarios where you need to track all changes while maintaining performance.
from pyspark.sql import DataFrame, functions as F
from fabricengineer.logging import TimeLogger
from fabricengineer.transform.lakehouse import LakehouseTable
from fabricengineer.transform import SilverIngesationInsertOnly
def transform_projects(df: DataFrame, etl) -> DataFrame:
df = df.withColumn("dtime", F.to_timestamp("dtime"))
return df
def transform_all(df: DataFrame, etl) -> DataFrame:
df = df.withColumn("data", F.lit("values"))
return df
# Initialize performance monitoring
timer = TimeLogger()
# Define table-specific transformations
transformations = {
"*": transform_all, # Applied to all tables
"projects": transform_projects # Applied only to projects table
}
# Configure source and destination tables
source_table = LakehouseTable(
lakehouse="BronzeLakehouse",
schema="schema",
table="projects"
)
destination_table = LakehouseTable(
lakehouse="SilverLakehouse",
schema=source_table.schema,
table=source_table.table
)
# Initialize and configure the ETL service
etl = SilverIngestionInsertOnly()
etl.init(
spark_=spark,
notebookutils_=notebookutils,
source_table=source_table,
destination_table=destination_table,
nk_columns=NK_COLUMNS,
constant_columns=CONSTANT_COLUMNS,
is_delta_load=IS_DELTA_LOAD,
delta_load_use_broadcast=DELTA_LOAD_USE_BROADCAST,
transformations=transformations,
exclude_comparing_columns=EXCLUDE_COLUMNS_FROM_COMPARING,
include_comparing_columns=INCLUDE_COLUMNS_AT_COMPARING,
historize=HISTORIZE,
partition_by_columns=PARTITION_BY_COLUMNS,
df_bronze=None,
create_historized_mlv=True
)
timer.start().log()
etl.run()
timer.end().log()
SCD Type 2 (Slowly Changing Dimensions)
The SCD2 service implements Type 2 Slowly Changing Dimensions with automatic history tracking and current record management.
from pyspark.sql import DataFrame, functions as F
from fabricengineer.logging import TimeLogger
from fabricengineer.transform.lakehouse import LakehouseTable
from fabricengineer.transform import SilverIngestionSCD2Service
def transform_projects(df: DataFrame, etl) -> DataFrame:
df = df.withColumn("dtime", F.to_timestamp("dtime"))
return df
def transform_all(df: DataFrame, etl) -> DataFrame:
df = df.withColumn("data", F.lit("values"))
return df
# Initialize performance monitoring
timer = TimeLogger()
# Define table-specific transformations
transformations = {
"*": transform_all, # Applied to all tables
"projects": transform_projects # Applied only to projects table
}
# Configure source and destination tables
source_table = LakehouseTable(
lakehouse="BronzeLakehouse",
schema="schema",
table="projects"
)
destination_table = LakehouseTable(
lakehouse="SilverLakehouse",
schema=source_table.schema,
table=source_table.table
)
# Initialize and configure the ETL service
etl = SilverIngestionSCD2Service()
etl.init(
spark_=spark,
notebookutils_=notebookutils,
source_table=source_table,
destination_table=destination_table,
nk_columns=NK_COLUMNS,
constant_columns=CONSTANT_COLUMNS,
is_delta_load=IS_DELTA_LOAD,
delta_load_use_broadcast=DELTA_LOAD_USE_BROADCAST,
transformations=transformations,
exclude_comparing_columns=EXCLUDE_COLUMNS_FROM_COMPARING,
include_comparing_columns=INCLUDE_COLUMNS_AT_COMPARING,
historize=HISTORIZE,
partition_by_columns=PARTITION_BY_COLUMNS,
df_bronze=None
)
timer.start().log()
etl.run()
timer.end().log()
Materialized Lake Views Management
Prerequisites
Configure a Utils Lakehouse as your default Lakehouse. The generated view SQL code will be saved as .sql.txt files in the lakehouse under /Files/mlv/{lakehouse}/{schema}/{table}.sql.txt.
from fabricengineer.mlv import MaterializeLakeView
# Initialize the Materialized Lake View manager
mlv = MaterializedLakeView(
lakehouse="SilverBusinessLakehouse",
schema="schema",
table="projects"
)
print(mlv.to_dict())
# Define your custom SQL query
sql = """
SELECT
p.id
,p.projectname
,p.budget
,u.name AS projectlead
FROM dbo.projects p
LEFT JOIN users u
ON p.projectlead_id = u.id
"""
# Create or replace the materialized view
result = mlv.create_or_replace(sql)
display(result)
Remote Module Import for Fabric Notebooks
Import specific package modules directly into your Fabric notebooks from GitHub releases:
# Cell 1:
import requests
VERSION = "0.1.0"
url = f"https://raw.githubusercontent.com/enricogoerlitz/fabricengineer-py/refs/tags/{VERSION}/src/fabricengineer/import_module/import_module.py"
resp = requests.get(url)
code = resp.text
exec(code, globals()) # This provides the 'import_module' function
assert code.startswith("import requests")
# Cell 2
mlv_module = import_module("transform.mlv", VERSION)
scd2_module = import_module("transform.silver.scd2", VERSION)
insertonly_module = import_module("transform.silver.insertonly", VERSION)
# Cell 3 - Use mlv module
exec(mlv_module, globals()) # Provides MaterializedLakeView class and mlv instance
mlv.init(
lakehouse="SilverBusinessLakehouse",
schema="schema",
table="projects"
)
print(mlv.to_dict())
# Cell 4 - Use scd2 module
exec(scd2_module, globals()) # Provides an instantiated etl object
etl.init(...)
print(str(etl))
# Cell 5 - Use insertonly module
exec(insertonly_module, globals()) # Provides an instantiated etl object
etl.init(...)
print(str(etl))
Advanced Features
Performance Optimization
- Broadcast Joins: Automatically optimize small table joins
- Partition Strategies: Intelligent partitioning for better query performance
- Schema Evolution: Handle schema changes without breaking existing pipelines
- Delta Load Processing: Efficient incremental data processing
Data Quality & Validation
- Automatic Validation: Built-in checks for data consistency and quality
- Type Safety: Comprehensive type annotations for better development experience
- Error Handling: Robust error handling and recovery mechanisms
Monitoring & Logging
from fabricengineer.logging import TimeLogger, logger
# Performance monitoring
timer = TimeLogger()
timer.start().log()
# Your ETL operations here
etl.run()
timer.end().log()
# Custom fabricengineer logging
logger.info("Custom log message")
logger.error("Error occurred during processing")
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file fabricengineer_py-0.1.4.tar.gz.
File metadata
- Download URL: fabricengineer_py-0.1.4.tar.gz
- Upload date:
- Size: 86.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fac56fa6d4c5406ff686aec980963650868be7a0b68e2449f786555bf4a01389
|
|
| MD5 |
16d90efceeedf7c160d85edac54792e8
|
|
| BLAKE2b-256 |
942eb69913febedc4699d3d854784e21145526af0875d7ee89c39dbdc1316839
|
Provenance
The following attestation bundles were made for fabricengineer_py-0.1.4.tar.gz:
Publisher:
release.yml on enricogoerlitz/fabricengineer-py
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
fabricengineer_py-0.1.4.tar.gz -
Subject digest:
fac56fa6d4c5406ff686aec980963650868be7a0b68e2449f786555bf4a01389 - Sigstore transparency entry: 362309373
- Sigstore integration time:
-
Permalink:
enricogoerlitz/fabricengineer-py@e7df2b0d686d84029b133ca81e66d68ad6e1e639 -
Branch / Tag:
refs/tags/0.1.4 - Owner: https://github.com/enricogoerlitz
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@e7df2b0d686d84029b133ca81e66d68ad6e1e639 -
Trigger Event:
push
-
Statement type:
File details
Details for the file fabricengineer_py-0.1.4-py3-none-any.whl.
File metadata
- Download URL: fabricengineer_py-0.1.4-py3-none-any.whl
- Upload date:
- Size: 24.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9757b26e3ff4f14593e06ab1da115e465fc4546317e4268af966b93208e28d09
|
|
| MD5 |
443357a85161a72379605b3cd92bc81c
|
|
| BLAKE2b-256 |
ad88a7a257a049ddbb74503843530ad91bb89861a62669eab6abdd745b7b7a3d
|
Provenance
The following attestation bundles were made for fabricengineer_py-0.1.4-py3-none-any.whl:
Publisher:
release.yml on enricogoerlitz/fabricengineer-py
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
fabricengineer_py-0.1.4-py3-none-any.whl -
Subject digest:
9757b26e3ff4f14593e06ab1da115e465fc4546317e4268af966b93208e28d09 - Sigstore transparency entry: 362309377
- Sigstore integration time:
-
Permalink:
enricogoerlitz/fabricengineer-py@e7df2b0d686d84029b133ca81e66d68ad6e1e639 -
Branch / Tag:
refs/tags/0.1.4 - Owner: https://github.com/enricogoerlitz
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@e7df2b0d686d84029b133ca81e66d68ad6e1e639 -
Trigger Event:
push
-
Statement type: