User-friendly PySpark helpers for Microsoft Fabric Lakehouses and Warehouses
Project description
fabrictools
User-friendly PySpark helpers for Microsoft Fabric — read, write, and merge Lakehouses and Warehouses with a single function call.
Features
- Auto-resolved paths — pass a Lakehouse or Warehouse name, no ABFS URL configuration required
- Auto-detected SparkSession — uses
SparkSession.builder.getOrCreate(), works seamlessly inside Fabric notebooks - Auto-detected format on read — tries Delta → Parquet → CSV automatically
- Delta merge (upsert) — one-liner upsert into any Lakehouse Delta table
- Generic data cleaning — standard cleaning with one helper function
- Data quality scan — detect nulls, blank strings, duplicates, and naming collisions
- Built-in logging — every operation logs its resolved path, detected format, and row/column count
Requirements
- Microsoft Fabric Spark runtime (provides
notebookutils,pyspark, anddelta-spark) - Python >= 3.9
Local development: install the
sparkextras to get PySpark and delta-spark.notebookutilsis only available inside Fabric — functions that resolve paths will raise a clearValueErroroutside Fabric.
Installation
# Inside a Fabric notebook or pipeline
pip install fabrictools
# Local development (includes PySpark + delta-spark)
pip install "fabrictools[spark]"
Quick start
import fabrictools as ft
Read a Lakehouse dataset
# Auto-detects Delta → Parquet → CSV
df = ft.read_lakehouse("BronzeLakehouse", "sales/2024")
Write to a Lakehouse
ft.write_lakehouse(
df,
lakehouse_name="SilverLakehouse",
relative_path="sales_clean",
mode="overwrite",
partition_by=["year", "month"], # optional
)
Merge (upsert) into a Delta table
ft.merge_lakehouse(
source_df=new_df,
lakehouse_name="SilverLakehouse",
relative_path="sales_clean",
merge_condition="src.id = tgt.id",
# update_set and insert_set are optional:
# omit them to update/insert all columns automatically
)
Clean data (generic)
clean_df = ft.clean_data(df)
By default it:
- normalizes columns to unique
snake_case - trims string values
- converts blank strings to
null - removes exact duplicates
- drops rows where all fields are
null
Scan data quality issues
report = ft.scan_data_errors(df, include_samples=True)
print(report["duplicate_row_count"])
print(report["null_counts"])
Read -> clean -> write in one call
clean_df = ft.clean_and_write_data(
source_lakehouse_name="RawLakehouse",
source_relative_path="sales/raw",
target_lakehouse_name="CuratedLakehouse",
target_relative_path="sales/clean",
mode="overwrite",
partition_by=["year"], # optional
)
With explicit column mappings:
ft.merge_lakehouse(
source_df=new_df,
lakehouse_name="SilverLakehouse",
relative_path="sales_clean",
merge_condition="src.id = tgt.id",
update_set={"amount": "src.amount", "updated_at": "src.updated_at"},
insert_set={"id": "src.id", "amount": "src.amount", "updated_at": "src.updated_at"},
)
Read from a Warehouse
df = ft.read_warehouse("MyWarehouse", "SELECT * FROM dbo.sales WHERE year = 2024")
Write to a Warehouse
ft.write_warehouse(
df,
warehouse_name="MyWarehouse",
table="dbo.sales_clean",
mode="overwrite", # or "append"
batch_size=10_000, # optional, default 10 000
)
API reference
Lakehouse
| Function | Description |
|---|---|
read_lakehouse(lakehouse_name, relative_path, spark=None) |
Read a dataset — auto-detects Delta / Parquet / CSV |
write_lakehouse(df, lakehouse_name, relative_path, mode, partition_by, format, spark=None) |
Write a DataFrame (default: Delta, overwrite) |
merge_lakehouse(source_df, lakehouse_name, relative_path, merge_condition, update_set, insert_set, spark=None) |
Upsert via Delta merge |
clean_data(df, drop_duplicates, drop_all_null_rows) |
Apply standard generic cleaning to a DataFrame |
scan_data_errors(df, include_samples) |
Report common data-quality issues |
clean_and_write_data(source_lakehouse_name, source_relative_path, target_lakehouse_name, target_relative_path, mode, partition_by, spark=None) |
Read, clean, and write in one helper |
Warehouse
| Function | Description |
|---|---|
read_warehouse(warehouse_name, query, spark=None) |
Run a SQL query, return a DataFrame |
write_warehouse(df, warehouse_name, table, mode, batch_size, spark=None) |
Write to a Warehouse table via JDBC |
How path resolution works
lakehouse_name="BronzeLakehouse"
│
▼
notebookutils.lakehouse.get("BronzeLakehouse")
│
▼
lh.properties.abfsPath
= "abfss://bronze@<account>.dfs.core.windows.net"
│
▼
full_path = abfsPath + "/" + relative_path
Running the tests
pip install "fabrictools[dev]"
pytest
Publishing to PyPI
See docs/PYPI_PUBLISH.md for a step-by-step guide.
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file fabrictools-0.1.1.tar.gz.
File metadata
- Download URL: fabrictools-0.1.1.tar.gz
- Upload date:
- Size: 14.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5e6e45216236a84876f7da660f5e8ea046ea0b1cefdb0ce2d34a09bdf64f9dc3
|
|
| MD5 |
71511a2553863bc526522ef068a6dfe3
|
|
| BLAKE2b-256 |
17b07f342c7c8e24b3bcb5734ad4830c482c5c4878551b8ee2b38c60f46aee52
|
Provenance
The following attestation bundles were made for fabrictools-0.1.1.tar.gz:
Publisher:
publish.yml on willykinfoussia/FabricPackage
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
fabrictools-0.1.1.tar.gz -
Subject digest:
5e6e45216236a84876f7da660f5e8ea046ea0b1cefdb0ce2d34a09bdf64f9dc3 - Sigstore transparency entry: 1122664172
- Sigstore integration time:
-
Permalink:
willykinfoussia/FabricPackage@31d79a2adf74699f9680d115155995f993b3438b -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/willykinfoussia
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@31d79a2adf74699f9680d115155995f993b3438b -
Trigger Event:
push
-
Statement type:
File details
Details for the file fabrictools-0.1.1-py3-none-any.whl.
File metadata
- Download URL: fabrictools-0.1.1-py3-none-any.whl
- Upload date:
- Size: 12.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
28e74b175f375a5fda40e4b662e4f6cbeb4cc312920d35fcd9e176dda88b5ac2
|
|
| MD5 |
4737521b6a94557fd2eb6aa1f0d43715
|
|
| BLAKE2b-256 |
0086bc981da144cc714242d39515f725600248c904952eab07853dcb1650791e
|
Provenance
The following attestation bundles were made for fabrictools-0.1.1-py3-none-any.whl:
Publisher:
publish.yml on willykinfoussia/FabricPackage
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
fabrictools-0.1.1-py3-none-any.whl -
Subject digest:
28e74b175f375a5fda40e4b662e4f6cbeb4cc312920d35fcd9e176dda88b5ac2 - Sigstore transparency entry: 1122664191
- Sigstore integration time:
-
Permalink:
willykinfoussia/FabricPackage@31d79a2adf74699f9680d115155995f993b3438b -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/willykinfoussia
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@31d79a2adf74699f9680d115155995f993b3438b -
Trigger Event:
push
-
Statement type: