Skip to main content

Local and Fabric lakehouse abstraction for modular, testable data code

Project description

laken

The missing local development workflow for Microsoft Fabric.

laken lets you develop Python code for Fabric locally, using the tools you already trust.

Write code on your machine, run it against real Fabric lakehouse data.

When you're ready, laken deploy packages your project, publishes it to Fabric, and makes it available to your Fabric notebooks.

Your code stays modular. Your notebooks stay thin. And your local workflow survives contact with the platform.

Why “laken”?

Laken, pronounced LAH-kuhn, is Dutch for “cloth.” If you're feeling generous, it's a pun on Fabric and data lakes.


Installation

Install uv if needed, then add laken:

uv add laken
pip install laken

Deploy uses uv to build your wheel before publishing to a Fabric environment.


Quickstart

Write lakehouse code on your laptop against real Fabric data, package it, and run the same code in a notebook.

1. Credentials — create a .env in your project root (see Environment variables for the full list):

AZURE_TENANT_ID=...
AZURE_CLIENT_ID=...
AZURE_CLIENT_SECRET=...
FABRIC_WORKSPACE_NAME=MyWorkspace
FABRIC_LAKEHOUSE_NAME=MyLakehouse
FABRIC_WORKSPACE_ID=...
FABRIC_LAKEHOUSE_ID=...

2. Develop locally — write your logic against real tables. On your laptop the first read pulls from Fabric into .laken/; in a notebook the same code runs against your attached lakehouse:

from laken import Lakehouse

lh = Lakehouse()
df = lh.read_table("customers", frame_type="pandas")
# ...
lh.write_table(df, "customer_analytics")

3. Package and deploy — move that code into a normal Python package and publish it to a Fabric Environment (FABRIC_ENVIRONMENT_ID in .env):

customer_analytics/
├── pyproject.toml
└── src/customer_analytics/
    └── pipeline.py
# src/customer_analytics/pipeline.py
from laken import Lakehouse

def create_analytics(lh: Lakehouse) -> None:
    df = lh.read_table("customers", frame_type="pandas")
    # ...
    lh.write_table(df, "customer_analytics")
laken deploy

4. Run in a notebook — after the environment publish finishes:

from laken import Lakehouse
from customer_analytics.pipeline import create_analytics

lh = Lakehouse()
create_analytics(lh)

Develop against your Fabric lakehouse

Set your credentials, select your workspace and lakehouse in a .env file at your project root (or export them in your shell). Lakehouse() and the laken CLI load that file from the current working directory; variables already set in the environment are not overwritten.

AZURE_TENANT_ID=...
AZURE_CLIENT_ID=...
AZURE_CLIENT_SECRET=...
FABRIC_WORKSPACE_NAME=MyWorkspace
FABRIC_LAKEHOUSE_NAME=MyLakehouse
FABRIC_WORKSPACE_ID=...
FABRIC_LAKEHOUSE_ID=...
FABRIC_ENVIRONMENT_ID=...
from laken import Lakehouse

lh = Lakehouse()
products = lh.read_table("marketing.products", frame_type="pandas")

lh.write_table(products, "staging.products_snapshot")

Lakehouse detects when it is running locally and when it is running inside Fabric.

Locally, the first read_table for a Fabric table pulls from OneLake and caches it under .laken/ as Delta; later reads use the cache. In a Fabric notebook, the same code reads from your attached lakehouse.

Local writes stay under .laken/ and do not sync to Fabric; in Fabric, writes persist to tables on the attached lakehouse.


Deploy to Fabric

Structure your local code as a Python project using the standard src layout:

myapp/
├── pyproject.toml          # [project] name = "myapp"
├── src/
│   └── myapp/
│       ├── __init__.py
│       └── pipeline.py
└── .env

Add laken to your project dependencies.

See the Python packaging guide if you are setting this up for the first time.

# src/myapp/pipeline.py
import pandas as pd

from laken import Lakehouse


def run_pipeline(lh: Lakehouse) -> None:
    products = lh.read_table("marketing.products", frame_type="pandas")
    summary = products.groupby("category", as_index=False)["amount"].sum()
    lh.write_table(summary, "staging.product_summary")

When you are ready, laken deploy builds your package and loads it into your specified Fabric Environment.

Deploy uses the same .env (or shell variables):

AZURE_TENANT_ID=...
AZURE_CLIENT_ID=...
AZURE_CLIENT_SECRET=...
FABRIC_WORKSPACE_ID=...
FABRIC_ENVIRONMENT_ID=...

From the repo root:

laken deploy

In a Fabric notebook:

from laken import Lakehouse
from myapp.pipeline import run_pipeline

lh = Lakehouse()
run_pipeline(lh)

Reference

Lakehouse

from laken import Lakehouse

lh = Lakehouse()

For tests or scripts that must pin a backend:

from laken import FabricLakehouse, LocalLakehouse

Tables — use schema.table to target a schema; a bare name is passed through to Spark and Fabric resolves it (typically the default dbo schema on a schema-enabled lakehouse). mode is "overwrite" or "append".

lh.write_table(df, "products")
lh.write_table(df, "marketing.products", mode="append")

df = lh.read_table("products")                    # pandas locally, Spark in Fabric
df = lh.read_table("products", frame_type="spark")       # Spark (Fabric runtime)
df = lh.read_table("marketing.products", frame_type="polars")

lh.list_tables()
lh.table_exists("marketing.products")
lh.drop_table("marketing.products")

Files — local paths under .laken/workspace/Files; in Fabric, under the lakehouse Files/ area.

lh.write_file(df, "exports/summary.parquet")
lh.write_file(raw_bytes, "exports/notes.txt")
raw = lh.read_file("exports/summary.parquet")
lh.file_exists("exports/summary.parquet")
lh.delete_file("exports/summary.parquet")

Warehouse tables — Spark synapsesql in Fabric notebooks only.

lh.load_table_from_warehouse("SalesOrderHeader", "SalesWarehouse", frame_type="pandas")

read_file returns raw bytes (from local cache or OneLake). Parse into pandas, polars, or Spark outside Lakehouse if you need a dataframe.

Other lakehouses — defaults come from notebook context in Fabric; override locally or in notebooks:

lh = Lakehouse(lakehouse="Sales_LH")
lh.read_table("marketing.products", frame_type="pandas")

CLI

laken deploy [--workspace-id <id>] [--environment-id <id>]
laken status
laken refresh <table>
laken reset <table>

laken deploy builds the wheel from your repo's pyproject.toml, uploads it to a Fabric Environment, and submits a publish. Fabric rebuilds the Environment asynchronously; notebooks can import your package after that publish completes.

laken status, laken refresh, and laken reset manage the local .laken/ cache on your laptop. They do not run inside Fabric notebooks.

laken status lists what is in .laken/ (full copy, row sample, or local-only) and whether your cache may be behind Fabric.

laken refresh <table> downloads the table from Fabric again. Local-only tables are unchanged.

laken reset <table> throws away local edits and downloads from Fabric again. The table must have come from Fabric originally.

Logging

laken logs to stderr when you use Lakehouse or the CLI. Default level is INFO. To see more detail:

import logging

logging.getLogger("laken").setLevel(logging.DEBUG)

Environment variables

Root .env is loaded when you construct Lakehouse or LocalLakehouse, or run the laken CLI. Shell and CI variables take precedence over .env values. Call load_environment() yourself only if you need env vars before creating a lakehouse instance.

Variable Purpose
AZURE_TENANT_ID Auth (fetch + deploy)
AZURE_CLIENT_ID Auth (fetch + deploy)
AZURE_CLIENT_SECRET Auth (fetch + deploy)
FABRIC_WORKSPACE_NAME Local Fabric fetch (all four name/ID vars required)
FABRIC_LAKEHOUSE_NAME Local Fabric fetch
FABRIC_WORKSPACE_ID OneLake paths; required for deploy
FABRIC_LAKEHOUSE_ID OneLake paths; required for local Fabric fetch
FABRIC_ENVIRONMENT_ID Deploy target

AZURE_TENANT_ID, AZURE_CLIENT_ID, and AZURE_CLIENT_SECRET are credentials from an Azure service principal.

FABRIC_WORKSPACE_NAME, FABRIC_LAKEHOUSE_NAME, FABRIC_WORKSPACE_ID, FABRIC_LAKEHOUSE_ID, and FABRIC_ENVIRONMENT_ID can be read from a Fabric notebook with notebookutils:

import notebookutils

context = notebookutils.runtime.context

FABRIC_WORKSPACE_NAME = context['currentWorkspaceName']
FABRIC_LAKEHOUSE_NAME = context.get('defaultLakehouseName')
FABRIC_WORKSPACE_ID = context['currentWorkspaceId']
FABRIC_LAKEHOUSE_ID = context.get('defaultLakehouseId')
FABRIC_ENVIRONMENT_ID = context.get('environmentId')

print(f"FABRIC_WORKSPACE_NAME={FABRIC_WORKSPACE_NAME}")
print(f"FABRIC_LAKEHOUSE_NAME={FABRIC_LAKEHOUSE_NAME}")
print(f"FABRIC_WORKSPACE_ID={FABRIC_WORKSPACE_ID}")
print(f"FABRIC_LAKEHOUSE_ID={FABRIC_LAKEHOUSE_ID}")
print(f"FABRIC_ENVIRONMENT_ID={FABRIC_ENVIRONMENT_ID}")

Deploy expects pyproject.toml at the repo root, a buildable application wheel, and a Fabric environment with a compatible Python/Spark runtime.

Local vs Fabric

Class Where Storage Reads Writes
Lakehouse Auto-detects notebook context Fabric if available, else .laken/ Delta Local: Fabric → cache; Fabric: attached lakehouse Local: .laken/ only; Fabric: attached lakehouse
LocalLakehouse Laptop / CI .laken/workspace/ Cached Delta and local tables Local only; not pushed to Fabric
FabricLakehouse Fabric notebook Attached lakehouse Spark/Delta on attached lakehouse Delta tables on attached lakehouse

Local Fabric cache

The first time you read_table a Fabric-backed name locally, laken downloads it into .laken/ as Delta. Later reads use that copy until you refresh it.

Defaults

  • Tables 100 MB or smaller on Fabric (file sizes from the Delta log) are cached in full.
  • Larger tables cache the first 10,000 rows only, enough for local development without pulling the whole table.

Change the limits

lh = Lakehouse(max_mirror_mb=200, max_sample_rows=5_000)
lh.read_table("dbo.big_fact", max_mirror_mb=500)

max_mirror_mb and max_sample_rows on Lakehouse(...) apply to laken refresh and laken reset. If you pass them to read_table instead, they apply only the first time that table is downloaded; after that, reads use the cached copy.

When Fabric changes

If someone updates the table in Fabric after you cached it, laken prints a warning and keeps using your local copy. Run laken refresh <table> to pull the latest version.


Development

Contributions are welcome. To work on this package:

uv sync
uv run pytest
uv run ruff check

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

laken-0.1.4.tar.gz (20.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

laken-0.1.4-py3-none-any.whl (27.8 kB view details)

Uploaded Python 3

File details

Details for the file laken-0.1.4.tar.gz.

File metadata

  • Download URL: laken-0.1.4.tar.gz
  • Upload date:
  • Size: 20.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.12

File hashes

Hashes for laken-0.1.4.tar.gz
Algorithm Hash digest
SHA256 a34b398ddc27334d330550f95d2009b9d9de118769d75e9d5ddc351c0a5405ed
MD5 6e5f10b42e51753ec9abc716a204b584
BLAKE2b-256 b76fd4735fdfd649582d3e9a321baa25a6ddff4103782a809504164fdbe04d6c

See more details on using hashes here.

File details

Details for the file laken-0.1.4-py3-none-any.whl.

File metadata

  • Download URL: laken-0.1.4-py3-none-any.whl
  • Upload date:
  • Size: 27.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.12

File hashes

Hashes for laken-0.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 caee78041d7455bd6ba891060153cb35e318827250f4bdb8ed7227f5257205f1
MD5 c1687deef49181650b9c43f58f2f64d4
BLAKE2b-256 6a7b20d4438c5e263d7fa42206bec34df9735c2989a906dd47e51d902e490f8b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page