A DataOps framework for building a lakehouse
Project description
Laktory
A DataOps framework for building Databricks lakehouse.
Okube Company
Okube is committed to develop open source data and ML engineering tools. This is an open space. Contributions are more than welcome.
Help
TODO: Build full help documentation
Installation
Install using pip install laktory
TODO: Full installation instructions
pyspark
Optionally, you can also install spark locally to test your custom functions.
TODO: Add pyspark instructions https://www.machinelearningplus.com/pyspark/install-pyspark-on-mac/
- JAVA_HOME=/opt/homebrew/opt/java;
- SPARK_HOME=/opt/homebrew/Cellar/apache-spark/3.5.0/libexec
A Basic Example
This example demonstrates how to send data events to a data lake and to set a data pipeline defining the tables transformation layers.
Generate data events
A data event class defines specifications of an event and provides methods for writing that event directly to a cloud storage or through a databricks volume or mount.
from laktory import models
from datetime import datetime
events = [
models.DataEvent(
name="stock_price",
producer={
"name": "yahoo-finance",
},
data={
"created_at": datetime(2023, 8, 23),
"symbol": "GOOGL",
"open": 130.25,
"close": 132.33,
},
),
models.DataEvent(
name="stock_price",
producer={
"name": "yahoo-finance",
},
data={
"created_at": datetime(2023, 8, 24),
"symbol": "GOOGL",
"open": 132.00,
"close": 134.12,
},
)
]
for event in events:
event.to_databricks()
Define data pipeline and data tables
A yaml file define the configuration for a data pipeline, including the transformations of a raw data event into curated (silver) and consumption (gold) layers.
name: pl-stock-prices
catalog: ${var.env}
target: default
clusters:
- name : default
node_type_id: Standard_DS3_v2
autoscale:
min_workers: 1
max_workers: 2
libraries:
- notebook:
path: /pipelines/dlt_template_brz.py
- notebook:
path: /pipelines/dlt_template_slv.py
permissions:
- group_name: account users
permission_level: CAN_VIEW
- group_name: role-engineers
permission_level: CAN_RUN
# --------------------------------------------------------------------------- #
# Tables #
# --------------------------------------------------------------------------- #
tables:
- name: brz_stock_prices
timestamp_key: data.created_at
event_source:
name: stock_price
producer:
name: yahoo-finance
zone: BRONZE
- name: slv_stock_prices
table_source:
catalog_name: ${var.env}
schema_name: finance
name: brz_stock_prices
zone: SILVER
columns:
- name: created_at
type: timestamp
spark_func_name: coalesce
spark_func_args:
- data._created_at
- name: open
type: double
spark_func_name: coalesce
spark_func_args:
- data.open
- name: close
type: double
spark_func_name: coalesce
spark_func_args:
- data.close
- name: high
type: double
sql_expression: GREATEST(data.open, data.close)
Deploy your configuration
Laktory currently support Pulumi for cloud deployment, but more engines will be added in the future (Terraform, Databricks CLI, etc.).
import os
import pulumi
from laktory import models
# Read configuration file
with open("pipeline.yaml", "r") as fp:
pipeline = models.Pipeline.model_validate_yaml(fp)
# Set variables
pipeline.vars = {
"env": os.getenv("ENV"),
}
# Deploy
pipeline.deploy_with_pulumi()
A full Data Ops template
A comprehensive template on how to deploy a lakehouse as code using Laktory is maintained here: https://github.com/okube-ai/lakehouse-as-code.
In this template, 4 pulumi projects are used to:
{cloud_provider}_infra
: Deploy the required resources on your cloud providerunity-catalog
: Setup users, groups, catalogs, schemas and manage grantsworkspace-conf
: Setup secrets, clusters and warehousesworkspace
: The data workflows to build your lakehouse.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.