Skip to main content

A DataOps framework for building a lakehouse

Project description

Laktory

pypi downloads versions license

A DataOps framework for building Databricks lakehouse.

Okube Company

Okube is committed to develop open source data and ML engineering tools. This is an open space. Contributions are more than welcome.

Help

TODO: Build full help documentation

Installation

Install using pip install laktory

TODO: Full installation instructions

A Basic Example

This example demonstrates how to send data events to a data lake and to set a data pipeline defining the tables transformation layers.

Generate data events

A data event class defines specifications of an event and provides methods for writing that event directly to a cloud storage or through a databricks volume or mount.

from laktory import models
from datetime import datetime


events = [
    models.DataEvent(
        name="stock_price",
        producer={
            "name": "yahoo-finance",
        },
        data={
            "created_at": datetime(2023, 8, 23),
            "symbol": "GOOGL",
            "open": 130.25,
            "close": 132.33,
        },
    ),
    models.DataEvent(
        name="stock_price",
        producer={
            "name": "yahoo-finance",
        },
        data={
            "created_at": datetime(2023, 8, 24),
            "symbol": "GOOGL",
            "open": 132.00,
            "close": 134.12,
        },
    )
]

for event in events:
    event.to_databricks()

Define data pipeline and data tables

A yaml file define the configuration for a data pipeline, including the transformations of a raw data event into curated (silver) and consumption (gold) layers.

name: pl-stock-prices

catalog: ${var.env}
target: default

clusters:
  - name : default
    node_type_id: Standard_DS3_v2
    autoscale:
      min_workers: 1
      max_workers: 2

libraries:
  - notebook:
      path: /pipelines/dlt_template_brz.py
  - notebook:
      path: /pipelines/dlt_template_slv.py

permissions:
  - group_name: account users
    permission_level: CAN_VIEW
  - group_name: role-engineers
    permission_level: CAN_RUN

# --------------------------------------------------------------------------- #
# Tables                                                                      #
# --------------------------------------------------------------------------- #

tables:
  - name: brz_stock_prices
    timestamp_key: data.created_at
    event_source:
      name: stock_price
      producer:
        name: yahoo-finance
    zone: BRONZE


  - name: slv_stock_prices
    table_source:
      catalog_name: dev
      schema_name: finance
      name: brz_stock_prices
    zone: SILVER
    columns:
      - name: created_at
        type: timestamp
        func_name: coalesce
        input_cols:
          - data._created_at

      - name: open
        type: double
        func_name: coalesce
        input_cols:
          - data.open

      - name: close
        type: double
        func_name: coalesce
        input_cols:
          - data.close

Deploy your configuration

Laktory currently support Pulumi for cloud deployement, but more engines will be added in the future (Terraform, Databricks CLI, etc.).

import os
import pulumi
from laktory import models

# Read configuration file
with open("pipeline.yaml", "r") as fp:
    pipeline = models.Pipeline.model_validate_yaml(fp)

# Set variables
pipeline.vars = {
    "env": os.getenv("ENV"),
}
    
# Deploy
pipeline.deploy_with_pulumi()

A full Data Ops template

A comprehensive template on how to deploy a lakehouse as code using Laktory is maintained here: https://github.com/okube-ai/lakehouse-as-code.

In this template, 4 pulumi projects are used to:

  • {cloud_provider}_infra: Deploy the required resources on your cloud provider
  • unity-catalog: Setup users, groups, catalogs, schemas and manage grants
  • workspace-conf: Setup secrets, clusters and warehouses
  • workspace: The data workflows to build your lakehouse.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

laktory-0.0.10.tar.gz (40.6 kB view hashes)

Uploaded Source

Built Distribution

laktory-0.0.10-py3-none-any.whl (54.5 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page