A DataOps framework for building a lakehouse

These details have not been verified by PyPI

Project links

Development Status
- 4 - Beta
Framework
- Pydantic :: 2
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language

Project description

Laktory

An open-source DataOps and dataframe-centric ETL framework for building lakehouses.

Laktory is your all-in-one solution for defining both data transformations and Databricks resources. Imagine if Terraform, Databricks Asset Bundles, and dbt combined forces—that’s essentially Laktory.

This open-source framework simplifies the creation, deployment, and execution of data pipelines while adhering to essential DevOps practices like version control, code reviews, and CI/CD integration. With Apache Spark and Polars driving its data transformation, Laktory ensures reliable and scalable data processing. Its modular, flexible approach allows you to seamlessly combine SQL statements with DataFrame operations.

Since Laktory pipelines are built on top of Spark and Polars, they can run in any environment that supports python—from your local machine to a Kubernetes cluster. They can also be deployed and orchestrated as Databricks Jobs or Delta Live Tables, offering a simple, fully managed, and low-maintenance solution.

But Laktory goes beyond data pipelines. It empowers you to define and deploy your entire Databricks data platform—from Unity Catalog and access grants to compute and quality monitoring—providing a complete, modern solution for data platform management. This empowers your data team to take full ownership of the solution, eliminating the need to juggle multiple technologies. Say goodbye to relying on external Terraform experts to handle compute, workspace configuration, and Unity Catalog, while your data engineers and analysts try to combine Databricks Asset Bundles and dbt to build data pipelines. Laktory consolidates these functions, simplifying the entire process and reducing the overall cost.

Help

See documentation for more details.

Installation

Install using

pip install laktory

For more installation options, see the Install section in the documentation.

A Basic Example

from laktory import models


node_brz = models.PipelineNode(
    name="brz_stock_prices",
    source={
        "format": "PARQUET",
        "path": "./data/brz_stock_prices/"
    },
    transformer={
        "nodes": [
        ]
    }
)

node_slv = models.PipelineNode(
    name="slv_stock_prices",
    source={
        "node_name": "brz_stock_prices"
    },
    sink={
        "path": "./data/slv_stock_prices",
        "mode": "OVERWRITE",
        "format": "PARQUET",
    },
    transformer={
        "nodes": [
            
            # SQL Transformation
            {
                "sql_expr": """
                    SELECT
                      data.created_at AS created_at,
                      data.symbol AS symbol,
                      data.open AS open,
                      data.close AS close,
                      data.high AS high,
                      data.low AS low,
                      data.volume AS volume
                    FROM
                      {df}
                """   
            },
            
            # Spark Transformation
            {
                "func_name": "drop_duplicates",
                "func_kwargs": {
                    "subset": ["created_at", "symbol"]
                }
            },
        ]
    }
)

pipeline = models.Pipeline(
    name="stock_prices",
    nodes=[node_brz, node_slv],
)

print(pipeline)
#> resource_name_=None options=ResourceOptions(variables={}, depends_on=[], provider=None, aliases=None, delete_before_replace=True, ignore_changes=None, import_=None, parent=None, replace_on_changes=None) variables={} databricks_job=None dlt=None name='stock_prices' nodes=[PipelineNode(variables={}, add_layer_columns=True, dlt_template='DEFAULT', description=None, drop_duplicates=None, drop_source_columns=False, transformer=SparkChain(variables={}, nodes=[SparkChainNode(variables={}, allow_missing_column_args=False, column=None, spark_func_args=[SparkFuncArg(variables={}, value='symbol'), SparkFuncArg(variables={}, value='timestamp'), SparkFuncArg(variables={}, value='open'), SparkFuncArg(variables={}, value='close')], spark_func_kwargs={}, spark_func_name='select', sql_expression=None)]), expectations=[], layer='BRONZE', name='brz_stock_prices', primary_key=None, sink=None, source=FileDataSource(variables={}, as_stream=False, broadcast=False, cdc=None, dataframe_type='SPARK', drops=None, filter=None, mock_df=None, renames=None, selects=None, watermark=None, format='PARQUET', header=True, multiline=False, path='./data/brz_stock_prices/', read_options={}, schema_location=None), timestamp_key=None), PipelineNode(variables={}, add_layer_columns=True, dlt_template='DEFAULT', description=None, drop_duplicates=None, drop_source_columns=True, transformer=SparkChain(variables={}, nodes=[SparkChainNode(variables={}, allow_missing_column_args=False, column=None, spark_func_args=[], spark_func_kwargs={'subset': SparkFuncArg(variables={}, value=['timestamp', 'symbol'])}, spark_func_name='drop_duplicates', sql_expression=None)]), expectations=[], layer='SILVER', name='slv_stock_prices', primary_key=None, sink=FileDataSink(variables={}, mode='OVERWRITE', checkpoint_location=None, format='PARQUET', path='./data/slv_stock_prices', write_options={}), source=PipelineNodeDataSource(variables={}, as_stream=False, broadcast=False, cdc=None, dataframe_type='SPARK', drops=None, filter=None, mock_df=None, renames=None, selects=None, watermark=None, node_name='brz_stock_prices', node=PipelineNode(variables={}, add_layer_columns=True, dlt_template='DEFAULT', description=None, drop_duplicates=None, drop_source_columns=False, transformer=SparkChain(variables={}, nodes=[SparkChainNode(variables={}, allow_missing_column_args=False, column=None, spark_func_args=[SparkFuncArg(variables={}, value='symbol'), SparkFuncArg(variables={}, value='timestamp'), SparkFuncArg(variables={}, value='open'), SparkFuncArg(variables={}, value='close')], spark_func_kwargs={}, spark_func_name='select', sql_expression=None)]), expectations=[], layer='BRONZE', name='brz_stock_prices', primary_key=None, sink=None, source=FileDataSource(variables={}, as_stream=False, broadcast=False, cdc=None, dataframe_type='SPARK', drops=None, filter=None, mock_df=None, renames=None, selects=None, watermark=None, format='PARQUET', header=True, multiline=False, path='./data/brz_stock_prices/', read_options={}, schema_location=None), timestamp_key=None)), timestamp_key=None)] orchestrator=None udfs=[]

pipeline.execute(spark=spark)

To get started with a more useful example, jump into the Quickstart.

A Lakehouse DataOps Template

A comprehensive template on how to deploy a lakehouse as code using Laktory is maintained here: https://github.com/okube-ai/lakehouse-as-code.

In this template, 4 pulumi projects are used to:

{cloud_provider}_infra: Deploy the required resources on your cloud provider
unity-catalog: Setup users, groups, catalogs, schemas and manage grants
workspace: Setup secrets, clusters and warehouses and common files/notebooks
workflows: The data workflows to build your lakehouse

Okube Company

Okube is dedicated to building open source frameworks, known as the kubes, empowering businesses to build, deploy and operate highly scalable data platforms and AI models.

Project details

These details have not been verified by PyPI

Project links

Development Status
- 4 - Beta
Framework
- Pydantic :: 2
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language

Release history Release notifications | RSS feed

0.4.14

Oct 8, 2024

This version

0.4.13

Oct 1, 2024

0.4.12

Sep 18, 2024

0.4.11

Aug 16, 2024

0.4.10

Jul 20, 2024

0.4.9

Jul 20, 2024

0.4.8

Jul 3, 2024

0.4.7

Jun 27, 2024

0.4.6

Jun 27, 2024

0.4.5

Jun 25, 2024

0.4.4

Jun 25, 2024

0.4.3

Jun 12, 2024

0.4.2

Jun 11, 2024

0.4.1

Jun 11, 2024

0.4.0

Jun 11, 2024

0.3.3

May 30, 2024

0.3.2

May 28, 2024

0.3.1

May 28, 2024

0.3.0

May 28, 2024

0.2.1

May 7, 2024

0.2.0

May 2, 2024

0.1.10

Apr 23, 2024

0.1.9

Apr 17, 2024

0.1.8

Mar 25, 2024

0.1.7

Mar 15, 2024

0.1.6

Feb 23, 2024

0.1.5

Feb 14, 2024

0.1.4

Feb 12, 2024

0.1.3

Feb 10, 2024

0.1.2

Feb 5, 2024

0.1.1

Jan 28, 2024

0.1.0

Jan 12, 2024

0.0.29

Dec 20, 2023

0.0.28

Dec 17, 2023

0.0.27

Dec 16, 2023

0.0.26

Dec 16, 2023

0.0.25

Dec 12, 2023

0.0.24

Dec 5, 2023

0.0.23

Dec 1, 2023

0.0.22

Nov 29, 2023

0.0.21

Nov 27, 2023

0.0.20

Nov 27, 2023

0.0.19

Nov 23, 2023

0.0.18

Nov 14, 2023

0.0.17

Nov 13, 2023

0.0.16

Nov 8, 2023

0.0.15

Nov 7, 2023

0.0.14

Nov 6, 2023

0.0.13

Nov 6, 2023

0.0.12

Nov 5, 2023

0.0.11

Nov 5, 2023

0.0.10

Oct 31, 2023

0.0.9

Oct 27, 2023

0.0.8

Oct 24, 2023

0.0.7

Oct 20, 2023

0.0.6

Oct 10, 2023

0.0.5

Sep 28, 2023

0.0.4

Sep 27, 2023

0.0.3

Sep 25, 2023

0.0.2

Sep 24, 2023

0.0.1

Jul 13, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

laktory-0.4.13.tar.gz (5.5 MB view hashes)

Uploaded Oct 1, 2024 Source

Built Distribution

laktory-0.4.13-py3-none-any.whl (461.7 kB view hashes)

Uploaded Oct 1, 2024 Python 3

Hashes for laktory-0.4.13.tar.gz

Hashes for laktory-0.4.13.tar.gz
Algorithm	Hash digest
SHA256	`f86a4a29214978a60c57e98c2878043e53ba88a94f5f3733e80de513187bba99`
MD5	`fb520ba0e6fd025f0221ddc0725baedc`
BLAKE2b-256	`b318939ad038804e24b6a04324b44f46e4c798d6da9d89a91d2085c22e023111`

Hashes for laktory-0.4.13-py3-none-any.whl

Hashes for laktory-0.4.13-py3-none-any.whl
Algorithm	Hash digest
SHA256	`3808f2150a75e8d17ff693aeb7d1c53bad19364d43b1246b48d69e9c44dd0dd3`
MD5	`b7841cff649feb28cf2a3fc737272b94`
BLAKE2b-256	`45625bed3b09f6ba6c46afa6d8159032e2fb952b95dfaa7392a06dbdc6aa65ea`