yet (another spark) etl framework

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

YETL

pip install yetl-framework

Website & Docs: Yet (another Apache Spark) ETL Framework

Example:

Define a dataflow

from yetl.flow import (
    yetl_flow, 
    IDataflow, 
    IContext, 
    Timeslice, 
    TimesliceUtcNow, 
    OverwriteSave, 
    Save
)
from pyspark.sql.functions import *
from typing import Type

@yetl_flow(project="demo")
def landing_to_raw(
    context: IContext,
    dataflow: IDataflow,
    timeslice: Timeslice = TimesliceUtcNow(),
    save: Type[Save] = None,
) -> dict:
    """Load the demo customer data as is into a raw delta hive registered table.

        the config for this dataflow has 2 landing sources that are joined
        and written to delta table
        delta tables are automatically created and if configured schema exceptions
        are loaded syphened into a schema exception table
    """

    df_cust = dataflow.source_df(f"{context.project}_landing.customer")
    df_prefs = dataflow.source_df(f"{context.project}_landing.customer_preferences")

    df = df_cust.join(df_prefs, "id", "inner")
    df = df.withColumn(
        "_partition_key", date_format("_timeslice", "yyyyMMdd").cast("integer")
    )

    dataflow.destination_df(f"{context.project}_raw.customer", df, save=save)

Run an incremental load:

timeslice = Timeslice(year=2022, month=7, day=12)
results = landing_to_raw(
    timeslice = Timeslice(year=2022, month=7, day=12)
)

Run a full load for Year 2022:

results = landing_to_raw(
    timeslice = Timeslice(year=2022, month='*', day='*'),
    save = OverwriteSave
)

Dependencies & Setup

This is a spark application with DeltaLake it requires following dependencies installed in order to run locally:

Ensure that the spark home path and is added to youy path is set Eg:

export SPARK_HOME="$HOME/opt/spark-3.2.2-bin-hadoop3.3"

Enable DeltaLake by:

cp $SPARK_HOME/conf/spark-defaults.conf.template  $SPARK_HOME/conf/spark-defaults.conf

Add the following to spark-defaults.conf:

spark.jars.packages               io.delta:delta-core_2.12:2.1.1
spark.sql.extensions              io.delta.sql.DeltaSparkSessionExtension
spark.sql.catalog.spark_catalog   org.apache.spark.sql.delta.catalog.DeltaCatalog
spark.sql.catalogImplementation   hive

Python Project Setup

Create virual environment and install dependencies for local development:

python -m venv venv
source venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt
pip install --editable .

Build

Build python wheel:

python setup.py sdist bdist_wheel

There is a CI build configured for this repo that builds on main origin and publishes to PyPi.

Releases

Version: 0.0.27

bumping spark version 3.3.2
bumping delta lake version 2.1.1

Version: 0.0.26

Bug fix of loads failing when schema exceptions not configured
Refactored table node from configuration of datasets in order to simply
Standardized yetl properties to capialised naming convention
Removed reader bug that adds the contecxt_id by default
Removed lineage columns on reader from schema creation
Extended parallel process to take the save type injection

Version: 0.0.25

Added typer dependency

Version: 0.0.24

Added cli init comman to initialise a yetl project directory
Added maxparallel parameter to prototype for multithreaded loading
Fixed partition bug on initial load causing failure when schema exist but no data
Refactored SQLReader SQL files into the project pipeline dir.

Version: 0.0.23

Added metadata lineage configuration into sources and destinations for context, dataflow and dataset id's
Removed spark logging argument from decorator to the config since allows context to be more abstract and is less confusing.
Added workflow module and prototype for multithreaded loading

Version: 0.0.22

Introduced YETL optimize table property since there are still reasons to optimise on databricks
Regression tested/fixes SQL Reader
Fixed bug that was missing lineage columns off automatic schema table creation.
Adjusted table creation on delta writer so that when schema is inferred the table is created afterwards to avoid schema partition synchronisation errors
Added in configuration for putting file origin into the source dataframes
Added configuration to add _corrupt_record on schema creation
Auto generating sql schema's on schema creation
Adding options for dynamic template loading from a single function so that it can be re-used across tables
deprecated custom parsing timeslice features in favour of jinja templating
Added in a new cli lib for build out templates and maintenance tasks using typer

Version: 0.0.21

Added in Jinja for variables replacements as more robust solution than simple string replacements.

Version: 0.0.20

Fixed missing packages in build

Version: 0.0.19

Fixed missing packages in build

Version: 0.0.18

Major cleanup and refactor of datasets for future road map
Sources and destinations have the same auto_io lifecycle in the dataflow, auto is called on retrieval from the dataflow collections
Added SQLReader dataset type so we can define SQL Sources from any hive table in data flows that write to destinations (e.g. delta lake tables)
Fixed audit error trapping

Version: 0.0.17

Integration testing with databricks.
Refactored configuration so that there is more re-use across environments
Dataset types are now specifically declared in the configuration to reduce complexity when adding more types of datasets.

Version: 0.0.16

Refactored context into inteface to allow the future expansion into engines other than spark.

Version: 0.0.15

raise errors and warnings from thresholds configurations
Refactored audit loging and added comprehensive data flow auditing.

Version: 0.0.14

Started building in integration tests
Refactored Destination save using class composition
Recfactored save dependency injection down to the dataset level
Added support for Merge save using deltalake

Version: 0.0.13

Added support default schema creation etl.schema.createIfNotExists.
refactoed and cleaned up the basic reader
added consistent validation and consistent property settings to basic reader
added reader skipping features based on configuration settings

Version: 0.0.12

Added support multicolumn zording.

Version: 0.0.11

Upgrade to spark 3.3. Upgraded development for spark 3.3 and delta lake 2.1.
Added _timeslice metadata column parsing into the destination dataset so that it can be used for partitioning, works even if the read path is wildcarded '*'
Added support for partition based optimization on writes
Added support for multi column partitioning

Version: 0.0.10

Fix YAML Schema Format Error when Dataflow Retries are Set to 0. Fixed dictionary extraction bug for setting retries and retry_wait to zero.
Added overwrite schema save
Added partition sql support
Fixed constraints synchronisation to drop and create more efficiently
Refined, refactored and fixed lineage
Added file lineage logging
Add file lineage logging
Detect spark and databricks versions, determine whether to auto optimise and compact

Version: 0.0.9

Clean Up BadRecords JSON Files Automatically remove json schema exception files created by the BadRecordsPath exception handler after they are loaded into a delta table.

Version: 0.0.8

Including all packages in distribution.

Version: 0.0.7

Fix the Timeslice on Wildcard Loads - wildcard format not working on databricks. Inserting %* instead of *.
Yetl CDC Pattern example and tests

Version: 0.0.6

Fix Reader Bad Records - Support exceptions handling for badrecordspath defined in the configuration e.g. landing.customer.read.badRecordsPath. Only supported in databricks runtime environment.

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

3.0.0

Oct 21, 2023

3.0.0.dev21 pre-release

Oct 15, 2023

3.0.0.dev20 pre-release

Oct 14, 2023

3.0.0.dev19 pre-release

Oct 14, 2023

3.0.0.dev18 pre-release

Oct 14, 2023

3.0.0.dev17 pre-release

Oct 14, 2023

3.0.0.dev16 pre-release

Oct 14, 2023

3.0.0.dev15 pre-release

Oct 14, 2023

3.0.0.dev14 pre-release

Oct 13, 2023

3.0.0.dev13 pre-release

Oct 12, 2023

3.0.0.dev12 pre-release

Oct 12, 2023

3.0.0.dev11 pre-release

Oct 12, 2023

3.0.0.dev10 pre-release

Oct 12, 2023

3.0.0.dev9 pre-release

Oct 11, 2023

3.0.0.dev8 pre-release

Oct 11, 2023

3.0.0.dev6 pre-release

Oct 11, 2023

3.0.0.dev5 pre-release

Oct 11, 2023

3.0.0.dev4 pre-release

Oct 5, 2023

3.0.0.dev3 pre-release

Aug 6, 2023

3.0.0.dev2 pre-release

Aug 6, 2023

3.0.0.dev1 pre-release

Aug 6, 2023

2.0.6

Aug 1, 2023

2.0.5

Aug 1, 2023

2.0.5.dev4 pre-release

Aug 1, 2023

2.0.5.dev3 pre-release

Aug 1, 2023

2.0.5.dev2 pre-release

Aug 1, 2023

2.0.5.dev1 pre-release

Aug 1, 2023

2.0.4

Jul 30, 2023

2.0.4.dev2 pre-release

Jul 30, 2023

2.0.4.dev1 pre-release

Jul 30, 2023

2.0.3

Jul 28, 2023

2.0.2

Jul 27, 2023

2.0.1

Jul 19, 2023

2.0.1.dev2 pre-release

Jul 19, 2023

2.0.1.dev1 pre-release

Jul 19, 2023

2.0.0

Jul 16, 2023

2.0.0.dev4 pre-release

Jul 16, 2023

2.0.0.dev3 pre-release

Jul 16, 2023

2.0.0.dev2 pre-release

Jul 15, 2023

2.0.0.dev1 pre-release

Jul 15, 2023

1.7.5

Jul 15, 2023

1.7.4

Jul 15, 2023

1.7.4.dev1 pre-release

Jul 15, 2023

1.7.3

Jul 14, 2023

1.7.2

Jul 13, 2023

1.7.1

Jul 13, 2023

1.7.0

Jul 13, 2023

1.6.6.dev5 pre-release

Jul 11, 2023

1.6.6.dev4 pre-release

Jul 11, 2023

1.6.6.dev3 pre-release

Jul 8, 2023

1.6.6.dev2 pre-release

Jul 8, 2023

1.6.6.dev1 pre-release

Jul 8, 2023

1.6.5

Jul 3, 2023

1.6.4

Jul 1, 2023

1.6.3

Jul 1, 2023

1.6.2

Jul 1, 2023

1.6.1

Jul 1, 2023

1.6.0

Jul 1, 2023

1.6.0.dev1 pre-release

Jul 1, 2023

1.5.0.post1

Jun 18, 2023

1.5.0

Jun 12, 2023

1.5.0.dev2 pre-release

Jun 11, 2023

1.5.0.dev1 pre-release

Jun 10, 2023

1.4.14

Jun 10, 2023

1.4.13

Jun 10, 2023

1.4.12

Jun 9, 2023

1.4.11

Jun 6, 2023

1.4.10

Jun 4, 2023

1.4.9

Jun 4, 2023

1.4.8

Jun 4, 2023

1.4.7

Jun 3, 2023

1.4.6

Jun 3, 2023

1.4.5

Jun 3, 2023

1.4.4

Jun 2, 2023

1.4.3

Jun 2, 2023

1.4.2

Jun 1, 2023

1.4.1

Jun 1, 2023

1.3.9

May 28, 2023

1.3.8

May 28, 2023

1.3.7

May 28, 2023

1.3.7.dev1 pre-release

May 28, 2023

1.3.6

May 14, 2023

1.3.5

May 14, 2023

1.3.4

May 13, 2023

1.3.3

May 13, 2023

1.3.2

May 12, 2023

1.3.2.dev1 pre-release

May 11, 2023

1.3.1

May 9, 2023

1.3.0

May 8, 2023

1.2.12

May 8, 2023

1.2.11

May 8, 2023

1.2.10

May 8, 2023

1.2.9

May 8, 2023

1.2.8

May 8, 2023

1.2.7

May 8, 2023

1.2.6

May 8, 2023

1.2.5

May 8, 2023

1.2.4

May 6, 2023

1.2.3

May 6, 2023

1.2.2

May 6, 2023

1.2.1

May 6, 2023

1.2.0

May 6, 2023

1.1.0

May 6, 2023

1.0.5

May 6, 2023

1.0.4

May 6, 2023

1.0.3

May 6, 2023

1.0.2

May 5, 2023

1.0.1

May 5, 2023

1.0.0

May 4, 2023

0.0.31

Jan 30, 2023

0.0.30

Jan 19, 2023

0.0.29

Jan 18, 2023

0.0.29.dev2 pre-release

Jan 10, 2023

0.0.29.dev1 pre-release

Jan 4, 2023

0.0.28

Dec 24, 2022

0.0.27

Nov 14, 2022

0.0.27.dev3 pre-release

Dec 24, 2022

0.0.27.dev2 pre-release

Dec 18, 2022

This version

0.0.27.dev1 pre-release

Dec 18, 2022

0.0.26

Nov 14, 2022

0.0.25

Nov 6, 2022

0.0.25.dev6 pre-release

Nov 13, 2022

0.0.25.dev5 pre-release

Nov 13, 2022

0.0.25.dev4 pre-release

Nov 9, 2022

0.0.25.dev3 pre-release

Nov 9, 2022

0.0.25.dev2 pre-release

Nov 8, 2022

0.0.25.dev1 pre-release

Nov 8, 2022

0.0.24

Nov 6, 2022

0.0.23.post1

Oct 31, 2022

0.0.23

Oct 31, 2022

0.0.22

Oct 29, 2022

0.0.21

Oct 12, 2022

0.0.20

Oct 11, 2022

0.0.19

Oct 11, 2022

0.0.18

Oct 11, 2022

0.0.17

Oct 9, 2022

0.0.16

Oct 6, 2022

0.0.15

Oct 3, 2022

0.0.14

Sep 18, 2022

0.0.13

Sep 4, 2022

0.0.11

Sep 4, 2022

0.0.10

Aug 29, 2022

0.0.9

Aug 11, 2022

0.0.8

Aug 7, 2022

0.0.7

Aug 7, 2022

0.0.6.post2

Aug 4, 2022

0.0.6.post1

Aug 4, 2022

0.0.6

Aug 4, 2022

0.0.5

Aug 4, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

yetl-framework-0.0.27.dev1.tar.gz (49.5 kB view hashes)

Uploaded Dec 18, 2022 Source

Built Distribution

yetl_framework-0.0.27.dev1-py3-none-any.whl (127.0 kB view hashes)

Uploaded Dec 18, 2022 Python 3

Hashes for yetl-framework-0.0.27.dev1.tar.gz

Hashes for yetl-framework-0.0.27.dev1.tar.gz
Algorithm	Hash digest
SHA256	`8acfee532c8ca6fd5257dde6e3d738e555d13350f2e8b2c28a62dcdf3a2239c5`
MD5	`fb3d56be1976160047568eacc1fbc9c2`
BLAKE2b-256	`7139b2bcd7765d1ea0c62024e840b7dd137455f519431e6a9a0d755b826077a2`

Hashes for yetl_framework-0.0.27.dev1-py3-none-any.whl

Hashes for yetl_framework-0.0.27.dev1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`154eb94fe9db7b1fe5ea198789917060414725ff3099cd3289090f9ec553eed4`
MD5	`27ecb291e140411da743154e6ddaf354`
BLAKE2b-256	`276a78270536fdeb9da216f3592c3a308f5274fd5e8ec104cb6ea51ff0ed5142`