Skip to main content

A package for implementing hybrid SCD1 and SCD2 operations using Delta Tables in Databricks

Project description

Hybrid SCD1 and SCD2 Implementation

This package provides a hybrid implementation of Slowly Changing Dimensions (SCD) Type 1 and Type 2 using Delta Table in Databricks. It allows you to apply SCD2 based on specified columns and SCD1 for other columns.

Features

  1. Hybrid SCD1 and SCD2: The code performs a hybrid implementation of SCD1 and SCD2.
  2. Column-based SCD2: SCD2 will be applied if any value changes in the specified SCD2 columns.
  3. Column-based SCD1: SCD1 will be applied if any value changes in columns other than the specified SCD2 columns.

Usage

apply_scd Function

The apply_scd function handles the implementation of SCD based on the specified columns. This function is designed for Delta tables in Databricks and requires the target table to have the following columns: record_status, effective_from, effective_to, dw_inserted_at, dw_updated_at, scd_key, and upd_key.

SCD Handler Example

This example demonstrates how to use the scd_handler from the delta_hybrid_scd module to apply Slowly Changing Dimension (SCD) Type 2 logic using PySpark.

1. Prepare Data

from datetime import datetime
from delta_hybrid_scd import scd_handler

incremental_data = [
    (1, "Google", 0, "Kite", datetime(2015, 12, 25, 10, 5, 30)),
    (1, "BTC", 0, "Binance", datetime(2016, 12, 25, 11, 5, 30)),
    (3, "ETH", 20, "Binance", datetime(2016, 12, 26, 12, 7, 35))
]

schema = ["id", "stock_name", "balance", "platform", "last_modify_ts"]
df = spark.createDataFrame(incremental_data, schema)

2. Apply SCD

target_table = f"{catalog_name}.{silver_schema}.account_scd2"
pk_col = ["id", "stock_name"]          # Primary key columns
skey_col = ["balance"]                 # Columns to track SCD2 changes on
effective_from_col = "last_modify_ts"  # Timestamp column to log changes
select_col_list = ["id", "stock_name", "balance", "platform"]

scd_handler.apply_scd(
    df,
    skey_col,
    pk_col,
    target_table,
    select_col_list,
    effective_from_col
)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

delta_hybrid_scd-0.1.2.tar.gz (3.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

delta_hybrid_scd-0.1.2-py3-none-any.whl (4.2 kB view details)

Uploaded Python 3

File details

Details for the file delta_hybrid_scd-0.1.2.tar.gz.

File metadata

  • Download URL: delta_hybrid_scd-0.1.2.tar.gz
  • Upload date:
  • Size: 3.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.3 CPython/3.11.11 Linux/5.15.0-1075-azure

File hashes

Hashes for delta_hybrid_scd-0.1.2.tar.gz
Algorithm Hash digest
SHA256 3ae60bf63e6e41530674276691051c7cc0ff7642b9fde79e7c659adf8ff72115
MD5 b1601e30e7ad88ee48b7b7230f8b901c
BLAKE2b-256 cf8d8650b94169d221575a840ea2a5d1d87ca29ba9a6d6a7c7e38ac624a63e78

See more details on using hashes here.

File details

Details for the file delta_hybrid_scd-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: delta_hybrid_scd-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 4.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.3 CPython/3.11.11 Linux/5.15.0-1075-azure

File hashes

Hashes for delta_hybrid_scd-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 923922ba569879b7a22462b827aa012d8bc978c522bd3cdaf7641674a973f4cd
MD5 3f4a69ca92b86655cb4e647624ab3034
BLAKE2b-256 6a21d708f2fb86ca91de61fbe5ee9a270f362ee6b7f007e66754156bece8be33

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page