Skip to main content

A package for implementing hybrid SCD1 and SCD2 operations using Delta Tables in Databricks

Project description

Hybrid SCD1 and SCD2 Implementation

This package provides a hybrid implementation of Slowly Changing Dimensions (SCD) Type 1 and Type 2 using Delta Table in Databricks. It allows you to apply SCD2 based on specified columns and SCD1 for other columns.

Features

  1. Hybrid SCD1 and SCD2: The code performs a hybrid implementation of SCD1 and SCD2.
  2. Column-based SCD2: SCD2 will be applied if any value changes in the specified SCD2 columns.
  3. Column-based SCD1: SCD1 will be applied if any value changes in columns other than the specified SCD2 columns.

Usage

apply_scd Function

The apply_scd function handles the implementation of SCD based on the specified columns. This function is designed for Delta tables in Databricks and requires the target table to have the following columns: record_status, effective_from, effective_to, dw_inserted_at, dw_updated_at, scd_key, and upd_key.

SCD Handler Example

This example demonstrates how to use the scd_handler from the delta_hybrid_scd module to apply Slowly Changing Dimension (SCD) Type 2 logic using PySpark.

1. Prepare Data

from datetime import datetime
from delta_hybrid_scd import scd_handler

incremental_data = [
    (1, "Google", 0, "Kite", datetime(2015, 12, 25, 10, 5, 30)),
    (1, "BTC", 0, "Binance", datetime(2016, 12, 25, 11, 5, 30)),
    (3, "ETH", 20, "Binance", datetime(2016, 12, 26, 12, 7, 35))
]

schema = ["id", "stock_name", "balance", "platform", "last_modify_ts"]
df = spark.createDataFrame(incremental_data, schema)

2. Apply SCD

target_table = f"{catalog_name}.{silver_schema}.account_scd2"
pk_col = ["id", "stock_name"]          # Primary key columns
skey_col = ["balance"]                 # Columns to track SCD2 changes on
effective_from_col = "last_modify_ts"  # Timestamp column to log changes
select_col_list = ["id", "stock_name", "balance", "platform"]

scd_handler.apply_scd(
    df,
    skey_col,
    pk_col,
    target_table,
    select_col_list,
    effective_from_col
)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

delta_hybrid_scd-0.1.1.tar.gz (3.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

delta_hybrid_scd-0.1.1-py3-none-any.whl (4.8 kB view details)

Uploaded Python 3

File details

Details for the file delta_hybrid_scd-0.1.1.tar.gz.

File metadata

  • Download URL: delta_hybrid_scd-0.1.1.tar.gz
  • Upload date:
  • Size: 3.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.3 CPython/3.11.11 Linux/5.15.0-1075-azure

File hashes

Hashes for delta_hybrid_scd-0.1.1.tar.gz
Algorithm Hash digest
SHA256 a1a07cbf189d6379d843c5d1f0aa45862ce17f199cd6c6d5275136903332e6a1
MD5 cbf8eef7a3951662a761833aca44807e
BLAKE2b-256 7a41b239f02b122e8bb33ea6dbbe5eef5e2cdcf8c427170ab16b0fdf6939d4b8

See more details on using hashes here.

File details

Details for the file delta_hybrid_scd-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: delta_hybrid_scd-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 4.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.3 CPython/3.11.11 Linux/5.15.0-1075-azure

File hashes

Hashes for delta_hybrid_scd-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 d2fc283c2699150707e4bf97e8eacd31b9228b8fccc2a6aeeac47a2966f6eecc
MD5 b252ca1e8b6f621354d8c6a38a89ed29
BLAKE2b-256 7687cbdff6a038baac8c3faf4ab6011c9077d66e63c0bcea543686ab786d4f59

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page