Skip to main content

This project utilizes faker-pyspark to generate random schema and dataframes to mimic data table snapshots. Using these snapshots to process and apply SCD2 pattern into delta table as the destination.

Project description

Demo PySpark Delta Table SCD2 implementation

Python package CodeQL

This project utilizes faker-pyspark to generate random schema and dataframes to mimic data table snapshots.

Using these snapshots to process and apply SCD2 pattern into delta table as the destination.

Source of Inspiration for SCD2 pattern: https://aws-blogs-artifacts-public.s3.amazonaws.com/artifacts/BDB-2547/glue/scd-deltalake-employee-etl-job.py

Installation

Install with pip:

pip install pyspark-delta-scd2 delta-spark faker-pyspark

Please note, this package do not enforce version of delta-spark, PySpark and faker-pyspark.

When you want to use this example in AWS Glue environment, enforced versions conflict with the target environment.

Generate incremental updates to dataframe and apply scd2

>>> from pyspark_delta_scd2 import get_spark, PySparkDeltaScd2
>>> spark = get_spark()
>>> demo  = PySparkDeltaScd2(spark=spark)
>>> # initial load
>>> df1   = demo.process()
>>> # incremental update
>>> df2   = demo.process()
>>> # df2 should have some deletes, updates and inserts

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyspark_delta_scd2-0.4.1.tar.gz (4.8 kB view hashes)

Uploaded Source

Built Distribution

pyspark_delta_scd2-0.4.1-py3-none-any.whl (5.8 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page