yet (another spark) etl framework
Project description
YETL
pip install yetl-framework
Website & Docs: Yet (another Apache Spark) ETL Framework
Example:
Define a dataflow
from yetl_flow import (
yetl_flow,
IDataflow,
Context,
Timeslice,
TimesliceUtcNow,
OverwriteSave,
Save
)
from pyspark.sql.functions import *
from typing import Type
@yetl_flow(log_level="ERROR")
def customer_landing_to_rawdb_csv(
context: Context,
dataflow: IDataflow,
timeslice: Timeslice = TimesliceUtcNow(),
save_type: Type[Save] = None
) -> dict:
"""Load the demo customer data as is into a raw delta hive registered table."""
# the config for this dataflow has 2 landing sources that are joined
# and written to delta table
# delta tables are automatically created and if configured schema exceptions
# are loaded syphened into a schema exception table
df_cust = dataflow.source_df("landing.customer")
df_prefs = dataflow.source_df("landing.customer_preferences")
context.log.info("Joining customers with customer_preferences")
df = df_cust.join(df_prefs, "id", "inner")
df = df_cust
dataflow.destination_df("raw.customer", df)
Run an incremental load:
timeslice = Timeslice(2022, 7, 12)
results = customer_landing_to_rawdb_csv(
timeslice = Timeslice(2022, 7, 12)
)
Run a full load:
results = customer_landing_to_rawdb_csv(
timeslice = Timeslice(2022, '*', '*'),
save_type = OverwriteSave
)
Dependencies & Setup
This is a spark application with DeltaLake it requires following dependencies installed in order to run locally:
Ensure that the spark home path and is added to youy path is set Eg:
export SPARK_HOME="$HOME/opt/spark-3.2.2-bin-hadoop3.3"
Enable DeltaLake by:
cp $SPARK_HOME/conf/spark-defaults.conf.template $SPARK_HOME/conf/spark-defaults.conf
Add the following to spark-defaults.conf
:
spark.jars.packages io.delta:delta-core_2.12:2.0.0
spark.sql.extensions io.delta.sql.DeltaSparkSessionExtension
spark.sql.catalog.spark_catalog org.apache.spark.sql.delta.catalog.DeltaCatalog
spark.sql.catalogImplementation hive
Python Project Setup
Create virual environment and install dependencies for local development:
python -m venv venv
source venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt
pip install --editable .
Build
Build python wheel:
python setup.py sdist bdist_wheel
There is a CI build configured for this repo that builds on main origin and publishes to PyPi.
Change Log
Version: 0.0.6
- Fix Reader Bad Records - Support exceptions handling for badrecordspath defined in the configuration e.g. landing.customer.read.badRecordsPath. Only supported in databricks runtime environment.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
yetl-framework-0.0.6.post2.tar.gz
(11.4 kB
view hashes)
Built Distribution
Close
Hashes for yetl-framework-0.0.6.post2.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 38558416bc14650a6b334bb71db38f6ca8035e74017a164213b529f81f7139c9 |
|
MD5 | bd88b689fbe87f0aee181f4c97f68e55 |
|
BLAKE2b-256 | 4045e3ecd16b91cfff2ffcf0c235ba9074535ba84df8e56849cf3ee6d3b98c38 |
Close
Hashes for yetl_framework-0.0.6.post2-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | a4e5e6c77d2218d7b30855ab488ffb062d0c86982c6287f9510796f79d4797a0 |
|
MD5 | d9a842437c95615b17adda3e8c5a7aa6 |
|
BLAKE2b-256 | 81f79b7e9120c8e9fad8d01e2879513733d0058bb21d86cb8adc820d50611d6c |