Skip to main content

No project description provided

Project description

Documentation: https://charlesdarkwind.github.io/ETL-lib.github.io/html/

For development or analysis purposes, the library saves time by never mounting volumes unless necessary or if asked to mount anyway, even after clearing a notebook's state.
The volumes accessibility is always assessed at the last possible moment to reduce the chances of using outdated credentials.

Tables data location is abstracted away.


 

Standard / low abstraction modules
  • raw
  • curated
  • trusted
  • raw_control
  • curated_control

These modules contain utility functions for read, write and delete operations while enforcing conventional naming, file and folder locations and other standards to keep things organised.

These also log a lot of debug informations in a log file located at:

  • windows: C://logs
  • data lake: raw-zone/logs

 

Utility modules
  • utils
  • config
  • dbfs_utils
  • json_utils
  • delta_utils

For more flexibility in order to build pipelines that can cover many other use cases.
The higher-order / standard modules seen before all implement functions from these utility modules.

utils is the only module that can be imported, and its functions used, from anywhere without needing spark or a databricks connection.


 

Installation

pip install ETL-lib
dbutils.library.installPyPI('ETL-lib')

Examples

from pyspark.sql.functions import to_timestamp, col
from pyspark.sql.types import TimestampType

from yammer_params import params
from ETL import *

config = Config(params)


def parse_date(df):
  return df.withColumn(
    'created_at', to_timestamp(col('created_at').cast(TimestampType()), "yyyy-mm-dd'T'HH:mm:ss"))


# Read all raw-zone data for table "Messages" and overwrite the curated-zone delta table:
curated.write(config, 'Messages', transformation=parse_date)

# Read only new raw-zone folders and merge it into the curated-zone:
curated.merge(config, 'Messages', transformation=parse_date, incremental=True)

# Since raw can only go to curated, ETL.curated_tables.merge() and ETL.curated.write() do it implicitly


# Example of the same merge but more explicitly using other functions from the library
def raw_to_curated(table, transformation=None, incremental=True):

  # Read raw data, also retrieve potential control table updates
  df, short_paths = raw.read(config, table, incremental=incremental)

  # Clean it
  if transformation and not utils.df_empty(df):
    df = transformation(df)

  # Merge into curated table
  curated.merge(config, table, df, incremental=incremental)

  # Update control table
  raw_control.insert(config, short_paths)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for ETL-lib, version 1.6
Filename, size File type Python version Upload date Hashes
Filename, size ETL_lib-1.6-py3.5.egg (63.2 kB) File type Egg Python version 3.5 Upload date Hashes View hashes
Filename, size ETL_lib-1.6-py3-none-any.whl (28.4 kB) File type Wheel Python version py3 Upload date Hashes View hashes
Filename, size ETL_lib-1.6.tar.gz (19.5 kB) File type Source Python version None Upload date Hashes View hashes

Supported by

Elastic Elastic Search Pingdom Pingdom Monitoring Google Google BigQuery Sentry Sentry Error logging AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN DigiCert DigiCert EV certificate StatusPage StatusPage Status page