An ELT with airflow helper module: Ewah

These details have not been verified by PyPI

Project links

Homepage

Project description

ewah

Ewah: ELT With Airflow Helper - Classes and functions to make apache airflow life easier.

Pre-Alpha. Used by myself for specific usecases at the moment.

Goal: Have functions to create all DAGs required for ELT using only a simple config file. Use this as a basis to build a GUI on top of it.

DWHs Implemented

Snowflake
PostgreSQL

DWHs Planned

Bigquery

Operators Implemented

PostgreSQL
MySQL
OracleSQL
Google Analytics (incremental only)
S3 (for JSON files stored in an S3 bucket, e.g. from Kinesis Firehose)
FX Rates (from Yahoo Finance)
Facebook (partially, so far: ads insights; incremental only)
Google Sheets
MongoDB

Philosophy

This package strictly follows an ELT Philosophy:

Business value is created by infusing business logic into the data and making great analyses and usable data available to stakeholders, not by building data pipelines
Airflow solely orchestrates loading raw data into a central DWH
Data is either loaded as full refresh (all data at every load) or incrementally, exploiting airflow's catchup and execution logic
The only additional DAGs are dbt DAGs and utility DAGs
Within that DWH, each data source lives in its own schema (e.g. raw_salesforce)
Irrespective of full refresh or incremental loading, DAGs always load into a separate schema (e.g. raw_salesforce_next) and at the end replace the schema with the old data with the schema with the new data, to avoid data corruption due to errors in DAG execution
Any data transformation is defined using SQL, ideally using dbt
Seriously, dbt is awesome, give it a shot!
(Non-SQL) Code contains no transformations

Usage

In your airflow Dags folder, define the DAGs by invoking either the incremental loading or full refresh DAG factory. The incremental loading DAG factory returns three DAGs in a tuple, make sure to call it like so: dag1, dag2, dag3 = dag_factory_incremental_loading() or add the dag IDs to your namespace like so:

dags = dag_factory_incremental_loading()
for dag in dags:
  globals()[dag._dag_id] = dag

Otherwise, airflow will not recognize the DAGs. Most arguments should be self-explanatory. The two noteworthy arguments are el_operator and operator_config. The former must be a child object of ewah.operators.base_operator.EWAHBaseOperator. Ideally, The required operator is already available for your use. Please feel free to fork and commit your own operators to this project! The latter is a dictionary containing the entire configuration of the operator. This is where you define what tables to load, how to load them, if loading specific columns only, and any other detail related to your EL job.

Full refresh factory

A filename.py file in your airflow/dags folder may look something like this:

from ewah.ewah_utils.dag_factory_full_refresh import dag_factory_drop_and_replace
from ewah.constants import EWAHConstants as EC
from ewah.operators.postgres_operator import EWAHPostgresOperator

from datetime import datetime, timedelta

dag = dag_factory_drop_and_replace(
    dag_name='EL_production_postgres_database', # Name of the DAG
    dwh_engine=EC.DWH_ENGINE_POSTGRES, # Implemented DWH Engine
    dwh_conn_id='dwh', # Airflow connection ID with connection details to the DWH
    el_operator=EWAHPostgresOperator, # Ewah Operator (or custom child class of EWAHBaseOperator)
    target_schema_name='raw_production', # Name of the raw schema where data will end up in the DWH
    target_schema_suffix='_next', # suffix of the schema containing the data before replacing the production data schema with the temporary loading schema
    # target_database_name='raw', # Only Snowflake
    start_date=datetime(2019, 10, 23), # As per airflow standard
    schedule_interval=timedelta(hours=1), # Only timedelta is allowed!
    default_args={ # Default args for DAG as per airflow standard
        'owner': 'Data Engineering',
        'retries': 1,
        'retry_delay': timedelta(minutes=5),
        'email_on_retry': False,
        'email_on_failure': True,
        'email': ['email@address.com'],
    },
    operator_config={
        'general_config': {
            'source_conn_id': 'production_postgres',
            'source_schema_name': 'public',
        },
        'tables': {
            'table_name':{},
            # ...
            # Additional optional kwargs at the table level:
            #   columns_definition
            #   update_on_columns
            #   primary_key_column_name
            #   + any operator specific arguments
        },
    },
)

For all kwargs of the operator config, the general config can be overwritten by supplying specific kwargs at the table level.

Configure all DAGs in a single YAML file

Standard data loading DAGs should be just a configuration. Thus, you can configure the DAGs using a simple YAML file. Your dags.py file in your $AIRFLOW_HOME/dags folder may then look like that, and nothing more:

import os
from airflow import DAG # This module must be imported for airflow to see DAGs
from airflow.configuration import conf

from ewah.dag_factories import dags_from_yml_file

folder = os.environ.get('AIRFLOW__CORE__DAGS_FOLDER', None)
folder = folder or conf.get("core", "dags_folder")
dags = dags_from_yml_file(folder + os.sep + 'dags.yml', True, True)
for dag in dags: # Must add the individual DAGs to the global namespace
    globals()[dag._dag_id] = dag

And the YAML file may look like this:

---

base_config: # applied to all DAGs unless overwritten
  dwh_engine: postgres
  dwh_conn_id: dwh
  airflow_conn_id: airflow
  start_date: 2019-10-23 00:00:00+00:00
  schedule_interval: !!python/object/apply:datetime.timedelta
    - 0 # days
    - 3600 # seconds
  schedule_interval_backfill: !!python/object/apply:datetime.timedelta
    - 7
  schedule_interval_future: !!python/object/apply:datetime.timedelta
    - 0
    - 3600
  additional_task_args:
    retries: 1
    retry_delay: !!python/object/apply:datetime.timedelta
      - 0
      - 300
    email_on_retry: False
    email_on_failure: True
    email: ['me+airflowerror@mail.com']
el_dags:
  EL_Production: # equals the name of the DAG
    incremental: False
    el_operator: postgres
    target_schema_name: raw_production
    operator_config:
      general_config:
        source_conn_id: production_postgres
        source_schema_name: public
      tables:
        users:
          source_table_name: Users
        transactions:
          source_table_name: UserTransactions
          source_schema_name: transaction_schema # Overwrite general_config args as needed
  EL_Facebook:
    incremental: True
    el_operator: fb
    start_date: 2019-07-01 00:00:00+00:00
    target_schema_name: raw_facebook
    operator_config:
      general_config:
        source_conn_id: facebook
        account_ids:
          - 123
          - 987
        data_from: '{{ execution_date }}' # Some fields allow airflow templating, depending on the operator
        data_until: '{{ next_execution_date }}'
        level: ad
      tables:
        ads_data_age_gender:
          insight_fields:
            - adset_id
            - adset_name
            - campaign_name
            - campaign_id
            - spend
          breackdowns:
            - age
            - gender
...

Using EWAH with Astronomer

To avoid all devops troubles, it is particularly easy to use EWAH with astronomer. Your astronomer project requires the following:

add ewah to the requirements.txt
add libstdc++ to the packages.txt
have a dags.py file and a dags.yml file in your dags folder
in production, you may need to request your airflow metadata postgres database password from the support for incremental loading DAGs

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.8.107

Oct 18, 2024

0.8.106

Oct 8, 2024

0.8.105

Sep 27, 2024

0.8.104

Sep 24, 2024

0.8.103

Sep 24, 2024

0.8.102

Apr 23, 2024

0.8.101

Feb 21, 2024

0.8.100

Feb 21, 2024

0.8.99

Feb 8, 2024

0.8.98

Feb 5, 2024

0.8.97

Feb 5, 2024

0.8.96

Feb 3, 2024

0.8.95

Jan 18, 2024

0.8.94

Dec 14, 2023

0.8.93

Nov 27, 2023

0.8.92

Nov 24, 2023

0.8.91

Nov 8, 2023

0.8.90

Nov 8, 2023

0.8.89

Nov 1, 2023

0.8.88

Sep 26, 2023

0.8.87

Aug 17, 2023

0.8.86

Aug 4, 2023

0.8.85

Jul 20, 2023

0.8.84

Jun 26, 2023

0.8.83

May 16, 2023

0.8.82

Mar 30, 2023

0.8.81

Mar 28, 2023

0.8.80

Feb 15, 2023

0.8.79

Nov 15, 2022

0.8.78

Nov 3, 2022

0.8.77

Oct 31, 2022

0.8.76

Oct 30, 2022

0.8.75

Oct 27, 2022

0.8.74

Oct 14, 2022

0.8.73

Sep 27, 2022

0.8.72

Sep 27, 2022

0.8.71

Sep 22, 2022

0.8.70

Sep 22, 2022

0.8.69

Sep 21, 2022

0.8.68

Sep 6, 2022

0.8.67

Sep 6, 2022

0.8.66

Sep 5, 2022

0.8.65

Aug 18, 2022

0.8.63

Aug 2, 2022

0.8.62

Aug 2, 2022

0.8.61

Aug 1, 2022

0.8.60

Jul 25, 2022

0.8.59

Jul 25, 2022

0.8.58

Jul 25, 2022

0.8.57

Jul 7, 2022

0.8.56

Jul 7, 2022

0.8.55

Jul 6, 2022

0.8.54

Jul 6, 2022

0.8.53

Jul 5, 2022

0.8.52

Jun 22, 2022

0.8.51

Jun 13, 2022

0.8.50

May 20, 2022

0.8.49

May 18, 2022

0.8.48

May 12, 2022

0.8.47

May 12, 2022

0.8.46

May 5, 2022

0.8.45

Apr 27, 2022

0.8.44

Apr 21, 2022

0.8.43

Apr 20, 2022

0.8.42

Mar 29, 2022

0.8.41

Mar 29, 2022

0.8.40

Mar 28, 2022

0.8.39

Mar 28, 2022

0.8.38

Mar 28, 2022

0.8.37

Mar 26, 2022

0.8.36

Mar 26, 2022

0.8.35

Mar 25, 2022

0.8.34

Mar 18, 2022

0.8.33

Mar 18, 2022

0.8.32

Mar 16, 2022

0.8.31

Mar 16, 2022

0.8.30

Mar 15, 2022

0.8.29

Mar 9, 2022

0.8.28

Mar 9, 2022

0.8.27

Mar 3, 2022

0.8.26

Mar 2, 2022

0.8.25

Mar 2, 2022

0.8.24

Feb 28, 2022

0.8.23

Feb 22, 2022

0.8.22

Feb 18, 2022

0.8.21

Feb 18, 2022

0.8.20

Feb 17, 2022

0.8.19

Feb 17, 2022

0.8.18

Feb 17, 2022

0.8.17

Feb 17, 2022

0.8.16

Feb 17, 2022

0.8.15

Feb 16, 2022

0.8.14

Feb 12, 2022

0.8.13

Jan 20, 2022

0.8.12

Jan 20, 2022

0.8.11

Jan 18, 2022

0.8.10

Jan 17, 2022

0.8.9

Jan 16, 2022

0.8.8

Jan 14, 2022

0.8.7

Jan 14, 2022

0.8.6

Jan 14, 2022

0.8.5

Jan 13, 2022

0.8.4

Jan 13, 2022

0.8.3

Jan 13, 2022

0.8.2

Jan 13, 2022

0.8.1

Jan 12, 2022

0.8.0

Jan 11, 2022

0.7.18

Jan 7, 2022

0.7.17

Dec 23, 2021

0.7.16

Dec 21, 2021

0.7.15

Dec 9, 2021

0.7.14

Dec 7, 2021

0.7.13

Dec 2, 2021

0.7.12

Nov 24, 2021

0.7.11

Nov 23, 2021

0.7.10

Nov 22, 2021

0.7.9

Nov 4, 2021

0.7.8

Oct 21, 2021

0.7.7

Oct 15, 2021

0.7.6

Sep 30, 2021

0.7.5

Sep 30, 2021

0.7.4

Sep 16, 2021

0.7.3

Sep 16, 2021

0.7.2

Sep 16, 2021

0.7.1

Aug 2, 2021

0.7.0

Jul 28, 2021

0.6.21

Jul 20, 2021

0.6.20

Jul 18, 2021

0.6.19

Jul 6, 2021

0.6.18

Jul 5, 2021

0.6.17

Jun 29, 2021

0.6.16

Jun 23, 2021

0.6.15

Jun 18, 2021

0.6.14

Jun 15, 2021

0.6.13

Jun 15, 2021

0.6.12

Jun 14, 2021

0.6.11

Jun 9, 2021

0.6.10

Jun 9, 2021

0.6.9

Jun 4, 2021

0.6.8

Jun 4, 2021

0.6.7

Jun 3, 2021

0.6.6

Jun 2, 2021

0.6.5

Jun 2, 2021

0.6.4

Jun 2, 2021

0.6.3

May 26, 2021

0.6.2

May 18, 2021

0.6.1

May 6, 2021

0.6.0

May 6, 2021

0.5.15

May 5, 2021

0.5.14

Apr 20, 2021

0.5.13

Apr 20, 2021

0.5.12

Apr 19, 2021

0.5.11

Apr 19, 2021

0.5.10

Apr 19, 2021

0.5.9

Apr 17, 2021

0.5.8

Apr 17, 2021

0.5.7

Apr 17, 2021

0.5.6

Apr 15, 2021

0.5.5

Apr 14, 2021

0.5.4

Apr 14, 2021

0.5.3

Apr 13, 2021

0.5.2

Mar 31, 2021

0.5.1

Mar 30, 2021

0.5.0

Mar 30, 2021

0.4.24

Mar 15, 2021

0.4.23

Mar 11, 2021

0.4.22

Mar 11, 2021

0.4.21

Mar 11, 2021

0.4.20

Mar 6, 2021

0.4.19

Mar 6, 2021

0.4.18

Mar 5, 2021

0.4.17

Mar 5, 2021

0.4.16

Feb 28, 2021

0.4.15

Feb 25, 2021

0.4.14

Feb 22, 2021

0.4.13

Feb 21, 2021

0.4.12

Feb 20, 2021

0.4.11

Feb 18, 2021

0.4.10

Feb 17, 2021

0.4.9

Feb 16, 2021

0.4.8

Feb 16, 2021

0.4.7

Feb 14, 2021

0.4.6

Feb 14, 2021

0.4.5

Feb 14, 2021

0.4.4

Feb 13, 2021

0.4.3

Feb 12, 2021

0.4.2

Feb 11, 2021

0.4.1

Feb 10, 2021

0.4.0

Feb 10, 2021

0.3.22

Feb 2, 2021

0.3.21

Jan 25, 2021

0.3.20

Jan 25, 2021

0.3.19

Jan 24, 2021

0.3.18

Jan 24, 2021

0.3.17

Jan 24, 2021

0.3.16

Jan 24, 2021

0.3.15

Jan 23, 2021

0.3.14

Jan 23, 2021

0.3.12

Jan 20, 2021

0.3.11

Jan 17, 2021

0.3.10

Jan 17, 2021

0.3.9

Jan 15, 2021

0.3.8

Jan 12, 2021

0.3.7

Jan 12, 2021

0.3.6

Jan 12, 2021

0.3.5

Jan 12, 2021

0.3.4

Jan 8, 2021

0.3.3

Jan 7, 2021

0.3.2

Jan 5, 2021

0.3.1

Jan 4, 2021

0.3.0

Jan 3, 2021

0.2.35

Dec 21, 2020

0.2.34

Dec 19, 2020

0.2.33

Dec 13, 2020

0.2.32

Dec 13, 2020

0.2.31

Dec 12, 2020

0.2.30

Dec 10, 2020

0.2.29

Dec 10, 2020

0.2.28

Dec 8, 2020

0.2.27

Dec 7, 2020

0.2.26

Dec 3, 2020

0.2.25

Dec 3, 2020

0.2.24

Dec 2, 2020

0.2.23

Dec 2, 2020

0.2.22

Nov 25, 2020

0.2.21

Nov 25, 2020

0.2.20

Nov 25, 2020

0.2.19

Nov 23, 2020

0.2.18

Nov 22, 2020

0.2.17

Nov 20, 2020

0.2.16

Nov 18, 2020

0.2.15

Nov 17, 2020

0.2.14

Nov 16, 2020

0.2.13

Nov 16, 2020

0.2.12

Nov 16, 2020

0.2.11

Nov 15, 2020

0.2.10

Nov 12, 2020

0.2.9

Nov 10, 2020

0.2.8

Nov 8, 2020

0.2.7

Oct 20, 2020

0.2.6

Oct 16, 2020

0.2.5

Oct 15, 2020

0.2.4

Oct 15, 2020

0.2.3

Oct 15, 2020

0.2.2

Oct 15, 2020

0.2.1

Oct 15, 2020

0.1.39

Oct 15, 2020

0.1.38

Oct 15, 2020

0.1.37

Oct 15, 2020

0.1.36

Oct 14, 2020

0.1.35

Oct 13, 2020

0.1.34

Oct 13, 2020

0.1.33

Oct 13, 2020

0.1.31

Sep 15, 2020

0.1.30

Sep 7, 2020

0.1.29

Sep 7, 2020

0.1.27

Aug 28, 2020

0.1.26

Aug 27, 2020

0.1.25

Aug 27, 2020

0.1.23

Aug 13, 2020

0.1.21

Jun 30, 2020

0.1.20

Jun 30, 2020

0.1.19

Jun 30, 2020

0.1.18

Jun 30, 2020

0.1.17

Jun 30, 2020

0.1.16

Jun 29, 2020

This version

0.1.15

Jun 29, 2020

0.1.14

Jun 29, 2020

0.1.13

Jun 29, 2020

0.1.12

Jun 29, 2020

0.1.11

Jun 20, 2020

0.1.10

Jun 8, 2020

0.1.9

Jun 8, 2020

0.1.8

Apr 30, 2020

0.1.7

Apr 30, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ewah-0.1.15.tar.gz (41.4 kB view hashes)

Uploaded Jun 29, 2020 Source

Built Distribution

ewah-0.1.15-py3-none-any.whl (53.0 kB view hashes)

Uploaded Jun 29, 2020 Python 3

Hashes for ewah-0.1.15.tar.gz

Hashes for ewah-0.1.15.tar.gz
Algorithm	Hash digest
SHA256	`0870f5401eb07ec8c286e6104941b162e616a7efb1fc39325a8b929198001453`
MD5	`d6de39f8bf9f2255655e5be78701fae6`
BLAKE2b-256	`98b96b293b738eee49ae5bfb58dbf44d17c49ae919c10da31d9c29009e3486d5`

Hashes for ewah-0.1.15-py3-none-any.whl

Hashes for ewah-0.1.15-py3-none-any.whl
Algorithm	Hash digest
SHA256	`dcd12324837cd9095715db3116e5f6b94a045821263589f74f97774024f3060d`
MD5	`a7c19968e59051bd1664445b0952a326`
BLAKE2b-256	`064fc5c122cc5b816f73fc0abe380ed7d276c05346cce3ef46ae85ba616bdca5`