yet (another spark) etl framework

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

yetl

Install

pip install yetl-framework

Configuration framework for databricks pipelines. Define configuration and table dependencies in yaml config then get the table mappings config model:

Define your tables.

landing: # this is the landing stage in the deltalake house
  read: # this is the type of spark asset that the pipeline needs to read
    landing_dbx_patterns:
      customer_details_1: null
      customer_details_2: null

raw: # this is the bronze stage in the deltalake house
  delta_lake: # this is the type of spark asset that the pipeline needs to read and write to
    raw_dbx_patterns: # this is the database name
      customers: # this is a table name and it's subsequent properties
        ids: id
        depends_on:
          - landing.landing_dbx_patterns.customer_details_1
          - landing.landing_dbx_patterns.customer_details_2
        warning_thresholds:
          invalid_ratio: 0.1
          invalid_rows: 0
          max_rows: 100
          min_rows: 5
        exception_thresholds:
          invalid_ratio: 0.2
          invalid_rows: 2
          max_rows: 1000
          min_rows: 0
        custom_properties:
          process_group: 1

base: # this is the silver stage in the delta lakehouse
  delta_lake: # this is the type of spark asset that the pipeline needs to read and write to
    # delta table properties can be set at stage level or table level
    delta_properties:
      delta.appendOnly: true
      delta.autoOptimize.autoCompact: true    
      delta.autoOptimize.optimizeWrite: true  
      delta.enableChangeDataFeed: false
    base_dbx_patterns: # this is a database name
      customer_details_1: # this is a table name and it's subsequent properties
        ids: id
        depends_on:
          - raw.raw_dbx_patterns.customers
        # delta table properties can be set at stage level or table level
        # table level properties will overwride stage level properties
        delta_properties:
            delta.enableChangeDataFeed: true
      customer_details_2: # this is a table name and it's subsequent properties
        ids: id
        depends_on:
          - raw.raw_dbx_patterns.customers

Define you load configuration:

version: 1.0.0
tables: ./tables.yaml

landing: # this is the landing stage in the deltalake house
  read: # this is the type of spark asset that the pipeline needs to read from
    trigger: customerdetailscomplete-{{filename_date_format}}*.flg
    trigger_type: file
    container: datalake
    root: "/mnt/{{container}}/data/landing/dbx_patterns/{{table}}/{{path_date_format}}"
    filename: "{{table}}-{{filename_date_format}}*.csv"
    filename_date_format: "%Y%m%d"
    path_date_format: "%Y%m%d"
    format: cloudFiles
    spark_schema: ../schema/{{table.lower()}}.yaml
    options:
      # autoloader
      cloudFiles.format: csv
      cloudFiles.schemaLocation:  /mnt/{{container}}/checkpoint/{{checkpoint}}
      cloudFiles.useIncrementalListing: auto
      # schema
      inferSchema: false
      enforceSchema: true
      columnNameOfCorruptRecord: _corrupt_record
      # csv
      header: false
      mode: PERMISSIVE
      encoding: windows-1252
      delimiter: ","
      escape: '"'
      nullValue: ""
      quote: '"'
      emptyValue: ""
    

raw: # this is the bronze stage in the deltalake house
  delta_lake: # this is the type of spark asset that the pipeline needs to read and write to
    # delta table properties can be set at stage level or table level
    delta_properties:
      delta.appendOnly: true
      delta.autoOptimize.autoCompact: true    
      delta.autoOptimize.optimizeWrite: true  
      delta.enableChangeDataFeed: false
    managed: false
    create_table: true
    container: datalake
    root: /mnt/{{container}}/data/raw
    path: "{{database}}/{{table}}"
    options:
      checkpointLocation: /mnt/{{container}}/checkpoint/{{database}}_{{table}}
      mergeSchema: true

Import the config objects into you pipeline:

from yetl import Config, Timeslice, StageType

pipeline = "auto_load_schema"
project = "test_project"
timeslice = Timeslice(day="*", month="*", year="*")
config = Config(
    project=project, pipeline=pipeline
)
table_mapping = config.get_table_mapping(
    timeslice=timeslice, stage=StageType.raw, table="customers"
)

print(table_mapping)

Use even less code and use the decorator pattern:

@yetl_flow(
        project="test_project", 
        stage=StageType.raw
)
def auto_load_schema(table_mapping:TableMapping):

    # << ADD YOUR PIPELINE LOGIC HERE - USING TABLE MAPPING CONFIG >>
    return table_mapping # return whatever you want here.


result = auto_load_schema(table="customers")

Development Setup

pip install -r requirements.txt

Unit Tests

To run the unit tests with a coverage report.

pip install -e .
pytest test/unit --junitxml=junit/test-results.xml --cov=yetl --cov-report=xml --cov-report=html

Integration Tests

To run the integration tests with a coverage report.

pip install -e .
pytest test/integration --junitxml=junit/test-results.xml --cov=yetl --cov-report=xml --cov-report=html

Build

python setup.py sdist bdist_wheel

Publish

twine upload dist/*

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

3.0.0

Oct 21, 2023

3.0.0.dev21 pre-release

Oct 15, 2023

3.0.0.dev20 pre-release

Oct 14, 2023

3.0.0.dev19 pre-release

Oct 14, 2023

3.0.0.dev18 pre-release

Oct 14, 2023

3.0.0.dev17 pre-release

Oct 14, 2023

3.0.0.dev16 pre-release

Oct 14, 2023

3.0.0.dev15 pre-release

Oct 14, 2023

3.0.0.dev14 pre-release

Oct 13, 2023

3.0.0.dev13 pre-release

Oct 12, 2023

3.0.0.dev12 pre-release

Oct 12, 2023

3.0.0.dev11 pre-release

Oct 12, 2023

3.0.0.dev10 pre-release

Oct 12, 2023

3.0.0.dev9 pre-release

Oct 11, 2023

3.0.0.dev8 pre-release

Oct 11, 2023

3.0.0.dev6 pre-release

Oct 11, 2023

3.0.0.dev5 pre-release

Oct 11, 2023

3.0.0.dev4 pre-release

Oct 5, 2023

3.0.0.dev3 pre-release

Aug 6, 2023

3.0.0.dev2 pre-release

Aug 6, 2023

3.0.0.dev1 pre-release

Aug 6, 2023

2.0.6

Aug 1, 2023

2.0.5

Aug 1, 2023

2.0.5.dev4 pre-release

Aug 1, 2023

2.0.5.dev3 pre-release

Aug 1, 2023

2.0.5.dev2 pre-release

Aug 1, 2023

2.0.5.dev1 pre-release

Aug 1, 2023

2.0.4

Jul 30, 2023

2.0.4.dev2 pre-release

Jul 30, 2023

2.0.4.dev1 pre-release

Jul 30, 2023

2.0.3

Jul 28, 2023

2.0.2

Jul 27, 2023

2.0.1

Jul 19, 2023

2.0.1.dev2 pre-release

Jul 19, 2023

2.0.1.dev1 pre-release

Jul 19, 2023

2.0.0

Jul 16, 2023

2.0.0.dev4 pre-release

Jul 16, 2023

2.0.0.dev3 pre-release

Jul 16, 2023

2.0.0.dev2 pre-release

Jul 15, 2023

2.0.0.dev1 pre-release

Jul 15, 2023

1.7.5

Jul 15, 2023

1.7.4

Jul 15, 2023

1.7.4.dev1 pre-release

Jul 15, 2023

1.7.3

Jul 14, 2023

1.7.2

Jul 13, 2023

1.7.1

Jul 13, 2023

1.7.0

Jul 13, 2023

1.6.6.dev5 pre-release

Jul 11, 2023

1.6.6.dev4 pre-release

Jul 11, 2023

1.6.6.dev3 pre-release

Jul 8, 2023

1.6.6.dev2 pre-release

Jul 8, 2023

1.6.6.dev1 pre-release

Jul 8, 2023

1.6.5

Jul 3, 2023

1.6.4

Jul 1, 2023

1.6.3

Jul 1, 2023

1.6.2

Jul 1, 2023

1.6.1

Jul 1, 2023

1.6.0

Jul 1, 2023

1.6.0.dev1 pre-release

Jul 1, 2023

1.5.0.post1

Jun 18, 2023

1.5.0

Jun 12, 2023

1.5.0.dev2 pre-release

Jun 11, 2023

1.5.0.dev1 pre-release

Jun 10, 2023

1.4.14

Jun 10, 2023

1.4.13

Jun 10, 2023

1.4.12

Jun 9, 2023

1.4.11

Jun 6, 2023

1.4.10

Jun 4, 2023

1.4.9

Jun 4, 2023

1.4.8

Jun 4, 2023

1.4.7

Jun 3, 2023

1.4.6

Jun 3, 2023

1.4.5

Jun 3, 2023

1.4.4

Jun 2, 2023

1.4.3

Jun 2, 2023

1.4.2

Jun 1, 2023

1.4.1

Jun 1, 2023

1.3.9

May 28, 2023

1.3.8

May 28, 2023

1.3.7

May 28, 2023

1.3.7.dev1 pre-release

May 28, 2023

1.3.6

May 14, 2023

1.3.5

May 14, 2023

1.3.4

May 13, 2023

1.3.3

May 13, 2023

1.3.2

May 12, 2023

1.3.2.dev1 pre-release

May 11, 2023

1.3.1

May 9, 2023

1.3.0

May 8, 2023

1.2.12

May 8, 2023

1.2.11

May 8, 2023

1.2.10

May 8, 2023

1.2.9

May 8, 2023

1.2.8

May 8, 2023

1.2.7

May 8, 2023

1.2.6

May 8, 2023

1.2.5

May 8, 2023

1.2.4

May 6, 2023

1.2.3

May 6, 2023

1.2.2

May 6, 2023

1.2.1

May 6, 2023

1.2.0

May 6, 2023

1.1.0

May 6, 2023

1.0.5

May 6, 2023

1.0.4

May 6, 2023

1.0.3

May 6, 2023

This version

1.0.2

May 5, 2023

1.0.1

May 5, 2023

1.0.0

May 4, 2023

0.0.31

Jan 30, 2023

0.0.30

Jan 19, 2023

0.0.29

Jan 18, 2023

0.0.29.dev2 pre-release

Jan 10, 2023

0.0.29.dev1 pre-release

Jan 4, 2023

0.0.28

Dec 24, 2022

0.0.27

Nov 14, 2022

0.0.27.dev3 pre-release

Dec 24, 2022

0.0.27.dev2 pre-release

Dec 18, 2022

0.0.27.dev1 pre-release

Dec 18, 2022

0.0.26

Nov 14, 2022

0.0.25

Nov 6, 2022

0.0.25.dev6 pre-release

Nov 13, 2022

0.0.25.dev5 pre-release

Nov 13, 2022

0.0.25.dev4 pre-release

Nov 9, 2022

0.0.25.dev3 pre-release

Nov 9, 2022

0.0.25.dev2 pre-release

Nov 8, 2022

0.0.25.dev1 pre-release

Nov 8, 2022

0.0.24

Nov 6, 2022

0.0.23.post1

Oct 31, 2022

0.0.23

Oct 31, 2022

0.0.22

Oct 29, 2022

0.0.21

Oct 12, 2022

0.0.20

Oct 11, 2022

0.0.19

Oct 11, 2022

0.0.18

Oct 11, 2022

0.0.17

Oct 9, 2022

0.0.16

Oct 6, 2022

0.0.15

Oct 3, 2022

0.0.14

Sep 18, 2022

0.0.13

Sep 4, 2022

0.0.11

Sep 4, 2022

0.0.10

Aug 29, 2022

0.0.9

Aug 11, 2022

0.0.8

Aug 7, 2022

0.0.7

Aug 7, 2022

0.0.6.post2

Aug 4, 2022

0.0.6.post1

Aug 4, 2022

0.0.6

Aug 4, 2022

0.0.5

Aug 4, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

yetl-framework-1.0.2.tar.gz (18.5 kB view hashes)

Uploaded May 5, 2023 Source

Built Distribution

yetl_framework-1.0.2-py3-none-any.whl (23.2 kB view hashes)

Uploaded May 5, 2023 Python 3

Hashes for yetl-framework-1.0.2.tar.gz

Hashes for yetl-framework-1.0.2.tar.gz
Algorithm	Hash digest
SHA256	`f0c816147b8f89a7d57b17a2a7455cee20313c8635c06f39fdaf181823fbaacc`
MD5	`223ba2b557e0547c914d37dd914d2cc0`
BLAKE2b-256	`023a596e38ee7b5d839c7318e38d79b6a388e84ec76374faedf8d998db46d020`

Hashes for yetl_framework-1.0.2-py3-none-any.whl

Hashes for yetl_framework-1.0.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`5938bdd51134755fc66c1c80cb04ceaa845951a52faf8df0f7966fdfd3219160`
MD5	`09c4a0f027c8f9a5e4a78e34f9c672c2`
BLAKE2b-256	`98125e2d3cbf763d4d9218b8d1017aa292078fc477ddf8a87832c3fa4700fb06`