Databricks Configuration Framework

These details have not been verified by PyPI

Project links

Project description

dbxconfig

Configuration framework for databricks pipelines. Define configuration and table dependencies in yaml config then get the table mappings config model:

Define your tables.

landing:
  landing_dbx_patterns:
    customer_details_1: null
    customer_details_2: null

raw:
  raw_dbx_patterns:
    customers:
      ids: id
      depends_on:
        - landing.landing_dbx_patterns.customer_details_1
        - landing.landing_dbx_patterns.customer_details_2
      warning_thresholds:
        invalid_ratio: 0.1
        invalid_rows: 0
        max_rows: 100
        min_rows: 5
      exception_thresholds:
        invalid_ratio: 0.2
        invalid_rows: 2
        max_rows: 1000
        min_rows: 0

base:
  base_dbx_patterns:
    customer_details_1:
      ids: id
      depends_on:
        - raw.raw_dbx_patterns.customers
    customer_details_2:
      ids: id
      depends_on:
        - raw.raw_dbx_patterns.customers

Define you load configuration:

tables: ./tables.yaml

landing:
  trigger: customerdetailscomplete-{{filename_date_format}}*.flg
  trigger_type: file
  database: landing_dbx_patterns
  table: "{{table}}"
  container: datalake
  root: "/mnt/{{container}}/data/landing/dbx_patterns/{{table}}/{{path_date_format}}"
  filename: "{{table}}-{{filename_date_format}}*.csv"
  filename_date_format: "%Y%m%d"
  path_date_format: "%Y%m%d"
  format: cloudFiles
  spark_schema: ../Schema/{{table.lower()}}.yaml
  options:
    # autoloader
    cloudFiles.format: csv
    cloudFiles.schemaLocation:  "/mnt/{{container}}/checkpoint/{{checkpoint}}"
    cloudFiles.useIncrementalListing: auto
    # schema
    inferSchema: false
    enforceSchema: true
    columnNameOfCorruptRecord: _corrupt_record
    # csv
    header: false
    mode: PERMISSIVE
    encoding: windows-1252
    delimiter: ","
    escape: '"'
    nullValue: ""
    quote: '"'
    emptyValue: ""
    

raw:
  database: raw_dbx_patterns
  table: "{{table}}"
  container: datalake
  root: /mnt/{{container}}/data/raw
  path: "{{database}}/{{table}}"
  checkpoint_location: /mnt/{{container}}/checkpoint/{{checkpoint}}
  options:
    mergeSchema: true

Import the config objects into you pipeline:

from dbxconfig import Config, Timeslice, StageType

# build path to configuration file
pattern = "auto_load_schema"
config_path = f"../Config"

# create a timeslice object for slice loading. Use * for all time (supports hrs, mins, seconds and sub-second).
timeslice = Timeslice(day="*", month="*", year="*")

# parse and create a config objects
config = Config(config_path=config_path, pattern=pattern)

# get the configuration for a table mapping to load.
table_mapping = config.get_table_mapping(
    timeslice=timeslice, 
    stage=StageType.raw, 
    table="customers"
)

print(table_mapping)

Development Setup

pip install -r requirements.txt

Unit Tests

To run the unit tests with a coverage report.

pip install -e .
pytest test/unit --junitxml=junit/test-results.xml --cov=dbxconfig --cov-report=xml --cov-report=html

Build

python setup.py sdist bdist_wheel

Publish

twine upload dist/*

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

5.0.7

May 4, 2023

5.0.6

May 3, 2023

5.0.5

May 3, 2023

5.0.4

May 3, 2023

5.0.3

May 3, 2023

5.0.2

May 3, 2023

5.0.1

May 3, 2023

5.0.0

May 3, 2023

4.0.3

May 3, 2023

2.4.0

Apr 27, 2023

This version

2.3.0

Apr 25, 2023

2.2.2

Apr 23, 2023

2.2.1

Apr 23, 2023

2.2.0

Apr 23, 2023

2.1.1

Apr 22, 2023

2.1.0

Apr 22, 2023

2.0.6

Apr 22, 2023

2.0.5

Apr 22, 2023

2.0.4

Apr 22, 2023

2.0.3

Apr 22, 2023

2.0.1

Apr 22, 2023

2.0.0

Apr 19, 2023

1.0.9

Apr 18, 2023

1.0.8

Apr 18, 2023

1.0.7

Apr 18, 2023

1.0.6

Apr 17, 2023

1.0.5

Apr 17, 2023

1.0.4

Apr 17, 2023

1.0.3

Apr 17, 2023

1.0.2

Apr 16, 2023

1.0.1

Apr 16, 2023

1.0.0.dev1 pre-release

Apr 16, 2023

0.1.0

Apr 16, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dbxconfig-2.3.0.tar.gz (12.0 kB view hashes)

Uploaded Apr 25, 2023 Source

Built Distribution

dbxconfig-2.3.0-py3-none-any.whl (15.4 kB view hashes)

Uploaded Apr 25, 2023 Python 3

Hashes for dbxconfig-2.3.0.tar.gz

Hashes for dbxconfig-2.3.0.tar.gz
Algorithm	Hash digest
SHA256	`f9bcd4e426a23b746c645d95cb72cac8e694a07f55f8df254f3618eedbc91309`
MD5	`df111a9886946c6394ec8b435b42d2cf`
BLAKE2b-256	`f10fec027ac69f629b1d2d37341477e4018bc16f14c05f3b195cf57d7e5c5ab3`

Hashes for dbxconfig-2.3.0-py3-none-any.whl

Hashes for dbxconfig-2.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`99e284427728c39023e8f46767fa9b586abec1cd1a659e2a1075150db4669e0e`
MD5	`c222d5f32c746e3adf4a2efbf522a7fd`
BLAKE2b-256	`87986ddd105fee11a31fe7533247a01e9c49c860e1e25d07a3ad4af20c9f1bd9`