Skip to main content

Databricks Configuration Framework

Project description

dbxconfig

Configuration framework for databricks pipelines. Define configuration and table dependencies in yaml config then get the table mappings config model:

Define your tables.

landing:
  read:
    landing_dbx_patterns:
      customer_details_1: null
      customer_details_2: null

raw:
  delta_lake:
    raw_dbx_patterns:
      customers:
        ids: id
        depends_on:
          - landing.landing_dbx_patterns.customer_details_1
          - landing.landing_dbx_patterns.customer_details_2
        warning_thresholds:
          invalid_ratio: 0.1
          invalid_rows: 0
          max_rows: 100
          min_rows: 5
        exception_thresholds:
          invalid_ratio: 0.2
          invalid_rows: 2
          max_rows: 1000
          min_rows: 0
        custom_properties:
          process_group: 1

base:
  delta_lake:
    # delta table properties can be set at stage level or table level
    delta_properties:
      delta.appendOnly: true
      delta.autoOptimize.autoCompact: true    
      delta.autoOptimize.optimizeWrite: true  
      delta.enableChangeDataFeed: false
    base_dbx_patterns:
      customer_details_1:
        ids: id
        depends_on:
          - raw.raw_dbx_patterns.customers
        # delta table properties can be set at stage level or table level
        # table level properties will overwride stage level properties
        delta_properties:
            delta.enableChangeDataFeed: true
      customer_details_2:
        ids: id
        depends_on:
          - raw.raw_dbx_patterns.customers

Define you load configuration:

tables: ./tables.yaml

landing:
  read:
    trigger: customerdetailscomplete-{{filename_date_format}}*.flg
    trigger_type: file
    database: landing_dbx_patterns
    table: "{{table}}"
    container: datalake
    root: "/mnt/{{container}}/data/landing/dbx_patterns/{{table}}/{{path_date_format}}"
    filename: "{{table}}-{{filename_date_format}}*.csv"
    filename_date_format: "%Y%m%d"
    path_date_format: "%Y%m%d"
    format: cloudFiles
    spark_schema: ../Schema/{{table.lower()}}.yaml
    options:
      # autoloader
      cloudFiles.format: csv
      cloudFiles.schemaLocation:  /mnt/{{container}}/checkpoint/{{checkpoint}}
      cloudFiles.useIncrementalListing: auto
      # schema
      inferSchema: false
      enforceSchema: true
      columnNameOfCorruptRecord: _corrupt_record
      # csv
      header: false
      mode: PERMISSIVE
      encoding: windows-1252
      delimiter: ","
      escape: '"'
      nullValue: ""
      quote: '"'
      emptyValue: ""
    

raw:
  delta_lake:
    # delta table properties can be set at stage level or table level
    delta_properties:
      delta.appendOnly: true
      delta.autoOptimize.autoCompact: true    
      delta.autoOptimize.optimizeWrite: true  
      delta.enableChangeDataFeed: false
    database: raw_dbx_patterns
    table: "{{table}}"
    container: datalake
    root: /mnt/{{container}}/data/raw
    path: "{{database}}/{{table}}"
    options:
      checkpointLocation: /mnt/{{container}}/checkpoint/{{database}}_{{table}}
      mergeSchema: true

Import the config objects into you pipeline:

from dbxconfig import Config, Timeslice, StageType

# build path to configuration file
pattern = "auto_load_schema"
config_path = f"../Config"

# create a timeslice object for slice loading. Use * for all time (supports hrs, mins, seconds and sub-second).
timeslice = Timeslice(day="*", month="*", year="*")

# parse and create a config objects
config = Config(config_path=config_path, pattern=pattern)

# get the configuration for a table mapping to load.
table_mapping = config.get_table_mapping(
    timeslice=timeslice, 
    stage=StageType.raw, 
    table="customers"
)

print(table_mapping)

Development Setup

pip install -r requirements.txt

Unit Tests

To run the unit tests with a coverage report.

pip install -e .
pytest test/unit --junitxml=junit/test-results.xml --cov=dbxconfig --cov-report=xml --cov-report=html

Build

python setup.py sdist bdist_wheel

Publish

twine upload dist/*

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dbxconfig-5.0.7.tar.gz (17.9 kB view details)

Uploaded Source

Built Distribution

dbxconfig-5.0.7-py3-none-any.whl (22.9 kB view details)

Uploaded Python 3

File details

Details for the file dbxconfig-5.0.7.tar.gz.

File metadata

  • Download URL: dbxconfig-5.0.7.tar.gz
  • Upload date:
  • Size: 17.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.16

File hashes

Hashes for dbxconfig-5.0.7.tar.gz
Algorithm Hash digest
SHA256 737836a5f74c2d6f0d90e8ec7ebee86d63ce192c75d71b3d0b284f1e19333032
MD5 c1c3b22301a98c0b54a82a6a5aba40bd
BLAKE2b-256 4a44fa05bcaa61a06202dd291a7a4aa6e0ca2be0864dad68e4999adc6336d61a

See more details on using hashes here.

File details

Details for the file dbxconfig-5.0.7-py3-none-any.whl.

File metadata

  • Download URL: dbxconfig-5.0.7-py3-none-any.whl
  • Upload date:
  • Size: 22.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.16

File hashes

Hashes for dbxconfig-5.0.7-py3-none-any.whl
Algorithm Hash digest
SHA256 e70ee6e03988ee1d35bac861a6b286b34bb5ca9e752b4bef265a2ee10e96695e
MD5 b420e530d5bec4e58bd1c9a9a42e93b6
BLAKE2b-256 49bb217355efa204694420a5d29129656d32e39147eaa485767d46c27c630a19

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page