Databricks Configuration Framework
Project description
dbxconfig
Configuration framework for databricks pipelines. Define configuration and table dependencies in yaml config then get the table mappings config model:
Define your tables.
landing:
read:
landing_dbx_patterns:
customer_details_1: null
customer_details_2: null
raw:
delta_lake:
raw_dbx_patterns:
customers:
ids: id
depends_on:
- landing.landing_dbx_patterns.customer_details_1
- landing.landing_dbx_patterns.customer_details_2
warning_thresholds:
invalid_ratio: 0.1
invalid_rows: 0
max_rows: 100
min_rows: 5
exception_thresholds:
invalid_ratio: 0.2
invalid_rows: 2
max_rows: 1000
min_rows: 0
custom_properties:
process_group: 1
base:
delta_lake:
# delta table properties can be set at stage level or table level
delta_properties:
delta.appendOnly: true
delta.autoOptimize.autoCompact: true
delta.autoOptimize.optimizeWrite: true
delta.enableChangeDataFeed: false
base_dbx_patterns:
customer_details_1:
ids: id
depends_on:
- raw.raw_dbx_patterns.customers
# delta table properties can be set at stage level or table level
# table level properties will overwride stage level properties
delta_properties:
delta.enableChangeDataFeed: true
customer_details_2:
ids: id
depends_on:
- raw.raw_dbx_patterns.customers
Define you load configuration:
tables: ./tables.yaml
landing:
read:
trigger: customerdetailscomplete-{{filename_date_format}}*.flg
trigger_type: file
database: landing_dbx_patterns
table: "{{table}}"
container: datalake
root: "/mnt/{{container}}/data/landing/dbx_patterns/{{table}}/{{path_date_format}}"
filename: "{{table}}-{{filename_date_format}}*.csv"
filename_date_format: "%Y%m%d"
path_date_format: "%Y%m%d"
format: cloudFiles
spark_schema: ../Schema/{{table.lower()}}.yaml
options:
# autoloader
cloudFiles.format: csv
cloudFiles.schemaLocation: /mnt/{{container}}/checkpoint/{{checkpoint}}
cloudFiles.useIncrementalListing: auto
# schema
inferSchema: false
enforceSchema: true
columnNameOfCorruptRecord: _corrupt_record
# csv
header: false
mode: PERMISSIVE
encoding: windows-1252
delimiter: ","
escape: '"'
nullValue: ""
quote: '"'
emptyValue: ""
raw:
delta_lake:
# delta table properties can be set at stage level or table level
delta_properties:
delta.appendOnly: true
delta.autoOptimize.autoCompact: true
delta.autoOptimize.optimizeWrite: true
delta.enableChangeDataFeed: false
database: raw_dbx_patterns
table: "{{table}}"
container: datalake
root: /mnt/{{container}}/data/raw
path: "{{database}}/{{table}}"
options:
checkpointLocation: /mnt/{{container}}/checkpoint/{{database}}_{{table}}
mergeSchema: true
Import the config objects into you pipeline:
from dbxconfig import Config, Timeslice, StageType
# build path to configuration file
pattern = "auto_load_schema"
config_path = f"../Config"
# create a timeslice object for slice loading. Use * for all time (supports hrs, mins, seconds and sub-second).
timeslice = Timeslice(day="*", month="*", year="*")
# parse and create a config objects
config = Config(config_path=config_path, pattern=pattern)
# get the configuration for a table mapping to load.
table_mapping = config.get_table_mapping(
timeslice=timeslice,
stage=StageType.raw,
table="customers"
)
print(table_mapping)
Development Setup
pip install -r requirements.txt
Unit Tests
To run the unit tests with a coverage report.
pip install -e .
pytest test/unit --junitxml=junit/test-results.xml --cov=dbxconfig --cov-report=xml --cov-report=html
Build
python setup.py sdist bdist_wheel
Publish
twine upload dist/*
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
dbxconfig-5.0.7.tar.gz
(17.9 kB
view details)
Built Distribution
dbxconfig-5.0.7-py3-none-any.whl
(22.9 kB
view details)
File details
Details for the file dbxconfig-5.0.7.tar.gz
.
File metadata
- Download URL: dbxconfig-5.0.7.tar.gz
- Upload date:
- Size: 17.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.8.16
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 737836a5f74c2d6f0d90e8ec7ebee86d63ce192c75d71b3d0b284f1e19333032 |
|
MD5 | c1c3b22301a98c0b54a82a6a5aba40bd |
|
BLAKE2b-256 | 4a44fa05bcaa61a06202dd291a7a4aa6e0ca2be0864dad68e4999adc6336d61a |
File details
Details for the file dbxconfig-5.0.7-py3-none-any.whl
.
File metadata
- Download URL: dbxconfig-5.0.7-py3-none-any.whl
- Upload date:
- Size: 22.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.8.16
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | e70ee6e03988ee1d35bac861a6b286b34bb5ca9e752b4bef265a2ee10e96695e |
|
MD5 | b420e530d5bec4e58bd1c9a9a42e93b6 |
|
BLAKE2b-256 | 49bb217355efa204694420a5d29129656d32e39147eaa485767d46c27c630a19 |