DataLake tables management bundle for the Bricksflow Framework
Project description
Datalake bundle
Table & schema management for your Databricks-based data lake (house).
Provides console commands to simplify table creation, update/migration and deletion.
Installation
Install the bundle via Poetry:
$ poetry add datalake-bundle
Add the bundle to your application's kernel to activate it:
from pyfony.kernel.BaseKernel import BaseKernel
from datalakebundle.DataLakeBundle import DataLakeBundle
class Kernel(BaseKernel):
def _registerBundles(self):
return [
# ...
DataLakeBundle(),
# ...
]
Usage
Defining tables
Add the following configuration to your app:
parameters:
datalakebundle:
tables:
customer.my_table:
schemaPath: 'myapp.client.schema'
targetPath: '/data/clients.delta'
product.another_table:
schemaPath: 'myapp.product.schema'
targetPath: '/data/products.delta'
partitionBy: ['date'] # optional table partitioning customization
Customizing table names
Table naming can be customized to match your company naming conventions.
By default, all tables are prefixed with the environment name (dev/test/prod/...):
parameters:
datalakebundle:
table:
nameTemplate: '%kernel.environment%_{identifier}'
The {identifier}
is resolved to the table identifier defined in the datalakebundle.tables
configuration (see above).
By changing the nameTemplate
option, you may add some prefix or suffix to both the database or the table names:
parameters:
datalakebundle:
table:
nameTemplate: '%kernel.environment%_{dbIdentifier}.tableprefix_{tableIdentifier}_tablesufix'
Parsing fields from table identifier:
Sometimes your table names may contain additional flags to explicitly emphasize some meta-information about the data stored in that particular table.
Imagine that you have the following tables:
customer_e.my_table
product_p.another_table
The e/p
suffixes describe the fact that the table contains encrypted or plain data. What if we need to use that information in our code?
You may always define the attribute manually in the tables configuration:
parameters:
datalakebundle:
table:
nameTemplate: '{identifier}'
tables:
customer_e.my_table:
schemaPath: 'datalakebundle.test.TestSchema'
encrypted: True
product_p.another_table:
schemaPath: 'datalakebundle.test.AnotherSchema'
encrypted: False
If you don't want to duplicate the configuration, try using the defaults
config option to parse the encrypted/plain flag into the new encrypted
boolean table attribute:
parameters:
datalakebundle:
table:
nameTemplate: '{identifier}'
defaults:
encrypted: !expr 'dbIdentifier[-1:] == "e"'
tables:
customer_e.my_table:
schemaPath: 'datalakebundle.test.TestSchema'
product_p.another_table:
schemaPath: 'datalakebundle.test.AnotherSchema'
For more complex cases, you may also use a custom resolver to create a new table attribute:
from box import Box
from datalakebundle.table.identifier.ValueResolverInterface import ValueResolverInterface
class TargetPathResolver(ValueResolverInterface):
def __init__(self, basePath: str):
self.__basePath = basePath
def resolve(self, rawTableConfig: Box):
encryptedString = 'encrypted' if rawTableConfig.encrypted is True else 'plain'
return self.__basePath + '/' + rawTableConfig.dbIdentifier + '/' + encryptedString + '/' + rawTableConfig.tableIdentifier + '.delta'
parameters:
datalakebundle:
table:
nameTemplate: '{identifier}'
defaults:
targetPath:
resolverClass: 'datalakebundle.test.TargetPathResolver'
resolverArguments:
- '%datalake.basePath%'
tables:
customer_e.my_table:
schemaPath: 'datalakebundle.test.TestSchema'
product_p.another_table:
schemaPath: 'datalakebundle.test.AnotherSchema'
Console commands provided by this bundle
-
datalake:table:create [table identifier]
- Creates a metastore table based on it's YAML definition (name, schema, data path, ...) -
datalake:table:delete [table identifier]
- Deletes a metastore table including data on HDFS -
datalake:table:create-missing
- Creates newly defined tables that do not exist in the metastore yet -
datalake:table:optimize-all
- Runs the OPTIMIZE command on all defined tables (Delta only)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Hashes for datalake_bundle-0.4.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | effe7d58b899f92d2ef85cf4494bea2ed528d159120246a6e7d773d9122de98c |
|
MD5 | a8847cf2d698f372a616ca1da67e8d17 |
|
BLAKE2b-256 | a28addf541fccadf05696c10e72e74e487b52d798b44854fed35ecc9f9b16aa3 |