This package aims to provide features for working with Delta Lake.
Project description
DTSTOOLS: Help with your Delta Lake
Library created by Data In Action
##### ###### ##### ###### ### ### #### #####
## ## ## ## ## ## ## ## ## ## ## ## ##
## ## ## ## ## ## ## ## ## ## ##
## ## ## ##### ## ## ## ## ## ## #####
## ## ## ## ## ## ## ## ## ## ##
## ## ## ## ## ## ## ## ## ## ## ## ##
##### ## ##### ## ### ### ####### #####
Table of Contents
- Introduction
- How to use dtstools
- How to use tableSize
- How to use tableMaintenance
- How to use LastMaintenance
- Future implementations
- Notes
- References
Introduction
version | date | description |
---|---|---|
v0.0.4 |
2023-06-17 | Basic features |
v0.0.9 |
2023-07-06 | Introduce lastMaintenance function |
This package aims to provide functionality to work with Delta Lake.
Facilitating the visualization of the actual size (Storage) of your Delta tables and their maintenance (Vacuum and Optimize)
Below are the steps performed in order of execution:
- Executed a describe detail to get the current location and size
- A scan (dbutils.fs.ls) is performed on the Storage folders recursively to calculate the space used in the Storage, excluding the _delta_logs, with this we can calculate an average of how much can be released with Vacuum
- Returns a Dataframet
How to use dtstools
First install the package via PyPi
pip install --upgrade dtstools
Import the dtsTable module into your context
from dtstools import dtsTable
Use the function dtsTable.Help() to see examples of function usage and a summary of each function.
dtsTable.Help()
How to use tableSize
Find out the true size of your table
Call the tableSize function passing the database and table name, use display() to see the results.
This function returns a Dataframe.
dtsTable.tableSize(databaseName,tableName).display()
Save the result in a Delta Table for monitoring and baseline
dtsTable.tableSize(databaseName,tableName) \
.write.format('delta') \
.mode('append') \
.saveAsTable("db_demo.tableSize",path='abfss://container@storage.dfs.core.windows.net/bronze/tableSize')
Get the size of all tables in your database
for tb in spark.sql(f"show tables from db_demo").collect():
try:
print(">> Collecting data... Table:",tb.tableName)
dtsTable.tableSize(tb.database,tb.tableName) \
.write.format('delta') \
.mode('append') \
.saveAsTable("db_demo.tableSize",path='abfss://container@storage.dfs.core.windows.net/bronze/tableSize')
except Exception as e:
print (f"###### Error to load tableName {tb.tableName} - {e}######")
How to use tableMaintenance
Apply maintenance to a table, Optimize and Vacuum.
Parameter | Description | Type |
---|---|---|
schemaName | Name of the database where the table is created | string |
tableName | Name of the table that will be applied in the maintenance | string |
vacuum | True: Vacuum will run, False: Ignore vacuum | bool |
optimize | True: OPTIMIZE will be executed, False: Skip OPTIMIZE | bool |
zorderColumns | If informed and optimize is equal to True, Zorder is applied to the list of columns separated by a comma (,) | string |
vacuumRetention | Number of hours to hold after vacuum runs | whole |
Debug | Just print the result on screen | bool |
Apply in a single table.
dtsTable.tableMaintenance(schemaName="Database", tableName="tableName", zorderColumns='none', vacuumRetention=168, vacuum=True, optimize=True, debug=False)
Apply maintenance to all tables for YOUR database
for tb in spark.sql(f"show tables from db_demo").collect():
dtsTable.tableMaintenance(schemaName=tb.database, tableName=tb.tableName, zorderColumns='none', vacuumRetention=168, vacuum=True, optimize=True, debug=False)
Last Maintenance
Use the function dtsTable.LastMaintenance() to see summary vacuum result and optimize operations.
dtsTable.lastMaintenance('database','tableName')
Future implementations
- Use Unity Catalog
- Run for all databases
- Minimize costs with dbutils.fs.ls by looking directly into the transaction log
Notes
- For partitioned tables with many partitions, the execution time can take longer, so monitor the first executions well, the use is at your responsibility, despite not having any risk mapped so far, it can only generate more transactions for your storage
- Cost of Azure Storage Transactions: Read Operations (per 10,000) - $0.0258 (two cents per 10,000 operations) (Estimated price as of 4/21/2023)
References
https://github.com/reginaldosilva27/dtstools
Author: Reginaldo Silva
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file dtstools-0.0.9.tar.gz
.
File metadata
- Download URL: dtstools-0.0.9.tar.gz
- Upload date:
- Size: 11.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6eb720b7c6b012000b1ecf8ae78d9f950d45980739bd934f4e3d3147aeba6eef |
|
MD5 | 4255cac11d8a714e47b94fb90ec71839 |
|
BLAKE2b-256 | 0e97970bf671810dd7ce5581a98ed6a7557e0bd9ce431185f5ebd71fa38d86f1 |
File details
Details for the file dtstools-0.0.9-py3-none-any.whl
.
File metadata
- Download URL: dtstools-0.0.9-py3-none-any.whl
- Upload date:
- Size: 8.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | ec6054c2920626b14866af49cae26a5452cfa76554e5bcc5fdaedbd3e4c30d9e |
|
MD5 | 8670ffb458adac468925549265201b8b |
|
BLAKE2b-256 | 4acda69064ba994517f6336a5f1255d330ed7c817a1b54da929d79c732f2d2fa |