Skip to main content

This package aims to provide features for working with Delta Lake.

Project description

DTSTOOLS: Help with your Delta Lake

Library created by Data In Action

#####    ######    #####   ######     ###      ###    ####      #####   
 ## ##     ##     ##   ##    ##      ## ##    ## ##    ##      ##   ##  
 ##  ##    ##     ##         ##     ##   ##  ##   ##   ##      ##       
 ##  ##    ##      #####     ##     ##   ##  ##   ##   ##       #####   
 ##  ##    ##          ##    ##     ##   ##  ##   ##   ##           ##  
 ## ##     ##     ##   ##    ##      ## ##    ## ##    ##      ##   ##  
#####      ##      #####     ##       ###      ###    #######   #####   

Table of Contents

Introduction

version date description
v0.0.4 2023-06-17 Basic features
v0.0.9 2023-07-06 Introduce lastMaintenance function

This package aims to provide functionality to work with Delta Lake.

Facilitating the visualization of the actual size (Storage) of your Delta tables and their maintenance (Vacuum and Optimize)

Below are the steps performed in order of execution:

  1. Executed a describe detail to get the current location and size
  2. A scan (dbutils.fs.ls) is performed on the Storage folders recursively to calculate the space used in the Storage, excluding the _delta_logs, with this we can calculate an average of how much can be released with Vacuum
  3. Returns a Dataframet

How to use dtstools

First install the package via PyPi

pip install --upgrade dtstools

Import the dtsTable module into your context

from dtstools import dtsTable

Use the function dtsTable.Help() to see examples of function usage and a summary of each function.

dtsTable.Help()

How to use tableSize

Find out the true size of your table

Call the tableSize function passing the database and table name, use display() to see the results.

This function returns a Dataframe.

dtsTable.tableSize(databaseName,tableName).display()

Save the result in a Delta Table for monitoring and baseline

dtsTable.tableSize(databaseName,tableName) \
 .write.format('delta') \
 .mode('append') \
 .saveAsTable("db_demo.tableSize",path='abfss://container@storage.dfs.core.windows.net/bronze/tableSize')

Get the size of all tables in your database

for tb in spark.sql(f"show tables from db_demo").collect():
    try:
        print(">> Collecting data... Table:",tb.tableName)
        dtsTable.tableSize(tb.database,tb.tableName) \
        .write.format('delta') \
        .mode('append') \
        .saveAsTable("db_demo.tableSize",path='abfss://container@storage.dfs.core.windows.net/bronze/tableSize')
    except Exception as e:
        print (f"###### Error to load tableName {tb.tableName} - {e}######")

How to use tableMaintenance

Apply maintenance to a table, Optimize and Vacuum.

Parameter Description Type
schemaName Name of the database where the table is created string
tableName Name of the table that will be applied in the maintenance string
vacuum True: Vacuum will run, False: Ignore vacuum bool
optimize True: OPTIMIZE will be executed, False: Skip OPTIMIZE bool
zorderColumns If informed and optimize is equal to True, Zorder is applied to the list of columns separated by a comma (,) string
vacuumRetention Number of hours to hold after vacuum runs whole
Debug Just print the result on screen bool

Apply in a single table.

dtsTable.tableMaintenance(schemaName="Database", tableName="tableName", zorderColumns='none', vacuumRetention=168, vacuum=True, optimize=True, debug=False)

Apply maintenance to all tables for YOUR database

for tb in spark.sql(f"show tables from db_demo").collect():
  dtsTable.tableMaintenance(schemaName=tb.database, tableName=tb.tableName, zorderColumns='none', vacuumRetention=168, vacuum=True, optimize=True, debug=False)

Last Maintenance

Use the function dtsTable.LastMaintenance() to see summary vacuum result and optimize operations.

dtsTable.lastMaintenance('database','tableName')

Future implementations

  1. Use Unity Catalog
  2. Run for all databases
  3. Minimize costs with dbutils.fs.ls by looking directly into the transaction log

Notes

  • For partitioned tables with many partitions, the execution time can take longer, so monitor the first executions well, the use is at your responsibility, despite not having any risk mapped so far, it can only generate more transactions for your storage
  • Cost of Azure Storage Transactions: Read Operations (per 10,000) - $0.0258 (two cents per 10,000 operations) (Estimated price as of 4/21/2023)

References

https://github.com/reginaldosilva27/dtstools

Author: Reginaldo Silva

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dtstools-0.0.9.tar.gz (11.0 kB view details)

Uploaded Source

Built Distribution

dtstools-0.0.9-py3-none-any.whl (8.7 kB view details)

Uploaded Python 3

File details

Details for the file dtstools-0.0.9.tar.gz.

File metadata

  • Download URL: dtstools-0.0.9.tar.gz
  • Upload date:
  • Size: 11.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.6

File hashes

Hashes for dtstools-0.0.9.tar.gz
Algorithm Hash digest
SHA256 6eb720b7c6b012000b1ecf8ae78d9f950d45980739bd934f4e3d3147aeba6eef
MD5 4255cac11d8a714e47b94fb90ec71839
BLAKE2b-256 0e97970bf671810dd7ce5581a98ed6a7557e0bd9ce431185f5ebd71fa38d86f1

See more details on using hashes here.

File details

Details for the file dtstools-0.0.9-py3-none-any.whl.

File metadata

  • Download URL: dtstools-0.0.9-py3-none-any.whl
  • Upload date:
  • Size: 8.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.6

File hashes

Hashes for dtstools-0.0.9-py3-none-any.whl
Algorithm Hash digest
SHA256 ec6054c2920626b14866af49cae26a5452cfa76554e5bcc5fdaedbd3e4c30d9e
MD5 8670ffb458adac468925549265201b8b
BLAKE2b-256 4acda69064ba994517f6336a5f1255d330ed7c817a1b54da929d79c732f2d2fa

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page