Skip to main content

Databricks utility to identify which column to use for z-ordering and partitioning.

Project description

Databricks optimize utility

Library to generate more easily optimize statement based on existing data.

Build from source

Build the wheel file is enough to run

python -m build --wheel

Table optimizer

This object can be initialized with a table object name (with three level namespace "catalog.schema.table"), the method pre_optimization will run some statistics to identify the best column to run z-order (column with highest cardinality) and to re-partition (this depends on the size of the table and the size of the generated files partition). This is used to define and print the optimization query. Then to run the optimization itself we have two additional method, run_optimize and run_partition, these two can be executed even on their own without the need to run the pre_optimization, in this case the method will run the statistics and the generation separately.

from dbks_optimize.optimizer import TableOptimizer

opt = TableOptimizer(spark,"testing.bakehouse.sales_customers",force_partition_on_col='customerID')

opt.pre_optimization() #print out the optimization statement

opt.run_optimizer() #execute optimize on the table

opt.run_partition() #execute partitioning query (it clone i a new table and overwrite the existing one)

Schema optimizer

This object leverage the existing class for single table object and iterate over list of table available computing table statistics. pre_optimize is used to generate the optimize statement, while to run real optimization we can use run_db_optimization.

from dbks_optimize.optimizer import SchemaOptimizer

opt = SchemaOptimizer(spark,'testing.bakehouse')

opt.pre_optimization()

opt.run_db_optimization()

Catalog optimizer

Using the same logic iterate over all schemas and for each on on all tables to generate the statement.

from dbks_optimize.optimizer import CatalogOptimizer

opt = CatalogOptimizer(spark,'testing')

opt.pre_optimization()

opt.run_catalog_optimization()

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

databricks_optimize-0.2.2.tar.gz (5.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

databricks_optimize-0.2.2-py3-none-any.whl (6.4 kB view details)

Uploaded Python 3

File details

Details for the file databricks_optimize-0.2.2.tar.gz.

File metadata

  • Download URL: databricks_optimize-0.2.2.tar.gz
  • Upload date:
  • Size: 5.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for databricks_optimize-0.2.2.tar.gz
Algorithm Hash digest
SHA256 1fe4faf391322d10297639959bc7a83908986501637b9a63513fe12394012e12
MD5 bba0d0a790b6d374605d99eaa81e85a1
BLAKE2b-256 c2ad55ea77db051e3d2726fc3c150f148b21924cc0a9eca4de2b5f72afbbaa41

See more details on using hashes here.

File details

Details for the file databricks_optimize-0.2.2-py3-none-any.whl.

File metadata

File hashes

Hashes for databricks_optimize-0.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 06965b532d22a0ebd8fdaa7ea58b9f8652244bac27f19da8687f726c46f87c63
MD5 2240dcc639085d88ae780a22a71bb3a2
BLAKE2b-256 d5ae9d2a69ee81258fb6310653bd727bf1639b4e30029b7bbf895798aded8eae

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page