Skip to main content

DiscoverX - Map and Search your Lakehouse

Project description

DiscoverX

Your Swiss-Army-knife for Lakehouse administration.

DiscoverX automates administration tasks that require inspecting or applying operations to a large number of Lakehouse assets.

Multi-table operations with SQL templates

You can execute a SQL template against multiple tables with

Multi-table operations with SQL template

DisocoverX will concurrently execute the SQL template against all Delta tables matching the selection pattern and return a Spark DataFrame with the union of all results.

Some useful SQL templates are

  • Describe details: DESCRIBE DETAIL {full_table_name}
  • Show delta history: SHOW HISTORY {full_table_name}
  • Deep clone: CREATE TABLE IF NOT EXISTS {table_catalog}.{table_schema}_clone.{table_name} DEEP CLONE {full_table_name}
  • Create an empty copy: CREATE TABLE IF NOT EXISTS {table_catalog}.{table_schema}.{table_name}_empty_copy LIKE {full_table_name}
  • Tag: ALTER TABLE {full_table_name} SET TAGS ('tag_name' = 'tag_value')
  • Change owner: ALTER TABLE {full_table_name} SET OWNER TO principal
  • Show partitions: SHOW PARTITIONS {full_table_name}
  • Select a sample row as joson from each table: SELECT to_json(struct(*)) AS row FROM {full_table_name} LIMIT 1
  • Select all pivoted string columns: SELECT {stack_string_columns} AS (column_name, string_value) FROM {full_table_name}
  • Select all pivoted columns casted to string: SELECT {stack_all_columns_as_string} AS (column_name, string_value) FROM {full_table_name}
  • Apply liquid clustering: ALTER TABLE {full_table_name} CLUSTER BY (column1, column2)
  • Vacuum: VACUUM {full_table_name}
  • Optimize: OPTIMIZE {full_table_name}

The available variables to use in the SQL templates are

  • {full_table_name} - The full table name (catalog.schema.table)
  • {table_catalog} - The catalog name
  • {table_schema} - The schema name
  • {table_name} - Teh table name
  • {stack_string_columns} - A SQL expression stack(N, 'col1', `col1`, ... , 'colN', `colN` ) for all N columns of type string
  • {stack_all_columns_as_string} - A SQL expression stack(N, 'col1', cast(`col1` AS string), ... , 'colN', cast(`colN` AS string) for all N columns

A more advanced SQL example

You can filter tables that only contain a specific column name, and them use the column name in the queries.

Multi-table operations with SQL template

Multi-table operations with python functions

DiscoverX can concurrently apply python funcitons to multiple assets

Multi-table operations with python functions

The properties available in table_info are

  • catalog - The catalog name
  • schema - The schema name
  • table - The table name
  • columns - A list of ColumnInfo, with name, data_type, and partition_index
  • tags - A list of TagsInfo, with column_tags, table_tags, schema_tags, and catalog_tags. Tags are only populated if the from_tables(...) operation is followed by .with_tags(True)

Example Notebooks

Getting started

Install DiscoverX, in Databricks notebook type

%pip install dbl-discoverx

Get started

from discoverx import DX
dx = DX(locale="US")

You can now run operations across multiple tables.

Available functionality

The available dx functions are

  • from_tables("<catalog>.<schema>.<table>") selects tables based on the specified pattern (use * as a wildcard). Returns a DataExplorer object with methods
    • having_columns restricts the selection to tables that have the specified columns
    • with_concurrency defines how many queries are executed concurrently (10 by defailt)
    • with_sql applies a SQL template to all tables. After this command you can apply an action. See in-depth documentation here.
    • unpivot_string_columns returns a melted (unpivoted) dataframe with all string columns from the selected tables. After this command you can apply an action
    • scan (experimental) scans the lakehouse with regex expressions defined by the rules and to power the semantic classification.
  • intro gives an introduction to the library
  • scan [deprecated] scans the lakehouse with regex expressions defined by the rules and to power the semantic classification. Documentation
  • display_rules shows the rules available for semantic classification
  • search [deprecated] searches the lakehouse content for by leveraging the semantic classes identified with scan (eg. email, ip address, etc.). Documentation
  • select_by_class [deprecated] selects data from the lakehouse content by semantic class. Documentation
  • delete_by_class [deprecated] deletes from the lakehouse by semantic class. Documentation

from_tables Actions

After a with_sql or unpivot_string_columns command, you can apply the following actions:

  • explain explains the queries that would be executed
  • display executes the queries and shows the first 1000 rows of the result in a unioned dataframe
  • apply returns a unioned dataframe with the result from the queries

Requirements

Project Support

Please note that all projects in the /databrickslabs github account are provided for your exploration only, and are not formally supported by Databricks with Service Level Agreements (SLAs). They are provided AS-IS and we do not make any guarantees of any kind. Please do not submit a support ticket relating to any issues arising from the use of these projects.

Any issues discovered through the use of this project should be filed as GitHub Issues on the Repo. They will be reviewed as time permits, but there are no formal SLAs for support.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dbl_discoverx-0.0.9.tar.gz (31.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dbl_discoverx-0.0.9-py3-none-any.whl (36.9 kB view details)

Uploaded Python 3

File details

Details for the file dbl_discoverx-0.0.9.tar.gz.

File metadata

  • Download URL: dbl_discoverx-0.0.9.tar.gz
  • Upload date:
  • Size: 31.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for dbl_discoverx-0.0.9.tar.gz
Algorithm Hash digest
SHA256 eaf9362fa3e109140a0cabbe3fa88f9773b27a82894e815bb4e30911b5e49dcf
MD5 e51f3d37a3c56b4ada92b7c85e7928f3
BLAKE2b-256 2a2f80387d102064458f6f7cf11d5c6b03f6772205f2555da0e840222e97db46

See more details on using hashes here.

Provenance

The following attestation bundles were made for dbl_discoverx-0.0.9.tar.gz:

Publisher: pypi_release.yml on databrickslabs/discoverx

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file dbl_discoverx-0.0.9-py3-none-any.whl.

File metadata

  • Download URL: dbl_discoverx-0.0.9-py3-none-any.whl
  • Upload date:
  • Size: 36.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for dbl_discoverx-0.0.9-py3-none-any.whl
Algorithm Hash digest
SHA256 b651f8808e91d5748570548a496dd8a0cb1aaf0f1d2d2eceb9155453b0d60213
MD5 5fe79304069744b676fc1f8799963ead
BLAKE2b-256 2abe5e895916a8989b0d7afd3caa998eb1f47f84d66b7779746f248dfc89a745

See more details on using hashes here.

Provenance

The following attestation bundles were made for dbl_discoverx-0.0.9-py3-none-any.whl:

Publisher: pypi_release.yml on databrickslabs/discoverx

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page