DiscoverX - Map and Search your Lakehouse
Project description
DiscoverX
Your Swiss-Army-knife for Lakehouse administration.
DiscoverX automates administration tasks that require inspecting or applying operations to a large number of Lakehouse assets.
Multi-table operations with SQL templates
You can execute a SQL template against multiple tables with
DisocoverX will concurrently execute the SQL template against all Delta tables matching the selection pattern and return a Spark DataFrame with the union of all results.
Some useful SQL templates are
- Describe details:
DESCRIBE DETAIL {full_table_name} - Show delta history:
SHOW HISTORY {full_table_name} - Deep clone:
CREATE TABLE IF NOT EXISTS {table_catalog}.{table_schema}_clone.{table_name} DEEP CLONE {full_table_name} - Create an empty copy:
CREATE TABLE IF NOT EXISTS {table_catalog}.{table_schema}.{table_name}_empty_copy LIKE {full_table_name} - Tag:
ALTER TABLE {full_table_name} SET TAGS ('tag_name' = 'tag_value') - Change owner:
ALTER TABLE {full_table_name} SET OWNER TO principal - Show partitions:
SHOW PARTITIONS {full_table_name} - Select a sample row as joson from each table:
SELECT to_json(struct(*)) AS row FROM {full_table_name} LIMIT 1 - Select all pivoted string columns:
SELECT {stack_string_columns} AS (column_name, string_value) FROM {full_table_name} - Select all pivoted columns casted to string:
SELECT {stack_all_columns_as_string} AS (column_name, string_value) FROM {full_table_name} - Apply liquid clustering:
ALTER TABLE {full_table_name} CLUSTER BY (column1, column2) - Vacuum:
VACUUM {full_table_name} - Optimize:
OPTIMIZE {full_table_name}
The available variables to use in the SQL templates are
{full_table_name}- The full table name (catalog.schema.table){table_catalog}- The catalog name{table_schema}- The schema name{table_name}- Teh table name{stack_string_columns}- A SQL expressionstack(N, 'col1', `col1`, ... , 'colN', `colN` )for all N columns of type string{stack_all_columns_as_string}- A SQL expressionstack(N, 'col1', cast(`col1` AS string), ... , 'colN', cast(`colN` AS string)for all N columns
A more advanced SQL example
You can filter tables that only contain a specific column name, and them use the column name in the queries.
Multi-table operations with python functions
DiscoverX can concurrently apply python funcitons to multiple assets
The properties available in table_info are
catalog- The catalog nameschema- The schema nametable- The table namecolumns- A list ofColumnInfo, withname,data_type, andpartition_indextags- A list ofTagsInfo, withcolumn_tags,table_tags,schema_tags, andcatalog_tags. Tags are only populated if thefrom_tables(...)operation is followed by.with_tags(True)
Example Notebooks
- Maintenance
- VACUUM all tables (example notebook)
- Detect tables having too many small files (example notebook)
- Delta housekeeping analysis (example notebook) which provide:
- stats (size of tables and number of files, timestamps of latest OPTIMIZE & VACUUM operations, stats of OPTIMIZE)
- recommendations on tables that need to be OPTIMIZED/VACUUM'ed
- are tables OPTIMIZED/VACUUM'ed often enough
- tables that have small files / tables for which ZORDER is not being effective
- Deep clone a catalog (example notebook)
- Governance
- PII detection with Presidio (example notebook)
- Text Analysis with MosaicML and Databricks MLflow (example notebook)
- Text Analysis with OpenAI GPT (example notebook)
- GDPR right of access: extract user data from all tables at once
- GDPR right of erasure: delete user data from all tables at once
- Search in any column
- Update Owner of Data Objects (example notebook)
- Semantic classification
- Semantic classification of columns by semantic class: email, phone number, IP address, etc.
- Select data based on semantic classes
- Delete data based on semantic classes
- Custom
- Create Mlflow gateway routes for MosaicML and OpenAI (example notebook)
- Scan using User Specified Data Source Formats (example notebook)
Getting started
Install DiscoverX, in Databricks notebook type
%pip install dbl-discoverx
Get started
from discoverx import DX
dx = DX(locale="US")
You can now run operations across multiple tables.
Available functionality
The available dx functions are
from_tables("<catalog>.<schema>.<table>")selects tables based on the specified pattern (use*as a wildcard). Returns aDataExplorerobject with methodshaving_columnsrestricts the selection to tables that have the specified columnswith_concurrencydefines how many queries are executed concurrently (10 by defailt)with_sqlapplies a SQL template to all tables. After this command you can apply an action. See in-depth documentation here.unpivot_string_columnsreturns a melted (unpivoted) dataframe with all string columns from the selected tables. After this command you can apply an actionscan(experimental) scans the lakehouse with regex expressions defined by the rules and to power the semantic classification.
introgives an introduction to the libraryscan[deprecated] scans the lakehouse with regex expressions defined by the rules and to power the semantic classification. Documentationdisplay_rulesshows the rules available for semantic classificationsearch[deprecated] searches the lakehouse content for by leveraging the semantic classes identified with scan (eg. email, ip address, etc.). Documentationselect_by_class[deprecated] selects data from the lakehouse content by semantic class. Documentationdelete_by_class[deprecated] deletes from the lakehouse by semantic class. Documentation
from_tables Actions
After a with_sql or unpivot_string_columns command, you can apply the following actions:
explainexplains the queries that would be executeddisplayexecutes the queries and shows the first 1000 rows of the result in a unioned dataframeapplyreturns a unioned dataframe with the result from the queries
Requirements
Project Support
Please note that all projects in the /databrickslabs github account are provided for your exploration only, and are not formally supported by Databricks with Service Level Agreements (SLAs). They are provided AS-IS and we do not make any guarantees of any kind. Please do not submit a support ticket relating to any issues arising from the use of these projects.
Any issues discovered through the use of this project should be filed as GitHub Issues on the Repo. They will be reviewed as time permits, but there are no formal SLAs for support.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dbl_discoverx-0.0.9.tar.gz.
File metadata
- Download URL: dbl_discoverx-0.0.9.tar.gz
- Upload date:
- Size: 31.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
eaf9362fa3e109140a0cabbe3fa88f9773b27a82894e815bb4e30911b5e49dcf
|
|
| MD5 |
e51f3d37a3c56b4ada92b7c85e7928f3
|
|
| BLAKE2b-256 |
2a2f80387d102064458f6f7cf11d5c6b03f6772205f2555da0e840222e97db46
|
Provenance
The following attestation bundles were made for dbl_discoverx-0.0.9.tar.gz:
Publisher:
pypi_release.yml on databrickslabs/discoverx
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
dbl_discoverx-0.0.9.tar.gz -
Subject digest:
eaf9362fa3e109140a0cabbe3fa88f9773b27a82894e815bb4e30911b5e49dcf - Sigstore transparency entry: 206011871
- Sigstore integration time:
-
Permalink:
databrickslabs/discoverx@849c17d44c6b7e3db8b65550b60174e74b20d48f -
Branch / Tag:
refs/tags/v0.0.9 - Owner: https://github.com/databrickslabs
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
pypi_release.yml@849c17d44c6b7e3db8b65550b60174e74b20d48f -
Trigger Event:
push
-
Statement type:
File details
Details for the file dbl_discoverx-0.0.9-py3-none-any.whl.
File metadata
- Download URL: dbl_discoverx-0.0.9-py3-none-any.whl
- Upload date:
- Size: 36.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b651f8808e91d5748570548a496dd8a0cb1aaf0f1d2d2eceb9155453b0d60213
|
|
| MD5 |
5fe79304069744b676fc1f8799963ead
|
|
| BLAKE2b-256 |
2abe5e895916a8989b0d7afd3caa998eb1f47f84d66b7779746f248dfc89a745
|
Provenance
The following attestation bundles were made for dbl_discoverx-0.0.9-py3-none-any.whl:
Publisher:
pypi_release.yml on databrickslabs/discoverx
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
dbl_discoverx-0.0.9-py3-none-any.whl -
Subject digest:
b651f8808e91d5748570548a496dd8a0cb1aaf0f1d2d2eceb9155453b0d60213 - Sigstore transparency entry: 206011873
- Sigstore integration time:
-
Permalink:
databrickslabs/discoverx@849c17d44c6b7e3db8b65550b60174e74b20d48f -
Branch / Tag:
refs/tags/v0.0.9 - Owner: https://github.com/databrickslabs
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
pypi_release.yml@849c17d44c6b7e3db8b65550b60174e74b20d48f -
Trigger Event:
push
-
Statement type: