DiscoverX - Map and Search your Lakehouse
Project description
DiscoverX
Multi-table operations over the lakehouse.
Run a single command to execute operations across many tables.
Operations examples
Operations are applied concurrently across multiple tables
- Maintenance
- VACUUM all tables (example notebook)
- OPTIMIZE with z-order on tables having specified columns
- Detect tables having too many small files (example notebook)
- Visualise quantity of data written per table per period
- Governance
- PII detection with Presidio (example notebook)
- Text Analysis with MosaicML and Databricks MLflow (example notebook)
- Text Analysis with OpenAI GPT (example notebook)
- GDPR right of access: extract user data from all tables at once
- GDPR right of erasure: delete user data from all tables at once
- Search in any column
- Update Owner of Data Objects (example notebook)
- Semantic classification
- Semantic classification of columns by semantic class: email, phone number, IP address, etc.
- Select data based on semantic classes
- Delete data based on semantic classes
- Custom
- Arbitrary SQL template execution across multiple tables
- Create Mlflow gateway routes for MosaicML and OpenAI (example notebook)
- Scan using User Specified Data Source Formats (example notebook)
Getting started
Install DiscoverX, in Databricks notebook type
%pip install dbl-discoverx
Get started
from discoverx import DX
dx = DX(locale="US")
You can now run operations across multiple tables.
As an illustration, consider the scenario where you need to retrieve a single row from various tables within a catalog that begins with "dev_" and includes the term "sample" in their names. To achieve this, the following code block utilizes the dx.from_tables function, which applies an SQL query to extract JSON-formatted data:
dx.from_tables("dev_*.*.*sample*")\
.with_sql("SELECT to_json(struct(*)) AS row FROM {full_table_name} LIMIT 1")\
.apply()
Available functionality
The available dx
functions are
from_tables("<catalog>.<schema>.<table>")
selects tables based on the specified pattern (use*
as a wildcard). Returns aDataExplorer
object with methodshaving_columns
restricts the selection to tables that have the specified columnswith_concurrency
defines how many queries are executed concurrently (10 by defailt)with_sql
applies a SQL template to all tables. After this command you can apply an action. See in-depth documentation here.unpivot_string_columns
returns a melted (unpivoted) dataframe with all string columns from the selected tables. After this command you can apply an actionscan
(experimental) scans the lakehouse with regex expressions defined by the rules and to power the semantic classification.
intro
gives an introduction to the libraryscan
scans the lakehouse with regex expressions defined by the rules and to power the semantic classification. Documentationdisplay_rules
shows the rules available for semantic classificationsearch
searches the lakehouse content for by leveraging the semantic classes identified with scan (eg. email, ip address, etc.). Documentationselect_by_class
selects data from the lakehouse content by semantic class. Documentationdelete_by_class
deletes from the lakehouse by semantic class. Documentation
from_tables Actions
After a with_sql
or unpivot_string_columns
command, you can apply the following actions:
explain
explains the queries that would be executeddisplay
executes the queries and shows the first 1000 rows of the result in a unioned dataframeapply
returns a unioned dataframe with the result from the queries
Requirements
Project Support
Please note that all projects in the /databrickslabs github account are provided for your exploration only, and are not formally supported by Databricks with Service Level Agreements (SLAs). They are provided AS-IS and we do not make any guarantees of any kind. Please do not submit a support ticket relating to any issues arising from the use of these projects.
Any issues discovered through the use of this project should be filed as GitHub Issues on the Repo. They will be reviewed as time permits, but there are no formal SLAs for support.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file dbl_discoverx-0.0.8.tar.gz
.
File metadata
- Download URL: dbl_discoverx-0.0.8.tar.gz
- Upload date:
- Size: 26.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 79112410ed29fe6115a00ee06df07d20ffd269e43c06d293b0d5e4ee5510825f |
|
MD5 | d930742c9b04e66dbf13b38b3460732e |
|
BLAKE2b-256 | 316a190c2da9f842f7ba943eb0812f0f6d9a2ec0ff03390cf3c3012ef0daae90 |
File details
Details for the file dbl_discoverx-0.0.8-py3-none-any.whl
.
File metadata
- Download URL: dbl_discoverx-0.0.8-py3-none-any.whl
- Upload date:
- Size: 31.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3e441cf2cfe7707f0cb77adc883891640766a414b44673a55b84c3990a09f0e3 |
|
MD5 | 6a0a8cebd034eadf9a83ad5554db7b9f |
|
BLAKE2b-256 | 8287a519791d2bf5803eef48cce0f3462ad20f31f988621600efff88e0a2b964 |