DiscoverX - Map and Search your Lakehouse
Project description
DiscoverX
Scan, Classify, and Discover the content of your Lakehouse
Requirements
Getting started
Install DiscoverX, in Databricks notebook type
%pip install dbl-discoverx
Get started
from discoverx import DX
dx = DX(locale="US")
Scan & classify
You can now scan the content of any set of tables for
- IP addresses (v4 and v6)
- Email addresses
- URLs
- ... and many more
See the full list of rules with
dx.display_rules()
You can also provide your custom matching rules.
The scan will automatically classify columns.
Example
Scan all (samples 10k rows from each table)
dx.scan(from_tables="*.*.*")
Check out the scan parameters.
The result is a dataset with a score
column, which defines the fraction of matched records against the total records scanned for each rule.
The full scan result can be displayed using
dx.scan_result
Save & Load the Scan Results
After a scan
you can save the scan results in a delta table of your choice.
dx.save(full_table_name=<your-table-name>)
To load the saved results at a later time or in a different session use
dx.load(full_table_name=<your-table-name>)
Cross-table queries
After a scan
you can leverage the classified column classes to run cross-table search
, delete_by_class
and select_by_classes
actions.
Search
Search for a specific value across multiple tables.
dx.search("example_email@databricks.com", from_tables="*.*.*")
The search will automatically try to classify the search term and restrict the search to columns that match that rule classes.
You can also specify the classes where the search should be performed explicitly:
dx.search("example_email@databricks.com", from_tables="*.*.*", by_classes=["dx_email"])
If you want to limit the search to columns with a specific classification score you need to provide it as parameter, i.e.
dx.search("example_email@databricks.com", from_tables="*.*.*", min_score=0.95)
The score refers to the frequency of matching rules during the scan for the respective column.
Delete
Delete
Preview delete statements
dx.delete_by_class(from_tables="*.*.*", by_class="dx_email", values=['example_email@databricks.com'], yes_i_am_sure=False, min_score=0.95)
Execute delete statements
dx.delete_by_class(from_tables="*.*.*", by_class="dx_email", values=['example_email@databricks.com'], yes_i_am_sure=True, min_score=0.95)
Note: You need to regularly vacuum all your delta tables to remove all traces of your deleted rows.
Select
Select all columns classified with specified classes from multiple tables
dx.select_by_classes(from_tables="*.*.*", by_classes=["dx_iso_date", "dx_email"], min_score=None)
You can apply further transformations to build your summary tables. Eg. Count the occurrence of each IP address per day across multiple tables and columns
df = (dx.select_by_classes(from_tables="*.*.*", by_classes=["dx_iso_date", "dx_ip_v4"])
.groupby(["table_catalog", "table_schema", "table_name", "classified_columns.dx_iso_date.column", "classified_columns.dx_iso_date.value", "classified_columns.dx_ip_v4.column"])
.agg(func.count("classified_columns.dx_ip_v4.value").alias("count"))
)
Configuration
Scan parameters
You can define
dx.scan(
catalogs="*", # Catalog filter ('*' is a wildcard)
schemas="*", # Database filter ('*' is a wildcard)
tables="*", # Table filter ('*' is a wildcard)
rules="*", # Rule filter ('*' is a wildcard) or list[string]
sample_size=10000, # Number of rows to sample, use None for a full table scan
what_if=False # If `True` it prints the SQL that would be executed
)
Custom rules
You can provide your custom scanning rules based on regex expressions.
custom_rules = [
{
'name': 'resource_request_id',
'type': 'regex',
'description': 'Resource request ID',
'definition': r'^AR-\d{9}$',
'match_example': ['AR-123456789'],
'nomatch_example': ['CD-123456789']
}
]
dx = DX(custom_rules=custom_rules)
You should now see your rules added to the default ones with
dx.display_rules()
Project Support
Please note that all projects in the /databrickslabs github account are provided for your exploration only, and are not formally supported by Databricks with Service Level Agreements (SLAs). They are provided AS-IS and we do not make any guarantees of any kind. Please do not submit a support ticket relating to any issues arising from the use of these projects.
Any issues discovered through the use of this project should be filed as GitHub Issues on the Repo. They will be reviewed as time permits, but there are no formal SLAs for support.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for dbl_discoverx-0.0.2-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9566ec357d1cc100dfe9733ba634f631525d79f20671a2f9e83374916229a6da |
|
MD5 | 163129c95152d9516748c56af4d104cb |
|
BLAKE2b-256 | a677760316644589bdf1394f14d1eaa64ab9a18bc3f49d386689eb62c821b8a2 |