DiscoverX - Map and Search your Lakehouse

These details have not been verified by PyPI

Project links

Homepage

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

DiscoverX

Multi-table operations over the lakehouse.

Run a single command to execute operations across many tables. Eg.

Maintenance operations (vacuum, optimize, cleanup, etc.)
Scan & classify content for patterns (phone numbers, emails, etc.)
Data transformations based on semantic types

Getting started

Install DiscoverX, in Databricks notebook type

%pip install dbl-discoverx

Get started

from discoverx import DX
dx = DX(locale="US")

Maintenance operations

You can run arbitrary SQL operations on multiple tables. For example, to vacuum all the tables in "default" catalog:

dx.from_tables("default.*.*")\
  .apply_sql("VACUUM {full_table_name}")\
  .execute()

That will apply the SQL template VACUUM {full_table_name} to all tables matched by the pattern default.*.*.

The SQL template has the following variables available:

full_table_name - The table name in format catalog.schama.table
table_catalog - Name of the catalog
table_schema - Name of the schema
table_name - Name of the table

You can use the explain() command to see the SQL that would be executed.

dx.from_tables("default.*.*")\
  .apply_sql("VACUUM {full_table_name}")\
  .explain()

You can also filter tables that have a specific column name. Eg.

dx.from_tables("default.*.*")\
  .having_columns("device_id")\
  .apply_sql("OPTIMIZE {full_table_name} ZORDER BY (`device_id`)")\
  .execute()

Scan & classify

You can scan a sample of 10k rows from each table with

dx.scan(from_tables="*.*.*")

Check out the scan parameters for more details.

The scan result is a dataset with a score column, which defines the fraction of matched records against the total records scanned for each rule.

Available classes

The supported classes are:

IP v4 address
IP v6 address
Email address
URL
fqdn (Fully qualified domain name)
Credit card number
Credit card expiration date
Iso date
Iso date time
Mac address
Integer number as string
Decimal number as string

US locale specific classes

us_mailing_address
us_phone_number
us_social_security_number
us_state
us_state_abbreviation
us_zip_code

See the list of available classification rules with

dx.display_rules()

You can also provide your custom matching rules.

Save & Load the Scan Results

After a scan you can save the scan results in a delta table of your choice.

dx.save(full_table_name=<your-table-name>)

To load the saved results at a later time or in a different session use

dx.load(full_table_name=<your-table-name>)

Cross-table queries

After a scan you can leverage the classified column classes to run cross-table search, delete_by_class and select_by_classes actions.

Search

Search for a specific value across multiple tables.

dx.search("example_email@databricks.com", from_tables="*.*.*")

The search will automatically try to classify the search term and restrict the search to columns that match that rule classes.

You can also specify the classes where the search should be performed explicitly:

dx.search("example_email@databricks.com", from_tables="*.*.*", by_classes=["dx_email"])

If you want to limit the search to columns with a specific classification score you need to provide it as parameter, i.e.

dx.search("example_email@databricks.com", from_tables="*.*.*", min_score=0.95)

The score refers to the frequency of matching rules during the scan for the respective column.

Delete

Preview delete statements

dx.delete_by_class(from_tables="*.*.*", by_class="dx_email", values=['example_email@databricks.com'], yes_i_am_sure=False, min_score=0.95)

Execute delete statements

dx.delete_by_class(from_tables="*.*.*", by_class="dx_email", values=['example_email@databricks.com'], yes_i_am_sure=True, min_score=0.95)

Note: You need to regularly vacuum all your delta tables to remove all traces of your deleted rows.

Select

Select all columns classified with specified classes from multiple tables

dx.select_by_classes(from_tables="*.*.*", by_classes=["dx_iso_date", "dx_email"], min_score=None)

You can apply further transformations to build your summary tables. Eg. Count the occurrence of each IP address per day across multiple tables and columns

df = (dx.select_by_classes(from_tables="*.*.*", by_classes=["dx_iso_date", "dx_ip_v4"])
    .groupby(["table_catalog", "table_schema", "table_name", "classified_columns.dx_iso_date.column", "classified_columns.dx_iso_date.value", "classified_columns.dx_ip_v4.column"])
    .agg(func.count("classified_columns.dx_ip_v4.value").alias("count"))
)

Configuration

Scan parameters

You can define

dx.scan(
    from_tables="*.*.*", # Table pattern in form of <catalog>.<schema>.<table> ('*' is a wildcard)
    rules="*",           # Rule filter ('*' is a wildcard)
    sample_size=10000,   # Number of rows to sample, use None for a full table scan
    what_if=False        # If `True` it prints the SQL that would be executed
)

Custom rules

You can provide your custom scanning rules based on regex expressions.

from discoverx.rules import RegexRule
from discoverx import DX

custom_rules = [
  RegexRule(
    name = "resource_request_id",
    description = "Resource request ID",
    definition = r"^AR-\d{9}$",
    match_example = ["AR-123456789"],
    nomatch_example = ["R-123"],
  )
]

dx = DX(custom_rules=custom_rules)

You should now see your rules added to the default ones with

dx.display_rules()

Requirements

Project Support

Please note that all projects in the /databrickslabs github account are provided for your exploration only, and are not formally supported by Databricks with Service Level Agreements (SLAs). They are provided AS-IS and we do not make any guarantees of any kind. Please do not submit a support ticket relating to any issues arising from the use of these projects.

Any issues discovered through the use of this project should be filed as GitHub Issues on the Repo. They will be reviewed as time permits, but there are no formal SLAs for support.

Project details

These details have not been verified by PyPI

Project links

Homepage

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.0.8

Jan 15, 2024

0.0.7

Nov 13, 2023

0.0.6

Oct 2, 2023

This version

0.0.5

Aug 29, 2023

0.0.4

Jul 28, 2023

0.0.3

Jul 5, 2023

0.0.2

Jun 27, 2023

0.0.1

Jun 20, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dbl_discoverx-0.0.5.tar.gz (24.3 kB view hashes)

Uploaded Aug 29, 2023 Source

Built Distribution

dbl_discoverx-0.0.5-py3-none-any.whl (26.5 kB view hashes)

Uploaded Aug 29, 2023 Python 3

Hashes for dbl_discoverx-0.0.5.tar.gz

Hashes for dbl_discoverx-0.0.5.tar.gz
Algorithm	Hash digest
SHA256	`2bfaa36e553c5db43f49cf8854597f52dee0deeb20e5640f1238043ad6fac136`
MD5	`cdfc80f0adde596011b2b50eb2366e65`
BLAKE2b-256	`6a22f1496a50690fbdad52c95e88c7baf830fcd53a415924f5a008c84fd05046`

Hashes for dbl_discoverx-0.0.5-py3-none-any.whl

Hashes for dbl_discoverx-0.0.5-py3-none-any.whl
Algorithm	Hash digest
SHA256	`8a1b42aadeb41d731d4d1e08f6b8a4db27f206325607a39a9a189ee3a48165f5`
MD5	`23c21736e2cb6527f9d2161ddcbe4c7a`
BLAKE2b-256	`db3976d34cca80c69b16077a5a20d6e892f3db43df2c56efcfbb16264dace250`