Skip to main content

DiscoverX - Map and Search your Lakehouse

Project description

DiscoverX

Multi-table operations over the lakehouse.

Multi-table operations

Run a single command to execute operations across many tables.

Operations examples

Operations are applied concurrently across multiple tables

Getting started

Install DiscoverX, in Databricks notebook type

%pip install dbl-discoverx

Get started

from discoverx import DX
dx = DX(locale="US")

You can now run operations across multiple tables.

As an illustration, consider the scenario where you need to retrieve a single row from various tables within a catalog that begins with "dev_" and includes the term "sample" in their names. To achieve this, the following code block utilizes the dx.from_tables function, which applies an SQL query to extract JSON-formatted data:

dx.from_tables("dev_*.*.*sample*")\
  .with_sql("SELECT to_json(struct(*)) AS row FROM {full_table_name} LIMIT 1")\
  .apply()

Available functionality

The available dx functions are

  • from_tables("<catalog>.<schema>.<table>") selects tables based on the specified pattern (use * as a wildcard). Returns a DataExplorer object with methods
    • having_columns restricts the selection to tables that have the specified columns
    • with_concurrency defines how many queries are executed concurrently (10 by defailt)
    • with_sql applies a SQL template to all tables. After this command you can apply an action. See in-depth documentation here.
    • unpivot_string_columns returns a melted (unpivoted) dataframe with all string columns from the selected tables. After this command you can apply an action
    • scan (experimental) scans the lakehouse with regex expressions defined by the rules and to power the semantic classification.
  • intro gives an introduction to the library
  • scan scans the lakehouse with regex expressions defined by the rules and to power the semantic classification. Documentation
  • display_rules shows the rules available for semantic classification
  • search searches the lakehouse content for by leveraging the semantic classes identified with scan (eg. email, ip address, etc.). Documentation
  • select_by_class selects data from the lakehouse content by semantic class. Documentation
  • delete_by_class deletes from the lakehouse by semantic class. Documentation

from_tables Actions

After a with_sql or unpivot_string_columns command, you can apply the following actions:

  • explain explains the queries that would be executed
  • display executes the queries and shows the first 1000 rows of the result in a unioned dataframe
  • apply returns a unioned dataframe with the result from the queries

Requirements

Project Support

Please note that all projects in the /databrickslabs github account are provided for your exploration only, and are not formally supported by Databricks with Service Level Agreements (SLAs). They are provided AS-IS and we do not make any guarantees of any kind. Please do not submit a support ticket relating to any issues arising from the use of these projects.

Any issues discovered through the use of this project should be filed as GitHub Issues on the Repo. They will be reviewed as time permits, but there are no formal SLAs for support.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dbl_discoverx-0.0.8.tar.gz (26.5 kB view details)

Uploaded Source

Built Distribution

dbl_discoverx-0.0.8-py3-none-any.whl (31.7 kB view details)

Uploaded Python 3

File details

Details for the file dbl_discoverx-0.0.8.tar.gz.

File metadata

  • Download URL: dbl_discoverx-0.0.8.tar.gz
  • Upload date:
  • Size: 26.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.7

File hashes

Hashes for dbl_discoverx-0.0.8.tar.gz
Algorithm Hash digest
SHA256 79112410ed29fe6115a00ee06df07d20ffd269e43c06d293b0d5e4ee5510825f
MD5 d930742c9b04e66dbf13b38b3460732e
BLAKE2b-256 316a190c2da9f842f7ba943eb0812f0f6d9a2ec0ff03390cf3c3012ef0daae90

See more details on using hashes here.

File details

Details for the file dbl_discoverx-0.0.8-py3-none-any.whl.

File metadata

File hashes

Hashes for dbl_discoverx-0.0.8-py3-none-any.whl
Algorithm Hash digest
SHA256 3e441cf2cfe7707f0cb77adc883891640766a414b44673a55b84c3990a09f0e3
MD5 6a0a8cebd034eadf9a83ad5554db7b9f
BLAKE2b-256 8287a519791d2bf5803eef48cce0f3462ad20f31f988621600efff88e0a2b964

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page