Skip to main content

A library for data quality checks in Microsoft Fabric using Great Expectations

Project description

FabricDataGuard

FabricDataGuard is a Python library that simplifies data quality checks in Microsoft Fabric using Great Expectations. It provides an easy-to-use interface for data scientists and engineers to perform data quality checks without the need for extensive Great Expectations setup.

Purpose

The main purpose of FabricDataGuard is to:

  • Streamline the process of setting up and running data quality checks in Microsoft Fabric
  • Provide a wrapper around Great Expectations for easier integration with Fabric workflows
  • Enable quick and efficient data validation with minimal setup

Installation

To install FabricDataGuard, use pip:

pip install fabric-data-guard

Usage

Here's a basic example of how to use FabricDataGuard:

from fabric_data_guard import FabricDataGuard
import great_expectations as gx

# Initialize FabricDataGuard
fdg = FabricDataGuard(
    datasource_name="MyDataSourceName",
    data_asset_name="MyDataAssetName",
    #project_root_dir="/lakehouse/default/Files" # This is an optional parameter. Default is set yo your lakehouse filestore
)

# Define data quality checks
fdg.add_expectation([
    gx.expectations.ExpectColumnValuesToNotBeNull(column="UserId"),
    gx.expectations.ExpectColumnPairValuesAToBeGreaterThanB(
        column_A="UpdateDatime", 
        column_B="CreationDatetime"
    ),
    gx.expectations.ExpectColumnValueLengthsToEqual(
        column="PostalCode", 
        value=5
    ),
])

# Read your data from your lake is a pysaprk dataframe
df = spark.sql("SELECT * FROM MyLakehouseName.MyDataAssetName")

# Run validation
results = fdg.run_validation(df, unexpected_identifiers=['UserId'])

Customizing Validation Run

The run_validation function accepts several keyword arguments that allow you to customize its behavior:

1. Display HTML Results:

results = fdg.run_validation(df, display_html=True)

Set display_html=False to suppress the HTML output (default is True).

2. Custom Target Table:

results = fdg.run_validation(df, table_name="MyCustomResultsTable")

Specify a custom name for the table where results will be stored.

3. Custom Workspace and Lakehouse:

results = fdg.run_validation(df, workspace_name="MyWorkspace", lakehouse_name="MyLakehouse")

By default, it uses the workspace and lakehouse attached to the running notebook. Use these parameters to specify different locations.

4. Notification Settings::

Below an example usage. See checkpoint.py to check all required arguments for your use case (Microsoft Teams, Slack or Email)

results = fdg.run_validation(df, 
                             slack_notification=True, 
                             slack_webhook="https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK",
                             email_notification=True,
                             email_to="user@example.com",
                             teams_notification=True,
                             teams_webhook="https://outlook.office.com/webhook/YOUR/TEAMS/WEBHOOK")

You can combine these options as needed:

results = fdg.run_validation(df, 
                             display_html=True,
                             table_name="MyCustomResultsTable",
                             workspace_name="MyWorkspace",
                             lakehouse_name="MyLakehouse",
                             slack_notification=True,
                             slack_webhook="https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK",
                             unexpected_identifiers=['UserId', 'TransactionId'])

This flexibility allows you to tailor the validation process to your specific needs and integrate it seamlessly with your existing data quality workflows.

Contributing

Contributions to FabricDataGuard are welcome! If you'd like to contribute:

  1. Fork the repository
  2. Create a new branch for your feature
  3. Implement your changes
  4. Write or update tests as necessary
  5. Submit a pull request

Please ensure your code adheres to the project's coding standards and includes appropriate tests.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fabric_data_guard-0.0.3.tar.gz (9.8 kB view details)

Uploaded Source

Built Distribution

fabric_data_guard-0.0.3-py3-none-any.whl (10.9 kB view details)

Uploaded Python 3

File details

Details for the file fabric_data_guard-0.0.3.tar.gz.

File metadata

  • Download URL: fabric_data_guard-0.0.3.tar.gz
  • Upload date:
  • Size: 9.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.3 CPython/3.10.2 Windows/10

File hashes

Hashes for fabric_data_guard-0.0.3.tar.gz
Algorithm Hash digest
SHA256 46feedd66650bb8f185482b575eecf291e7bbdd4f443f6fbf577444ff9f35c80
MD5 7d0e522f906ba95b66a14409df1e121e
BLAKE2b-256 475f690a5937909b51a95bdadc1478c85cfd809d7b4b4d3585826cbc426efad7

See more details on using hashes here.

File details

Details for the file fabric_data_guard-0.0.3-py3-none-any.whl.

File metadata

File hashes

Hashes for fabric_data_guard-0.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 f96540f370e9bcaf1ff2dac695ed4dd5df6810cb8442177b32219af5d4bc7cdb
MD5 84035b3ef1e7ef154490dcda878845cc
BLAKE2b-256 2ef46353b124b5871f148e9d8ca1727f8b0a5122f8d8fc353f9ba4154ce9cc3e

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page