Skip to main content

A data quality check module for Spark

Project description

dq_check

Overview

dq_check is a Python package that provides a data quality check function encapsulated in the DQCheck class. It allows you to perform data quality checks on tables using SQL queries and save the results into a Delta table for auditing purposes.

Features

  • Perform data quality checks on specified tables using SQL queries.
  • Save audit logs of data quality checks into a Delta table.
  • Handle aggregation checks and basic data quality metrics.
  • Supports PySpark and Pandas integration.

Installation

You can install dq_check from PyPI using pip:

bash

pip install dq_check

Usage

Here's an example of how to use the DQCheck class from the dq_check package:

from pyspark.sql import SparkSession

from dq_check import DQCheck

Initialize Spark session

spark = SparkSession.builder.appName("DQCheckExample").getOrCreate()

Create an instance of DQCheck

dq_checker = DQCheck(spark,audit_table_name) #audit table name should have catalog and schema.

spark (SparkSession): The Spark session.

audit_table_name (str):Default is audit_log. The name of the Delta table to store audit logs.

azure_sql_client:Default is None. This is required for asql,create azure_sql_client by providing scope and secret with AzureSQLClient

run_id:Default is -999 , run_id for the ADF pipeline

Define the data quality check parameters

table_type = "delta" # Type of the table ('delta' or 'asql')

table_name = "your_table_name" # Name of the table, should have catalog/schema for delta and schema for asql.

primary_keys = ["your_primary_key"] # List of primary key columns

sql_query = "SELECT * FROM your_table WHERE condition" # Data quality check query # should have table name with catalog and schema.

Perform the data quality check

dq_checker.perform_dq_check(

table_type,

table_name,

primary_keys,

sql_query,

where_clause=None, # Optional where clause for sample data

quality_threshold_percentage=5,  # Optional Quality threshold percentage

chunk_size=200, #Optional chunk size for pks list

)

Configuration

Adjust the parameters passed to the perform_dq_check method based on your requirements.

Dependencies

PySpark Pandas

Contributing

Contributions are welcome! Please feel free to submit issues and pull requests on the GitHub repository.

License

None.

Contact

For any questions or feedback, open a github issue

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dq_check-0.4.0.tar.gz (7.6 kB view details)

Uploaded Source

Built Distribution

dq_check-0.4.0-py3-none-any.whl (7.1 kB view details)

Uploaded Python 3

File details

Details for the file dq_check-0.4.0.tar.gz.

File metadata

  • Download URL: dq_check-0.4.0.tar.gz
  • Upload date:
  • Size: 7.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.9.12

File hashes

Hashes for dq_check-0.4.0.tar.gz
Algorithm Hash digest
SHA256 9b186af038f2bb3b975677b34787c0087c99c7d0e6122bfe64a5ff447f4dd5a6
MD5 9600ef37cbfcfe5b80a1dcadcd8f5757
BLAKE2b-256 0912118712f43a5f52505e8d6f439b91ff3c759a443372be872acd024c0de932

See more details on using hashes here.

File details

Details for the file dq_check-0.4.0-py3-none-any.whl.

File metadata

  • Download URL: dq_check-0.4.0-py3-none-any.whl
  • Upload date:
  • Size: 7.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.9.12

File hashes

Hashes for dq_check-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d6029450bff146d5b1555cb4d20f03e79faf0d8ca354e053786991edcd1340c4
MD5 20d8b7c3ada7646a5b3654d6e37f554f
BLAKE2b-256 be98bd31a19545d932f35a3257d1ec4f30c4003cb8042c7d15eabb79c9ca0576

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page