A data quality check module for Spark
Project description
dq_check
Overview
dq_check
is a Python package that provides a data quality check function encapsulated in the DQCheck
class. It allows you to perform data quality checks on tables using SQL queries and save the results into a Delta table for auditing purposes.
Features
- Perform data quality checks on specified tables using SQL queries.
- Save audit logs of data quality checks into a Delta table.
- Handle aggregation checks and basic data quality metrics.
- Supports PySpark and Pandas integration.
Installation
You can install dq_check
from PyPI using pip:
bash
pip install dq_check
Usage
Here's an example of how to use the DQCheck class from the dq_check package:
from pyspark.sql import SparkSession
from dq_check import DQCheck
Initialize Spark session
spark = SparkSession.builder.appName("DQCheckExample").getOrCreate()
Create an instance of DQCheck
dq_checker = DQCheck(spark,audit_table_name) #audit table name should have catalog and schema.
spark (SparkSession): The Spark session.
audit_table_name (str):Default is audit_log. The name of the Delta table to store audit logs.
azure_sql_client:Default is None. This is required for asql,create azure_sql_client by providing scope and secret with AzureSQLClient
run_id:Default is -999 , run_id for the ADF pipeline
Define the data quality check parameters
table_type = "delta" # Type of the table ('delta' or 'asql')
table_name = "your_table_name" # Name of the table, should have catalog/schema for delta and schema for asql.
primary_keys = ["your_primary_key"] # List of primary key columns
sql_query = "SELECT * FROM your_table WHERE condition" # Data quality check query # should have table name with catalog and schema.
Perform the data quality check
dq_checker.perform_dq_check(
table_type,
table_name,
primary_keys,
sql_query,
where_clause=None, # Optional where clause for sample data
quality_threshold_percentage=5, # Optional Quality threshold percentage
chunk_size=200, #Optional chunk size for pks list
)
Configuration
Adjust the parameters passed to the perform_dq_check method based on your requirements.
Dependencies
PySpark Pandas
Contributing
Contributions are welcome! Please feel free to submit issues and pull requests on the GitHub repository.
License
None.
Contact
For any questions or feedback, open a github issue
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file dq_check-0.3.3.tar.gz
.
File metadata
- Download URL: dq_check-0.3.3.tar.gz
- Upload date:
- Size: 7.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.9.12
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f8b788e3137c0fa8dd0456d185d86cde921d28c6d61e5ade59f43d29eca675f1 |
|
MD5 | 3be0e02145c7bc290bbc191d8cb6225b |
|
BLAKE2b-256 | 6da2a3f605d93692fb82ac5efa6668dd13cd793e449220fc4c18137c45906543 |
File details
Details for the file dq_check-0.3.3-py3-none-any.whl
.
File metadata
- Download URL: dq_check-0.3.3-py3-none-any.whl
- Upload date:
- Size: 6.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.9.12
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 24cc461e10a1c8959761ebbcd69ac76a5fc187727472051f81e20f039e4299c2 |
|
MD5 | 452ec5b7cde307321e28ef544bef7e0d |
|
BLAKE2b-256 | 475599e41f0a3812b5aa545b2dcaabb461381a80501108173dff58c85672b932 |