A data quality check module for Spark
Project description
dq_check
Overview
dq_check
is a Python package that provides a data quality check function encapsulated in the DQCheck
class. It allows you to perform data quality checks on tables using SQL queries and save the results into a Delta table for auditing purposes.
Features
- Perform data quality checks on specified tables using SQL queries.
- Save audit logs of data quality checks into a Delta table.
- Handle aggregation checks and basic data quality metrics.
- Supports PySpark and Pandas integration.
Installation
You can install dq_check
from PyPI using pip:
bash
pip install dq_check
Usage
Here's an example of how to use the DQCheck class from the dq_check package:
from pyspark.sql import SparkSession
from dq_check import DQCheck
Initialize Spark session
spark = SparkSession.builder.appName("DQCheckExample").getOrCreate()
Create an instance of DQCheck
dq_checker = DQCheck(spark,audit_table) #audit table name should have catalog and schema.
Define the data quality check parameters
table_type = "delta" # Type of the table ('delta' or 'asql')
table_name = "your_table_name" # Name of the table, should have catalog/schema for delta and schema for asql.
primary_keys = ["your_primary_key"] # List of primary key columns
sql_query = "SELECT * FROM your_table WHERE condition" # Data quality check query # should have table name with catalog and schema.
Perform the data quality check
dq_checker.perform_dq_check(
table_type,
table_name,
primary_keys,
sql_query,
scope=None,# Optional, required for asql only
secret=None, # Optional, required for asql only
data_batch_identifier_name=None, # Optional batch identifier name
data_batch_identifier_value=None, # Optional batch identifier value
quality_threshold_percentage=5, # Quality threshold percentage
)
Configuration
Adjust the parameters passed to the perform_dq_check method based on your requirements.
Dependencies
PySpark Pandas
Contributing
Contributions are welcome! Please feel free to submit issues and pull requests on the GitHub repository.
License
None.
Contact
For any questions or feedback, open a github issue
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file dq_check-0.2.1.tar.gz
.
File metadata
- Download URL: dq_check-0.2.1.tar.gz
- Upload date:
- Size: 4.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.9.12
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | ea4898f1fbf302989ebe0da8c5e4b01e2c1e97a8c6ec193d751e36542f58d2dc |
|
MD5 | eb296db2370364b2b0261d15a297fe91 |
|
BLAKE2b-256 | 729056147b0f18e2514281ccc608877de1e49541c51cd2348df694bb011cd32b |
File details
Details for the file dq_check-0.2.1-py3-none-any.whl
.
File metadata
- Download URL: dq_check-0.2.1-py3-none-any.whl
- Upload date:
- Size: 5.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.9.12
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6ff8620e66260c51050fd256f8b3d43937166f00a21fcdc7553b072f73554e53 |
|
MD5 | a7be855b61e8815b9de2aecdb53ee677 |
|
BLAKE2b-256 | 1f6aaf4ca3d801bd77516b91dad57b96d1530b3bf34aef2bd08db4f865f65347 |