Skip to main content

A data quality check module for Spark

Project description

dq_check

Overview

dq_check is a Python package that provides a data quality check function encapsulated in the DQCheck class. It allows you to perform data quality checks on tables using SQL queries and save the results into a Delta table for auditing purposes.

Features

  • Perform data quality checks on specified tables using SQL queries.
  • Save audit logs of data quality checks into a Delta table.
  • Handle aggregation checks and basic data quality metrics.
  • Supports PySpark and Pandas integration.

Installation

You can install dq_check from PyPI using pip:

bash

pip install dq_check

Usage

Here's an example of how to use the DQCheck class from the dq_check package:

from pyspark.sql import SparkSession from dq_check import DQCheck

Initialize Spark session

spark = SparkSession.builder.appName("DQCheckExample").getOrCreate()

Create an instance of DQCheck

dq_checker = DQCheck(spark)

Define the data quality check parameters

table_type = "delta" # Type of the table ('delta' or 'asql') table_name = "your_table_name" # Name of the table primary_keys = ["your_primary_key"] # List of primary key columns sql_query = "SELECT * FROM your_table WHERE condition" # Data quality check query

Perform the data quality check

dq_checker.perform_dq_check( table_type, table_name, primary_keys, sql_query, quality_threshold_percentage=5, # Quality threshold percentage data_batch_identifier_name=None, # Optional batch identifier name data_batch_identifier_value=None, # Optional batch identifier value audit_table_name="your_audit_log_table" # Name of the Delta table to store audit logs )

Configuration

Adjust the parameters passed to the perform_dq_check method based on your requirements.

Dependencies

PySpark Pandas

Contributing

Contributions are welcome! Please feel free to submit issues and pull requests on the GitHub repository.

License

This project is licensed under the MIT License. See the LICENSE file for more details.

Contact

For any questions or feedback, open a github issue

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dq_check-0.1.0.tar.gz (5.2 kB view details)

Uploaded Source

Built Distribution

dq_check-0.1.0-py3-none-any.whl (5.4 kB view details)

Uploaded Python 3

File details

Details for the file dq_check-0.1.0.tar.gz.

File metadata

  • Download URL: dq_check-0.1.0.tar.gz
  • Upload date:
  • Size: 5.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.9.12

File hashes

Hashes for dq_check-0.1.0.tar.gz
Algorithm Hash digest
SHA256 f46e660dfecd1e478b9c8ae951e6ea418a32663f38781edcec23e4ddfda8d0d5
MD5 9516db870e56b3737173c743d29aa352
BLAKE2b-256 0c74dd172da10b4b3ca41b6ac8908d418634c19efdda1df081df9e9725749975

See more details on using hashes here.

File details

Details for the file dq_check-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: dq_check-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 5.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.9.12

File hashes

Hashes for dq_check-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 96596d4da9e86f1b43149a8b25be9bbcb7122e8f668cdff544a73bcc65150f53
MD5 8240b62738ab9b0f66edaeb86a83f2d0
BLAKE2b-256 c225f3d12acc5ac546bc62aab86e800bf162298b851477b6fdddabfe9fbf43b1

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page