A library for data quality validation using PyDeequ.
Project description
Data Quality Validation
This is a repository for the data quality validation.
Author: Ketan Kirange
Contributors: Ketan Kirange, Rahul Ajay, Ruth Mifsud
This repository contains tools and utilities for performing data quality checks on data files in
- Pandas,
- Dask, and
- PySpark formats, leveraging libraries such as PyDeequ and SODA utilities.
These checks help ensure the integrity, accuracy, and completeness of the data, essential for robust data-driven decision-making processes.
Importance of Data Quality
Data quality plays a pivotal role in any engineering project, especially in data science, reporting, and analysis.
Here's why ensuring high data quality is crucial:
1. Reliable Insights
High-quality data leads to reliable and trustworthy insights.
When the data is accurate, complete, and consistent, data scientists and analysts can make informed decisions confidently.
2. Trustworthy Models
Data quality directly impacts the performance and reliability of machine learning models.
Models trained on low-quality data may produce biased or inaccurate predictions, leading to unreliable outcomes.
3. Effective Reporting
Quality data is fundamental for generating accurate reports and visualizations.
Analysts and stakeholders rely on these reports for understanding trends, identifying patterns, and making strategic decisions.
Poor data quality can lead to misleading reports and flawed interpretations.
4. Regulatory Compliance
In many industries, compliance with regulations such as GDPR, HIPAA, or industry-specific standards is mandatory.
Ensuring data quality is essential for meeting these regulatory requirements and avoiding potential legal consequences.
Data Quality Validation Tools
This repository provides a set of tools and utilities to perform comprehensive data quality validation on various data formats:
- Pandas: Data quality checks for data stored in Pandas DataFrames, including checks for missing values, data types, and statistical summaries.
- Dask: Scalable data quality checks for large-scale datasets using Dask, ensuring consistency and accuracy across distributed computing environments.
- PySpark with PyDeequ: Integration with PyDeequ, enabling data quality validation on data processed using PySpark, including checks for schema validation, data distribution, and anomaly detection.
- SODA Utilities: Utilities for validating data quality using SODA (Scalable Observations of Data Attributes) framework, allowing for automated quality checks and anomaly detection.
Getting Started
To get started with data quality validation using this repository, follow the instructions in the respective documentation for each tool:
- Pandas Data Quality Validation Guide
- Dask Data Quality Validation Guide
- PySpark with PyDeequ Guide
- SODA Utilities Guide
Contributing
We welcome contributions from the community to enhance and expand the capabilities of this data quality validation repository.
Please refer to the contribution guidelines for more information on how to contribute.
Prerequisites:
- Step 1: Download Java, Python, and Apache Spark.
Having the appropriate versions is essential to run the code on a local system.
Java: Java 1.8 Archive Downloads
Python: Python 3.9.18 Release
Apache Spark: Apache Spark 3.3.0 Release
- Step 2: Install PyDeequ in the terminal if you encounter an error related to "PyDeequ module is not installed on the machine."
How to install PyDeequ? Use the following command:
pip install pydeequ
-
step 3: Install our ‘Data Quality Validation’ python library in terminal.
pip install data-quality-validation-pydeequ
-
step 4: To run the Data Quality Validation function, import the library as below:
from dqv.dqv_pydeequ import DqvPydeequ
-
Step 5: Create a config file in a folder with the columns that need to be validated.
Name the file as you wish, but remember to use the name in the DqvPydeequ function. -
Step 6: Upload your data to S3 and save it in a new directory if you are running locally.
-
Step 7: Pass your source and target file paths in the DqvPydeequ function.
DqvPydeequ( "", #config_file "", #source_data_path "") #target_data_path
-
Step 8: Run the file to validate.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file data-quality-validation-pydeequ-0.5.tar.gz
.
File metadata
- Download URL: data-quality-validation-pydeequ-0.5.tar.gz
- Upload date:
- Size: 6.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.11.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5e524db1db9ef3496c44dd706a2abc9733f105e7b97eac28acdf9d53e943aecc |
|
MD5 | 681bc57a3135bd8211a7b506059b67c3 |
|
BLAKE2b-256 | 2195cc1929b58c6631eb103751c5f6ed477cd07a60d1ed56dcf9d45195235c27 |
File details
Details for the file data_quality_validation_pydeequ-0.5-py3-none-any.whl
.
File metadata
- Download URL: data_quality_validation_pydeequ-0.5-py3-none-any.whl
- Upload date:
- Size: 7.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.11.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7f8050b686234ee04633defae62721d933a5bfc70ab76de06ba3aa6582187924 |
|
MD5 | b2dbebe6da18472f8ec9b31e95a24d46 |
|
BLAKE2b-256 | 4d4aa17e6b2449f156e1c6e17e29f40165035f2b3c72bf7bbfe1b6306a6e4bf6 |