Skip to main content

A library for data quality validation using PyDeequ.

Project description

Data Quality Validation

This is a repository for the data quality validation.

Author: Ketan Kirange

Contributors: Ketan Kirange, Rahul Ajay, Ruth Mifsud

This repository contains tools and utilities for performing data quality checks on data files in

  • Pandas,
  • Dask, and
  • PySpark formats, leveraging libraries such as PyDeequ and SODA utilities.

These checks help ensure the integrity, accuracy, and completeness of the data, essential for robust data-driven decision-making processes.

Importance of Data Quality

Data quality plays a pivotal role in any engineering project, especially in data science, reporting, and analysis.
Here's why ensuring high data quality is crucial:

1. Reliable Insights

High-quality data leads to reliable and trustworthy insights.
When the data is accurate, complete, and consistent, data scientists and analysts can make informed decisions confidently.

2. Trustworthy Models

Data quality directly impacts the performance and reliability of machine learning models.
Models trained on low-quality data may produce biased or inaccurate predictions, leading to unreliable outcomes.

3. Effective Reporting

Quality data is fundamental for generating accurate reports and visualizations.
Analysts and stakeholders rely on these reports for understanding trends, identifying patterns, and making strategic decisions.
Poor data quality can lead to misleading reports and flawed interpretations.

4. Regulatory Compliance

In many industries, compliance with regulations such as GDPR, HIPAA, or industry-specific standards is mandatory.
Ensuring data quality is essential for meeting these regulatory requirements and avoiding potential legal consequences.

Data Quality Validation Tools

This repository provides a set of tools and utilities to perform comprehensive data quality validation on various data formats:

  • Pandas: Data quality checks for data stored in Pandas DataFrames, including checks for missing values, data types, and statistical summaries.
  • Dask: Scalable data quality checks for large-scale datasets using Dask, ensuring consistency and accuracy across distributed computing environments.
  • PySpark with PyDeequ: Integration with PyDeequ, enabling data quality validation on data processed using PySpark, including checks for schema validation, data distribution, and anomaly detection.
  • SODA Utilities: Utilities for validating data quality using SODA (Scalable Observations of Data Attributes) framework, allowing for automated quality checks and anomaly detection.

Getting Started

To get started with data quality validation using this repository, follow the instructions in the respective documentation for each tool:

Contributing

We welcome contributions from the community to enhance and expand the capabilities of this data quality validation repository.
Please refer to the contribution guidelines for more information on how to contribute.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

data-quality-validation-pydeequ-0.4.tar.gz (6.0 kB view details)

Uploaded Source

Built Distribution

File details

Details for the file data-quality-validation-pydeequ-0.4.tar.gz.

File metadata

File hashes

Hashes for data-quality-validation-pydeequ-0.4.tar.gz
Algorithm Hash digest
SHA256 05430569969a55227f076a21c9e244ed0d3e1c924c022af9a47397be37b1da72
MD5 0f67f6e5f997e1d461ca2ac285aa293a
BLAKE2b-256 c0649c6503187a9bc71b7f963f84d5f64a0b2e64a023d7a6614cd0159ea1ffa6

See more details on using hashes here.

File details

Details for the file data_quality_validation_pydeequ-0.4-py3-none-any.whl.

File metadata

File hashes

Hashes for data_quality_validation_pydeequ-0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 bbeaebb21d393a79a58b4d061ab8604806f47f89e1c20130b5776c76b265d770
MD5 eda72e973903b31e6be891f12516c7cd
BLAKE2b-256 0d17026f8f4625b6ce6a242b03b3fb4c0495669df05b2031fc993b91cbf4a56e

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page