Skip to main content

HDF Data Quality Framework for PySpark DataFrames using Great Expectations

Project description

HDF DQ Framework

A powerful Data Quality Framework for PySpark DataFrames using Great Expectations validation rules, designed for the HDF Data Pipeline ecosystem.

Overview

The DQ Framework provides a simple and efficient way to filter DataFrames based on data quality rules. It separates qualified data from bad data, allowing you to handle data quality issues systematically in your data pipelines.

Key Features

  • Easy Integration: Simple API that works with existing PySpark workflows
  • Great Expectations: Leverages the power of Great Expectations for data validation
  • Flexible Rules: Support for JSON string, dictionary, or list-based rule configuration
  • Dual Output: Returns both qualified and bad rows as separate DataFrames
  • Detailed Validation: Optional validation details for debugging and monitoring

Quick Start

from pyspark.sql import SparkSession
from dq_framework import DQFramework

# Initialize Spark session
spark = SparkSession.builder.appName("DQ_Example").getOrCreate()

# Create sample data
data = [
    (1, "John", 25, "john@email.com"),
    (2, "Jane", -5, "invalid-email"),  # Bad data: negative age, invalid email
    (3, "Bob", 30, "bob@email.com"),
    (4, None, 35, "alice@email.com"),  # Bad data: null name
]
columns = ["id", "name", "age", "email"]
df = spark.createDataFrame(data, columns)

# Define quality rules
quality_rules = [
    {
        "expectation_type": "expect_column_values_to_not_be_null",
        "kwargs": {"column": "name"}
    },
    {
        "expectation_type": "expect_column_values_to_be_between",
        "kwargs": {"column": "age", "min_value": 0, "max_value": 120}
    },
    {
        "expectation_type": "expect_column_values_to_match_regex",
        "kwargs": {"column": "email", "regex": r"^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$"}
    }
]

# Initialize DQ Framework
dq = DQFramework()

# Filter data
qualified_df, bad_df = dq.filter_dataframe(
    dataframe=df,
    quality_rules=quality_rules,
    include_validation_details=True
)

# Show results
print("Qualified Data:")
qualified_df.show()

print("Bad Data:")
bad_df.show()

API Reference

DQFramework

The main class for data quality processing.

Methods

  • filter_dataframe(dataframe, quality_rules, columns=None, include_validation_details=False)
    • Filters a DataFrame based on quality rules
    • Returns tuple of (qualified_df, bad_df)

RuleProcessor

Handles the processing of Great Expectations rules.

Dependencies

Core Dependencies

  • PySpark ^3.0.0: For DataFrame operations
  • Great Expectations ^0.15.0: For validation logic
  • typing-extensions ^4.0.0: For enhanced type hints

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hdf_dq_framework-0.3.0.tar.gz (109.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

hdf_dq_framework-0.3.0-py3-none-any.whl (184.1 kB view details)

Uploaded Python 3

File details

Details for the file hdf_dq_framework-0.3.0.tar.gz.

File metadata

  • Download URL: hdf_dq_framework-0.3.0.tar.gz
  • Upload date:
  • Size: 109.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.0.0 CPython/3.13.3 Darwin/24.5.0

File hashes

Hashes for hdf_dq_framework-0.3.0.tar.gz
Algorithm Hash digest
SHA256 f9b9ab4a836dea5bd4ca97bf9fa164522fbe358a58fa72d186e1ac81c4425249
MD5 4a12964e5defa17fe051a958d146158a
BLAKE2b-256 3945272636b08bd1deb515a66df9e4ac7944258845087ee82f459c8312bd0fa6

See more details on using hashes here.

File details

Details for the file hdf_dq_framework-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: hdf_dq_framework-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 184.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.0.0 CPython/3.13.3 Darwin/24.5.0

File hashes

Hashes for hdf_dq_framework-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 2fc69a07321e72d1941d7d76a81d06ff4b60965dea36a2cecb83b21747c8df9f
MD5 015f2f82835c742f27029e5e5e5a4ce2
BLAKE2b-256 96d66a05be27942e83b7458d5d522c3fa81c5b52ef6c85a0cebd74a881cc111b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page