HDF Data Quality Framework for PySpark DataFrames using Great Expectations
Project description
HDF DQ Framework
A powerful Data Quality Framework for PySpark DataFrames using Great Expectations validation rules, designed for the HDF Data Pipeline ecosystem.
Overview
The DQ Framework provides a simple and efficient way to filter DataFrames based on data quality rules. It separates qualified data from bad data, allowing you to handle data quality issues systematically in your data pipelines.
Key Features
- Easy Integration: Simple API that works with existing PySpark workflows
- Great Expectations: Leverages the power of Great Expectations for data validation
- Flexible Rules: Support for JSON string, dictionary, or list-based rule configuration
- Dual Output: Returns both qualified and bad rows as separate DataFrames
- Detailed Validation: Optional validation details for debugging and monitoring
Quick Start
from pyspark.sql import SparkSession
from dq_framework import DQFramework
# Initialize Spark session
spark = SparkSession.builder.appName("DQ_Example").getOrCreate()
# Create sample data
data = [
(1, "John", 25, "john@email.com"),
(2, "Jane", -5, "invalid-email"), # Bad data: negative age, invalid email
(3, "Bob", 30, "bob@email.com"),
(4, None, 35, "alice@email.com"), # Bad data: null name
]
columns = ["id", "name", "age", "email"]
df = spark.createDataFrame(data, columns)
# Define quality rules
quality_rules = [
{
"expectation_type": "expect_column_values_to_not_be_null",
"kwargs": {"column": "name"}
},
{
"expectation_type": "expect_column_values_to_be_between",
"kwargs": {"column": "age", "min_value": 0, "max_value": 120}
},
{
"expectation_type": "expect_column_values_to_match_regex",
"kwargs": {"column": "email", "regex": r"^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$"}
}
]
# Initialize DQ Framework
dq = DQFramework()
# Filter data
qualified_df, bad_df = dq.filter_dataframe(
dataframe=df,
quality_rules=quality_rules,
include_validation_details=True
)
# Show results
print("Qualified Data:")
qualified_df.show()
print("Bad Data:")
bad_df.show()
API Reference
DQFramework
The main class for data quality processing.
Methods
filter_dataframe(dataframe, quality_rules, columns=None, include_validation_details=False)- Filters a DataFrame based on quality rules
- Returns tuple of (qualified_df, bad_df)
RuleProcessor
Handles the processing of Great Expectations rules.
Dependencies
Core Dependencies
- PySpark ^3.0.0: For DataFrame operations
- Great Expectations ^0.15.0: For validation logic
- typing-extensions ^4.0.0: For enhanced type hints
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file hdf_dq_framework-0.3.0.tar.gz.
File metadata
- Download URL: hdf_dq_framework-0.3.0.tar.gz
- Upload date:
- Size: 109.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.0.0 CPython/3.13.3 Darwin/24.5.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f9b9ab4a836dea5bd4ca97bf9fa164522fbe358a58fa72d186e1ac81c4425249
|
|
| MD5 |
4a12964e5defa17fe051a958d146158a
|
|
| BLAKE2b-256 |
3945272636b08bd1deb515a66df9e4ac7944258845087ee82f459c8312bd0fa6
|
File details
Details for the file hdf_dq_framework-0.3.0-py3-none-any.whl.
File metadata
- Download URL: hdf_dq_framework-0.3.0-py3-none-any.whl
- Upload date:
- Size: 184.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.0.0 CPython/3.13.3 Darwin/24.5.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2fc69a07321e72d1941d7d76a81d06ff4b60965dea36a2cecb83b21747c8df9f
|
|
| MD5 |
015f2f82835c742f27029e5e5e5a4ce2
|
|
| BLAKE2b-256 |
96d66a05be27942e83b7458d5d522c3fa81c5b52ef6c85a0cebd74a881cc111b
|