HDF Data Quality Framework for PySpark DataFrames using Great Expectations
Project description
DQ Framework
A powerful Data Quality Framework for PySpark DataFrames using Great Expectations validation rules.
Overview
The DQ Framework provides a simple and efficient way to filter DataFrames based on data quality rules. It separates qualified data from bad data, allowing you to handle data quality issues systematically in your data pipelines.
Key Features
- Easy Integration: Simple API that works with existing PySpark workflows
- Great Expectations: Leverages the power of Great Expectations for data validation
- Flexible Rules: Support for JSON string, dictionary, or list-based rule configuration
- Dual Output: Returns both qualified and bad rows as separate DataFrames
- Detailed Validation: Optional validation details for debugging and monitoring
Installation
Prerequisites
Install Poetry if you haven't already:
# Install Poetry (if not already installed)
curl -sSL https://install.python-poetry.org | python3 -
# Or using pip
pip install poetry
Version
git add .
git commit -m "Add awesome new feature"
# Create a new release tag
git tag v0.2.0
git push --tags
# Build automatically uses the tag version
poetry build
#### Option 4: Build and Install Distribution
# Build the package
poetry build
# This creates:
# - dist/dq_framework-0.1.0.tar.gz (source distribution)
# - dist/dq_framework-0.1.0-py3-none-any.whl (wheel)
# Install the built wheel
pip install dist/dq_framework-0.1.0-py3-none-any.whl
Option 5: Install from Source
# Install directly from the source directory
pip install /Users/nchungkham/Workspace/hdf-data-pipeline/spark_etl_framework/dq_framework
Quick Start
from pyspark.sql import SparkSession
from dq_framework import DQFramework
# Initialize Spark session
spark = SparkSession.builder.appName("DQ_Example").getOrCreate()
# Create sample data
data = [
(1, "John", 25, "john@email.com"),
(2, "Jane", -5, "invalid-email"), # Bad data: negative age, invalid email
(3, "Bob", 30, "bob@email.com"),
(4, None, 35, "alice@email.com"), # Bad data: null name
]
columns = ["id", "name", "age", "email"]
df = spark.createDataFrame(data, columns)
# Define quality rules
quality_rules = [
{
"expectation_type": "expect_column_values_to_not_be_null",
"kwargs": {"column": "name"}
},
{
"expectation_type": "expect_column_values_to_be_between",
"kwargs": {"column": "age", "min_value": 0, "max_value": 120}
},
{
"expectation_type": "expect_column_values_to_match_regex",
"kwargs": {"column": "email", "regex": r"^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$"}
}
]
# Initialize DQ Framework
dq = DQFramework()
# Filter data
qualified_df, bad_df = dq.filter_dataframe(
dataframe=df,
quality_rules=quality_rules,
include_validation_details=True
)
# Show results
print("Qualified Data:")
qualified_df.show()
print("Bad Data:")
bad_df.show()
Poetry Commands Reference
Development Workflow
# Install dependencies
poetry install
# Add a new dependency
poetry add package-name
# Add a development dependency
poetry add --group dev package-name
# Update dependencies
poetry update
# Show current dependencies
poetry show
# Show dependency tree
poetry show --tree
# Activate virtual environment
poetry shell
# Run commands in the poetry environment
poetry run python script.py
poetry run pytest
poetry run black .
poetry run mypy .
Building and Publishing
# Build the package
poetry build
# Check the package before publishing
poetry check
# Publish to PyPI (requires configuration)
poetry publish
# Publish to a private repository
poetry publish --repository my-repo
Virtual Environment Management
# Show virtual environment info
poetry env info
# Show path to virtual environment
poetry env info --path
# Remove virtual environment
poetry env remove python
# Use specific Python version
poetry env use python3.9
API Reference
DQFramework
The main class for data quality processing.
Methods
filter_dataframe(dataframe, quality_rules, columns=None, include_validation_details=False)- Filters a DataFrame based on quality rules
- Returns tuple of (qualified_df, bad_df)
RuleProcessor
Handles the processing of Great Expectations rules.
DQConfigExamples
Provides example configurations for common data quality scenarios.
Configuration Examples
The framework includes extensive configuration examples in DQConfigExamples. See the detailed documentation for more information.
Dependencies
Core Dependencies
- PySpark ^3.0.0: For DataFrame operations
- Great Expectations ^0.15.0: For validation logic
- typing-extensions ^4.0.0: For enhanced type hints
Optional Dependencies
- pandas ^1.3.0: For DataFrame interoperability (install with
--extras enhanced) - numpy ^1.20.0: For numerical operations (install with
--extras enhanced)
Development Dependencies
- pytest ^6.0.0: Testing framework
- pytest-cov ^2.0.0: Coverage reporting
- black ^21.0.0: Code formatting
- flake8 ^3.8.0: Linting
- mypy ^0.900: Type checking
- pre-commit ^2.15.0: Git hooks
Development
Setup Development Environment
# Install with development dependencies
poetry install --with dev
# Set up pre-commit hooks
poetry run pre-commit install
# Run tests
poetry run pytest
# Run tests with coverage
poetry run pytest --cov=dq_framework
# Code formatting
poetry run black .
# Type checking
poetry run mypy .
# Linting
poetry run flake8 .
Package Structure
dq_framework/
├── __init__.py # Package initialization
├── dq_framework.py # Main DQFramework class
├── rule_processor.py # Rule processing logic
├── config_examples.py # Configuration examples
├── requirements.txt # Legacy requirements (for reference)
├── readme.md # Detailed technical documentation
├── pyproject.toml # Poetry configuration
└── README.md # This file
Contributing
- Follow PEP 8 style guidelines (use
poetry run black .) - Add type hints to all functions (check with
poetry run mypy .) - Include docstrings for all public methods
- Add tests for new functionality (
poetry run pytest) - Update documentation as needed
License
[Add your license information here]
Support
For questions and support, please create an issue in the repository.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file hdf_dq_framework-0.2.0.tar.gz.
File metadata
- Download URL: hdf_dq_framework-0.2.0.tar.gz
- Upload date:
- Size: 19.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.0.0 CPython/3.13.3 Darwin/24.5.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
25d541ac40dda3e128e9050ce61c5066941c4a94b9862c489795a32885d3f459
|
|
| MD5 |
4cb7d465a2bdcd12ef290690df99af06
|
|
| BLAKE2b-256 |
8784d033ff23da20fd2039e5e8a7307540bf29755d55492a25116cd63234ce02
|
File details
Details for the file hdf_dq_framework-0.2.0-py3-none-any.whl.
File metadata
- Download URL: hdf_dq_framework-0.2.0-py3-none-any.whl
- Upload date:
- Size: 38.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.0.0 CPython/3.13.3 Darwin/24.5.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
24c279d956cc6aadead6925450d512112820f703a3760dbfe6b97f79d877c4a5
|
|
| MD5 |
14859fab6e600d3d8cc8b4823b91622c
|
|
| BLAKE2b-256 |
003f6d8a226661b519645428c25b1eb2a39b5d290b5775299a71e8b58e44c80d
|