Skip to main content

HDF Data Quality Framework for PySpark DataFrames using Great Expectations

Project description

DQ Framework

A powerful Data Quality Framework for PySpark DataFrames using Great Expectations validation rules.

Overview

The DQ Framework provides a simple and efficient way to filter DataFrames based on data quality rules. It separates qualified data from bad data, allowing you to handle data quality issues systematically in your data pipelines.

Key Features

  • Easy Integration: Simple API that works with existing PySpark workflows
  • Great Expectations: Leverages the power of Great Expectations for data validation
  • Flexible Rules: Support for JSON string, dictionary, or list-based rule configuration
  • Dual Output: Returns both qualified and bad rows as separate DataFrames
  • Detailed Validation: Optional validation details for debugging and monitoring

Installation

Prerequisites

Install Poetry if you haven't already:

# Install Poetry (if not already installed)
curl -sSL https://install.python-poetry.org | python3 -

# Or using pip
pip install poetry

Version

git add .
git commit -m "Add awesome new feature"

# Create a new release tag
git tag v0.2.0
git push --tags

# Build automatically uses the tag version
poetry build
#### Option 4: Build and Install Distribution
# Build the package
poetry build

# This creates:
# - dist/dq_framework-0.1.0.tar.gz (source distribution)
# - dist/dq_framework-0.1.0-py3-none-any.whl (wheel)

# Install the built wheel
pip install dist/dq_framework-0.1.0-py3-none-any.whl

Option 5: Install from Source

# Install directly from the source directory
pip install /Users/nchungkham/Workspace/hdf-data-pipeline/spark_etl_framework/dq_framework

Quick Start

from pyspark.sql import SparkSession
from dq_framework import DQFramework

# Initialize Spark session
spark = SparkSession.builder.appName("DQ_Example").getOrCreate()

# Create sample data
data = [
    (1, "John", 25, "john@email.com"),
    (2, "Jane", -5, "invalid-email"),  # Bad data: negative age, invalid email
    (3, "Bob", 30, "bob@email.com"),
    (4, None, 35, "alice@email.com"),  # Bad data: null name
]
columns = ["id", "name", "age", "email"]
df = spark.createDataFrame(data, columns)

# Define quality rules
quality_rules = [
    {
        "expectation_type": "expect_column_values_to_not_be_null",
        "kwargs": {"column": "name"}
    },
    {
        "expectation_type": "expect_column_values_to_be_between",
        "kwargs": {"column": "age", "min_value": 0, "max_value": 120}
    },
    {
        "expectation_type": "expect_column_values_to_match_regex",
        "kwargs": {"column": "email", "regex": r"^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$"}
    }
]

# Initialize DQ Framework
dq = DQFramework()

# Filter data
qualified_df, bad_df = dq.filter_dataframe(
    dataframe=df,
    quality_rules=quality_rules,
    include_validation_details=True
)

# Show results
print("Qualified Data:")
qualified_df.show()

print("Bad Data:")
bad_df.show()

Poetry Commands Reference

Development Workflow

# Install dependencies
poetry install

# Add a new dependency
poetry add package-name

# Add a development dependency
poetry add --group dev package-name

# Update dependencies
poetry update

# Show current dependencies
poetry show

# Show dependency tree
poetry show --tree

# Activate virtual environment
poetry shell

# Run commands in the poetry environment
poetry run python script.py
poetry run pytest
poetry run black .
poetry run mypy .

Building and Publishing

# Build the package
poetry build

# Check the package before publishing
poetry check

# Publish to PyPI (requires configuration)
poetry publish

# Publish to a private repository
poetry publish --repository my-repo

Virtual Environment Management

# Show virtual environment info
poetry env info

# Show path to virtual environment
poetry env info --path

# Remove virtual environment
poetry env remove python

# Use specific Python version
poetry env use python3.9

API Reference

DQFramework

The main class for data quality processing.

Methods

  • filter_dataframe(dataframe, quality_rules, columns=None, include_validation_details=False)
    • Filters a DataFrame based on quality rules
    • Returns tuple of (qualified_df, bad_df)

RuleProcessor

Handles the processing of Great Expectations rules.

DQConfigExamples

Provides example configurations for common data quality scenarios.

Configuration Examples

The framework includes extensive configuration examples in DQConfigExamples. See the detailed documentation for more information.

Dependencies

Core Dependencies

  • PySpark ^3.0.0: For DataFrame operations
  • Great Expectations ^0.15.0: For validation logic
  • typing-extensions ^4.0.0: For enhanced type hints

Optional Dependencies

  • pandas ^1.3.0: For DataFrame interoperability (install with --extras enhanced)
  • numpy ^1.20.0: For numerical operations (install with --extras enhanced)

Development Dependencies

  • pytest ^6.0.0: Testing framework
  • pytest-cov ^2.0.0: Coverage reporting
  • black ^21.0.0: Code formatting
  • flake8 ^3.8.0: Linting
  • mypy ^0.900: Type checking
  • pre-commit ^2.15.0: Git hooks

Development

Setup Development Environment

# Install with development dependencies
poetry install --with dev

# Set up pre-commit hooks
poetry run pre-commit install

# Run tests
poetry run pytest

# Run tests with coverage
poetry run pytest --cov=dq_framework

# Code formatting
poetry run black .

# Type checking
poetry run mypy .

# Linting
poetry run flake8 .

Package Structure

dq_framework/
├── __init__.py              # Package initialization
├── dq_framework.py          # Main DQFramework class
├── rule_processor.py        # Rule processing logic
├── config_examples.py       # Configuration examples
├── requirements.txt         # Legacy requirements (for reference)
├── readme.md               # Detailed technical documentation
├── pyproject.toml          # Poetry configuration
└── README.md              # This file

Contributing

  1. Follow PEP 8 style guidelines (use poetry run black .)
  2. Add type hints to all functions (check with poetry run mypy .)
  3. Include docstrings for all public methods
  4. Add tests for new functionality (poetry run pytest)
  5. Update documentation as needed

License

[Add your license information here]

Support

For questions and support, please create an issue in the repository.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hdf_dq_framework-0.2.0.tar.gz (19.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

hdf_dq_framework-0.2.0-py3-none-any.whl (38.9 kB view details)

Uploaded Python 3

File details

Details for the file hdf_dq_framework-0.2.0.tar.gz.

File metadata

  • Download URL: hdf_dq_framework-0.2.0.tar.gz
  • Upload date:
  • Size: 19.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.0.0 CPython/3.13.3 Darwin/24.5.0

File hashes

Hashes for hdf_dq_framework-0.2.0.tar.gz
Algorithm Hash digest
SHA256 25d541ac40dda3e128e9050ce61c5066941c4a94b9862c489795a32885d3f459
MD5 4cb7d465a2bdcd12ef290690df99af06
BLAKE2b-256 8784d033ff23da20fd2039e5e8a7307540bf29755d55492a25116cd63234ce02

See more details on using hashes here.

File details

Details for the file hdf_dq_framework-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: hdf_dq_framework-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 38.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.0.0 CPython/3.13.3 Darwin/24.5.0

File hashes

Hashes for hdf_dq_framework-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 24c279d956cc6aadead6925450d512112820f703a3760dbfe6b97f79d877c4a5
MD5 14859fab6e600d3d8cc8b4823b91622c
BLAKE2b-256 003f6d8a226661b519645428c25b1eb2a39b5d290b5775299a71e8b58e44c80d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page