Skip to main content

Reusable Data Validation Framework for data migration, ETL validation, and cross-database reconciliation

Project description

DataQE Framework - Data Quality and Equality Testing

A powerful Python framework for validating data quality and ensuring data consistency between source and target databases. Designed for data migration projects, ETL validation, and cross-database reconciliation.

Version: 0.0.1

Overview

DataQE Framework enables organizations to:

  • Validate data migration quality between different database systems
  • Ensure data consistency across source and target environments
  • Run comprehensive test suites with flexible comparison modes
  • Generate detailed reports for compliance and audit trails
  • Support dynamic dataset replacement for multi-release environments

Key Features

Multi-Database Support

  • MySQL - Relational database validation
  • Google BigQuery - Cloud data warehouse validation
  • Extensible connector architecture for adding more databases

Flexible Test Configuration

  • YAML-based test definitions
  • Single-source validation with expected conditions
  • Source vs Target equality checks
  • Threshold-based comparisons (percentage and absolute)
  • Support for multiple test cases in a single execution

Dynamic Dataset Replacement

  • Replace dataset placeholders with actual release names
  • Centralized configuration for dataset mappings
  • Support for multiple sources with different release versions

Comprehensive Reporting

  • ExecutionReport.html - Full test results with detailed execution times
  • FailedExecutionReport.html - Failed tests or confirmation of all tests passing
  • ExecutionReport.csv - Structured test results for further analysis
  • AutomationData.csv - CI/CD integration data
  • Real-time console output with progress tracking

Enterprise Features

  • PHI data protection with KMS encryption support
  • Detailed execution timing metrics
  • Environment-based configuration
  • Flexible credential management

Installation

Prerequisites

  • Python 3.8+
  • pip

Install from Source

git clone <repository-url>
cd dataqe-framework
pip install -e .

Verify Installation

dataqe-run --help

Quick Start

1. Create Configuration File

Create config.yml:

config_block_validation:
  source:
    database_type: mysql
    mysql:
      host: source-db.example.com
      port: 3306
      user: db_user
      password: db_password
      database: source_db

  target:
    database_type: gcpbq
    gcp:
      project_id: my-gcp-project
      dataset_id: target_dataset
      credentials_path: /path/to/credentials.json

  other:
    validation_script: test_suite.yml
    preprocessor_queries: preprocessor_queries.yml

2. Create Test Suite

Create test_suite.yml:

- test_row_count:
    severity: critical
    source:
      query: |
        SELECT COUNT(*) as value FROM users
    target:
      query: |
        SELECT COUNT(*) as value FROM users
    comparisons:
      comment: "User count must match between source and target"

- test_with_threshold:
    severity: high
    source:
      query: |
        SELECT SUM(amount) as value FROM transactions
    target:
      query: |
        SELECT SUM(amount) as value FROM transactions
    comparisons:
      threshold:
        value: percentage
        limit: 1
      comment: "Transaction amounts must match within 1%"

3. Run Validation

dataqe-run --config config.yml

Check output directory for reports:

./output/ExecutionReport.html
./output/ExecutionReport.csv
./output/FailedExecutionReport.html

Configuration

Config Block Structure

config_block_<name>:
  source:
    database_type: mysql|gcpbq
    mysql: {...}
    gcp: {...}
    config_query_key: optional_query_key
    source_name: optional_source_name

  target:
    database_type: mysql|gcpbq
    mysql: {...}
    gcp: {...}
    config_query_key: optional_query_key
    source_name: optional_source_name

  other:
    validation_script: path/to/test_suite.yml
    preprocessor_queries: path/to/preprocessor_queries.yml

Database Configuration

MySQL

mysql:
  host: hostname
  port: 3306
  user: username
  password: password
  database: database_name

Google BigQuery

gcp:
  project_id: my-project
  dataset_id: my-dataset
  credentials_path: /path/to/service-account.json
  location: us-central1
  use_encryption: false

See CONFIGURATION.md for detailed configuration options.

Test Suite Definition

Each test case has the following structure:

- test_name:
    severity: critical|high|medium|low

    source:
      query: |
        SELECT COUNT(*) as value FROM table
      config_query_key: optional_key
      source_name: optional_source_name

    target:
      query: |
        SELECT COUNT(*) as value FROM table
      config_query_key: optional_key
      source_name: optional_source_name

    comparisons:
      expected: optional_expected_value
      threshold:
        value: percentage|absolute
        limit: number
      comment: "Description of this test"

Comparison Modes

1. Source vs Target Equality

comparisons:
  comment: "Values must match exactly"

2. Expected Value Check

comparisons:
  expected: ">=1000"
  comment: "Count must be at least 1000"

3. Percentage Threshold

comparisons:
  threshold:
    value: percentage
    limit: 5
  comment: "Target can vary up to 5% from source"

4. Absolute Difference

comparisons:
  threshold:
    value: absolute
    limit: 100
  comment: "Target can differ by max 100 units"

Dynamic Dataset Replacement

Replace dataset placeholders with actual release names:

1. Create Preprocessor Queries File

Create preprocessor_queries.yml:

get_releases: |
  SELECT source, current_release, previous_release
  FROM release_metadata
  WHERE is_active = TRUE

get_bcbsa_releases: |
  SELECT 'bcbsa' as source, 'bcbsa_export1' as current_release, 'bcbsa_export3' as previous_release

2. Update Configuration

Add to config.yml:

other:
  validation_script: test_suite.yml
  preprocessor_queries: preprocessor_queries.yml

3. Update Test Suite

Use placeholders in queries and specify the preprocessor key:

- test_current_release:
    source:
      query: |
        SELECT COUNT(*) as value FROM BCBSA_CURR_WEEK.users
      config_query_key: get_bcbsa_releases
      source_name: bcbsa

The framework will:

  1. Execute get_bcbsa_releases query
  2. Get current_release value (bcbsa_export1)
  3. Replace BCBSA_CURR_WEEKbcbsa_export1
  4. Run the modified query

See PREPROCESSOR.md for detailed examples.

Report Generation

ExecutionReport.html

Full test execution report with:

  • Test results (PASS/FAIL)
  • Source and target values
  • Execution timestamps
  • Query execution times
  • Comparison methods

FailedExecutionReport.html

Summary of failed tests or confirmation of all tests passing

ExecutionReport.csv

Structured test results for import into analysis tools:

  • Test name
  • Status
  • Severity
  • Source/Target values
  • Execution time

AutomationData.csv

CI/CD integration data:

  • App name
  • Branch
  • Platform
  • Owner
  • Test report path

Environment Variables

Configure the framework behavior using environment variables:

# Output directory for reports (default: ./output)
export DATAQE_OUTPUT_DIR=/path/to/output

# CI/CD metadata (used in AutomationData.csv)
export DATAQE_APP_NAME=my-app
export DATAQE_BRANCH=main
export DATAQE_PLATFORM=kubernetes
export DATAQE_OWNER=team-name

Command Line Usage

Basic Execution

dataqe-run --config /path/to/config.yml

With Custom Output Directory

export DATAQE_OUTPUT_DIR=/custom/output
dataqe-run --config /path/to/config.yml

CI/CD Integration

export DATAQE_APP_NAME=ecommerce-platform
export DATAQE_BRANCH=feature-branch
export DATAQE_PLATFORM=kubernetes
export DATAQE_OWNER=data-team

dataqe-run --config /path/to/config.yml

Project Structure

dataqe-framework/
├── src/dataqe_framework/
│   ├── __init__.py
│   ├── cli.py                 # Command-line interface
│   ├── config_loader.py       # YAML config loading
│   ├── executor.py            # Test execution engine
│   ├── preprocessor.py        # Query preprocessing
│   ├── reporter.py            # Report generation
│   ├── comparison/
│   │   ├── comparator.py      # Comparison logic
│   │   └── threshold.py       # Threshold calculations
│   └── connectors/
│       ├── base_connector.py  # Base connector interface
│       ├── mysql_connector.py # MySQL implementation
│       └── bigquery_connector.py # BigQuery implementation
├── example_preprocessor_config.yml
├── example_preprocessor_queries.yml
├── example_preprocessor_test_script.yml
├── README.md
├── CONFIGURATION.md
├── PREPROCESSOR.md
└── pyproject.toml

Examples

Example 1: Simple Row Count Validation

Test if row counts match between MySQL and BigQuery:

- users_row_count:
    severity: critical
    source:
      query: SELECT COUNT(*) as value FROM users
    target:
      query: SELECT COUNT(*) as value FROM users
    comparisons:
      comment: "User count must match exactly"

Example 2: Multi-Release Dataset Validation

Validate current and previous release datasets:

- current_release_sales:
    severity: high
    source:
      query: |
        SELECT SUM(amount) as value FROM BCBSA_CURR_WEEK.sales
      config_query_key: get_bcbsa_releases
      source_name: bcbsa

- previous_release_sales:
    severity: medium
    source:
      query: |
        SELECT SUM(amount) as value FROM BCBSA_PREV_WEEK.sales
      config_query_key: get_bcbsa_releases
      source_name: bcbsa

Example 3: Threshold-Based Comparison

Allow data variations within acceptable ranges:

- transaction_amounts:
    severity: high
    source:
      query: SELECT SUM(amount) as value FROM transactions
    target:
      query: SELECT SUM(amount) as value FROM transactions
    comparisons:
      threshold:
        value: percentage
        limit: 2
      comment: "Amounts must match within 2%"

Troubleshooting

Connection Issues

MySQL Connection Refused

# Check connectivity
mysql -h <host> -u <user> -p<password> <database>

# Verify in config.yml:
# - host is correct
# - port is 3306 (or custom port)
# - user/password are correct

BigQuery Authentication Failed

# Verify credentials file
gcloud auth application-default print-access-token

# Check in config.yml:
# - credentials_path points to valid service account JSON
# - credentials file has BigQuery permissions

Query Execution Issues

Query Timeout

  • Increase database timeout settings
  • Optimize query performance
  • Check database load

Dataset Not Found

  • For preprocessor queries: verify config_query_key matches a key in preprocessor_queries.yml
  • For dynamic replacement: verify placeholder format matches expected convention

Report Generation Issues

Output directory not writable

chmod -R 755 ./output

No output files generated

  • Check logs for errors
  • Verify DATAQE_OUTPUT_DIR has write permissions
  • Ensure test suite has valid queries

Performance Considerations

  • Large result sets: Memory usage scales with query result size
  • Many tests: Execution time is cumulative
  • Database load: Run during off-peak hours for production databases
  • Network latency: BigQuery queries may take longer than MySQL

Security

Sensitive Data Handling

  • Never commit credentials files
  • Use environment variables for secrets
  • Enable KMS encryption for PHI data in BigQuery

Best Practices

  • Use dedicated read-only database accounts
  • Limit query timeout duration
  • Monitor execution logs for suspicious patterns
  • Review generated reports for sensitive data exposure

Contributing

For bug reports and feature requests, please open an issue on the repository.

Installation via pip

From PyPI (Coming Soon)

pip install dataqe-framework

From GitHub

pip install git+https://github.com/ShaikKhadarmohiddin/dataqe-framework.git

From Source

git clone https://github.com/ShaikKhadarmohiddin/dataqe-framework.git
cd dataqe-framework
pip install -e .

Author

Khadar Shaik

License

This project is licensed under the MIT License - see the LICENSE file for details.

MIT License - You are free to use this project for personal, educational, or commercial purposes.

Support

For support and questions:

  • Check documentation in the project repository
  • Open an issue on GitHub Issues
  • Review troubleshooting section in GETTING_STARTED.md
  • Consult test output and logs for error details

Version History

0.0.1 (Initial Release)

  • Multi-database support (MySQL, BigQuery)
  • YAML-based test configuration
  • Flexible comparison modes
  • Dynamic dataset replacement
  • Comprehensive reporting
  • PHI data protection
  • CI/CD integration support

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dataqe_framework-0.2.0.tar.gz (26.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dataqe_framework-0.2.0-py3-none-any.whl (26.2 kB view details)

Uploaded Python 3

File details

Details for the file dataqe_framework-0.2.0.tar.gz.

File metadata

  • Download URL: dataqe_framework-0.2.0.tar.gz
  • Upload date:
  • Size: 26.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.7

File hashes

Hashes for dataqe_framework-0.2.0.tar.gz
Algorithm Hash digest
SHA256 b75c2bcafcc00498369a10329356f7f5df8386fe5cd221a012defafc796e81b0
MD5 cad0cf8e39653bd4993293cbf5bff001
BLAKE2b-256 4d0f77536da5f96b52b317cb68e56bced9a836a62dc51cfad7c1f2ff89b6e2fc

See more details on using hashes here.

File details

Details for the file dataqe_framework-0.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for dataqe_framework-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 477f9533f0c5714fd8606828d89b2f53c8119f3bb203b6c7c40f17ea4a987c0f
MD5 676413a6ea9548f5dddcd550f47b9766
BLAKE2b-256 2486e697b797aa4f43589c57a12655ff1d4ccff45b00d19984ffeaf7bb8b59fe

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page