Skip to main content

Reusable Data Validation Framework for data migration, ETL validation, and cross-database reconciliation

Project description

DataQE Framework - Data Quality and Equality Testing

A powerful Python framework for validating data quality and ensuring data consistency between source and target databases. Designed for data migration projects, ETL validation, and cross-database reconciliation.

Version: 0.3.2

Overview

DataQE Framework enables organizations to:

  • Validate data migration quality between different database systems
  • Ensure data consistency across source and target environments
  • Run comprehensive test suites with flexible comparison modes
  • Generate detailed reports for compliance and audit trails
  • Support dynamic dataset replacement for multi-release environments

Key Features

Multi-Database Support

  • MySQL - Relational database validation
  • Google BigQuery - Cloud data warehouse validation
  • Extensible connector architecture for adding more databases

Flexible Test Configuration

  • YAML-based test definitions
  • Single-source validation with expected conditions
  • Source vs Target equality checks
  • Threshold-based comparisons (percentage and absolute)
  • Support for multiple test cases in a single execution

Dynamic Dataset Replacement

  • Replace dataset placeholders with actual release names
  • Centralized configuration for dataset mappings
  • Support for multiple sources with different release versions

Comprehensive Reporting

  • ExecutionReport.html - Full test results with detailed execution times
  • FailedExecutionReport.html - Failed tests or confirmation of all tests passing
  • ExecutionReport.csv - Structured test results for further analysis
  • AutomationData.csv - CI/CD integration data
  • Real-time console output with progress tracking

Enterprise Features

  • PHI data protection with KMS encryption support
  • Detailed execution timing metrics
  • Environment-based configuration
  • Flexible credential management

Installation

Prerequisites

  • Python 3.8+
  • pip

Install from Source

git clone <repository-url>
cd dataqe-framework
pip install -e .

Verify Installation

dataqe-run --help

Quick Start

1. Create Configuration File

Create config.yml:

config_block_validation:
  source:
    database_type: mysql
    mysql:
      host: source-db.example.com
      port: 3306
      user: db_user
      password: db_password
      database: source_db

  target:
    database_type: gcpbq
    gcp:
      project_id: my-gcp-project
      dataset_id: target_dataset
      credentials_path: /path/to/credentials.json

  other:
    validation_script: test_suite.yml
    preprocessor_queries: preprocessor_queries.yml

2. Create Test Suite

Create test_suite.yml:

- test_row_count:
    severity: critical
    source:
      query: |
        SELECT COUNT(*) as value FROM users
    target:
      query: |
        SELECT COUNT(*) as value FROM users
    comparisons:
      comment: "User count must match between source and target"

- test_with_threshold:
    severity: high
    source:
      query: |
        SELECT SUM(amount) as value FROM transactions
    target:
      query: |
        SELECT SUM(amount) as value FROM transactions
    comparisons:
      threshold:
        value: percentage
        limit: 1
      comment: "Transaction amounts must match within 1%"

3. Run Validation

dataqe-run --config config.yml

Check output directory for reports:

./output/ExecutionReport.html
./output/ExecutionReport.csv
./output/FailedExecutionReport.html

Multi-Block Configuration Support

The framework supports running multiple configuration blocks in a single execution. This allows you to validate different database pairs or environments without creating separate config files.

What is a Block?

A configuration block is a complete validation setup with source database, target database, and test scripts. Each block starts with config_block_ followed by a unique identifier.

Execution Modes

1. Single Block (Default - Backward Compatible)

Executes the first valid block found:

dataqe-run --config config.yml

2. Specific Block

Execute a particular block by name:

dataqe-run --config config.yml --block validation_qa

3. All Blocks

Execute all valid blocks sequentially:

dataqe-run --config config.yml --all-blocks

Multi-Block Example

Create a single config file with multiple blocks for different environments:

# Validation for QA environment
config_block_validation_qa:
  source:
    database_type: mysql
    mysql:
      host: qa-mysql.example.com
      port: 3306
      user: db_user
      password: db_password
      database: qa_db

  target:
    database_type: gcpbq
    gcp:
      project_id: my-gcp-qa
      dataset_id: qa_dataset
      credentials_path: /path/to/credentials.json

  other:
    validation_script: tests/qa_tests.yml

# Validation for Production environment
config_block_validation_prod:
  source:
    database_type: mysql
    mysql:
      host: prod-mysql.example.com
      port: 3306
      user: db_user
      password: db_password
      database: prod_db

  target:
    database_type: gcpbq
    gcp:
      project_id: my-gcp-prod
      dataset_id: prod_dataset
      credentials_path: /path/to/credentials.json

  other:
    validation_script: tests/prod_tests.yml

Then execute:

# Run only QA validation
dataqe-run --config config.yml --block validation_qa

# Run only Production validation
dataqe-run --config config.yml --block validation_prod

# Run both sequentially
dataqe-run --config config.yml --all-blocks

Block Validation

Each block must contain:

  • source section with database configuration
  • target section with database configuration
  • other section with validation_script path

Invalid or incomplete blocks are automatically skipped with a warning.

Configuration

Config Block Structure

config_block_<name>:
  source:
    database_type: mysql|gcpbq
    mysql: {...}
    gcp: {...}
    config_query_key: optional_query_key
    source_name: optional_source_name

  target:
    database_type: mysql|gcpbq
    mysql: {...}
    gcp: {...}
    config_query_key: optional_query_key
    source_name: optional_source_name

  other:
    validation_script: path/to/test_suite.yml
    preprocessor_queries: path/to/preprocessor_queries.yml

Database Configuration

MySQL

mysql:
  host: hostname
  port: 3306
  user: username
  password: password
  database: database_name

Google BigQuery

gcp:
  project_id: my-project
  dataset_id: my-dataset
  credentials_path: /path/to/service-account.json
  location: us-central1
  use_encryption: false
  # Optional: Replace dataset placeholders (e.g., EDW_PRCD_PROJECT)
  replace_dataset:
    EDW_PRCD_PROJECT: actual-project-id-edw
    PD_CDW_METADATA: actual-project-id-pd

Replace Dataset Placeholders

Replace dataset placeholders with actual BigQuery project IDs. Two formats are supported:

Format 1: Configuration-driven lookup (Recommended)

Use project_name and dataset_name to auto-resolve project IDs from configuration:

config_block_example:
  source:
    database_type: gcpbq
    gcp:
      project_id: my-project
      dataset_id: staging
      credentials_path: /path/to/creds.json
      # Auto-resolve from castlight config
      replace_dataset:
        - project_name: "pd"
          dataset_name: "cdw_prcd_metadata"
        - project_name: "pd"
          dataset_name: "cdw_metadata"

  target:
    database_type: gcpbq
    gcp:
      project_id: my-project
      dataset_id: analytics
      credentials_path: /path/to/creds.json
      replace_dataset:
        - project_name: "edw"
          dataset_name: "prcd_metadata"

This generates placeholders from project_name and dataset_name:

  • project_name: "pd" + dataset_name: "cdw_prcd_metadata" → looks for PD_CDW_PRCD_METADATA in queries
  • project_name: "edw" + dataset_name: "prcd_metadata" → looks for EDW_PRCD_METADATA in queries

The actual project IDs are retrieved from: config_details.data['bigquery'][project_name]['datasets'][dataset_name]['project_id']

Format 2: Direct mapping (Legacy)

Manually specify placeholder to project ID mappings:

replace_dataset:
  EDW_PRCD_PROJECT: "data-warehouse-prod"
  PD_CDW_METADATA: "analytics-prod"

Test query example:

SELECT COUNT(*) as value
FROM PD_CDW_PRCD_METADATA.customers
JOIN EDW_PRCD_METADATA.addresses ON customers.id = addresses.customer_id

The placeholders are automatically replaced at runtime with the actual project IDs. Format 1 is preferred as it eliminates manual project ID maintenance.

See CONFIGURATION.md for detailed configuration options.

Test Suite Definition

Each test case has the following structure:

- test_name:
    severity: critical|high|medium|low

    source:
      query: |
        SELECT COUNT(*) as value FROM table
      config_query_key: optional_key
      source_name: optional_source_name

    target:
      query: |
        SELECT COUNT(*) as value FROM table
      config_query_key: optional_key
      source_name: optional_source_name

    comparisons:
      expected: optional_expected_value
      threshold:
        value: percentage|absolute
        limit: number
      comment: "Description of this test"

Comparison Modes

1. Source vs Target Equality

comparisons:
  comment: "Values must match exactly"

2. Expected Value Check

comparisons:
  expected: ">=1000"
  comment: "Count must be at least 1000"

3. Percentage Threshold

comparisons:
  threshold:
    value: percentage
    limit: 5
  comment: "Target can vary up to 5% from source"

4. Absolute Difference

comparisons:
  threshold:
    value: absolute
    limit: 100
  comment: "Target can differ by max 100 units"

Dynamic Dataset Replacement

Replace dataset placeholders with actual release names:

1. Create Preprocessor Queries File

Create preprocessor_queries.yml:

get_releases: |
  SELECT source, current_release, previous_release
  FROM release_metadata
  WHERE is_active = TRUE

get_bcbsa_releases: |
  SELECT 'bcbsa' as source, 'bcbsa_export1' as current_release, 'bcbsa_export3' as previous_release

2. Update Configuration

Add to config.yml:

other:
  validation_script: test_suite.yml
  preprocessor_queries: preprocessor_queries.yml

3. Update Test Suite

Use placeholders in queries and specify the preprocessor key:

- test_current_release:
    source:
      query: |
        SELECT COUNT(*) as value FROM BCBSA_CURR_WEEK.users
      config_query_key: get_bcbsa_releases
      source_name: bcbsa

The framework will:

  1. Execute get_bcbsa_releases query
  2. Get current_release value (bcbsa_export1)
  3. Replace BCBSA_CURR_WEEKbcbsa_export1
  4. Run the modified query

See PREPROCESSOR.md for detailed examples.

Report Generation

ExecutionReport.html

Full test execution report with:

  • Test results (PASS/FAIL)
  • Source and target values
  • Execution timestamps
  • Query execution times
  • Comparison methods

FailedExecutionReport.html

Summary of failed tests or confirmation of all tests passing

ExecutionReport.csv

Structured test results for import into analysis tools:

  • Test name
  • Status
  • Severity
  • Source/Target values
  • Execution time

AutomationData.csv

CI/CD integration data:

  • App name
  • Branch
  • Platform
  • Owner
  • Test report path

Environment Variables

Configure the framework behavior using environment variables:

# Output directory for reports (default: ./output)
export DATAQE_OUTPUT_DIR=/path/to/output

# CI/CD metadata (used in AutomationData.csv)
export DATAQE_APP_NAME=my-app
export DATAQE_BRANCH=main
export DATAQE_PLATFORM=kubernetes
export DATAQE_OWNER=team-name

Command Line Usage

Basic Execution (First Block)

dataqe-run --config /path/to/config.yml

Multi-Block Execution

# Execute a specific block by name
dataqe-run --config /path/to/config.yml --block validation_qa

# Execute all blocks in the config file
dataqe-run --config /path/to/config.yml --all-blocks

With Custom Output Directory

dataqe-run --config /path/to/config.yml --output-dir /custom/output

Or use environment variable:

export DATAQE_OUTPUT_DIR=/custom/output
dataqe-run --config /path/to/config.yml

With Variable Replacement

Replace variables in test scripts dynamically:

# Replace a single variable
dataqe-run --config /path/to/config.yml --replace "@employerID,5"

# Replace multiple variables
dataqe-run --config /path/to/config.yml \
  --replace "@employerID,5" \
  --replace "@storeID,10" \
  --replace "status,active"

# Set ENVIRONMENT (SPRING_PROFILES_ACTIVE env var, defaults to 'gcpqa')
export SPRING_PROFILES_ACTIVE=production
dataqe-run --config /path/to/config.yml

Variable Replacement Features:

  • ENVIRONMENT - Automatically replaced with SPRING_PROFILES_ACTIVE env var (default: gcpqa)
  • Custom variables - Use @variableName,value format in --replace
  • Multiple replacements - Use --replace multiple times
  • Works in queries - Replacements apply to all SQL queries in test scripts

Example test script with variables:

- test_count_by_environment:
    source:
      query: |
        SELECT COUNT(*) as value
        FROM MySQL_ENVIRONMENT.insurance_companies
        WHERE employer_id = @employerID

Running: dataqe-run --config config.yml --replace "@employerID,123"

Results in:

SELECT COUNT(*) as value
FROM MySQL_ENVIRONMENT.insurance_companies
WHERE employer_id = 123

With Version Check

dataqe-run --version

CI/CD Integration

export DATAQE_APP_NAME=ecommerce-platform
export DATAQE_BRANCH=feature-branch
export DATAQE_PLATFORM=kubernetes
export DATAQE_OWNER=data-team

dataqe-run --config /path/to/config.yml

Project Structure

dataqe-framework/
├── src/dataqe_framework/
│   ├── __init__.py
│   ├── cli.py                 # Command-line interface
│   ├── config_loader.py       # YAML config loading
│   ├── executor.py            # Test execution engine
│   ├── preprocessor.py        # Query preprocessing
│   ├── reporter.py            # Report generation
│   ├── comparison/
│   │   ├── comparator.py      # Comparison logic
│   │   └── threshold.py       # Threshold calculations
│   └── connectors/
│       ├── base_connector.py  # Base connector interface
│       ├── mysql_connector.py # MySQL implementation
│       └── bigquery_connector.py # BigQuery implementation
├── example_preprocessor_config.yml
├── example_preprocessor_queries.yml
├── example_preprocessor_test_script.yml
├── README.md
├── CONFIGURATION.md
├── PREPROCESSOR.md
└── pyproject.toml

Examples

Example 1: Simple Row Count Validation

Test if row counts match between MySQL and BigQuery:

- users_row_count:
    severity: critical
    source:
      query: SELECT COUNT(*) as value FROM users
    target:
      query: SELECT COUNT(*) as value FROM users
    comparisons:
      comment: "User count must match exactly"

Example 2: Multi-Release Dataset Validation

Validate current and previous release datasets:

- current_release_sales:
    severity: high
    source:
      query: |
        SELECT SUM(amount) as value FROM BCBSA_CURR_WEEK.sales
      config_query_key: get_bcbsa_releases
      source_name: bcbsa

- previous_release_sales:
    severity: medium
    source:
      query: |
        SELECT SUM(amount) as value FROM BCBSA_PREV_WEEK.sales
      config_query_key: get_bcbsa_releases
      source_name: bcbsa

Example 3: Threshold-Based Comparison

Allow data variations within acceptable ranges:

- transaction_amounts:
    severity: high
    source:
      query: SELECT SUM(amount) as value FROM transactions
    target:
      query: SELECT SUM(amount) as value FROM transactions
    comparisons:
      threshold:
        value: percentage
        limit: 2
      comment: "Amounts must match within 2%"

Troubleshooting

Connection Issues

MySQL Connection Refused

# Check connectivity
mysql -h <host> -u <user> -p<password> <database>

# Verify in config.yml:
# - host is correct
# - port is 3306 (or custom port)
# - user/password are correct

BigQuery Authentication Failed

# Verify credentials file
gcloud auth application-default print-access-token

# Check in config.yml:
# - credentials_path points to valid service account JSON
# - credentials file has BigQuery permissions

Query Execution Issues

Query Timeout

  • Increase database timeout settings
  • Optimize query performance
  • Check database load

Dataset Not Found

  • For preprocessor queries: verify config_query_key matches a key in preprocessor_queries.yml
  • For dynamic replacement: verify placeholder format matches expected convention

Report Generation Issues

Output directory not writable

chmod -R 755 ./output

No output files generated

  • Check logs for errors
  • Verify DATAQE_OUTPUT_DIR has write permissions
  • Ensure test suite has valid queries

Performance Considerations

  • Large result sets: Memory usage scales with query result size
  • Many tests: Execution time is cumulative
  • Database load: Run during off-peak hours for production databases
  • Network latency: BigQuery queries may take longer than MySQL

Security

Sensitive Data Handling

  • Never commit credentials files
  • Use environment variables for secrets
  • Enable KMS encryption for PHI data in BigQuery

Best Practices

  • Use dedicated read-only database accounts
  • Limit query timeout duration
  • Monitor execution logs for suspicious patterns
  • Review generated reports for sensitive data exposure

Contributing

For bug reports and feature requests, please open an issue on the repository.

Installation via pip

From PyPI (Coming Soon)

pip install dataqe-framework

From GitHub

pip install git+https://github.com/ShaikKhadarmohiddin/dataqe-framework.git

From Source

git clone https://github.com/ShaikKhadarmohiddin/dataqe-framework.git
cd dataqe-framework
pip install -e .

Author

Khadar Shaik

License

This project is licensed under the MIT License - see the LICENSE file for details.

MIT License - You are free to use this project for personal, educational, or commercial purposes.

Support

For support and questions:

  • Check documentation in the project repository
  • Open an issue on GitHub Issues
  • Review troubleshooting section in GETTING_STARTED.md
  • Consult test output and logs for error details

Version History

0.2.8 (Variable Replacement & Output Directory)

  • NEW: Variable replacement in test scripts (--replace option)
  • NEW: Automatic ENVIRONMENT replacement from SPRING_PROFILES_ACTIVE (default: gcpqa)
  • NEW: Custom output directory via CLI (--output-dir)
  • IMPROVED: Path resolution for validation scripts (from current working directory)
  • IMPROVED: Path resolution for preprocessor queries (from current working directory)
  • Support for multiple variable replacements in single execution
  • Backward compatible with existing configurations

0.2.7 (Multi-Block Configuration Support)

  • NEW: Execute multiple configuration blocks in a single run
  • NEW: CLI options for block selection (--block, --all-blocks)
  • NEW: Flexible execution modes (default, specific, all)
  • NEW: Backward compatible (default executes first block)
  • Enhanced block validation and error handling
  • Improved logging for block execution

0.2.6-0.2.5 (Preprocessor & Configuration Enhancements)

  • Automatic query preprocessor with config-level settings
  • Multi-release dataset validation
  • Enhanced configuration flexibility

0.2.4-0.0.1 (Foundation & Stability)

  • Multi-database support (MySQL, BigQuery)
  • YAML-based test configuration
  • Flexible comparison modes
  • Dynamic dataset replacement
  • Comprehensive reporting
  • PHI data protection
  • CI/CD integration support

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dataqe_framework-0.3.2.tar.gz (46.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dataqe_framework-0.3.2-py3-none-any.whl (38.3 kB view details)

Uploaded Python 3

File details

Details for the file dataqe_framework-0.3.2.tar.gz.

File metadata

  • Download URL: dataqe_framework-0.3.2.tar.gz
  • Upload date:
  • Size: 46.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.7

File hashes

Hashes for dataqe_framework-0.3.2.tar.gz
Algorithm Hash digest
SHA256 b905383220428d88fece8d75aa5812591b799c49836f90842ee964ca9ea0ac59
MD5 cf3104aa673bf22db6030116c95c53cd
BLAKE2b-256 f894a8c4086028c17fce8a2a97dd6a376e1275d172f7aaed2b33e8609d45bd84

See more details on using hashes here.

File details

Details for the file dataqe_framework-0.3.2-py3-none-any.whl.

File metadata

File hashes

Hashes for dataqe_framework-0.3.2-py3-none-any.whl
Algorithm Hash digest
SHA256 092634d117f52011a1f682ef168d04c272435ef494721cabeeb27e4a39094108
MD5 2c9eb978a9d951d360e49dcb8aa84002
BLAKE2b-256 a33b193729b28f09780b69915d31e45afcf98b1cb79d69ef152121f64a649160

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page