Reusable Data Validation Framework for data migration, ETL validation, and cross-database reconciliation

These details have not been verified by PyPI

Project links

Project description

DataQE Framework - Data Quality and Equality Testing

A powerful Python framework for validating data quality and ensuring data consistency between source and target databases. Designed for data migration projects, ETL validation, and cross-database reconciliation.

Version: 0.0.1

Overview

DataQE Framework enables organizations to:

Validate data migration quality between different database systems
Ensure data consistency across source and target environments
Run comprehensive test suites with flexible comparison modes
Generate detailed reports for compliance and audit trails
Support dynamic dataset replacement for multi-release environments

Key Features

Multi-Database Support

MySQL - Relational database validation
Google BigQuery - Cloud data warehouse validation
Extensible connector architecture for adding more databases

Flexible Test Configuration

YAML-based test definitions
Single-source validation with expected conditions
Source vs Target equality checks
Threshold-based comparisons (percentage and absolute)
Support for multiple test cases in a single execution

Dynamic Dataset Replacement

Replace dataset placeholders with actual release names
Centralized configuration for dataset mappings
Support for multiple sources with different release versions

Comprehensive Reporting

ExecutionReport.html - Full test results with detailed execution times
FailedExecutionReport.html - Failed tests or confirmation of all tests passing
ExecutionReport.csv - Structured test results for further analysis
AutomationData.csv - CI/CD integration data
Real-time console output with progress tracking

Enterprise Features

PHI data protection with KMS encryption support
Detailed execution timing metrics
Environment-based configuration
Flexible credential management

Installation

Prerequisites

Python 3.8+
pip

Install from Source

git clone <repository-url>
cd dataqe-framework
pip install -e .

Verify Installation

dataqe-run --help

Quick Start

1. Create Configuration File

Create config.yml:

config_block_validation:
  source:
    database_type: mysql
    mysql:
      host: source-db.example.com
      port: 3306
      user: db_user
      password: db_password
      database: source_db

  target:
    database_type: gcpbq
    gcp:
      project_id: my-gcp-project
      dataset_id: target_dataset
      credentials_path: /path/to/credentials.json

  other:
    validation_script: test_suite.yml
    preprocessor_queries: preprocessor_queries.yml

2. Create Test Suite

Create test_suite.yml:

- test_row_count:
    severity: critical
    source:
      query: |
        SELECT COUNT(*) as value FROM users
    target:
      query: |
        SELECT COUNT(*) as value FROM users
    comparisons:
      comment: "User count must match between source and target"

- test_with_threshold:
    severity: high
    source:
      query: |
        SELECT SUM(amount) as value FROM transactions
    target:
      query: |
        SELECT SUM(amount) as value FROM transactions
    comparisons:
      threshold:
        value: percentage
        limit: 1
      comment: "Transaction amounts must match within 1%"

3. Run Validation

dataqe-run --config config.yml

Check output directory for reports:

./output/ExecutionReport.html
./output/ExecutionReport.csv
./output/FailedExecutionReport.html

Configuration

Config Block Structure

config_block_<name>:
  source:
    database_type: mysql|gcpbq
    mysql: {...}
    gcp: {...}
    config_query_key: optional_query_key
    source_name: optional_source_name

  target:
    database_type: mysql|gcpbq
    mysql: {...}
    gcp: {...}
    config_query_key: optional_query_key
    source_name: optional_source_name

  other:
    validation_script: path/to/test_suite.yml
    preprocessor_queries: path/to/preprocessor_queries.yml

Database Configuration

MySQL

mysql:
  host: hostname
  port: 3306
  user: username
  password: password
  database: database_name

Google BigQuery

gcp:
  project_id: my-project
  dataset_id: my-dataset
  credentials_path: /path/to/service-account.json
  location: us-central1
  use_encryption: false

See CONFIGURATION.md for detailed configuration options.

Test Suite Definition

Each test case has the following structure:

- test_name:
    severity: critical|high|medium|low

    source:
      query: |
        SELECT COUNT(*) as value FROM table
      config_query_key: optional_key
      source_name: optional_source_name

    target:
      query: |
        SELECT COUNT(*) as value FROM table
      config_query_key: optional_key
      source_name: optional_source_name

    comparisons:
      expected: optional_expected_value
      threshold:
        value: percentage|absolute
        limit: number
      comment: "Description of this test"

Comparison Modes

1. Source vs Target Equality

comparisons:
  comment: "Values must match exactly"

2. Expected Value Check

comparisons:
  expected: ">=1000"
  comment: "Count must be at least 1000"

3. Percentage Threshold

comparisons:
  threshold:
    value: percentage
    limit: 5
  comment: "Target can vary up to 5% from source"

4. Absolute Difference

comparisons:
  threshold:
    value: absolute
    limit: 100
  comment: "Target can differ by max 100 units"

Dynamic Dataset Replacement

Replace dataset placeholders with actual release names:

1. Create Preprocessor Queries File

Create preprocessor_queries.yml:

get_releases: |
  SELECT source, current_release, previous_release
  FROM release_metadata
  WHERE is_active = TRUE

get_bcbsa_releases: |
  SELECT 'bcbsa' as source, 'bcbsa_export1' as current_release, 'bcbsa_export3' as previous_release

2. Update Configuration

Add to config.yml:

other:
  validation_script: test_suite.yml
  preprocessor_queries: preprocessor_queries.yml

3. Update Test Suite

Use placeholders in queries and specify the preprocessor key:

- test_current_release:
    source:
      query: |
        SELECT COUNT(*) as value FROM BCBSA_CURR_WEEK.users
      config_query_key: get_bcbsa_releases
      source_name: bcbsa

The framework will:

Execute get_bcbsa_releases query
Get current_release value (bcbsa_export1)
Replace BCBSA_CURR_WEEK → bcbsa_export1
Run the modified query

See PREPROCESSOR.md for detailed examples.

Report Generation

ExecutionReport.html

Full test execution report with:

Test results (PASS/FAIL)
Source and target values
Execution timestamps
Query execution times
Comparison methods

FailedExecutionReport.html

Summary of failed tests or confirmation of all tests passing

ExecutionReport.csv

Structured test results for import into analysis tools:

Test name
Status
Severity
Source/Target values
Execution time

AutomationData.csv

CI/CD integration data:

App name
Branch
Platform
Owner
Test report path

Environment Variables

Configure the framework behavior using environment variables:

# Output directory for reports (default: ./output)
export DATAQE_OUTPUT_DIR=/path/to/output

# CI/CD metadata (used in AutomationData.csv)
export DATAQE_APP_NAME=my-app
export DATAQE_BRANCH=main
export DATAQE_PLATFORM=kubernetes
export DATAQE_OWNER=team-name

Command Line Usage

Basic Execution

dataqe-run --config /path/to/config.yml

With Custom Output Directory

export DATAQE_OUTPUT_DIR=/custom/output
dataqe-run --config /path/to/config.yml

CI/CD Integration

export DATAQE_APP_NAME=ecommerce-platform
export DATAQE_BRANCH=feature-branch
export DATAQE_PLATFORM=kubernetes
export DATAQE_OWNER=data-team

dataqe-run --config /path/to/config.yml

Project Structure

dataqe-framework/
├── src/dataqe_framework/
│   ├── __init__.py
│   ├── cli.py                 # Command-line interface
│   ├── config_loader.py       # YAML config loading
│   ├── executor.py            # Test execution engine
│   ├── preprocessor.py        # Query preprocessing
│   ├── reporter.py            # Report generation
│   ├── comparison/
│   │   ├── comparator.py      # Comparison logic
│   │   └── threshold.py       # Threshold calculations
│   └── connectors/
│       ├── base_connector.py  # Base connector interface
│       ├── mysql_connector.py # MySQL implementation
│       └── bigquery_connector.py # BigQuery implementation
├── example_preprocessor_config.yml
├── example_preprocessor_queries.yml
├── example_preprocessor_test_script.yml
├── README.md
├── CONFIGURATION.md
├── PREPROCESSOR.md
└── pyproject.toml

Examples

Example 1: Simple Row Count Validation

Test if row counts match between MySQL and BigQuery:

- users_row_count:
    severity: critical
    source:
      query: SELECT COUNT(*) as value FROM users
    target:
      query: SELECT COUNT(*) as value FROM users
    comparisons:
      comment: "User count must match exactly"

Example 2: Multi-Release Dataset Validation

Validate current and previous release datasets:

- current_release_sales:
    severity: high
    source:
      query: |
        SELECT SUM(amount) as value FROM BCBSA_CURR_WEEK.sales
      config_query_key: get_bcbsa_releases
      source_name: bcbsa

- previous_release_sales:
    severity: medium
    source:
      query: |
        SELECT SUM(amount) as value FROM BCBSA_PREV_WEEK.sales
      config_query_key: get_bcbsa_releases
      source_name: bcbsa

Example 3: Threshold-Based Comparison

Allow data variations within acceptable ranges:

- transaction_amounts:
    severity: high
    source:
      query: SELECT SUM(amount) as value FROM transactions
    target:
      query: SELECT SUM(amount) as value FROM transactions
    comparisons:
      threshold:
        value: percentage
        limit: 2
      comment: "Amounts must match within 2%"

Troubleshooting

Connection Issues

MySQL Connection Refused

# Check connectivity
mysql -h <host> -u <user> -p<password> <database>

# Verify in config.yml:
# - host is correct
# - port is 3306 (or custom port)
# - user/password are correct

BigQuery Authentication Failed

# Verify credentials file
gcloud auth application-default print-access-token

# Check in config.yml:
# - credentials_path points to valid service account JSON
# - credentials file has BigQuery permissions

Query Execution Issues

Query Timeout

Increase database timeout settings
Optimize query performance
Check database load

Dataset Not Found

For preprocessor queries: verify config_query_key matches a key in preprocessor_queries.yml
For dynamic replacement: verify placeholder format matches expected convention

Report Generation Issues

Output directory not writable

chmod -R 755 ./output

No output files generated

Check logs for errors
Verify DATAQE_OUTPUT_DIR has write permissions
Ensure test suite has valid queries

Performance Considerations

Large result sets: Memory usage scales with query result size
Many tests: Execution time is cumulative
Database load: Run during off-peak hours for production databases
Network latency: BigQuery queries may take longer than MySQL

Security

Sensitive Data Handling

Never commit credentials files
Use environment variables for secrets
Enable KMS encryption for PHI data in BigQuery

Best Practices

Use dedicated read-only database accounts
Limit query timeout duration
Monitor execution logs for suspicious patterns
Review generated reports for sensitive data exposure

Contributing

For bug reports and feature requests, please open an issue on the repository.

Installation via pip

From PyPI (Coming Soon)

pip install dataqe-framework

From GitHub

pip install git+https://github.com/ShaikKhadarmohiddin/dataqe-framework.git

From Source

git clone https://github.com/ShaikKhadarmohiddin/dataqe-framework.git
cd dataqe-framework
pip install -e .

Author

Khadar Shaik

Email: khadarmohiddin.shaik@apree.health
GitHub: @ShaikKhadarmohiddin

License

This project is licensed under the MIT License - see the LICENSE file for details.

MIT License - You are free to use this project for personal, educational, or commercial purposes.

Support

For support and questions:

Check documentation in the project repository
Open an issue on GitHub Issues
Review troubleshooting section in GETTING_STARTED.md
Consult test output and logs for error details

Version History

0.0.1 (Initial Release)

Multi-database support (MySQL, BigQuery)
YAML-based test configuration
Flexible comparison modes
Dynamic dataset replacement
Comprehensive reporting
PHI data protection
CI/CD integration support

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.3.5

Mar 17, 2026

0.3.4

Mar 17, 2026

0.3.3

Mar 13, 2026

0.3.2

Mar 12, 2026

0.3.1

Mar 6, 2026

0.3.0

Mar 6, 2026

0.2.9

Mar 5, 2026

0.2.8

Mar 5, 2026

0.2.7

Mar 3, 2026

0.2.6

Mar 2, 2026

0.2.5

Mar 2, 2026

0.2.4

Feb 24, 2026

0.2.3

Feb 24, 2026

0.2.2

Feb 24, 2026

0.2.1

Feb 20, 2026

This version

0.2.0

Feb 20, 2026

0.1.2

Feb 20, 2026

0.1.1

Feb 20, 2026

0.1.0

Feb 20, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dataqe_framework-0.2.0.tar.gz (26.1 kB view details)

Uploaded Feb 20, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

dataqe_framework-0.2.0-py3-none-any.whl (26.2 kB view details)

Uploaded Feb 20, 2026 Python 3

File details

Details for the file dataqe_framework-0.2.0.tar.gz.

File metadata

Download URL: dataqe_framework-0.2.0.tar.gz
Upload date: Feb 20, 2026
Size: 26.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.7

File hashes

Hashes for dataqe_framework-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`b75c2bcafcc00498369a10329356f7f5df8386fe5cd221a012defafc796e81b0`
MD5	`cad0cf8e39653bd4993293cbf5bff001`
BLAKE2b-256	`4d0f77536da5f96b52b317cb68e56bced9a836a62dc51cfad7c1f2ff89b6e2fc`

See more details on using hashes here.

File details

Details for the file dataqe_framework-0.2.0-py3-none-any.whl.

File metadata

Download URL: dataqe_framework-0.2.0-py3-none-any.whl
Upload date: Feb 20, 2026
Size: 26.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.7

File hashes

Hashes for dataqe_framework-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`477f9533f0c5714fd8606828d89b2f53c8119f3bb203b6c7c40f17ea4a987c0f`
MD5	`676413a6ea9548f5dddcd550f47b9766`
BLAKE2b-256	`2486e697b797aa4f43589c57a12655ff1d4ccff45b00d19984ffeaf7bb8b59fe`

See more details on using hashes here.

dataqe-framework 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

DataQE Framework - Data Quality and Equality Testing

Overview

Key Features

Multi-Database Support

Flexible Test Configuration

Dynamic Dataset Replacement

Comprehensive Reporting

Enterprise Features

Installation

Prerequisites

Install from Source

Verify Installation

Quick Start

1. Create Configuration File

2. Create Test Suite

3. Run Validation

Configuration

Config Block Structure

Database Configuration

MySQL

Google BigQuery

Test Suite Definition

Comparison Modes

1. Source vs Target Equality

2. Expected Value Check

3. Percentage Threshold

4. Absolute Difference

Dynamic Dataset Replacement

1. Create Preprocessor Queries File

2. Update Configuration

3. Update Test Suite

Report Generation

ExecutionReport.html

FailedExecutionReport.html

ExecutionReport.csv

AutomationData.csv

Environment Variables

Command Line Usage

Basic Execution

With Custom Output Directory

CI/CD Integration

Project Structure

Examples

Example 1: Simple Row Count Validation

Example 2: Multi-Release Dataset Validation

Example 3: Threshold-Based Comparison

Troubleshooting

Connection Issues

Query Execution Issues

Report Generation Issues

Performance Considerations

Security

Sensitive Data Handling

Best Practices

Contributing

Installation via pip

From PyPI (Coming Soon)

From GitHub

From Source

Author

License

Support

Version History

0.0.1 (Initial Release)

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers