Skip to main content

Data Quality Automation Framework with ML-powered anomaly detection

Project description

2QC+ Data Quality Automation Framework

Python 3.8+ License: MIT Tests

A comprehensive, production-ready framework for automated data quality control with machine learning-powered anomaly detection.

๐ŸŽฏ Overview

2QC+ is an open-source Python framework that provides two levels of data quality automation:

  • Level 1: Business rule validation (constraints, formats, statistical thresholds)
  • Level 2: ML-powered anomaly detection (correlations, temporal patterns, multivariate analysis, distributions)

Inspired by dbt's approach, 2QC+ offers a familiar CLI experience while providing advanced ML capabilities for comprehensive data quality monitoring.

โœจ Key Features

๐Ÿ”ง Easy to Use

  • dbt-like CLI: Familiar commands and workflow
  • YAML Configuration: Simple model and test definitions
  • Multi-Environment: Dev, staging, prod support
  • Auto-Discovery: Automatic model and test discovery

๐Ÿ—„๏ธ Multi-Database Support

  • PostgreSQL
  • Snowflake
  • BigQuery
  • Redshift
  • MySQL (experimental)

๐Ÿ“Š Comprehensive Testing

  • Level 1 Tests: Unique, not-null, email format, foreign keys, statistical thresholds
  • Level 2 ML Analysis: Correlation analysis, temporal patterns, multivariate outliers, distribution changes

๐Ÿšจ Smart Alerting

  • Multi-Channel: Email, Slack, Microsoft Teams
  • Severity-Based: Critical alerts for immediate attention
  • Rich Formatting: HTML emails, Slack cards, Teams notifications

๐Ÿ“ˆ Power BI Integration

  • Auto-Created Tables: quality_test_results, quality_run_summary, quality_anomalies
  • Historical Tracking: Trend analysis and reporting
  • Executive Dashboards: Ready-to-use Power BI templates

๐Ÿš€ Quick Start

Installation

pip install qc2plus

Initialize a Project

qc2plus init my_data_quality_project
cd my_data_quality_project

Configure Database Connection

Edit profiles.yml:

my_data_quality_project:
  target: dev
  outputs:
    dev:
      type: postgresql
      host: localhost
      port: 5432
      user: your_username
      password: your_password
      dbname: your_database
      schema: public
    prod:
      type: snowflake
      account: your_account
      user: your_username
      password: your_password
      role: your_role
      database: your_database
      warehouse: your_warehouse
      schema: public

Define Your First Model

Create models/customers.yml:

models:
  - name: customers
    description: Customer data quality tests
    qc2plus_tests:
      level1:
        - unique:
            column_name: customer_id
            severity: critical
        - not_null:
            column_name: email
            severity: critical
        - email_format:
            column_name: email
            severity: medium
        - statistical_threshold:
            metric: count
            threshold_type: relative
            threshold_value: 2.0
            severity: medium
      
      level2:
        correlation_analysis:
          variables: [daily_registrations, daily_activations]
          expected_correlation: 0.8
          threshold: 0.2
        
        temporal_analysis:
          date_column: created_at
          metrics: [count, avg_revenue]
          seasonality_check: true
        
        distribution_analysis:
          segments: [country, customer_type]
          metrics: [revenue, orders_count]

Run Quality Tests

# Test database connection
qc2plus test-connection --target dev

# Run all tests
qc2plus run --target dev

# Run specific model
qc2plus run --models customers --target dev

# Run only Level 1 tests
qc2plus run --level 1 --target dev

# Run with parallel execution
qc2plus run --threads 4 --target dev

๐Ÿ“‹ Available Test Types

Level 1 (Business Rules)

Test Type Description Parameters
unique Ensures column values are unique column_name, severity
not_null Ensures no null values column_name, severity
email_format Validates email format column_name, severity
foreign_key Checks referential integrity column_name, reference_table, reference_column, severity
future_date Ensures dates are not in the future column_name, severity
statistical_threshold Statistical anomaly detection metric, threshold_type, threshold_value, window_days, severity
accepted_values Validates against allowed values column_name, accepted_values, severity
range_check Validates numeric ranges column_name, min_value, max_value, severity

Level 2 (ML-Powered)

Analyzer Description Use Cases
Correlation Analysis Detects changes in variable relationships Revenue vs. marketing spend correlation breaks
Temporal Analysis Identifies time series anomalies Seasonal pattern disruptions, trend breaks
Multivariate Analysis Finds outliers in multi-dimensional space Complex fraud detection, system anomalies
Distribution Analysis Compares distributions across segments Geographic shifts, demographic changes

๐Ÿ”ง Configuration Examples

Statistical Threshold Test

statistical_threshold:
  column_name: daily_registrations
  metric: count
  threshold_type: relative  # or 'absolute'
  threshold_value: 2.0      # 2 standard deviations
  window_days: 30
  severity: medium

Correlation Analysis

correlation_analysis:
  variables: [ad_spend, conversions, revenue]
  expected_correlation: 0.7
  threshold: 0.2
  correlation_type: pearson  # or 'spearman'

Multivariate Analysis

multivariate_analysis:
  features: [revenue, orders, session_duration, page_views]
  contamination: 0.1
  algorithms: [isolation_forest, lof, pca]
  min_samples: 100

๐Ÿšจ Alerting Configuration

Add to your project's qc2plus_project.yml:

alerting:
  enabled_channels: [email, slack, teams]
  
  thresholds:
    critical_failure_threshold: 1
    failure_rate_threshold: 0.2
    individual_alerts: [critical]
    summary_alerts: [high, medium, low]
  
  email:
    enabled: true
    smtp_server: smtp.gmail.com
    smtp_port: 587
    username: your_email@gmail.com
    password: your_app_password
    to_emails: [team@company.com, alerts@company.com]
  
  slack:
    enabled: true
    webhook_url: https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK
  
  teams:
    enabled: true
    webhook_url: https://company.webhook.office.com/webhookb2/YOUR/TEAMS/WEBHOOK

๐Ÿ“Š Power BI Integration

2QC+ automatically creates three tables for Power BI reporting:

1. quality_test_results

Individual test results with details about failures and execution times.

2. quality_run_summary

High-level summary of each test run including overall success rates and execution duration.

3. quality_anomalies

Detailed information about Level 2 ML-detected anomalies with severity scores.

Connect Power BI to your database and use these tables to create:

  • Executive quality dashboards
  • Trend analysis reports
  • Anomaly investigation views
  • Model performance tracking

๐Ÿ—๏ธ Architecture

qc2plus/
โ”œโ”€โ”€ qc2plus/
โ”‚   โ”œโ”€โ”€ cli.py              # Command-line interface
โ”‚   โ”œโ”€โ”€ core/
โ”‚   โ”‚   โ”œโ”€โ”€ project.py      # Project management
โ”‚   โ”‚   โ”œโ”€โ”€ connection.py   # Multi-database support
โ”‚   โ”‚   โ””โ”€โ”€ runner.py       # Test orchestration
โ”‚   โ”œโ”€โ”€ level1/
โ”‚   โ”‚   โ”œโ”€โ”€ engine.py       # Business rule engine
โ”‚   โ”‚   โ””โ”€โ”€ macros.py       # SQL templates
โ”‚   โ”œโ”€โ”€ level2/
โ”‚   โ”‚   โ”œโ”€โ”€ correlation.py  # Correlation analysis
โ”‚   โ”‚   โ”œโ”€โ”€ temporal.py     # Time series analysis
โ”‚   โ”‚   โ”œโ”€โ”€ multivariate.py # Multivariate outlier detection
โ”‚   โ”‚   โ””โ”€โ”€ distribution.py # Distribution comparison
โ”‚   โ”œโ”€โ”€ alerting/
โ”‚   โ”‚   โ””โ”€โ”€ alerts.py       # Multi-channel alerting
โ”‚   โ””โ”€โ”€ output/
โ”‚       โ””โ”€โ”€ persistence.py  # Database persistence
โ”œโ”€โ”€ setup.py
โ””โ”€โ”€ requirements.txt

๐Ÿ”ฌ Advanced Usage

Custom SQL Tests

custom_sql:
  sql: |
    SELECT customer_id, COUNT(*) as violation_count
    FROM customers 
    WHERE email IS NULL OR email NOT LIKE '%@%'
    HAVING COUNT(*) > 0
  severity: critical

Parallel Execution

# Run tests in parallel for faster execution
qc2plus run --threads 8 --target prod

Environment-Specific Configuration

# qc2plus_project.yml
vars:
  dev:
    statistical_threshold_sensitivity: 3.0
  prod:
    statistical_threshold_sensitivity: 2.0

Compilation Without Execution

# Compile tests to SQL for review
qc2plus compile

๐Ÿงช Testing Your Setup

# Test alert configurations
qc2plus test-alerts

# Validate project configuration
qc2plus validate

# List all available models
qc2plus list-models

๐Ÿ“ˆ Performance Tips

  1. Use Parallel Execution: Set --threads based on your database capabilities
  2. Optimize Windows: Adjust window_days for statistical tests based on data volume
  3. Segment Wisely: Choose segments that balance granularity with performance
  4. Index Optimization: Ensure proper indexing on date and key columns
  5. Batch Scheduling: Run during low-traffic periods for production systems

๐Ÿ› Troubleshooting

Common Issues

Database Connection Fails

qc2plus test-connection --target dev

Tests Not Found

qc2plus list-models --output-format table

Memory Issues with Large Datasets

  • Reduce window_days for statistical tests
  • Increase min_samples threshold for ML tests
  • Use sampling in your models

Slow Execution

  • Enable parallel execution with --threads
  • Optimize database queries and indexes
  • Consider pre-aggregated tables for large datasets

Debug Mode

# Run with detailed logging
QC2PLUS_LOG_LEVEL=DEBUG qc2plus run --models customers

๐Ÿค Contributing

We welcome contributions! Please see our Contributing Guide for details.

Development Setup

git clone https://github.com/qc2plus/qc2plus.git
cd qc2plus
pip install -e ".[dev]"
pytest tests/

๐Ÿ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

๐Ÿ™‹โ€โ™€๏ธ Support

๐ŸŽ‰ Acknowledgments

  • Inspired by dbt for the CLI and project structure approach
  • Built on top of excellent open-source libraries: SQLAlchemy, scikit-learn, pandas
  • Thanks to the data quality community for feedback and contributions

Made with โค๏ธ by the 2QC+ Team

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

qc2plus-1.0.0.tar.gz (94.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

qc2plus-1.0.0-py3-none-any.whl (66.1 kB view details)

Uploaded Python 3

File details

Details for the file qc2plus-1.0.0.tar.gz.

File metadata

  • Download URL: qc2plus-1.0.0.tar.gz
  • Upload date:
  • Size: 94.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for qc2plus-1.0.0.tar.gz
Algorithm Hash digest
SHA256 74d33e3a7e1556f217741adde568aa5b11d7b5e81f1ba0b570b6fcc3ab5bfb88
MD5 9626d29855aebe99759aa361e0a7b809
BLAKE2b-256 99723b11330c36cac8979afa615fc1740223e30b9fd97a2e7bb057622dd9b1aa

See more details on using hashes here.

File details

Details for the file qc2plus-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: qc2plus-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 66.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for qc2plus-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 3c5ac636fe298fb85db859980f75274cdcdd195228d5ad815aa5e03ebfd74ec8
MD5 d14acb5513d1d8c2a0a3892591aebda4
BLAKE2b-256 e58f2bc8eb267c15994d73834357d7bc9fe89f05f8e533d70a4fc9f674044acd

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page