Skip to main content

Data Quality Automation Framework with ML-powered anomaly detection

Project description

QC2Plus - Advanced Data Quality Framework

PyPI version Python 3.8+ License: MIT Code style: black Documentation

Production-ready data quality framework with ML-powered anomaly detection

FeaturesInstallationQuick StartDocumentationExamples


🎯 What is QC2Plus?

QC2Plus is an open-source Python framework for automated data quality testing, combining traditional SQL-based validation with advanced machine learning anomaly detection.

Two-Level Quality Approach

Level 1: SQL-Based Validation 🔍

  • Business rules (unique, not_null, relationship)
  • Format validation (email, phone, dates)
  • Statistical thresholds (detect metric anomalies)
  • Custom SQL tests

Level 2: ML-Based Anomaly Detection 🤖

  • Correlation shifts between variables
  • Temporal pattern changes
  • Distribution drift across segments
  • Smart contextual filtering

Why QC2Plus?

Feature Traditional Tools QC2Plus
Setup Time Hours to days Minutes
Anomaly Detection Rule-based only ML-powered
Alerting Basic notifications Multi-channel with context
Monitoring Standalone Power BI integration
Learning Curve Steep dbt-like CLI

✨ Features

🚀 Easy to Use

  • dbt-inspired CLI: Familiar qc2plus run, qc2plus test commands
  • YAML Configuration: Simple model and test definitions
  • Auto-Discovery: Automatically finds models in your project
  • Multi-Environment: Separate configs for dev, staging, prod

🗄️ Database Support

Database Support Level Installation
PostgreSQL ✅ Stable Included
Snowflake ✅ Stable pip install qc2plus[snowflake]
BigQuery ✅ Stable pip install qc2plus[bigquery]
Redshift ⚠️ Beta pip install qc2plus[redshift]

📊 Comprehensive Testing

Level 1 Tests (8 built-in types):

  • unique, not_null, accepted_values
  • relationship, range_check
  • email_format, future_date
  • statistical_threshold (ML-powered)

Level 2 Analyzers (3 ML algorithms):

  • Correlation Analyzer: Detect relationship changes
  • Temporal Analyzer: Find time series anomalies
  • Distribution Analyzer: Monitor segment shifts

🔔 Smart Alerting

  • Channels: Email (SMTP), Slack, Microsoft Teams
  • Severity Levels: Critical, High, Medium, Low
  • Smart Routing: Individual alerts for critical, summaries for others
  • Rich Formatting: HTML emails, Slack cards, Teams adaptive cards

📈 Power BI Ready

Three auto-created tables for instant dashboards:

  • quality_test_results - Individual test outcomes
  • quality_run_summary - Run-level metrics
  • quality_anomalies - ML-detected anomalies with details

📦 Installation

Installation

pip install qc2plus

🏁 Quick Start

1. Initialize Project

qc2plus init my_quality_project
cd my_quality_project

This creates:

my_quality_project/
├── qc2plus_project.yml    # Project config
├── profiles.yml            # Database connections
├── models/                 # Test definitions
│   └── customers.yml       # Example model
└── README.md               # Getting started guide

2. Configure Database

Edit profiles.yml:

my_quality_project:
  target: dev
  outputs:
    dev:
      data_source:              # Where your data lives
        type: postgresql
        host: localhost
        port: 5432
        user: ${DB_USER}        # Use env variables!
        password: ${DB_PASSWORD}
        dbname: analytics
        schema: public
      
      quality_output:            # Where results are stored
        type: postgresql
        host: localhost
        port: 5432
        dbname: quality_db
        schema: qc2plus

Security Best Practice: Use environment variables for credentials!

3. Define Tests

Edit models/customers.yml:

models:
  - name: customers
    description: Customer data quality tests
    
    qc2plus_tests:
      # Level 1: Business Rules
      level1:
        - unique:
            column_name: customer_id
            severity: critical
        
        - not_null:
            column_name: email
            severity: critical
        
        - email_format:
            column_name: email
            severity: high
        
        - accepted_values:
            column_name: status
            accepted_values: ['active', 'inactive', 'churned']
            severity: medium
        
        - statistical_threshold:
            metric: count
            threshold_type: relative
            threshold_value: 2.0     # 2 std deviations
            window_days: 30
            severity: high
      
      # Level 2: ML Anomaly Detection
      level2:
        correlation_analysis:
          variables: [lifetime_value, order_count, avg_order_value]
          expected_correlation: 0.8
          threshold: 0.2
        
        temporal_analysis:
          date_column: created_at
          metrics: [count, avg_lifetime_value]
          seasonality_check: true
        
        distribution_analysis:
          segments: [country, customer_type]
          metrics: [lifetime_value, order_count]
          date_colum: date_order

4. Run Tests

# Test connection
qc2plus test-connection

# Run all tests
qc2plus run --target dev

# Run specific model
qc2plus run --models customers --target dev

# Run only Level 1
qc2plus run --level 1

# Parallel execution (4 threads)
qc2plus run --threads 4

# Production run with fail-fast
qc2plus run --target prod --fail-fast

🏁 Quick Start With Docker

1. Clone the repository

git clone https://github.com/kheopsys/qc2plus
cd qc2plus

2. Start all services

docker-compose up -d

Expected output:

 Container qc2plus-postgres         Started
 Container qc2plus-postgres-results Started
 Container qc2plus-runner           Started

3. Verify services are running

docker-compose ps

4. Access the QC2Plus container

docker exec -it qc2plus-runner bash

5. Inside the container, run quality checks

cd examples/advanced
qc2plus run --models customers --target demo

6. View results in PostgreSQL

docker exec -it qc2plus-postgres-results psql -U qc2plus -d qc2plus_results \
  -c "SELECT model_name, test_type, status, failed_rows 
      FROM quality_test_results 
      ORDER BY execution_time DESC 
      LIMIT 10;"

📚 Documentation

📖 Complete Guides


📋 Test Reference

Level 1 Tests

Test Use Case Example
unique Primary keys, unique identifiers customer_id, email
not_null Required fields email, created_at
email_format Email validation Email addresses
relationship Referential integrity customer_idcustomers.id
accepted_values Enum/status fields status in ['active', 'inactive']
range_check Numeric boundaries age between 0 and 120
future_date Date validation Birth dates, creation dates
statistical_threshold Metric anomalies Daily registrations, revenue

See documentation for complete parameter reference.

Level 2 Analyzers

Analyzer Detects Example Scenario
Correlation Relationship changes Marketing spend vs revenue decoupling
Temporal Time series anomalies Unexpected spike in daily signups
Distribution Segment shifts Geographic distribution change

🔔 Alerting Example

Configure in qc2plus_project.yml:

alerting:
  enabled_channels: [slack, email]
  
  thresholds:
    critical_failure_threshold: 1    # Alert on 1+ critical failure
    failure_rate_threshold: 0.15     # Alert if >15% tests fail
  
  slack:
    enabled: true
    webhook_url: ${SLACK_WEBHOOK_URL}
  
  email:
    enabled: true
    smtp_server: smtp.gmail.com
    smtp_port: 587
    username: ${EMAIL_USERNAME}
    password: ${EMAIL_APP_PASSWORD}
    from_email: qc2plus@company.com
    to_emails:
      - data-team@company.com
      - alerts@company.com

📊 Power BI Integration

QC2Plus automatically creates three tables in your quality database:

1. quality_test_results

Individual test results with full details.

SELECT 
  model_name,
  test_name,
  test_type,
  level,
  severity,
  status,
  failed_rows,
  total_rows,
  execution_time
FROM qc2plus.quality_test_results
WHERE execution_time >= CURRENT_DATE - INTERVAL '30 days'
ORDER BY execution_time DESC;

2. quality_run_summary

High-level run metrics for trend analysis.

SELECT 
  run_id,
  execution_time,
  target_environment,
  total_tests,
  passed_tests,
  failed_tests,
  critical_failures,
  execution_duration_seconds
FROM qc2plus.quality_run_summary
ORDER BY execution_time DESC;

3. quality_anomalies

ML-detected anomalies with severity scores.

SELECT 
  model_name,
  analyzer_type,
  anomaly_type,
  anomaly_score,
  affected_columns,
  detection_time,
  severity
FROM qc2plus.quality_anomalies
WHERE detection_time >= CURRENT_DATE - INTERVAL '7 days'
ORDER BY anomaly_score DESC;

🎯 Examples

E-commerce Data Quality

models:
  - name: orders
    qc2plus_tests:
      level1:
        - not_null:
            column_name: order_id
            severity: critical
        - relationship:
            column_name: customer_id
            reference_table: customers
            reference_column: id
            severity: critical
        - range_check:
            column_name: order_total
            min_value: 0
            severity: high
        - statistical_threshold:
            metric: sum
            column_name: order_total
            threshold_type: relative
            threshold_value: 3.0
            severity: high
      
      level2:
        correlation_analysis:
          variables: [order_total, item_count, shipping_cost]
          expected_correlation: 0.7
          threshold: 0.25
        
        temporal_analysis:
          date_column: order_date
          metrics: [count, sum_order_total, avg_order_total]
          seasonality_check: true

SaaS Metrics Monitoring

models:
  - name: daily_metrics
    qc2plus_tests:
      level1:
        - statistical_threshold:
            metric: count
            column_name: new_signups
            threshold_type: relative
            threshold_value: 2.0
            window_days: 30
            severity: high
        
        - statistical_threshold:
            metric: sum
            column_name: mrr
            threshold_type: absolute
            threshold_value: 100000
            severity: critical
      
      level2:
        correlation_analysis:
          variables: [new_signups, trial_starts, paid_conversions]
          expected_correlation: 0.85
          threshold: 0.15
        
        temporal_analysis:
          date_column: metric_date
          metrics: [new_signups, churn_count, mrr]
          seasonality_check: true
          window_days: 180

🏗️ Architecture

┌─────────────────────────────────────────────┐
│         QC2Plus Architecture                │
├─────────────────────────────────────────────┤
│                                             │
│  ┌──────────────┐     ┌──────────────┐     │
│  │   Level 1    │     │   Level 2    │     │
│  │  SQL Tests   │────▶│  ML Anomaly  │     │
│  │              │     │  Detection   │     │
│  └──────────────┘     └──────────────┘     │
│         │                     │             │
│         ▼                     ▼             │
│  ┌────────────────────────────────────┐    │
│  │      Results Persistence           │    │
│  │  (PostgreSQL/BigQuery/Snowflake)   │    │
│  └────────────────────────────────────┘    │
│         │                                   │
│         ├──▶ Power BI Dashboards            │
│         └──▶ Multi-Channel Alerts           │
│              (Slack/Email/Teams)            │
└─────────────────────────────────────────────┘

🚀 Performance Tips

  1. Parallel Execution: Use --threads based on DB capacity

    qc2plus run --threads 4  # Good for most setups
    
  2. Optimize Windows: Adjust based on data volume

    window_days: 30  # Fast, less history
    window_days: 90  # Balanced
    window_days: 180  # Comprehensive, slower
    
  3. Index Critical Columns: Especially date columns

    CREATE INDEX idx_created_at ON customers(created_at);
    
  4. Use Sampling: For exploratory analysis

    min_samples: 1000  # ML tests skip if < 1000 rows
    
  5. Schedule Wisely: Run during low-traffic periods

    # Crontab example: Daily at 2 AM
    0 2 * * * cd /path/to/project && qc2plus run --target prod
    

🐛 Troubleshooting

Connection Issues

# Test database connection
qc2plus test-connection --target dev

# Enable debug logging
export QC2PLUS_LOG_LEVEL=DEBUG
qc2plus run

Tests Not Found

# List all models
qc2plus list-models

Performance Issues

# Reduce window for testing
statistical_threshold:
  window_days: 7  # Instead of 30

# Increase minimum samples
level2:
  temporal_analysis:
    min_samples: 100  # Skip analysis if < 100 rows

Memory Errors

# Reduce parallel threads
qc2plus run --threads 1

# Or increase Docker memory (if using Docker)
docker run --memory=4g qc2plus

🤝 Contributing

We welcome contributions! See CONTRIBUTING.md for guidelines.

Quick Start:

git clone https://github.com/kheopsys/qc2plus.git
cd qc2plus
pip install -e ".[dev]"
pytest tests/

Areas We Need Help:

  • 📝 Documentation improvements
  • 🧪 Additional test types
  • 🗄️ New database adapters
  • 🎨 Power BI templates
  • 🌐 Translations

📄 License

MIT License - see LICENSE for details.


🙏 Contributors & Acknowledgments

Main Contributors

This project is maintained by:

Your Name
Ikrame Ettiache

Creator & Maintainer
🤖 💻 📊

Abdoul Raoufou Gambo

Creator & Maintainer
💻 🐛 📖

Yasser Sokri

Creator & Maintainer
🤖 💻 📊

Special Thanks

Sponsor

If QC2Plus helps your organization, consider:


📧 Support & Community

📧 Support & Community


⭐ Star us on GitHub if QC2Plus helps your data quality! ⭐

Made with ❤️ by the QC2Plus Team

⬆ Back to top

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

qc2plus-1.0.5.tar.gz (90.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

qc2plus-1.0.5-py3-none-any.whl (66.1 kB view details)

Uploaded Python 3

File details

Details for the file qc2plus-1.0.5.tar.gz.

File metadata

  • Download URL: qc2plus-1.0.5.tar.gz
  • Upload date:
  • Size: 90.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for qc2plus-1.0.5.tar.gz
Algorithm Hash digest
SHA256 b7871edad977a35b1adee95a0ef514f05698fa1baafbda07ac85ac6350746bff
MD5 35ad7e3b2b710849d597af6136992732
BLAKE2b-256 883a9b4c62b20502fdf9bfe6f32b5f5f54f7ecc635c11534a3076538e62d5e22

See more details on using hashes here.

File details

Details for the file qc2plus-1.0.5-py3-none-any.whl.

File metadata

  • Download URL: qc2plus-1.0.5-py3-none-any.whl
  • Upload date:
  • Size: 66.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for qc2plus-1.0.5-py3-none-any.whl
Algorithm Hash digest
SHA256 59b3341e555fb8eeba85de5a4e214b58e7a6e2fa583c3650ce96fde27ab4e3ce
MD5 a3338c7638e56459115dc7d9c4a8dffa
BLAKE2b-256 e6dbdb315485ff26168d2415dfeebef0350c52005cdced54f6e3007cad134890

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page