Data Quality Automation Framework with ML-powered anomaly detection
Project description
2QC+ Data Quality Automation Framework
A comprehensive, production-ready framework for automated data quality control with machine learning-powered anomaly detection.
๐ฏ Overview
2QC+ is an open-source Python framework that provides two levels of data quality automation:
- Level 1: Business rule validation (constraints, formats, statistical thresholds)
- Level 2: ML-powered anomaly detection (correlations, temporal patterns, multivariate analysis, distributions)
Inspired by dbt's approach, 2QC+ offers a familiar CLI experience while providing advanced ML capabilities for comprehensive data quality monitoring.
โจ Key Features
๐ง Easy to Use
- dbt-like CLI: Familiar commands and workflow
- YAML Configuration: Simple model and test definitions
- Multi-Environment: Dev, staging, prod support
- Auto-Discovery: Automatic model and test discovery
๐๏ธ Multi-Database Support
- PostgreSQL
- Snowflake
- BigQuery
- Redshift
- MySQL (experimental)
๐ Comprehensive Testing
- Level 1 Tests: Unique, not-null, email format, foreign keys, statistical thresholds
- Level 2 ML Analysis: Correlation analysis, temporal patterns, multivariate outliers, distribution changes
๐จ Smart Alerting
- Multi-Channel: Email, Slack, Microsoft Teams
- Severity-Based: Critical alerts for immediate attention
- Rich Formatting: HTML emails, Slack cards, Teams notifications
๐ Power BI Integration
- Auto-Created Tables: quality_test_results, quality_run_summary, quality_anomalies
- Historical Tracking: Trend analysis and reporting
- Executive Dashboards: Ready-to-use Power BI templates
๐ Quick Start
Installation
pip install qc2plus
Initialize a Project
qc2plus init my_data_quality_project
cd my_data_quality_project
Configure Database Connection
Edit profiles.yml:
my_data_quality_project:
target: dev
outputs:
dev:
type: postgresql
host: localhost
port: 5432
user: your_username
password: your_password
dbname: your_database
schema: public
prod:
type: snowflake
account: your_account
user: your_username
password: your_password
role: your_role
database: your_database
warehouse: your_warehouse
schema: public
Define Your First Model
Create models/customers.yml:
models:
- name: customers
description: Customer data quality tests
qc2plus_tests:
level1:
- unique:
column_name: customer_id
severity: critical
- not_null:
column_name: email
severity: critical
- email_format:
column_name: email
severity: medium
- statistical_threshold:
metric: count
threshold_type: relative
threshold_value: 2.0
severity: medium
level2:
correlation_analysis:
variables: [daily_registrations, daily_activations]
expected_correlation: 0.8
threshold: 0.2
temporal_analysis:
date_column: created_at
metrics: [count, avg_revenue]
seasonality_check: true
distribution_analysis:
segments: [country, customer_type]
metrics: [revenue, orders_count]
Run Quality Tests
# Test database connection
qc2plus test-connection --target dev
# Run all tests
qc2plus run --target dev
# Run specific model
qc2plus run --models customers --target dev
# Run only Level 1 tests
qc2plus run --level 1 --target dev
# Run with parallel execution
qc2plus run --threads 4 --target dev
๐ Available Test Types
Level 1 (Business Rules)
| Test Type | Description | Parameters |
|---|---|---|
unique |
Ensures column values are unique | column_name, severity |
not_null |
Ensures no null values | column_name, severity |
email_format |
Validates email format | column_name, severity |
foreign_key |
Checks referential integrity | column_name, reference_table, reference_column, severity |
future_date |
Ensures dates are not in the future | column_name, severity |
statistical_threshold |
Statistical anomaly detection | metric, threshold_type, threshold_value, window_days, severity |
accepted_values |
Validates against allowed values | column_name, accepted_values, severity |
range_check |
Validates numeric ranges | column_name, min_value, max_value, severity |
Level 2 (ML-Powered)
| Analyzer | Description | Use Cases |
|---|---|---|
| Correlation Analysis | Detects changes in variable relationships | Revenue vs. marketing spend correlation breaks |
| Temporal Analysis | Identifies time series anomalies | Seasonal pattern disruptions, trend breaks |
| Multivariate Analysis | Finds outliers in multi-dimensional space | Complex fraud detection, system anomalies |
| Distribution Analysis | Compares distributions across segments | Geographic shifts, demographic changes |
๐ง Configuration Examples
Statistical Threshold Test
statistical_threshold:
column_name: daily_registrations
metric: count
threshold_type: relative # or 'absolute'
threshold_value: 2.0 # 2 standard deviations
window_days: 30
severity: medium
Correlation Analysis
correlation_analysis:
variables: [ad_spend, conversions, revenue]
expected_correlation: 0.7
threshold: 0.2
correlation_type: pearson # or 'spearman'
Multivariate Analysis
multivariate_analysis:
features: [revenue, orders, session_duration, page_views]
contamination: 0.1
algorithms: [isolation_forest, lof, pca]
min_samples: 100
๐จ Alerting Configuration
Add to your project's qc2plus_project.yml:
alerting:
enabled_channels: [email, slack, teams]
thresholds:
critical_failure_threshold: 1
failure_rate_threshold: 0.2
individual_alerts: [critical]
summary_alerts: [high, medium, low]
email:
enabled: true
smtp_server: smtp.gmail.com
smtp_port: 587
username: your_email@gmail.com
password: your_app_password
to_emails: [team@company.com, alerts@company.com]
slack:
enabled: true
webhook_url: https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK
teams:
enabled: true
webhook_url: https://company.webhook.office.com/webhookb2/YOUR/TEAMS/WEBHOOK
๐ Power BI Integration
2QC+ automatically creates three tables for Power BI reporting:
1. quality_test_results
Individual test results with details about failures and execution times.
2. quality_run_summary
High-level summary of each test run including overall success rates and execution duration.
3. quality_anomalies
Detailed information about Level 2 ML-detected anomalies with severity scores.
Connect Power BI to your database and use these tables to create:
- Executive quality dashboards
- Trend analysis reports
- Anomaly investigation views
- Model performance tracking
๐๏ธ Architecture
qc2plus/
โโโ qc2plus/
โ โโโ cli.py # Command-line interface
โ โโโ core/
โ โ โโโ project.py # Project management
โ โ โโโ connection.py # Multi-database support
โ โ โโโ runner.py # Test orchestration
โ โโโ level1/
โ โ โโโ engine.py # Business rule engine
โ โ โโโ macros.py # SQL templates
โ โโโ level2/
โ โ โโโ correlation.py # Correlation analysis
โ โ โโโ temporal.py # Time series analysis
โ โ โโโ multivariate.py # Multivariate outlier detection
โ โ โโโ distribution.py # Distribution comparison
โ โโโ alerting/
โ โ โโโ alerts.py # Multi-channel alerting
โ โโโ output/
โ โโโ persistence.py # Database persistence
โโโ setup.py
โโโ requirements.txt
๐ฌ Advanced Usage
Custom SQL Tests
custom_sql:
sql: |
SELECT customer_id, COUNT(*) as violation_count
FROM customers
WHERE email IS NULL OR email NOT LIKE '%@%'
HAVING COUNT(*) > 0
severity: critical
Parallel Execution
# Run tests in parallel for faster execution
qc2plus run --threads 8 --target prod
Environment-Specific Configuration
# qc2plus_project.yml
vars:
dev:
statistical_threshold_sensitivity: 3.0
prod:
statistical_threshold_sensitivity: 2.0
Compilation Without Execution
# Compile tests to SQL for review
qc2plus compile
๐งช Testing Your Setup
# Test alert configurations
qc2plus test-alerts
# Validate project configuration
qc2plus validate
# List all available models
qc2plus list-models
๐ Performance Tips
- Use Parallel Execution: Set
--threadsbased on your database capabilities - Optimize Windows: Adjust
window_daysfor statistical tests based on data volume - Segment Wisely: Choose segments that balance granularity with performance
- Index Optimization: Ensure proper indexing on date and key columns
- Batch Scheduling: Run during low-traffic periods for production systems
๐ Troubleshooting
Common Issues
Database Connection Fails
qc2plus test-connection --target dev
Tests Not Found
qc2plus list-models --output-format table
Memory Issues with Large Datasets
- Reduce
window_daysfor statistical tests - Increase
min_samplesthreshold for ML tests - Use sampling in your models
Slow Execution
- Enable parallel execution with
--threads - Optimize database queries and indexes
- Consider pre-aggregated tables for large datasets
Debug Mode
# Run with detailed logging
QC2PLUS_LOG_LEVEL=DEBUG qc2plus run --models customers
๐ค Contributing
We welcome contributions! Please see our Contributing Guide for details.
Development Setup
git clone https://github.com/qc2plus/qc2plus.git
cd qc2plus
pip install -e ".[dev]"
pytest tests/
๐ License
This project is licensed under the MIT License - see the LICENSE file for details.
๐โโ๏ธ Support
- Documentation: https://qc2plus.readthedocs.io/
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Email: support@qc2plus.org
๐ Acknowledgments
- Inspired by dbt for the CLI and project structure approach
- Built on top of excellent open-source libraries: SQLAlchemy, scikit-learn, pandas
- Thanks to the data quality community for feedback and contributions
Made with โค๏ธ by the 2QC+ Team
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file qc2plus-1.0.0.tar.gz.
File metadata
- Download URL: qc2plus-1.0.0.tar.gz
- Upload date:
- Size: 94.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
74d33e3a7e1556f217741adde568aa5b11d7b5e81f1ba0b570b6fcc3ab5bfb88
|
|
| MD5 |
9626d29855aebe99759aa361e0a7b809
|
|
| BLAKE2b-256 |
99723b11330c36cac8979afa615fc1740223e30b9fd97a2e7bb057622dd9b1aa
|
File details
Details for the file qc2plus-1.0.0-py3-none-any.whl.
File metadata
- Download URL: qc2plus-1.0.0-py3-none-any.whl
- Upload date:
- Size: 66.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3c5ac636fe298fb85db859980f75274cdcdd195228d5ad815aa5e03ebfd74ec8
|
|
| MD5 |
d14acb5513d1d8c2a0a3892591aebda4
|
|
| BLAKE2b-256 |
e58f2bc8eb267c15994d73834357d7bc9fe89f05f8e533d70a4fc9f674044acd
|