Data Quality Automation Framework with ML-powered anomaly detection
Project description
QC2Plus - Advanced Data Quality Framework
Production-ready data quality framework with ML-powered anomaly detection
Features • Installation • Quick Start • Documentation • Examples
🎯 What is QC2Plus?
QC2Plus is an open-source Python framework for automated data quality testing, combining traditional SQL-based validation with advanced machine learning anomaly detection.
Two-Level Quality Approach
Level 1: SQL-Based Validation 🔍
- Business rules (unique, not_null, relationship)
- Format validation (email, phone, dates)
- Statistical thresholds (detect metric anomalies)
- Custom SQL tests
Level 2: ML-Based Anomaly Detection 🤖
- Correlation shifts between variables
- Temporal pattern changes
- Distribution drift across segments
- Smart contextual filtering
Why QC2Plus?
| Feature | Traditional Tools | QC2Plus |
|---|---|---|
| Setup Time | Hours to days | Minutes |
| Anomaly Detection | Rule-based only | ML-powered |
| Alerting | Basic notifications | Multi-channel with context |
| Monitoring | Standalone | Power BI integration |
| Learning Curve | Steep | dbt-like CLI |
✨ Features
🚀 Easy to Use
- dbt-inspired CLI: Familiar
qc2plus run,qc2plus testcommands - YAML Configuration: Simple model and test definitions
- Auto-Discovery: Automatically finds models in your project
- Multi-Environment: Separate configs for dev, staging, prod
🗄️ Database Support
| Database | Support Level | Installation |
|---|---|---|
| PostgreSQL | ✅ Stable | Included |
| Snowflake | ✅ Stable | pip install qc2plus[snowflake] |
| BigQuery | ✅ Stable | pip install qc2plus[bigquery] |
| Redshift | ⚠️ Beta | pip install qc2plus[redshift] |
📊 Comprehensive Testing
Level 1 Tests (8 built-in types):
unique,not_null,accepted_valuesrelationship,range_checkemail_format,future_datestatistical_threshold(ML-powered)
Level 2 Analyzers (3 ML algorithms):
- Correlation Analyzer: Detect relationship changes
- Temporal Analyzer: Find time series anomalies
- Distribution Analyzer: Monitor segment shifts
🔔 Smart Alerting
- Channels: Email (SMTP), Slack, Microsoft Teams
- Severity Levels: Critical, High, Medium, Low
- Smart Routing: Individual alerts for critical, summaries for others
- Rich Formatting: HTML emails, Slack cards, Teams adaptive cards
📈 Power BI Ready
Three auto-created tables for instant dashboards:
quality_test_results- Individual test outcomesquality_run_summary- Run-level metricsquality_anomalies- ML-detected anomalies with details
📦 Installation
Installation
pip install qc2plus
🏁 Quick Start
1. Initialize Project
qc2plus init my_quality_project
cd my_quality_project
This creates:
my_quality_project/
├── qc2plus_project.yml # Project config
├── profiles.yml # Database connections
├── models/ # Test definitions
│ └── customers.yml # Example model
└── README.md # Getting started guide
2. Configure Database
Edit profiles.yml:
my_quality_project:
target: dev
outputs:
dev:
data_source: # Where your data lives
type: postgresql
host: localhost
port: 5432
user: ${DB_USER} # Use env variables!
password: ${DB_PASSWORD}
dbname: analytics
schema: public
quality_output: # Where results are stored
type: postgresql
host: localhost
port: 5432
dbname: quality_db
schema: qc2plus
Security Best Practice: Use environment variables for credentials!
3. Define Tests
Edit models/customers.yml:
models:
- name: customers
description: Customer data quality tests
qc2plus_tests:
# Level 1: Business Rules
level1:
- unique:
column_name: customer_id
severity: critical
- not_null:
column_name: email
severity: critical
- email_format:
column_name: email
severity: high
- accepted_values:
column_name: status
accepted_values: ['active', 'inactive', 'churned']
severity: medium
- statistical_threshold:
metric: count
threshold_type: relative
threshold_value: 2.0 # 2 std deviations
window_days: 30
severity: high
# Level 2: ML Anomaly Detection
level2:
correlation_analysis:
variables: [lifetime_value, order_count, avg_order_value]
expected_correlation: 0.8
threshold: 0.2
temporal_analysis:
date_column: created_at
metrics: [count, avg_lifetime_value]
seasonality_check: true
distribution_analysis:
segments: [country, customer_type]
metrics: [lifetime_value, order_count]
date_colum: date_order
4. Run Tests
# Test connection
qc2plus test-connection
# Run all tests
qc2plus run --target dev
# Run specific model
qc2plus run --models customers --target dev
# Run only Level 1
qc2plus run --level 1
# Parallel execution (4 threads)
qc2plus run --threads 4
# Production run with fail-fast
qc2plus run --target prod --fail-fast
🏁 Quick Start With Docker
1. Clone the repository
git clone https://github.com/kheopsys/qc2plus
cd qc2plus
2. Start all services
docker-compose up -d
Expected output:
Container qc2plus-postgres Started
Container qc2plus-postgres-results Started
Container qc2plus-runner Started
3. Verify services are running
docker-compose ps
4. Access the QC2Plus container
docker exec -it qc2plus-runner bash
5. Inside the container, run quality checks
cd examples/advanced
qc2plus run --models customers --target demo
6. View results in PostgreSQL
docker exec -it qc2plus-postgres-results psql -U qc2plus -d qc2plus_results \
-c "SELECT model_name, test_type, status, failed_rows
FROM quality_test_results
ORDER BY execution_time DESC
LIMIT 10;"
📚 Documentation
📖 Complete Guides
- QC2PLUS Documentation - Complete parameter reference
- Examples - Real-world use cases
📋 Test Reference
Level 1 Tests
| Test | Use Case | Example |
|---|---|---|
unique |
Primary keys, unique identifiers | customer_id, email |
not_null |
Required fields | email, created_at |
email_format |
Email validation | Email addresses |
relationship |
Referential integrity | customer_id → customers.id |
accepted_values |
Enum/status fields | status in ['active', 'inactive'] |
range_check |
Numeric boundaries | age between 0 and 120 |
future_date |
Date validation | Birth dates, creation dates |
statistical_threshold |
Metric anomalies | Daily registrations, revenue |
See QC2PLUS_DOCUMENTATION.md for complete parameter reference.
Level 2 Analyzers
| Analyzer | Detects | Example Scenario |
|---|---|---|
| Correlation | Relationship changes | Marketing spend vs revenue decoupling |
| Temporal | Time series anomalies | Unexpected spike in daily signups |
| Distribution | Segment shifts | Geographic distribution change |
🔔 Alerting Example
Configure in qc2plus_project.yml:
alerting:
enabled_channels: [slack, email]
thresholds:
critical_failure_threshold: 1 # Alert on 1+ critical failure
failure_rate_threshold: 0.15 # Alert if >15% tests fail
slack:
enabled: true
webhook_url: ${SLACK_WEBHOOK_URL}
email:
enabled: true
smtp_server: smtp.gmail.com
smtp_port: 587
username: ${EMAIL_USERNAME}
password: ${EMAIL_APP_PASSWORD}
from_email: qc2plus@company.com
to_emails:
- data-team@company.com
- alerts@company.com
📊 Power BI Integration
QC2Plus automatically creates three tables in your quality database:
1. quality_test_results
Individual test results with full details.
SELECT
model_name,
test_name,
test_type,
level,
severity,
status,
failed_rows,
total_rows,
execution_time
FROM qc2plus.quality_test_results
WHERE execution_time >= CURRENT_DATE - INTERVAL '30 days'
ORDER BY execution_time DESC;
2. quality_run_summary
High-level run metrics for trend analysis.
SELECT
run_id,
execution_time,
target_environment,
total_tests,
passed_tests,
failed_tests,
critical_failures,
execution_duration_seconds
FROM qc2plus.quality_run_summary
ORDER BY execution_time DESC;
3. quality_anomalies
ML-detected anomalies with severity scores.
SELECT
model_name,
analyzer_type,
anomaly_type,
anomaly_score,
affected_columns,
detection_time,
severity
FROM qc2plus.quality_anomalies
WHERE detection_time >= CURRENT_DATE - INTERVAL '7 days'
ORDER BY anomaly_score DESC;
🎯 Examples
E-commerce Data Quality
models:
- name: orders
qc2plus_tests:
level1:
- not_null:
column_name: order_id
severity: critical
- relationship:
column_name: customer_id
reference_table: customers
reference_column: id
severity: critical
- range_check:
column_name: order_total
min_value: 0
severity: high
- statistical_threshold:
metric: sum
column_name: order_total
threshold_type: relative
threshold_value: 3.0
severity: high
level2:
correlation_analysis:
variables: [order_total, item_count, shipping_cost]
expected_correlation: 0.7
threshold: 0.25
temporal_analysis:
date_column: order_date
metrics: [count, sum_order_total, avg_order_total]
seasonality_check: true
SaaS Metrics Monitoring
models:
- name: daily_metrics
qc2plus_tests:
level1:
- statistical_threshold:
metric: count
column_name: new_signups
threshold_type: relative
threshold_value: 2.0
window_days: 30
severity: high
- statistical_threshold:
metric: sum
column_name: mrr
threshold_type: absolute
threshold_value: 100000
severity: critical
level2:
correlation_analysis:
variables: [new_signups, trial_starts, paid_conversions]
expected_correlation: 0.85
threshold: 0.15
temporal_analysis:
date_column: metric_date
metrics: [new_signups, churn_count, mrr]
seasonality_check: true
window_days: 180
🏗️ Architecture
┌─────────────────────────────────────────────┐
│ QC2Plus Architecture │
├─────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ ┌──────────────┐ │
│ │ Level 1 │ │ Level 2 │ │
│ │ SQL Tests │────▶│ ML Anomaly │ │
│ │ │ │ Detection │ │
│ └──────────────┘ └──────────────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌────────────────────────────────────┐ │
│ │ Results Persistence │ │
│ │ (PostgreSQL/BigQuery/Snowflake) │ │
│ └────────────────────────────────────┘ │
│ │ │
│ ├──▶ Power BI Dashboards │
│ └──▶ Multi-Channel Alerts │
│ (Slack/Email/Teams) │
└─────────────────────────────────────────────┘
🚀 Performance Tips
-
Parallel Execution: Use
--threadsbased on DB capacityqc2plus run --threads 4 # Good for most setups
-
Optimize Windows: Adjust based on data volume
window_days: 30 # Fast, less history window_days: 90 # Balanced window_days: 180 # Comprehensive, slower
-
Index Critical Columns: Especially date columns
CREATE INDEX idx_created_at ON customers(created_at);
-
Use Sampling: For exploratory analysis
min_samples: 1000 # ML tests skip if < 1000 rows
-
Schedule Wisely: Run during low-traffic periods
# Crontab example: Daily at 2 AM 0 2 * * * cd /path/to/project && qc2plus run --target prod
🐛 Troubleshooting
Connection Issues
# Test database connection
qc2plus test-connection --target dev
# Enable debug logging
export QC2PLUS_LOG_LEVEL=DEBUG
qc2plus run
Tests Not Found
# List all models
qc2plus list-models
Performance Issues
# Reduce window for testing
statistical_threshold:
window_days: 7 # Instead of 30
# Increase minimum samples
level2:
temporal_analysis:
min_samples: 100 # Skip analysis if < 100 rows
Memory Errors
# Reduce parallel threads
qc2plus run --threads 1
# Or increase Docker memory (if using Docker)
docker run --memory=4g qc2plus
🤝 Contributing
We welcome contributions! See CONTRIBUTING.md for guidelines.
Quick Start:
git clone https://github.com/kheopsys/qc2plus.git
cd qc2plus
pip install -e ".[dev]"
pytest tests/
Areas We Need Help:
- 📝 Documentation improvements
- 🧪 Additional test types
- 🗄️ New database adapters
- 🎨 Power BI templates
- 🌐 Translations
📄 License
MIT License - see LICENSE for details.
🙏 Contributors & Acknowledgments
Main Contributors
This project is maintained by:
|
Ikrame Ettiache Creator & Maintainer 🤖 💻 📊 |
Abdoul Raoufou Gambo Creator & Maintainer 💻 🐛 📖 |
Yasser Sokri Creator & Maintainer 🤖 💻 📊 |
Special Thanks
- Inspired by dbt for the elegant CLI approach
- Built with SQLAlchemy, scikit-learn, pandas
- Thanks to everyone who reported bugs and suggested features!
Sponsor
If QC2Plus helps your organization, consider:
📧 Support & Community
- 📖 Documentation: - QC2PLUS Documentation - Complete parameter reference
- 🐛 Issues: GitHub Issues
- 💬 Discussions: GitHub Discussions
- 🐦 Twitter: @qc2plus
- 💼 LinkedIn: QC2Plus
⭐ Star us on GitHub if QC2Plus helps your data quality! ⭐
Made with ❤️ by the QC2Plus Team
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file qc2plus-1.0.4.tar.gz.
File metadata
- Download URL: qc2plus-1.0.4.tar.gz
- Upload date:
- Size: 98.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a492fe05cbee9dfdefac849d122dcd1d98c7ba8dfe5419f4f4198c34769456e8
|
|
| MD5 |
3273d5315db1b3ec74a476281af8e9e0
|
|
| BLAKE2b-256 |
2ed8c70f444a767a5de7e998daa08926a58e99ec6161ae5df609750e9cd943a2
|
File details
Details for the file qc2plus-1.0.4-py3-none-any.whl.
File metadata
- Download URL: qc2plus-1.0.4-py3-none-any.whl
- Upload date:
- Size: 69.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
097d580caa6ebb53ef9096bff297000c95f86dc5c1f1f1f8e2d19fc6bedbe6ef
|
|
| MD5 |
fd9fdf47a1076e003fddd8771c809ba7
|
|
| BLAKE2b-256 |
9ff81f8e7ed2b4e5d8e5a6c4740482833cabc889c36fb9a6f4b3995f1c575e4d
|