A Professional Data Quality Framework for ML Pipelines
Project description
DataSentry
A Professional Data Quality Framework for ML Pipelines
Overview
DataSentry is a comprehensive Python library designed to detect and remediate data quality issues in machine learning pipelines. It provides a unified interface for identifying and fixing common data problems including class imbalance, label noise, data leakage, missing values, outliers, feature redundancy, and data distribution shift.
Features
Data Quality Detection
- Imbalance Detection: Identify class imbalance with customizable thresholds
- Label Noise Detection: Find potentially mislabeled samples using confident learning
- Data Leakage Detection: Detect target leakage, duplicates, and train-test contamination
- Missing Value Detection: Analyze missing value patterns and completeness
- Outlier Detection: Identify outliers using IQR, Z-score, Isolation Forest, and LOF
- Redundancy Detection: Find correlated and duplicate features
- Shift Detection: Detect distribution drift between train and test sets
Data Quality Remediation
- Imbalance Fixer: SMOTE, ADASYN, undersampling, and class weights
- Label Noise Fixer: Remove, relabel, or weight noisy samples
- Data Leakage Fixer: Remove leaky features and duplicates
- Missing Value Fixer: Mean, median, mode, KNN, and iterative imputation
- Outlier Fixer: Remove, cap, transform, or winsorize outliers
- Redundancy Fixer: Remove features or apply PCA
- Shift Fixer: Standardize and normalize distributions
Visualization
- Interactive plots for all data quality issues
- Distribution comparisons
- Correlation heatmaps
- Missing value patterns
- Outlier visualizations
Installation
From PyPI (Recommended)
pip install datasentry
With Optional Dependencies
# For advanced imbalance handling (SMOTE, ADASYN)
pip install datasentry[imblearn]
# For all optional features
pip install datasentry[all]
# For development
pip install datasentry[dev]
From Source
git clone https://github.com/010Ankushsharma/datasentry.git
cd datasentry
pip install -e .
Quick Start
from datasentry import DataSentry
import numpy as np
# Generate sample data
np.random.seed(42)
X_train = np.random.randn(1000, 10)
y_train = np.random.randint(0, 3, 1000)
X_test = np.random.randn(200, 10)
# Initialize DataSentry
ds = DataSentry(random_state=42, verbose=True)
# Generate comprehensive report
report = ds.generate_full_report(X_train, y_train, X_test=X_test)
# View health score
print(f"Health Score: {report['report_metadata']['health_score']:.2%}")
print(f"Overall Status: {report['report_metadata']['overall_status']}")
# Fix all detected issues
X_clean, y_clean = ds.fix_all(
X_train, y_train,
fix_config={
'missing_values': {'strategy': 'mean'},
'outliers': {'method': 'cap'},
'imbalance': {'method': 'smote'},
}
)
Detailed Usage
Individual Detection
from datasentry import DataSentry
ds = DataSentry()
# Detect specific issues
imbalance_result = ds.detect_imbalance(X, y)
missing_result = ds.detect_missing_values(X)
outlier_result = ds.detect_outliers(X)
leakage_result = ds.detect_data_leakage(X, y, X_test=X_test)
# Check if issues were detected
if imbalance_result.issue_detected:
print(f"Imbalance ratio: {imbalance_result.details['imbalance_ratio']}")
print(f"Severity: {imbalance_result.severity}")
Individual Fixing
# Fix specific issues
from datasentry import MissingValueFixer, OutlierFixer
# Fix missing values
missing_fixer = MissingValueFixer(strategy='knn')
result = missing_fixer.fix(X, y)
X_fixed = result.X_transformed
# Fix outliers
outlier_fixer = OutlierFixer(method='winsorize')
result = outlier_fixer.fix(X, y)
X_fixed = result.X_transformed
Visualization
# Visualize data quality issues
import matplotlib.pyplot as plt
# Class imbalance
fig = ds.visualize_imbalance(y, plot_type='both')
plt.show()
# Missing values
fig = ds.visualize_missing_values(X, plot_type='matrix')
plt.show()
# Outliers
fig = ds.visualize_outliers(X, plot_type='box')
plt.show()
# Correlation heatmap
fig = ds.visualize_redundancy(X, plot_type='heatmap')
plt.show()
# Distribution shift
fig = ds.visualize_shift(X_train, X_test, plot_type='comparison')
plt.show()
Report Generation
# Generate HTML report
report_gen = ds.generate_full_report(X, y, X_test=X_test)
# Save as HTML
from datasentry.core.report import ReportGenerator
detectors = ds.detect_all(X, y, X_test=X_test)
report_gen = ReportGenerator(list(detectors.values()))
report_gen.save_html('data_quality_report.html')
# Save as JSON
report_gen.save_json('data_quality_report.json')
API Reference
Main Classes
DataSentry
Main orchestrator class for data quality management.
DataSentry(
random_state: int = 42,
verbose: bool = True
)
Methods:
detect_all(X, y, X_test, y_test)- Run all detectorsdetect_imbalance(X, y)- Detect class imbalancedetect_label_noise(X, y)- Detect label noisedetect_data_leakage(X, y, X_test)- Detect data leakagedetect_missing_values(X, y)- Detect missing valuesdetect_outliers(X, y)- Detect outliersdetect_redundancy(X, y)- Detect feature redundancydetect_shift(X, y, X_test, y_test)- Detect distribution shiftfix_all(X, y, X_test, fix_config)- Fix all issuesgenerate_full_report(X, y, X_test, y_test)- Generate comprehensive reportvisualize_*- Various visualization methods
Detectors
All detectors inherit from BaseDetector and return a DetectionResult.
from datasentry import (
ImbalanceDetector,
LabelNoiseDetector,
DataLeakageDetector,
MissingValueDetector,
OutlierDetector,
RedundancyDetector,
ShiftDetector,
)
Fixers
All fixers inherit from BaseFixer and return a FixResult.
from datasentry import (
ImbalanceFixer,
LabelNoiseFixer,
DataLeakageFixer,
MissingValueFixer,
OutlierFixer,
RedundancyFixer,
ShiftFixer,
)
Examples
See the examples/ directory for more detailed examples:
basic_example.py- Basic usage of DataSentryadvanced_example.py- Advanced features and customizationpipeline_integration.py- Integration with sklearn pipelines
Contributing
We welcome contributions! Please see CONTRIBUTING.md for guidelines.
License
This project is licensed under the MIT License - see the LICENSE file for details.
Changelog
See CHANGELOG.md for version history.
Support
- Documentation: https://datasentry.readthedocs.io
- Issue Tracker: GitHub Issues
- Discussions: GitHub Discussions
Acknowledgments
- Built with scikit-learn
- Inspired by data quality best practices in ML pipelines
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file datasentry-1.0.0.tar.gz.
File metadata
- Download URL: datasentry-1.0.0.tar.gz
- Upload date:
- Size: 68.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8fbda020e932cb6b1651d4ce3cc6b58e274c968271144282baea60629e0ef4d2
|
|
| MD5 |
693df2e4f375311fb573e8313e1289e3
|
|
| BLAKE2b-256 |
b80ceaaf03d0da7238c79b503ba7f16a346c8b0af9e29861840ac6b807deabe7
|
File details
Details for the file datasentry-1.0.0-py3-none-any.whl.
File metadata
- Download URL: datasentry-1.0.0-py3-none-any.whl
- Upload date:
- Size: 85.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3bced2e2237e7d7ceab7a1a4739cc78d6d8f9ce92209deed83509151b86db56f
|
|
| MD5 |
29510f33bbec09541e6b995864b13792
|
|
| BLAKE2b-256 |
4e9c8cf39ae3b21eb3606e95cb8710a5beaf8e3b8d97e239446e7cd18c9f4336
|