Skip to main content

A Professional Data Quality Framework for ML Pipelines

Project description

DataSentry

PyPI version Python Versions License: MIT CI codecov

A Professional Data Quality Framework for ML Pipelines

Overview

DataSentry is a comprehensive Python library designed to detect and remediate data quality issues in machine learning pipelines. It provides a unified interface for identifying and fixing common data problems including class imbalance, label noise, data leakage, missing values, outliers, feature redundancy, and data distribution shift.

Features

Data Quality Detection

  • Imbalance Detection: Identify class imbalance with customizable thresholds
  • Label Noise Detection: Find potentially mislabeled samples using confident learning
  • Data Leakage Detection: Detect target leakage, duplicates, and train-test contamination
  • Missing Value Detection: Analyze missing value patterns and completeness
  • Outlier Detection: Identify outliers using IQR, Z-score, Isolation Forest, and LOF
  • Redundancy Detection: Find correlated and duplicate features
  • Shift Detection: Detect distribution drift between train and test sets

Data Quality Remediation

  • Imbalance Fixer: SMOTE, ADASYN, undersampling, and class weights
  • Label Noise Fixer: Remove, relabel, or weight noisy samples
  • Data Leakage Fixer: Remove leaky features and duplicates
  • Missing Value Fixer: Mean, median, mode, KNN, and iterative imputation
  • Outlier Fixer: Remove, cap, transform, or winsorize outliers
  • Redundancy Fixer: Remove features or apply PCA
  • Shift Fixer: Standardize and normalize distributions

Visualization

  • Interactive plots for all data quality issues
  • Distribution comparisons
  • Correlation heatmaps
  • Missing value patterns
  • Outlier visualizations

Installation

From PyPI (Recommended)

pip install datasentry

With Optional Dependencies

# For advanced imbalance handling (SMOTE, ADASYN)
pip install datasentry[imblearn]

# For all optional features
pip install datasentry[all]

# For development
pip install datasentry[dev]

From Source

git clone https://github.com/010Ankushsharma/datasentry.git
cd datasentry
pip install -e .

Quick Start

from datasentry import DataSentry
import numpy as np

# Generate sample data
np.random.seed(42)
X_train = np.random.randn(1000, 10)
y_train = np.random.randint(0, 3, 1000)
X_test = np.random.randn(200, 10)

# Initialize DataSentry
ds = DataSentry(random_state=42, verbose=True)

# Generate comprehensive report
report = ds.generate_full_report(X_train, y_train, X_test=X_test)

# View health score
print(f"Health Score: {report['report_metadata']['health_score']:.2%}")
print(f"Overall Status: {report['report_metadata']['overall_status']}")

# Fix all detected issues
X_clean, y_clean = ds.fix_all(
    X_train, y_train,
    fix_config={
        'missing_values': {'strategy': 'mean'},
        'outliers': {'method': 'cap'},
        'imbalance': {'method': 'smote'},
    }
)

Detailed Usage

Individual Detection

from datasentry import DataSentry

ds = DataSentry()

# Detect specific issues
imbalance_result = ds.detect_imbalance(X, y)
missing_result = ds.detect_missing_values(X)
outlier_result = ds.detect_outliers(X)
leakage_result = ds.detect_data_leakage(X, y, X_test=X_test)

# Check if issues were detected
if imbalance_result.issue_detected:
    print(f"Imbalance ratio: {imbalance_result.details['imbalance_ratio']}")
    print(f"Severity: {imbalance_result.severity}")

Individual Fixing

# Fix specific issues
from datasentry import MissingValueFixer, OutlierFixer

# Fix missing values
missing_fixer = MissingValueFixer(strategy='knn')
result = missing_fixer.fix(X, y)
X_fixed = result.X_transformed

# Fix outliers
outlier_fixer = OutlierFixer(method='winsorize')
result = outlier_fixer.fix(X, y)
X_fixed = result.X_transformed

Visualization

# Visualize data quality issues
import matplotlib.pyplot as plt

# Class imbalance
fig = ds.visualize_imbalance(y, plot_type='both')
plt.show()

# Missing values
fig = ds.visualize_missing_values(X, plot_type='matrix')
plt.show()

# Outliers
fig = ds.visualize_outliers(X, plot_type='box')
plt.show()

# Correlation heatmap
fig = ds.visualize_redundancy(X, plot_type='heatmap')
plt.show()

# Distribution shift
fig = ds.visualize_shift(X_train, X_test, plot_type='comparison')
plt.show()

Report Generation

# Generate HTML report
report_gen = ds.generate_full_report(X, y, X_test=X_test)

# Save as HTML
from datasentry.core.report import ReportGenerator

detectors = ds.detect_all(X, y, X_test=X_test)
report_gen = ReportGenerator(list(detectors.values()))
report_gen.save_html('data_quality_report.html')

# Save as JSON
report_gen.save_json('data_quality_report.json')

API Reference

Main Classes

DataSentry

Main orchestrator class for data quality management.

DataSentry(
    random_state: int = 42,
    verbose: bool = True
)

Methods:

  • detect_all(X, y, X_test, y_test) - Run all detectors
  • detect_imbalance(X, y) - Detect class imbalance
  • detect_label_noise(X, y) - Detect label noise
  • detect_data_leakage(X, y, X_test) - Detect data leakage
  • detect_missing_values(X, y) - Detect missing values
  • detect_outliers(X, y) - Detect outliers
  • detect_redundancy(X, y) - Detect feature redundancy
  • detect_shift(X, y, X_test, y_test) - Detect distribution shift
  • fix_all(X, y, X_test, fix_config) - Fix all issues
  • generate_full_report(X, y, X_test, y_test) - Generate comprehensive report
  • visualize_* - Various visualization methods

Detectors

All detectors inherit from BaseDetector and return a DetectionResult.

from datasentry import (
    ImbalanceDetector,
    LabelNoiseDetector,
    DataLeakageDetector,
    MissingValueDetector,
    OutlierDetector,
    RedundancyDetector,
    ShiftDetector,
)

Fixers

All fixers inherit from BaseFixer and return a FixResult.

from datasentry import (
    ImbalanceFixer,
    LabelNoiseFixer,
    DataLeakageFixer,
    MissingValueFixer,
    OutlierFixer,
    RedundancyFixer,
    ShiftFixer,
)

Examples

See the examples/ directory for more detailed examples:

  • basic_example.py - Basic usage of DataSentry
  • advanced_example.py - Advanced features and customization
  • pipeline_integration.py - Integration with sklearn pipelines

Contributing

We welcome contributions! Please see CONTRIBUTING.md for guidelines.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Changelog

See CHANGELOG.md for version history.

Support

Acknowledgments

  • Built with scikit-learn
  • Inspired by data quality best practices in ML pipelines

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datasentry-1.0.0.tar.gz (68.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

datasentry-1.0.0-py3-none-any.whl (85.6 kB view details)

Uploaded Python 3

File details

Details for the file datasentry-1.0.0.tar.gz.

File metadata

  • Download URL: datasentry-1.0.0.tar.gz
  • Upload date:
  • Size: 68.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for datasentry-1.0.0.tar.gz
Algorithm Hash digest
SHA256 8fbda020e932cb6b1651d4ce3cc6b58e274c968271144282baea60629e0ef4d2
MD5 693df2e4f375311fb573e8313e1289e3
BLAKE2b-256 b80ceaaf03d0da7238c79b503ba7f16a346c8b0af9e29861840ac6b807deabe7

See more details on using hashes here.

File details

Details for the file datasentry-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: datasentry-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 85.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for datasentry-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 3bced2e2237e7d7ceab7a1a4739cc78d6d8f9ce92209deed83509151b86db56f
MD5 29510f33bbec09541e6b995864b13792
BLAKE2b-256 4e9c8cf39ae3b21eb3606e95cb8710a5beaf8e3b8d97e239446e7cd18c9f4336

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page