A Professional Data Quality Framework for ML Pipelines

These details have not been verified by PyPI

Project links

Project description

DataSentry

A Professional Data Quality Framework for ML Pipelines

Overview

DataSentry is a comprehensive Python library designed to detect and remediate data quality issues in machine learning pipelines. It provides a unified interface for identifying and fixing common data problems including class imbalance, label noise, data leakage, missing values, outliers, feature redundancy, and data distribution shift.

Features

Data Quality Detection

Imbalance Detection: Identify class imbalance with customizable thresholds
Label Noise Detection: Find potentially mislabeled samples using confident learning
Data Leakage Detection: Detect target leakage, duplicates, and train-test contamination
Missing Value Detection: Analyze missing value patterns and completeness
Outlier Detection: Identify outliers using IQR, Z-score, Isolation Forest, and LOF
Redundancy Detection: Find correlated and duplicate features
Shift Detection: Detect distribution drift between train and test sets

Data Quality Remediation

Imbalance Fixer: SMOTE, ADASYN, undersampling, and class weights
Label Noise Fixer: Remove, relabel, or weight noisy samples
Data Leakage Fixer: Remove leaky features and duplicates
Missing Value Fixer: Mean, median, mode, KNN, and iterative imputation
Outlier Fixer: Remove, cap, transform, or winsorize outliers
Redundancy Fixer: Remove features or apply PCA
Shift Fixer: Standardize and normalize distributions

Visualization

Interactive plots for all data quality issues
Distribution comparisons
Correlation heatmaps
Missing value patterns
Outlier visualizations

Installation

From PyPI (Recommended)

pip install datasentry

With Optional Dependencies

# For advanced imbalance handling (SMOTE, ADASYN)
pip install datasentry[imblearn]

# For all optional features
pip install datasentry[all]

# For development
pip install datasentry[dev]

From Source

git clone https://github.com/010Ankushsharma/datasentry.git
cd datasentry
pip install -e .

Quick Start

from datasentry import DataSentry
import numpy as np

# Generate sample data
np.random.seed(42)
X_train = np.random.randn(1000, 10)
y_train = np.random.randint(0, 3, 1000)
X_test = np.random.randn(200, 10)

# Initialize DataSentry
ds = DataSentry(random_state=42, verbose=True)

# Generate comprehensive report
report = ds.generate_full_report(X_train, y_train, X_test=X_test)

# View health score
print(f"Health Score: {report['report_metadata']['health_score']:.2%}")
print(f"Overall Status: {report['report_metadata']['overall_status']}")

# Fix all detected issues
X_clean, y_clean = ds.fix_all(
    X_train, y_train,
    fix_config={
        'missing_values': {'strategy': 'mean'},
        'outliers': {'method': 'cap'},
        'imbalance': {'method': 'smote'},
    }
)

Detailed Usage

Individual Detection

from datasentry import DataSentry

ds = DataSentry()

# Detect specific issues
imbalance_result = ds.detect_imbalance(X, y)
missing_result = ds.detect_missing_values(X)
outlier_result = ds.detect_outliers(X)
leakage_result = ds.detect_data_leakage(X, y, X_test=X_test)

# Check if issues were detected
if imbalance_result.issue_detected:
    print(f"Imbalance ratio: {imbalance_result.details['imbalance_ratio']}")
    print(f"Severity: {imbalance_result.severity}")

Individual Fixing

# Fix specific issues
from datasentry import MissingValueFixer, OutlierFixer

# Fix missing values
missing_fixer = MissingValueFixer(strategy='knn')
result = missing_fixer.fix(X, y)
X_fixed = result.X_transformed

# Fix outliers
outlier_fixer = OutlierFixer(method='winsorize')
result = outlier_fixer.fix(X, y)
X_fixed = result.X_transformed

Visualization

# Visualize data quality issues
import matplotlib.pyplot as plt

# Class imbalance
fig = ds.visualize_imbalance(y, plot_type='both')
plt.show()

# Missing values
fig = ds.visualize_missing_values(X, plot_type='matrix')
plt.show()

# Outliers
fig = ds.visualize_outliers(X, plot_type='box')
plt.show()

# Correlation heatmap
fig = ds.visualize_redundancy(X, plot_type='heatmap')
plt.show()

# Distribution shift
fig = ds.visualize_shift(X_train, X_test, plot_type='comparison')
plt.show()

Report Generation

# Generate HTML report
report_gen = ds.generate_full_report(X, y, X_test=X_test)

# Save as HTML
from datasentry.core.report import ReportGenerator

detectors = ds.detect_all(X, y, X_test=X_test)
report_gen = ReportGenerator(list(detectors.values()))
report_gen.save_html('data_quality_report.html')

# Save as JSON
report_gen.save_json('data_quality_report.json')

API Reference

Main Classes

`DataSentry`

Main orchestrator class for data quality management.

DataSentry(
    random_state: int = 42,
    verbose: bool = True
)

Methods:

detect_all(X, y, X_test, y_test) - Run all detectors
detect_imbalance(X, y) - Detect class imbalance
detect_label_noise(X, y) - Detect label noise
detect_data_leakage(X, y, X_test) - Detect data leakage
detect_missing_values(X, y) - Detect missing values
detect_outliers(X, y) - Detect outliers
detect_redundancy(X, y) - Detect feature redundancy
detect_shift(X, y, X_test, y_test) - Detect distribution shift
fix_all(X, y, X_test, fix_config) - Fix all issues
generate_full_report(X, y, X_test, y_test) - Generate comprehensive report
visualize_* - Various visualization methods

Detectors

All detectors inherit from BaseDetector and return a DetectionResult.

from datasentry import (
    ImbalanceDetector,
    LabelNoiseDetector,
    DataLeakageDetector,
    MissingValueDetector,
    OutlierDetector,
    RedundancyDetector,
    ShiftDetector,
)

Fixers

All fixers inherit from BaseFixer and return a FixResult.

from datasentry import (
    ImbalanceFixer,
    LabelNoiseFixer,
    DataLeakageFixer,
    MissingValueFixer,
    OutlierFixer,
    RedundancyFixer,
    ShiftFixer,
)

Examples

See the examples/ directory for more detailed examples:

basic_example.py - Basic usage of DataSentry
advanced_example.py - Advanced features and customization
pipeline_integration.py - Integration with sklearn pipelines

Contributing

We welcome contributions! Please see CONTRIBUTING.md for guidelines.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Changelog

See CHANGELOG.md for version history.

Support

Documentation: https://datasentry.readthedocs.io
Issue Tracker: GitHub Issues
Discussions: GitHub Discussions

Acknowledgments

Built with scikit-learn
Inspired by data quality best practices in ML pipelines

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

1.0.0

Feb 22, 2026

0.1.1

Feb 12, 2026

0.1.0

Feb 12, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datasentry-1.0.0.tar.gz (68.0 kB view details)

Uploaded Feb 22, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

datasentry-1.0.0-py3-none-any.whl (85.6 kB view details)

Uploaded Feb 22, 2026 Python 3

File details

Details for the file datasentry-1.0.0.tar.gz.

File metadata

Download URL: datasentry-1.0.0.tar.gz
Upload date: Feb 22, 2026
Size: 68.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for datasentry-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`8fbda020e932cb6b1651d4ce3cc6b58e274c968271144282baea60629e0ef4d2`
MD5	`693df2e4f375311fb573e8313e1289e3`
BLAKE2b-256	`b80ceaaf03d0da7238c79b503ba7f16a346c8b0af9e29861840ac6b807deabe7`

See more details on using hashes here.

File details

Details for the file datasentry-1.0.0-py3-none-any.whl.

File metadata

Download URL: datasentry-1.0.0-py3-none-any.whl
Upload date: Feb 22, 2026
Size: 85.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for datasentry-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`3bced2e2237e7d7ceab7a1a4739cc78d6d8f9ce92209deed83509151b86db56f`
MD5	`29510f33bbec09541e6b995864b13792`
BLAKE2b-256	`4e9c8cf39ae3b21eb3606e95cb8710a5beaf8e3b8d97e239446e7cd18c9f4336`

See more details on using hashes here.

datasentry 1.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

DataSentry

Overview

Features

Data Quality Detection

Data Quality Remediation

Visualization

Installation

From PyPI (Recommended)

With Optional Dependencies

From Source

Quick Start

Detailed Usage

Individual Detection

Individual Fixing

Visualization

Report Generation

API Reference

Main Classes

DataSentry

Detectors

Fixers

Examples

Contributing

License

Changelog

Support

Acknowledgments

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`DataSentry`