Skip to main content

Data-centric ML inspection and auto-remediation toolkit

Project description

๐Ÿš€ DataSentry

Python License Status

Proactive dataset validation for reliable machine learning systems.

DataSentry is a lightweight, production-focused Python library that detects structural dataset issues before model training begins.
It helps prevent silent performance degradation, misleading validation metrics, and costly deployment failures.

Because in machine learning --- bad data breaks good models.


๐Ÿšฉ Why DataSentry?

Modern ML pipelines frequently suffer from hidden dataset issues:

Problem Risk


โš–๏ธ Class Imbalance Biased predictions ๐Ÿท Label Noise Reduced generalization ๐Ÿ”“ Data Leakage Inflated validation accuracy ๐Ÿ“‰ Outliers Distorted feature space ๐Ÿ”„ Distribution Shift Poor real-world performance

DataSentry automatically identifies these risks early in your workflow.


๐Ÿง  Core Features

โš–๏ธ Class Imbalance Detection

Evaluates label distribution and computes normalized imbalance metrics.

๐Ÿท Label Noise Detection

Flags suspicious label inconsistencies affecting model learning.

๐Ÿ”“ Data Leakage Detection

Identifies features overly correlated with target variables.

๐Ÿ“‰ Outlier Detection

Detects anomalous samples that may distort training.

๐Ÿ”„ Distribution Shift Detection

Compares feature distributions to detect dataset drift.


๐Ÿ“ฆ Installation

Install from PyPI:

pip install datasentry

Or install locally:

pip install .

โšก Quick Start

import numpy as np
from datasentry import analyze

X = np.random.randn(100, 5)
y = np.array([0] * 90 + [1] * 10)

report = analyze(X=X, y=y)
print(report)

โš™ Advanced Usage

report = analyze(
    X=X,
    y=y,
    imbalance_threshold=3.0,
    outlier_threshold=0.1,
    leakage_threshold=0.9
)

๐Ÿ— Architecture

datasentry/
โ”‚
โ”œโ”€โ”€ detectors/
โ”‚   โ”œโ”€โ”€ imbalance.py
โ”‚   โ”œโ”€โ”€ label_noise.py
โ”‚   โ”œโ”€โ”€ leakage.py
โ”‚   โ”œโ”€โ”€ outliers.py
โ”‚   โ””โ”€โ”€ shift.py
โ”‚
โ”œโ”€โ”€ analyzer.py
โ”œโ”€โ”€ config.py
โ”œโ”€โ”€ report.py
โ”œโ”€โ”€ utils.py
โ””โ”€โ”€ fixer.py

๐Ÿงช Running Tests

pytest

๐Ÿ—บ Roadmap

  • CLI interface\
  • Visualization dashboard\
  • Advanced statistical leakage detection\
  • Automated remediation suggestions\
  • scikit-learn pipeline integration\
  • Production drift monitoring

๐Ÿค Contributing

Contributions are welcome. Please ensure:

  • Clear documentation\
  • Proper unit test coverage\
  • Consistent coding standards\
  • Descriptive commit messages

๐Ÿ“„ License

MIT License


๐Ÿ’ก Philosophy

Reliable machine learning begins with reliable data.

DataSentry focuses on structural dataset validation to reduce debugging effort, improve model robustness, and increase production stability.

v0.1.1

  • Unified detector output schema
  • Fixed imbalance threshold logic
  • Refactored health score computation
  • Fixed AutoFixer compatibility
  • Updated test suite
  • Improved package structure

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datasentry-0.1.1.tar.gz (9.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

datasentry-0.1.1-py3-none-any.whl (11.1 kB view details)

Uploaded Python 3

File details

Details for the file datasentry-0.1.1.tar.gz.

File metadata

  • Download URL: datasentry-0.1.1.tar.gz
  • Upload date:
  • Size: 9.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for datasentry-0.1.1.tar.gz
Algorithm Hash digest
SHA256 35d709eea598986723779809b2e8c4f9ef1560988a93f2a3bab1177559792cc4
MD5 45c4435245b7f9aa3d0be72f33d9e544
BLAKE2b-256 07fc872e4430a8975dc24cb8e64ec8f7d6501c8ba91aa1237e3827326baba446

See more details on using hashes here.

File details

Details for the file datasentry-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: datasentry-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 11.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for datasentry-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 44b421e5f5c27e183f84daa67fbcb99f6e800ef3643bb5377e6ac20354da7768
MD5 b211382515c76fb74751ad83270dd4a1
BLAKE2b-256 4bbc53f69b5caf4196076868822375e0b0e0c253c808d9c033b983fa58b482d9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page