Data-centric ML inspection and auto-remediation toolkit
Project description
๐ DataSentry
Proactive dataset validation for reliable machine learning systems.
DataSentry is a lightweight, production-focused Python library that
detects structural dataset issues before model training begins.
It helps prevent silent performance degradation, misleading validation
metrics, and costly deployment failures.
Because in machine learning --- bad data breaks good models.
๐ฉ Why DataSentry?
Modern ML pipelines frequently suffer from hidden dataset issues:
Problem Risk
โ๏ธ Class Imbalance Biased predictions ๐ท Label Noise Reduced generalization ๐ Data Leakage Inflated validation accuracy ๐ Outliers Distorted feature space ๐ Distribution Shift Poor real-world performance
DataSentry automatically identifies these risks early in your workflow.
๐ง Core Features
โ๏ธ Class Imbalance Detection
Evaluates label distribution and computes normalized imbalance metrics.
๐ท Label Noise Detection
Flags suspicious label inconsistencies affecting model learning.
๐ Data Leakage Detection
Identifies features overly correlated with target variables.
๐ Outlier Detection
Detects anomalous samples that may distort training.
๐ Distribution Shift Detection
Compares feature distributions to detect dataset drift.
๐ฆ Installation
Install from PyPI:
pip install datasentry
Or install locally:
pip install .
โก Quick Start
import numpy as np
from datasentry import analyze
X = np.random.randn(100, 5)
y = np.array([0] * 90 + [1] * 10)
report = analyze(X=X, y=y)
print(report)
โ Advanced Usage
report = analyze(
X=X,
y=y,
imbalance_threshold=3.0,
outlier_threshold=0.1,
leakage_threshold=0.9
)
๐ Architecture
datasentry/
โ
โโโ detectors/
โ โโโ imbalance.py
โ โโโ label_noise.py
โ โโโ leakage.py
โ โโโ outliers.py
โ โโโ shift.py
โ
โโโ analyzer.py
โโโ config.py
โโโ report.py
โโโ utils.py
โโโ fixer.py
๐งช Running Tests
pytest
๐บ Roadmap
- CLI interface\
- Visualization dashboard\
- Advanced statistical leakage detection\
- Automated remediation suggestions\
- scikit-learn pipeline integration\
- Production drift monitoring
๐ค Contributing
Contributions are welcome. Please ensure:
- Clear documentation\
- Proper unit test coverage\
- Consistent coding standards\
- Descriptive commit messages
๐ License
MIT License
๐ก Philosophy
Reliable machine learning begins with reliable data.
DataSentry focuses on structural dataset validation to reduce debugging effort, improve model robustness, and increase production stability.
v0.1.1
- Unified detector output schema
- Fixed imbalance threshold logic
- Refactored health score computation
- Fixed AutoFixer compatibility
- Updated test suite
- Improved package structure
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file datasentry-0.1.1.tar.gz.
File metadata
- Download URL: datasentry-0.1.1.tar.gz
- Upload date:
- Size: 9.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
35d709eea598986723779809b2e8c4f9ef1560988a93f2a3bab1177559792cc4
|
|
| MD5 |
45c4435245b7f9aa3d0be72f33d9e544
|
|
| BLAKE2b-256 |
07fc872e4430a8975dc24cb8e64ec8f7d6501c8ba91aa1237e3827326baba446
|
File details
Details for the file datasentry-0.1.1-py3-none-any.whl.
File metadata
- Download URL: datasentry-0.1.1-py3-none-any.whl
- Upload date:
- Size: 11.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
44b421e5f5c27e183f84daa67fbcb99f6e800ef3643bb5377e6ac20354da7768
|
|
| MD5 |
b211382515c76fb74751ad83270dd4a1
|
|
| BLAKE2b-256 |
4bbc53f69b5caf4196076868822375e0b0e0c253c808d9c033b983fa58b482d9
|