The unified data repair, validation, drift detection, and failure tracing library for production ML
Project description
โฆ Why datamend? โฆ
Real-world data is never clean. Nulls sneak in. Distributions shift. Models fail silently on corrupted inputs.
datamend is the single library that catches, fixes, validates, monitors, and traces every data quality issue โ automatically โ so your ML pipeline never breaks from bad data again.
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ โ
โ WITHOUT datamend WITH datamend โ
โ โโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโ โ
โ โ Nulls โ model crashes โ
Auto-imputed before fit โ
โ โ Drift undetected โ
PSI + KS test every batch โ
โ โ Contract violations โ
Schema enforced at the gate โ
โ โ Hours debugging โ
Row-level failure attribution โ
โ โ 5 different libraries โ
One unified API โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
๐ฆ Installation
# Core (repair, contract, drift, trace)
pip install datamend
# With scikit-learn + XGBoost support
pip install "datamend[sklearn,xgboost]"
# With experiment tracking
pip install "datamend[mlflow,wandb]"
# Everything
pip install "datamend[all]"
Requires: Python โฅ 3.9 ยท pandas โฅ 1.5 ยท numpy โฅ 1.23 ยท scipy โฅ 1.9
โก 60-Second Demo
import pandas as pd
import datamend
df = pd.read_csv("production_data.csv") # messy real-world data
# โโ Pillar 1: Auto-repair everything โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
repaired, report = datamend.repair(df)
print(report.summary())
# โ Fixed 247 nulls ยท Removed 31 duplicates ยท Clipped 19 outliers
# โ MendScore: 54.2 โ 96.8 (+42.6 pts)
# โโ Pillar 2: Enforce your data contract โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
contract = datamend.contract(train_df)
violations = datamend.validate(repaired, contract)
# โ 0 violations ยท Contract PASSED
# โโ Pillar 3: Detect drift vs training data โโโโโโโโโโโโโโโโโโโโโโโโโโโ
drift = datamend.drift(train_df, repaired)
print(drift.summary())
# โ 'income' drifted PSI=0.38 KS p=0.001
# โโ Pillar 4: Trace model failures to root columns โโโโโโโโโโโโโโโโโโโโ
trace = datamend.trace(model, repaired, predictions)
print(trace.summary())
# โ Top suspicious rows: [1042, 887, 3310] Top column: 'income'
๐๏ธ The Four Pillars of datamend
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ โ
โ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โ
โ โ AutoRepair โโโโถโ DataContract โโโโถโ DriftRadar โโโโถ ๐ โ
โ โ Pillar 1 โ โ Pillar 2 โ โ Pillar 3 โ โ
โ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โ
โ โ โ โ โ
โ โโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโ โ
โ โ โ
โ โผ โ
โ โโโโโโโโโโโโโโโโโโโโ โ
โ โ FailureTrace โ โ
โ โ Pillar 4 โ โ
โ โโโโโโโโโโโโโโโโโโโโ โ
โ โ โ
โ โผ โ
โ MendScore โโโโโโโโโโโโโ 96.8/100 โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
๐ง Pillar 1 โ AutoRepair
"Tell datamend to fix it. It will."
AutoRepair is an 8-phase intelligent repair engine that detects and heals over 15 distinct categories of data corruption using statistics-driven algorithms โ no configuration needed.
๐ The 8-Phase Detection Pipeline
RAW DATAFRAME IN
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Phase 1 โโ NULL DETECTION & IMPUTATION โ
โ โ
โ skewness > 1.0? โโYESโโโถ Median imputation โ
โ โ โ
โ NO โ
โ โผ โ
โ Mean imputation (for numeric) โ
โ Mode imputation (for categorical) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Phase 2 โโ OUTLIER DETECTION (Modified Z-Score / MAD) โ
โ โ
โ MAD = median(|Xi - median(X)|) โ
โ Modified Z = 0.6745 ร (Xi - median) / MAD โ
โ โ
โ |Z| > 3.5? โโYESโโโถ IQR clip to [Q1-1.5รIQR, โ
โ Q3+1.5รIQR] โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Phase 3 โโ TYPE MISMATCH DETECTION โ
โ โ
โ >80% match r"^\s*-?\d+(\.\d+)?\s*$"? โ
โ โโYESโโโถ coerce column to float64 โ
โ โ
โ >60% match ISO-8601 / common date patterns? โ
โ โโYESโโโถ coerce to datetime64 โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Phase 4 โโ DUPLICATE DETECTION & REMOVAL โ
โ โ
โ Exact: pandas .duplicated(keep='first') โ
โ โ
โ Near-duplicate (Jaccard โฅ 0.85): โ
โ token-set similarity across string columns โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Phase 5 โโ ENCODING CORRUPTION (Mojibake) REPAIR โ
โ โ
โ Regex: [\xc0-\xff][\x80-\xbf]{1,3} โ
โ โโYESโโโถ encode latin-1, decode utf-8 โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Phase 6 โโ CATEGORY NORMALISATION โ
โ โ
โ NFKD + lower + strip whitespace โ
โ " New York " โ "new york" โ
โ "Nono" โ "nono" (unicode canonical) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Phase 7 โโ WHITESPACE & HIDDEN CHARACTER REMOVAL โ
โ โ
โ Remove: zero-width spaces, soft hyphens, BOM, \r, \t โ
โ Strip invisible unicode control characters โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Phase 8 โโ UNIT MISMATCH DETECTION โ
โ โ
โ CV > 5.0 AND IQR ratio > 10? โ
โ โโYESโโโถ flag column as suspect unit mix โ
โ (salary: 50000 mixed with 50.0 = same row anomaly) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
REPAIRED DATAFRAME ยท RepairReport ยท MendScore
๐ What Each Detector Catches
| Phase | Issue Type | Detection Algorithm | Fix Strategy |
|---|---|---|---|
| 1 | Null / NaN values | Column-wise null rate | Mean / Median / Mode imputation |
| 2 | Outliers | Modified Z-score (MAD) | IQR-bounded clipping |
| 3 | Type mismatches | Regex coverage โฅ 80% | dtype coercion |
| 4 | Exact duplicates | pandas .duplicated() |
Keep first, drop rest |
| 4 | Near-duplicates | Jaccard token similarity โฅ 0.85 | Drop near-clone rows |
| 5 | Mojibake encoding | [\xc0-\xff][\x80-\xbf] regex |
latin-1 โ utf-8 re-encode |
| 6 | Category noise | NFKD unicode normalisation | Lowercase canonical form |
| 7 | Whitespace / invisible chars | Unicode control char regex | Strip to clean string |
| 8 | Unit mismatch | CV > 5.0 + IQR ratio > 10 | Flag + warn |
๐ก Usage Examples
import datamend
# โโ Simple one-liner โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
repaired, report = datamend.repair(df)
# โโ With specific strategy โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
repaired, report = datamend.repair(df, strategy="median", verbose=True)
# โโ For large datasets (10M+ rows, chunked processing) โโโโโโโโโโโโ
from datamend import AutoRepair
engine = AutoRepair(strategy="auto", fast_mode=True)
repaired, report = engine.repair_chunked(df, chunk_size=500_000)
# โโ Inspect what was fixed โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
for action in report.actions:
print(f"[{action.column}] {action.issue_type}: {action.description}")
print(f" Rows affected: {action.rows_affected}")
# โโ Full repair report โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
print(report.summary())
print(f"MendScore: {report.mend_score_before:.1f} โ {report.mend_score_after:.1f}")
๐งฎ MendScore โ The Data Health Metric
datamend computes a composite MendScore (0โ100) that tells you exactly how healthy your data is:
MendScore = 100
- 40 ร null_rate โ nulls hurt the most
- 20 ร duplicate_rate โ dupes skew aggregations
- 25 ร outlier_rate โ outliers corrupt models
- 15 ร whitespace_rate โ silent model confusion
| Score Range | Health Grade | Interpretation |
|---|---|---|
| 95 โ 100 | ๐ข Excellent | Production-ready, no action needed |
| 85 โ 94 | ๐ก Good | Minor issues, acceptable for most models |
| 70 โ 84 | ๐ Fair | Noticeable problems, repair recommended |
| 50 โ 69 | ๐ด Poor | Significant corruption, repair required |
| 0 โ 49 | โ Critical | Severe data quality issues, stop pipeline |
๐ Pillar 2 โ DataContract
"Define what clean data looks like. Enforce it forever."
DataContract learns the statistical fingerprint of your training data and validates every new batch against it โ catching schema violations, null rate explosions, distribution shifts, and cardinality mismatches before they reach your model.
๐ Contract Fitting & Validation Flow
TRAINING DATA (clean)
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ DataContract.fit(train_df) โ
โ โ
โ For each column, learns: โ
โ dtype โ expected data type โ
โ nullable โ is null allowed? โ
โ null_rate โ acceptable null fraction โ
โ min / max โ numeric range bounds โ
โ mean / std โ distribution centre + spread โ
โ percentiles โ p5, p25, p50, p75, p95 โ
โ allowed_values โ set of valid categories โ
โ cardinality โ number of unique values โ
โ distribution โ KS-ready empirical CDF โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ contract.save("contract.json")
โผ
โโโโโโโโโโโโโโโโโ
โ contract.json โ โ version-controlled
โโโโโโโโโฌโโโโโโโโ
โ DataContract.load("contract.json")
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ DataContract.validate(new_df) โ
โ โ
โ Check 1: Missing columns? โโFAILโโโถ CRITICAL โ
โ Check 2: Extra columns? โโWARNโโโถ LOW โ
โ Check 3: Null rate exceeded? โโFAILโโโถ HIGH โ
โ Check 4: dtype mismatch? โโFAILโโโถ HIGH โ
โ Check 5: Values out of range? โโFAILโโโถ MEDIUM โ
โ Check 6: KS distribution? โโFAILโโโถ MEDIUM โ
โ Check 7: Cardinality shifted? โโWARNโโโถ LOW โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
ContractReport ยท violations[] ยท passed?
๐ก Usage Examples
import datamend
# โโ Fit contract on clean training data โโโโโโโโโโโโโโโโโโโโโโโโโโโ
contract = datamend.contract(train_df)
contract.save("contracts/v1.json") # version control this!
# โโ Load and validate production batch โโโโโโโโโโโโโโโโโโโโโโโโโโโ
contract = datamend.contract.load("contracts/v1.json")
report = datamend.validate(prod_df, contract)
if not report.passed:
for v in report.violations:
print(f"[{v.severity}] {v.column}: {v.message}")
print(f" Expected: {v.expected} | Got: {v.observed}")
# โโ Raise exception on violation (for strict pipelines) โโโโโโโโโโโ
try:
datamend.validate(prod_df, contract, raise_on_failure=True)
except datamend.ContractViolationError as e:
# Block the pipeline, alert the team
alert_slack(str(e))
# โโ Using DataContract class directly โโโโโโโโโโโโโโโโโโโโโโโโโโโโ
from datamend import DataContract
contract = DataContract(null_threshold=0.02) # max 2% nulls allowed
contract.fit(train_df)
report = contract.validate(prod_df)
print(report.summary())
๐ DataContract vs Great Expectations vs Pandera
| Feature | datamend | Great Expectations | Pandera |
|---|---|---|---|
| Auto-learn from data | โ | โ (manual) | โ (manual) |
| Statistical distribution check | โ KS-test | โ | โ |
| JSON persistence | โ | โ (JSON/YAML) | โ (YAML) |
| Setup lines of code | 2 | ~20 | ~10 |
| Integrated repair | โ | โ | โ |
| MendScore health metric | โ | โ | โ |
| Drift detection built-in | โ | โ | โ |
๐ก Pillar 3 โ DriftRadar
"Know before your model knows it's broken."
DriftRadar runs four independent statistical tests on every feature column and combines them into a single drift verdict with severity scoring โ giving you early warning before degraded model performance becomes visible.
๐ Multi-Test Drift Detection Pipeline
TRAINING DATA โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
PRODUCTION DATA โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ DriftRadar.detect() โ
โ โ
โ For each column: โ
โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ Test 1: PSI (Population Stability Index) โ โ
โ โ โ โ
โ โ 1. Build percentile-based bins on training data โ โ
โ โ 2. Count actual% and expected% per bin โ โ
โ โ 3. PSI = Sum (actual% - expected%) x ln(actual%/expected%) โ โ
โ โ โ โ
โ โ PSI < 0.10 โโโถ Stable โ โ
โ โ PSI 0.10โ0.25 โโโถ Slight shift (monitor) โ โ
โ โ PSI > 0.25 โโโถ Significant drift (alert!) โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ Test 2: KS Test (Kolmogorov-Smirnov, continuous columns) โ โ
โ โ โ โ
โ โ D = max|F_train(x) - F_prod(x)| (max CDF distance) โ โ
โ โ p-value < alpha (0.05) โโโถ Distributions differ โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ Test 3: Chi-Square (categorical columns) โ โ
โ โ โ โ
โ โ Compare observed vs expected category frequencies โ โ
โ โ p-value < alpha โโโถ Category distribution shifted โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ Test 4: JSD (Jensen-Shannon Divergence) โ โ
โ โ โ โ
โ โ JSD(P||Q) = 0.5*KL(P||M) + 0.5*KL(Q||M), M = (P+Q)/2 โ โ
โ โ 0 = identical ยท 1 = maximally different โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โ Combined Drift Score = 0.40xPSI + 0.25xKS + 0.20xJSD + 0.15xX2 โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
DriftReport ยท per-column results ยท MendScore
๐ Drift Severity Thresholds
| PSI Value | Severity | Recommended Action |
|---|---|---|
| < 0.10 | โ None | No action needed |
| 0.10 โ 0.20 | ๐ก Low | Monitor closely |
| 0.20 โ 0.25 | ๐ Medium | Investigate source |
| 0.25 โ 0.50 | ๐ด High | Retrain model soon |
| > 0.50 | โ Critical | Stop serving, retrain now |
๐ก Usage Examples
import datamend
# โโ Basic drift detection โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
report = datamend.drift(train_df, prod_df)
print(report.summary())
# โโ Only check specific columns โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
report = datamend.drift(train_df, prod_df, columns=["age", "income", "tenure"])
# โโ Inspect each column's drift metrics โโโโโโโโโโโโโโโโโโโโโโโโโโ
for col, result in report.column_results.items():
if result.drifted:
print(f"[DRIFT] {col}")
print(f" PSI={result.psi:.3f} KS p={result.ks_pvalue:.4f}")
print(f" JSD={result.jsd:.3f} Severity: {result.severity}")
# โโ With custom significance level โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
from datamend import DriftRadar
radar = DriftRadar(psi_buckets=20, alpha=0.01, verbose=True)
report = radar.detect(train_df, prod_df)
# โโ Only numeric or only categorical โโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
numeric_cols = prod_df.select_dtypes("number").columns.tolist()
report = datamend.drift(train_df, prod_df, columns=numeric_cols)
๐ DriftRadar vs Evidently vs NannyML
| Feature | datamend | Evidently | NannyML |
|---|---|---|---|
| PSI (numeric drift) | โ | โ | โ |
| KS test | โ | โ | โ |
| Chi-Square | โ | โ | โ |
| Jensen-Shannon Divergence | โ | โ | โ |
| Combined drift score | โ | โ | โ |
| Integrated repair pipeline | โ | โ | โ |
| HTML dashboard (offline) | โ | โ | โ |
| Zero server / zero cloud | โ | โ | โ |
| Setup complexity | 2 lines | ~10 lines | ~15 lines |
๐ฌ Pillar 4 โ FailureTrace
"Your model failed. Which rows? Which columns? Why?"
FailureTrace provides row-level and column-level attribution of model failures. It combines data-quality signals with model confidence estimates and surrogate model explanations to surface the exact rows and features causing predictions to go wrong.
๐ Failure Attribution Pipeline
MODEL + DATAFRAME + PREDICTIONS
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Step 1: Feature Importance (Column Attribution) โ
โ โ
โ Native importances? โโYESโโโถ sklearn .feature_importances_ โ
โ โ xgboost .feature_importances_ โ
โ โ lightgbm .feature_importances_ โ
โ โ torch .weight.abs().mean() โ
โ NO โ
โ โผ โ
โ Surrogate: DecisionTreeRegressor(X, predictions) โ
โ โ extract .feature_importances_ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Step 2: Data Quality Score (Per Row) โ
โ โ
โ dq_score = 1.0 โ
โ - 0.3 x has_any_null โ
โ - 0.3 x is_outlier (modified Z-score) โ
โ - 0.2 x has_encoding_issue โ
โ - 0.2 x has_type_mismatch โ
โ โ
โ dq_suspicion = 1.0 - dq_score โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Step 3: Model Confidence Score (Per Row) โ
โ โ
โ Classifier: confidence = 1 - max(predict_proba(row)) โ
โ (low confidence = high suspicion) โ
โ โ
โ Regressor: confidence from normalized absolute residuals โ
โ โ
โ model_suspicion = 1.0 - confidence โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Step 4: Composite Suspicion Score (Per Row) โ
โ โ
โ suspicion = 0.50 x dq_suspicion โ
โ + 0.30 x weighted_anomaly_score โ
โ + 0.20 x model_suspicion โ
โ โ
โ Top-K rows by suspicion score = "suspicious rows" โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Step 5: Column Attribution Score (Per Column) โ
โ โ
โ col_score = 0.6 x model_importance โ
โ + 0.4 x data_quality_contribution โ
โ โ
โ Sorted descending โ top columns driving failures โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
TraceReport ยท suspicious_rows[] ยท column_attributions{}
๐ก Usage Examples
import datamend
# โโ Basic failure trace โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
report = datamend.trace(model, df, predictions)
print(report.summary())
# โโ With ground truth (shows actual errors) โโโโโโโโโโโโโโโโโโโโโโโ
report = datamend.trace(model, df, predictions, ground_truth=y_true)
# โโ Inspect suspicious rows โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
for row in report.suspicious_rows[:5]:
print(f"Row {row.row_index} suspicion={row.suspicion_score:.3f}")
print(f" Top cols: {row.top_columns}")
print(f" DQ score: {row.data_quality_score:.3f}")
print(f" Reason: {row.reason}")
# โโ Inspect which columns drive failures โโโโโโโโโโโโโโโโโโโโโโโโโโ
for col, attr in sorted(report.column_attributions.items(),
key=lambda x: -x[1].importance_score):
print(f"{col}: importance={attr.importance_score:.3f} "
f"anomaly_rate={attr.anomaly_rate:.3f}")
# โโ Works with sklearn, XGBoost, LightGBM, PyTorch โโโโโโโโโโโโโโโ
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBRegressor
report_sk = datamend.trace(rf_model, df, preds)
report_xgb = datamend.trace(xgb_model, df, preds)
๐ FailureTrace vs SHAP vs LIME
| Feature | datamend | SHAP | LIME |
|---|---|---|---|
| Row-level suspicion score | โ | โ | โ |
| Data quality ร model signal | โ | โ | โ |
| Zero-configuration | โ | โ (needs tree explainer) | โ |
| Works on black-box models | โ | โ (KernelSHAP slow) | โ |
| Column attribution | โ | โ | โ |
| Integrated pipeline | โ | โ | โ |
| HTML dashboard output | โ | โ | โ |
๐ MendPipeline โ All Four Pillars, One Call
For production ML systems, MendPipeline chains all four pillars into a single, stateful object:
from datamend import MendPipeline
# โโ Fit on clean training data (once) โโโโโโโโโโโโโโโโโโโโโโโโโโโโ
pipeline = MendPipeline(
repair_strategy="auto",
null_threshold=0.05,
drift_alpha=0.05,
psi_buckets=10,
top_k_trace=10,
verbose=True,
)
pipeline.fit(train_df)
# โโ Run on every production batch โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
result = pipeline.transform(
prod_df,
model=model,
predictions=preds,
ground_truth=y_true, # optional
)
# โโ Full report โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
print(result.summary())
# =================================================================
# datamend MendPipeline โ Full Health Report
# =================================================================
# Overall MendScore : 91.4/100
#
# [Pillar 1] AutoRepair
# Issues fixed : 142
# MendScore change : 54.2 โ 96.8
#
# [Pillar 2] DataContract โ PASSED
# Violations : 0
# MendScore : 98.0
#
# [Pillar 3] DriftRadar โ STABLE
# Columns drifted : 0
# MendScore (drift) : 4.2
#
# [Pillar 4] FailureTrace
# Suspicious rows : 3
# MendScore : 87.1
# โโ Export repaired data โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
result.repaired_df.to_parquet("clean_batch.parquet")
# โโ Serialize to JSON โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
result.to_json()
Overall MendScore Formula
Overall MendScore =
0.35 x repair_score_after
+ 0.30 x contract_score
+ 0.20 x (100 - drift_score) โ inverted: low drift = good
+ 0.15 x (100 - trace_score) โ inverted: low failures = good
๐ฅ๏ธ HTML Dashboard
datamend generates a self-contained, single-file dark-mode HTML dashboard โ no server, no internet, no dependencies:
from datamend import MendReport
# Build report from individual pillar outputs
report = MendReport(
repair_report=repair_report,
contract_report=contract_report,
drift_report=drift_report,
trace_report=trace_report,
)
# Write dashboard to disk
report.to_html("dashboard.html")
# Or launch a live server in your browser
report.serve(port=8080, open_browser=True)
Dashboard sections:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ datamend Dashboard MendScore 96 โ
โโโโโโโโโโโโโโฌโโโโโโโโโโโโโฌโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโค
โ AutoRepair โ Contract โ DriftRadar โ FailureTrace โ
โ Fixes: 142โ PASSED โ โ STABLE โ โ Rows: 3 โ
โโโโโโโโโโโโโโดโโโโโโโโโโโโโดโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโค
โ Repair Actions Table (sortable, filterable) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Contract Violations (severity colour-coded) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Drift Results (per-column PSI/KS/JSD) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Column Attribution (importance scores bar chart) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
๐ป CLI Reference
datamend ships a full command-line interface:
# โโ Repair โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
datamend repair data.csv -o repaired.csv --strategy median --verbose
datamend repair data.parquet -o clean.parquet --fast
# โโ Validate against a contract โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
datamend validate data.csv --contract contracts/v1.json
datamend contract data.csv -o contracts/v1.json # fit contract
# โโ Detect drift โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
datamend drift train.csv prod.csv --alpha 0.01 --columns age income
# โโ Score data quality โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
datamend score data.csv # prints MendScore
# โโ Generate HTML dashboard โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
datamend dashboard data.csv -o report.html --open
# โโ List registered plugins โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
datamend plugins list
# โโ Supported formats: CSV ยท Parquet ยท JSON ยท Excel (.xlsx) โโโโโโโ
datamend repair data.xlsx -o clean.xlsx
๐ Plugin System
Build custom repair logic and plug it in with a decorator:
from datamend.plugins.base import BaseRepairPlugin, register_plugin
from datamend.core.repair import RepairAction
import pandas as pd
@register_plugin
class ClipNegativePlugin(BaseRepairPlugin):
name = "clip_negative"
description = "Clips all negative values in numeric columns to 0"
def repair(self, df):
df = df.copy()
actions = []
for col in df.select_dtypes("number").columns:
mask = df[col] < 0
count = mask.sum()
if count > 0:
df.loc[mask, col] = 0
actions.append(RepairAction(
column=col,
issue_type="NEGATIVE_VALUE",
description=f"Clipped {count} negative values to 0",
rows_affected=int(count),
before_sample=None, after_sample=None,
strategy="clip_negative",
))
return df, actions
# โโ Use your plugin โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
repaired, report = datamend.repair(df, plugins=[ClipNegativePlugin()])
Plugin auto-discovery via entry points:
# In your pyproject.toml
[project.entry-points."datamend.plugins"]
my_plugin = "my_package.plugins:MyPlugin"
๐ Integrations
MLflow
from datamend.integrations.mlflow import log_repair, log_drift, log_pipeline_result
import mlflow
with mlflow.start_run():
repaired, repair_report = datamend.repair(df)
log_repair(repair_report) # logs MendScore, issue counts as metrics
pipeline_result = pipeline.transform(prod_df, model=model, predictions=preds)
log_pipeline_result(pipeline_result) # logs all 4 pillars + artifacts
Weights & Biases
from datamend.integrations.wandb import log_repair, log_drift
import wandb
wandb.init(project="my-ml-project")
repaired, repair_report = datamend.repair(df)
log_repair(repair_report) # logs to current wandb run
drift_report = datamend.drift(train_df, prod_df)
log_drift(drift_report)
DVC
from datamend.integrations.dvc import save_repair_metrics, save_pipeline_result
repaired, report = datamend.repair(df)
save_repair_metrics(report, path="metrics/repair.json") # git + dvc tracked
result = pipeline.transform(prod_df, model=model, predictions=preds)
save_pipeline_result(result, path="metrics/pipeline.json")
โ๏ธ Advanced Usage
๐น Async / Concurrent Processing
import asyncio
import datamend
async def process_batch(df):
loop = asyncio.get_event_loop()
# Run blocking repair in a thread pool
repaired, report = await loop.run_in_executor(
None, lambda: datamend.repair(df, verbose=False)
)
return repaired, report
# Process multiple batches concurrently
tasks = [process_batch(batch) for batch in batches]
results = await asyncio.gather(*tasks)
๐น Large Dataset โ Chunked Mode
from datamend import AutoRepair
# Handles 50M+ rows without memory blowup
engine = AutoRepair(strategy="median", fast_mode=True)
repaired, report = engine.repair_chunked(
df,
chunk_size=1_000_000, # process 1M rows at a time
)
print(f"Total rows processed: {len(repaired):,}")
print(f"MendScore: {report.mend_score_after:.1f}")
๐น Production-Safe Selective Repair
# Repair only specific columns (e.g., don't touch ID columns)
from datamend import AutoRepair
engine = AutoRepair(strategy="auto")
subset = df[["age", "income", "score"]].copy()
repaired_subset, report = engine.fit_transform(subset)
# Merge back into original frame
df[["age", "income", "score"]] = repaired_subset
๐น Selective Drift Monitoring
# Monitor only numeric features for drift (skip ID/timestamp cols)
numeric_cols = [c for c in prod_df.select_dtypes("number").columns
if c not in ["id", "timestamp", "row_num"]]
report = datamend.drift(train_df, prod_df, columns=numeric_cols)
# Send alert if any column is critical
critical = [c for c, r in report.column_results.items()
if r.severity == "critical"]
if critical:
send_pagerduty_alert(f"Critical drift: {critical}")
๐น Custom DataContract Rules
from datamend import DataContract
# Strict contract: 0% nulls, max 10% cardinality change
contract = DataContract(
null_threshold=0.0, # zero nulls allowed
)
contract.fit(train_df)
# Save with metadata
import json
contract_dict = json.loads(contract.to_json())
contract_dict["version"] = "1.2.0"
contract_dict["fitted_on"] = "2024-01-15"
with open("contract_v1.2.json", "w") as f:
json.dump(contract_dict, f, indent=2)
๐ Benchmark
Measured on a 100,000-row ยท 20-column dataset (MacBook Pro M2, Python 3.11):
| Task | datamend | pandas manual | Great Expectations | Evidently | SHAP |
|---|---|---|---|---|---|
| Null imputation | 0.12s | 0.08s | N/A | N/A | N/A |
| Outlier detection + fix | 0.31s | ~1.2s manual | N/A | N/A | N/A |
| Duplicate removal | 0.09s | 0.07s | N/A | N/A | N/A |
| Full data repair | 0.61s | ~4s manual | N/A | N/A | N/A |
| Contract fit | 0.18s | N/A | ~2.1s | N/A | N/A |
| Contract validate | 0.11s | N/A | ~0.9s | N/A | N/A |
| Drift detection (10 cols) | 0.29s | N/A | N/A | ~0.8s | N/A |
| Failure trace (RF model) | 1.14s | N/A | N/A | N/A | ~8.2s |
| Full pipeline | 2.1s | ~7s+ combined | N/A | N/A | N/A |
Benchmarks are indicative. Performance varies by data shape, column types, and hardware.
๐๏ธ Architecture & Project Structure
datamend/
โ
โโโ datamend/ โ library package
โ โโโ __init__.py โ top-level API (repair, contract, drift, trace)
โ โโโ pipeline.py โ MendPipeline (all 4 pillars unified)
โ โโโ report.py โ MendReport + HTML dashboard generator
โ โโโ cli.py โ Click CLI (repair/validate/drift/score/dashboard)
โ โ
โ โโโ core/
โ โ โโโ repair.py โ AutoRepair โ 8-phase engine (15+ detectors)
โ โ โโโ contract.py โ DataContract โ fit / validate / persist
โ โ โโโ drift.py โ DriftRadar โ PSI + KS + chiยฒ + JSD
โ โ โโโ trace.py โ FailureTrace โ row + column attribution
โ โ
โ โโโ plugins/
โ โ โโโ base.py โ BaseRepairPlugin, PluginRegistry, @register_plugin
โ โ
โ โโโ integrations/
โ โโโ mlflow.py โ MLflow metrics + artifact logging
โ โโโ wandb.py โ W&B metrics logging
โ โโโ dvc.py โ DVC-tracked JSON metrics
โ
โโโ tests/ โ 113 tests, 94% coverage
โ โโโ conftest.py โ shared fixtures
โ โโโ test_repair.py โ 32 tests
โ โโโ test_contract.py โ 22 tests
โ โโโ test_drift.py โ 19 tests
โ โโโ test_trace.py โ 11 tests
โ โโโ test_pipeline.py โ 12 tests
โ โโโ test_report.py โ 8 tests
โ โโโ test_plugins.py โ 9 tests
โ
โโโ .github/
โ โโโ workflows/ci.yml โ Tests: ubuntu/windows/macos ร py3.9โ3.12
โ โโโ workflows/publish.yml โ PyPI trusted publish on v*.*.* tags
โ
โโโ pyproject.toml
โโโ README.md
๐งช Running Tests
git clone https://github.com/vignesh2027/datamend.py.git
cd datamend.py
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"
# Run all 113 tests with coverage
pytest tests/ -v --cov=datamend --cov-report=term-missing
# Run a single pillar
pytest tests/test_repair.py -v
pytest tests/test_drift.py -v
โฑ๏ธ Time Saved Per Week
| Task | Manual time | With datamend | Saved |
|---|---|---|---|
| Null imputation per dataset | ~25 min | < 1 sec | 25 min |
| Outlier detection & fix | ~45 min | < 1 sec | 45 min |
| Schema validation setup | ~2 hours | 2 lines | 2 hours |
| Drift monitoring setup | ~3 hours | 1 line | 3 hours |
| Debugging model failures | ~4 hours | 2 sec | ~4 hours |
| Total per week | ~10+ hours | < 5 seconds | 10 hours |
๐ Requirements
| Package | Version | Why |
|---|---|---|
| pandas | โฅ 1.5.0 | Core DataFrame operations |
| numpy | โฅ 1.23.0 | Numerical computations |
| scipy | โฅ 1.9.0 | KS test, chi-square, statistical tests |
| click | โฅ 8.0.0 | CLI framework |
| rich | โฅ 13.0.0 | Beautiful terminal output |
| jinja2 | โฅ 3.1.0 | HTML dashboard templating |
| pydantic | โฅ 2.0.0 | Data validation models |
Optional extras:
pip install "datamend[sklearn]" # scikit-learn integration
pip install "datamend[xgboost]" # XGBoost native importances
pip install "datamend[lightgbm]" # LightGBM native importances
pip install "datamend[torch]" # PyTorch layer attribution
pip install "datamend[mlflow]" # MLflow experiment tracking
pip install "datamend[wandb]" # Weights & Biases logging
pip install "datamend[dvc]" # DVC metric tracking
pip install "datamend[all]" # Everything
๐บ๏ธ Roadmap
- AutoRepair โ 8-phase repair engine
- DataContract โ statistical contract learning
- DriftRadar โ PSI + KS + chiยฒ + JSD
- FailureTrace โ surrogate row attribution
- MendPipeline โ unified 4-pillar pipeline
- CLI โ repair / validate / drift / score / dashboard
- HTML dashboard โ self-contained dark-mode output
- MLflow / W&B / DVC integrations
- Plugin system with entry-point discovery
- PyPI release (0.1.0)
- Async native support (0.2.0)
- Polars DataFrame support (0.2.0)
- Time-series drift (CUSUM / ADWIN) (0.3.0)
- REST API server mode (0.3.0)
- Grafana plugin for MendScore dashboards (0.4.0)
- AutoML-style repair strategy search (0.5.0)
๐ค Contributing
Contributions are welcome! Please open an issue first to discuss the change, then submit a PR.
# Fork and clone
git clone https://github.com/<your-username>/datamend.py.git
# Install dev dependencies
pip install -e ".[dev]"
# Run the full test suite before submitting
pytest tests/ -v
ruff check datamend/
mypy datamend/
๐ License
MIT โ see LICENSE for details.
Built with care by Vignesh
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file datamend-1.1.1.tar.gz.
File metadata
- Download URL: datamend-1.1.1.tar.gz
- Upload date:
- Size: 90.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
264faf921e026861c801bfb018fce046a0c1d0c590f61b1096ef741218649fb2
|
|
| MD5 |
abd4efb87a3f69154b5093c9e95ec6b3
|
|
| BLAKE2b-256 |
4fb63a06b2d30faebaffb69d981df8b7a605bae14c95539a6e35cebb29a11cdf
|
File details
Details for the file datamend-1.1.1-py3-none-any.whl.
File metadata
- Download URL: datamend-1.1.1-py3-none-any.whl
- Upload date:
- Size: 59.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f1e8a31354ada6a1e6130b043a1032a4498a485d34cf6e522bc4d758c2dad36b
|
|
| MD5 |
a8053f8bc43ce657d45368e8a28eb4e6
|
|
| BLAKE2b-256 |
de025ce8914937194d36c74aeba8be055c5f9550fbb1590a0b80b16989422f5e
|