Skip to main content

The unified data repair, validation, drift detection, and failure tracing library for production ML

Project description

Typing SVG

PyPI Downloads Python License


Tests Coverage Ruff mypy



๐Ÿ“– Docs ย โ€ขย  ๐Ÿš€ PyPI ย โ€ขย  ๐Ÿ› Issues ย โ€ขย  ๐Ÿ’ฌ Discussions ย โ€ขย  ๐Ÿ“ Changelog


โœฆ Why datamend? โœฆ

Real-world data is never clean. Nulls sneak in. Distributions shift. Models fail silently on corrupted inputs.
datamend is the single library that catches, fixes, validates, monitors, and traces every data quality issue โ€” automatically โ€” so your ML pipeline never breaks from bad data again.


โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                                                                     โ”‚
โ”‚   WITHOUT datamend              WITH datamend                       โ”‚
โ”‚   โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€             โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€                      โ”‚
โ”‚   โŒ Nulls โ†’ model crashes      โœ… Auto-imputed before fit           โ”‚
โ”‚   โŒ Drift undetected           โœ… PSI + KS test every batch         โ”‚
โ”‚   โŒ Contract violations        โœ… Schema enforced at the gate        โ”‚
โ”‚   โŒ Hours debugging            โœ… Row-level failure attribution       โ”‚
โ”‚   โŒ 5 different libraries      โœ… One unified API                    โ”‚
โ”‚                                                                     โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

๐Ÿ“ฆ Installation

# Core (repair, contract, drift, trace)
pip install datamend

# With scikit-learn + XGBoost support
pip install "datamend[sklearn,xgboost]"

# With experiment tracking
pip install "datamend[mlflow,wandb]"

# Everything
pip install "datamend[all]"

Requires: Python โ‰ฅ 3.9 ยท pandas โ‰ฅ 1.5 ยท numpy โ‰ฅ 1.23 ยท scipy โ‰ฅ 1.9


โšก 60-Second Demo

import pandas as pd
import datamend

df = pd.read_csv("production_data.csv")   # messy real-world data

# โ”€โ”€ Pillar 1: Auto-repair everything โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
repaired, report = datamend.repair(df)
print(report.summary())
# โœ” Fixed 247 nulls ยท Removed 31 duplicates ยท Clipped 19 outliers
# โœ” MendScore: 54.2 โ†’ 96.8  (+42.6 pts)

# โ”€โ”€ Pillar 2: Enforce your data contract โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
contract = datamend.contract(train_df)
violations = datamend.validate(repaired, contract)
# โœ” 0 violations ยท Contract PASSED

# โ”€โ”€ Pillar 3: Detect drift vs training data โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
drift = datamend.drift(train_df, repaired)
print(drift.summary())
# โš  'income' drifted  PSI=0.38  KS p=0.001

# โ”€โ”€ Pillar 4: Trace model failures to root columns โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
trace = datamend.trace(model, repaired, predictions)
print(trace.summary())
# โš  Top suspicious rows: [1042, 887, 3310]  Top column: 'income'

๐Ÿ›๏ธ The Four Pillars of datamend

โ•”โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•—
โ•‘                                                                  โ•‘
โ•‘   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”        โ•‘
โ•‘   โ”‚  AutoRepair  โ”‚โ”€โ”€โ–ถโ”‚ DataContract โ”‚โ”€โ”€โ–ถโ”‚  DriftRadar  โ”‚โ”€โ”€โ–ถ ๐Ÿ“Š  โ•‘
โ•‘   โ”‚  Pillar  1   โ”‚   โ”‚  Pillar  2   โ”‚   โ”‚  Pillar  3   โ”‚        โ•‘
โ•‘   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜        โ•‘
โ•‘          โ”‚                  โ”‚                  โ”‚                 โ•‘
โ•‘          โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜                 โ•‘
โ•‘                             โ”‚                                    โ•‘
โ•‘                             โ–ผ                                    โ•‘
โ•‘                   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”                           โ•‘
โ•‘                   โ”‚  FailureTrace    โ”‚                           โ•‘
โ•‘                   โ”‚   Pillar  4      โ”‚                           โ•‘
โ•‘                   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜                           โ•‘
โ•‘                             โ”‚                                    โ•‘
โ•‘                             โ–ผ                                    โ•‘
โ•‘               MendScore  โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“  96.8/100               โ•‘
โ•‘                                                                  โ•‘
โ•šโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•

๐Ÿ”ง Pillar 1 โ€” AutoRepair

"Tell datamend to fix it. It will."

AutoRepair is an 8-phase intelligent repair engine that detects and heals over 15 distinct categories of data corruption using statistics-driven algorithms โ€” no configuration needed.


๐Ÿ” The 8-Phase Detection Pipeline

 RAW DATAFRAME IN
        โ”‚
        โ–ผ
 โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
 โ”‚  Phase 1 โ”€โ”€ NULL DETECTION & IMPUTATION                 โ”‚
 โ”‚                                                         โ”‚
 โ”‚   skewness > 1.0?  โ”€โ”€YESโ”€โ”€โ–ถ  Median imputation         โ”‚
 โ”‚        โ”‚                                                โ”‚
 โ”‚        NO                                               โ”‚
 โ”‚        โ–ผ                                                โ”‚
 โ”‚   Mean imputation  (for numeric)                        โ”‚
 โ”‚   Mode imputation  (for categorical)                    โ”‚
 โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                            โ”‚
                            โ–ผ
 โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
 โ”‚  Phase 2 โ”€โ”€ OUTLIER DETECTION (Modified Z-Score / MAD)  โ”‚
 โ”‚                                                         โ”‚
 โ”‚   MAD = median(|Xi - median(X)|)                        โ”‚
 โ”‚   Modified Z = 0.6745 ร— (Xi - median) / MAD            โ”‚
 โ”‚                                                         โ”‚
 โ”‚   |Z| > 3.5?  โ”€โ”€YESโ”€โ”€โ–ถ  IQR clip to [Q1-1.5ร—IQR,      โ”‚
 โ”‚                                        Q3+1.5ร—IQR]     โ”‚
 โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                            โ”‚
                            โ–ผ
 โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
 โ”‚  Phase 3 โ”€โ”€ TYPE MISMATCH DETECTION                     โ”‚
 โ”‚                                                         โ”‚
 โ”‚   >80% match r"^\s*-?\d+(\.\d+)?\s*$"?                 โ”‚
 โ”‚        โ”€โ”€YESโ”€โ”€โ–ถ  coerce column to float64               โ”‚
 โ”‚                                                         โ”‚
 โ”‚   >60% match ISO-8601 / common date patterns?           โ”‚
 โ”‚        โ”€โ”€YESโ”€โ”€โ–ถ  coerce to datetime64                   โ”‚
 โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                            โ”‚
                            โ–ผ
 โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
 โ”‚  Phase 4 โ”€โ”€ DUPLICATE DETECTION & REMOVAL               โ”‚
 โ”‚                                                         โ”‚
 โ”‚   Exact:  pandas .duplicated(keep='first')              โ”‚
 โ”‚                                                         โ”‚
 โ”‚   Near-duplicate (Jaccard โ‰ฅ 0.85):                      โ”‚
 โ”‚     token-set similarity across string columns          โ”‚
 โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                            โ”‚
                            โ–ผ
 โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
 โ”‚  Phase 5 โ”€โ”€ ENCODING CORRUPTION (Mojibake) REPAIR       โ”‚
 โ”‚                                                         โ”‚
 โ”‚   Regex: [\xc0-\xff][\x80-\xbf]{1,3}                   โ”‚
 โ”‚        โ”€โ”€YESโ”€โ”€โ–ถ  encode latin-1, decode utf-8           โ”‚
 โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                            โ”‚
                            โ–ผ
 โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
 โ”‚  Phase 6 โ”€โ”€ CATEGORY NORMALISATION                      โ”‚
 โ”‚                                                         โ”‚
 โ”‚   NFKD + lower + strip whitespace                       โ”‚
 โ”‚   "  New York  " โ†’ "new york"                           โ”‚
 โ”‚   "Nono" โ†’ "nono"  (unicode canonical)                  โ”‚
 โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                            โ”‚
                            โ–ผ
 โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
 โ”‚  Phase 7 โ”€โ”€ WHITESPACE & HIDDEN CHARACTER REMOVAL       โ”‚
 โ”‚                                                         โ”‚
 โ”‚   Remove: zero-width spaces, soft hyphens, BOM, \r, \t  โ”‚
 โ”‚   Strip invisible unicode control characters            โ”‚
 โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                            โ”‚
                            โ–ผ
 โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
 โ”‚  Phase 8 โ”€โ”€ UNIT MISMATCH DETECTION                     โ”‚
 โ”‚                                                         โ”‚
 โ”‚   CV > 5.0  AND  IQR ratio > 10?                        โ”‚
 โ”‚        โ”€โ”€YESโ”€โ”€โ–ถ  flag column as suspect unit mix        โ”‚
 โ”‚   (salary: 50000 mixed with 50.0 = same row anomaly)    โ”‚
 โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                            โ”‚
                            โ–ผ
   REPAIRED DATAFRAME  ยท  RepairReport  ยท  MendScore

๐Ÿ“Š What Each Detector Catches

Phase Issue Type Detection Algorithm Fix Strategy
1 Null / NaN values Column-wise null rate Mean / Median / Mode imputation
2 Outliers Modified Z-score (MAD) IQR-bounded clipping
3 Type mismatches Regex coverage โ‰ฅ 80% dtype coercion
4 Exact duplicates pandas .duplicated() Keep first, drop rest
4 Near-duplicates Jaccard token similarity โ‰ฅ 0.85 Drop near-clone rows
5 Mojibake encoding [\xc0-\xff][\x80-\xbf] regex latin-1 โ†’ utf-8 re-encode
6 Category noise NFKD unicode normalisation Lowercase canonical form
7 Whitespace / invisible chars Unicode control char regex Strip to clean string
8 Unit mismatch CV > 5.0 + IQR ratio > 10 Flag + warn

๐Ÿ’ก Usage Examples

import datamend

# โ”€โ”€ Simple one-liner โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
repaired, report = datamend.repair(df)

# โ”€โ”€ With specific strategy โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
repaired, report = datamend.repair(df, strategy="median", verbose=True)

# โ”€โ”€ For large datasets (10M+ rows, chunked processing) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
from datamend import AutoRepair
engine = AutoRepair(strategy="auto", fast_mode=True)
repaired, report = engine.repair_chunked(df, chunk_size=500_000)

# โ”€โ”€ Inspect what was fixed โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
for action in report.actions:
    print(f"[{action.column}] {action.issue_type}: {action.description}")
    print(f"  Rows affected: {action.rows_affected}")

# โ”€โ”€ Full repair report โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
print(report.summary())
print(f"MendScore: {report.mend_score_before:.1f} โ†’ {report.mend_score_after:.1f}")

๐Ÿงฎ MendScore โ€” The Data Health Metric

datamend computes a composite MendScore (0โ€“100) that tells you exactly how healthy your data is:

MendScore = 100
   - 40 ร— null_rate          โ† nulls hurt the most
   - 20 ร— duplicate_rate     โ† dupes skew aggregations
   - 25 ร— outlier_rate       โ† outliers corrupt models
   - 15 ร— whitespace_rate    โ† silent model confusion
Score Range Health Grade Interpretation
95 โ€“ 100 ๐ŸŸข Excellent Production-ready, no action needed
85 โ€“ 94 ๐ŸŸก Good Minor issues, acceptable for most models
70 โ€“ 84 ๐ŸŸ  Fair Noticeable problems, repair recommended
50 โ€“ 69 ๐Ÿ”ด Poor Significant corruption, repair required
0 โ€“ 49 โ›” Critical Severe data quality issues, stop pipeline

๐Ÿ“‹ Pillar 2 โ€” DataContract

"Define what clean data looks like. Enforce it forever."

DataContract learns the statistical fingerprint of your training data and validates every new batch against it โ€” catching schema violations, null rate explosions, distribution shifts, and cardinality mismatches before they reach your model.


๐Ÿ” Contract Fitting & Validation Flow

 TRAINING DATA (clean)
        โ”‚
        โ–ผ
 โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
 โ”‚  DataContract.fit(train_df)                             โ”‚
 โ”‚                                                         โ”‚
 โ”‚  For each column, learns:                               โ”‚
 โ”‚    dtype          โ† expected data type                  โ”‚
 โ”‚    nullable       โ† is null allowed?                    โ”‚
 โ”‚    null_rate      โ† acceptable null fraction            โ”‚
 โ”‚    min / max      โ† numeric range bounds                โ”‚
 โ”‚    mean / std     โ† distribution centre + spread        โ”‚
 โ”‚    percentiles    โ† p5, p25, p50, p75, p95             โ”‚
 โ”‚    allowed_values โ† set of valid categories             โ”‚
 โ”‚    cardinality    โ† number of unique values             โ”‚
 โ”‚    distribution   โ† KS-ready empirical CDF             โ”‚
 โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                            โ”‚  contract.save("contract.json")
                            โ–ผ
                    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
                    โ”‚ contract.json โ”‚  โ† version-controlled
                    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                            โ”‚  DataContract.load("contract.json")
                            โ–ผ
 โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
 โ”‚  DataContract.validate(new_df)                          โ”‚
 โ”‚                                                         โ”‚
 โ”‚  Check 1: Missing columns?     โ”€โ”€FAILโ”€โ”€โ–ถ CRITICAL       โ”‚
 โ”‚  Check 2: Extra columns?       โ”€โ”€WARNโ”€โ”€โ–ถ LOW            โ”‚
 โ”‚  Check 3: Null rate exceeded?  โ”€โ”€FAILโ”€โ”€โ–ถ HIGH           โ”‚
 โ”‚  Check 4: dtype mismatch?      โ”€โ”€FAILโ”€โ”€โ–ถ HIGH           โ”‚
 โ”‚  Check 5: Values out of range? โ”€โ”€FAILโ”€โ”€โ–ถ MEDIUM         โ”‚
 โ”‚  Check 6: KS distribution?     โ”€โ”€FAILโ”€โ”€โ–ถ MEDIUM         โ”‚
 โ”‚  Check 7: Cardinality shifted? โ”€โ”€WARNโ”€โ”€โ–ถ LOW            โ”‚
 โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                            โ”‚
                            โ–ผ
              ContractReport  ยท  violations[]  ยท  passed?

๐Ÿ’ก Usage Examples

import datamend

# โ”€โ”€ Fit contract on clean training data โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
contract = datamend.contract(train_df)
contract.save("contracts/v1.json")   # version control this!

# โ”€โ”€ Load and validate production batch โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
contract = datamend.contract.load("contracts/v1.json")
report = datamend.validate(prod_df, contract)

if not report.passed:
    for v in report.violations:
        print(f"[{v.severity}] {v.column}: {v.message}")
        print(f"  Expected: {v.expected}  |  Got: {v.observed}")

# โ”€โ”€ Raise exception on violation (for strict pipelines) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
try:
    datamend.validate(prod_df, contract, raise_on_failure=True)
except datamend.ContractViolationError as e:
    # Block the pipeline, alert the team
    alert_slack(str(e))

# โ”€โ”€ Using DataContract class directly โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
from datamend import DataContract
contract = DataContract(null_threshold=0.02)  # max 2% nulls allowed
contract.fit(train_df)
report = contract.validate(prod_df)
print(report.summary())

๐Ÿ†š DataContract vs Great Expectations vs Pandera

Feature datamend Great Expectations Pandera
Auto-learn from data โœ… โŒ (manual) โŒ (manual)
Statistical distribution check โœ… KS-test โŒ โŒ
JSON persistence โœ… โœ… (JSON/YAML) โœ… (YAML)
Setup lines of code 2 ~20 ~10
Integrated repair โœ… โŒ โŒ
MendScore health metric โœ… โŒ โŒ
Drift detection built-in โœ… โŒ โŒ

๐Ÿ“ก Pillar 3 โ€” DriftRadar

"Know before your model knows it's broken."

DriftRadar runs four independent statistical tests on every feature column and combines them into a single drift verdict with severity scoring โ€” giving you early warning before degraded model performance becomes visible.


๐Ÿ” Multi-Test Drift Detection Pipeline

 TRAINING DATA  โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
                                                           โ”‚
 PRODUCTION DATA โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
                                                           โ”‚
                                                           โ–ผ
 โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
 โ”‚                  DriftRadar.detect()                                โ”‚
 โ”‚                                                                     โ”‚
 โ”‚   For each column:                                                  โ”‚
 โ”‚                                                                     โ”‚
 โ”‚   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”‚
 โ”‚   โ”‚  Test 1: PSI  (Population Stability Index)                   โ”‚  โ”‚
 โ”‚   โ”‚                                                              โ”‚  โ”‚
 โ”‚   โ”‚   1. Build percentile-based bins on training data            โ”‚  โ”‚
 โ”‚   โ”‚   2. Count actual% and expected% per bin                     โ”‚  โ”‚
 โ”‚   โ”‚   3. PSI = Sum (actual% - expected%) x ln(actual%/expected%) โ”‚  โ”‚
 โ”‚   โ”‚                                                              โ”‚  โ”‚
 โ”‚   โ”‚   PSI < 0.10  โ”€โ”€โ–ถ  Stable                                   โ”‚  โ”‚
 โ”‚   โ”‚   PSI 0.10โ€“0.25  โ”€โ”€โ–ถ  Slight shift (monitor)                โ”‚  โ”‚
 โ”‚   โ”‚   PSI > 0.25  โ”€โ”€โ–ถ  Significant drift (alert!)               โ”‚  โ”‚
 โ”‚   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ”‚
 โ”‚                                                                     โ”‚
 โ”‚   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”‚
 โ”‚   โ”‚  Test 2: KS Test  (Kolmogorov-Smirnov, continuous columns)   โ”‚  โ”‚
 โ”‚   โ”‚                                                              โ”‚  โ”‚
 โ”‚   โ”‚   D = max|F_train(x) - F_prod(x)|   (max CDF distance)      โ”‚  โ”‚
 โ”‚   โ”‚   p-value < alpha (0.05)  โ”€โ”€โ–ถ  Distributions differ          โ”‚  โ”‚
 โ”‚   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ”‚
 โ”‚                                                                     โ”‚
 โ”‚   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”‚
 โ”‚   โ”‚  Test 3: Chi-Square  (categorical columns)                   โ”‚  โ”‚
 โ”‚   โ”‚                                                              โ”‚  โ”‚
 โ”‚   โ”‚   Compare observed vs expected category frequencies          โ”‚  โ”‚
 โ”‚   โ”‚   p-value < alpha  โ”€โ”€โ–ถ  Category distribution shifted        โ”‚  โ”‚
 โ”‚   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ”‚
 โ”‚                                                                     โ”‚
 โ”‚   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”‚
 โ”‚   โ”‚  Test 4: JSD  (Jensen-Shannon Divergence)                    โ”‚  โ”‚
 โ”‚   โ”‚                                                              โ”‚  โ”‚
 โ”‚   โ”‚   JSD(P||Q) = 0.5*KL(P||M) + 0.5*KL(Q||M), M = (P+Q)/2     โ”‚  โ”‚
 โ”‚   โ”‚   0 = identical  ยท  1 = maximally different                 โ”‚  โ”‚
 โ”‚   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ”‚
 โ”‚                                                                     โ”‚
 โ”‚   Combined Drift Score = 0.40xPSI + 0.25xKS + 0.20xJSD + 0.15xX2 โ”‚
 โ”‚                                                                     โ”‚
 โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                                                โ”‚
                                                โ–ผ
              DriftReport  ยท  per-column results  ยท  MendScore

๐Ÿ“Š Drift Severity Thresholds

PSI Value Severity Recommended Action
< 0.10 โœ… None No action needed
0.10 โ€“ 0.20 ๐ŸŸก Low Monitor closely
0.20 โ€“ 0.25 ๐ŸŸ  Medium Investigate source
0.25 โ€“ 0.50 ๐Ÿ”ด High Retrain model soon
> 0.50 โ›” Critical Stop serving, retrain now

๐Ÿ’ก Usage Examples

import datamend

# โ”€โ”€ Basic drift detection โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
report = datamend.drift(train_df, prod_df)
print(report.summary())

# โ”€โ”€ Only check specific columns โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
report = datamend.drift(train_df, prod_df, columns=["age", "income", "tenure"])

# โ”€โ”€ Inspect each column's drift metrics โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
for col, result in report.column_results.items():
    if result.drifted:
        print(f"[DRIFT] {col}")
        print(f"  PSI={result.psi:.3f}  KS p={result.ks_pvalue:.4f}")
        print(f"  JSD={result.jsd:.3f}  Severity: {result.severity}")

# โ”€โ”€ With custom significance level โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
from datamend import DriftRadar
radar = DriftRadar(psi_buckets=20, alpha=0.01, verbose=True)
report = radar.detect(train_df, prod_df)

# โ”€โ”€ Only numeric or only categorical โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
numeric_cols = prod_df.select_dtypes("number").columns.tolist()
report = datamend.drift(train_df, prod_df, columns=numeric_cols)

๐Ÿ†š DriftRadar vs Evidently vs NannyML

Feature datamend Evidently NannyML
PSI (numeric drift) โœ… โœ… โœ…
KS test โœ… โœ… โœ…
Chi-Square โœ… โœ… โŒ
Jensen-Shannon Divergence โœ… โŒ โŒ
Combined drift score โœ… โŒ โœ…
Integrated repair pipeline โœ… โŒ โŒ
HTML dashboard (offline) โœ… โœ… โœ…
Zero server / zero cloud โœ… โœ… โŒ
Setup complexity 2 lines ~10 lines ~15 lines

๐Ÿ”ฌ Pillar 4 โ€” FailureTrace

"Your model failed. Which rows? Which columns? Why?"

FailureTrace provides row-level and column-level attribution of model failures. It combines data-quality signals with model confidence estimates and surrogate model explanations to surface the exact rows and features causing predictions to go wrong.


๐Ÿ” Failure Attribution Pipeline

 MODEL + DATAFRAME + PREDICTIONS
              โ”‚
              โ–ผ
 โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
 โ”‚  Step 1: Feature Importance (Column Attribution)                    โ”‚
 โ”‚                                                                     โ”‚
 โ”‚  Native importances?  โ”€โ”€YESโ”€โ”€โ–ถ  sklearn .feature_importances_       โ”‚
 โ”‚       โ”‚                         xgboost .feature_importances_       โ”‚
 โ”‚       โ”‚                         lightgbm .feature_importances_      โ”‚
 โ”‚       โ”‚                         torch .weight.abs().mean()          โ”‚
 โ”‚       NO                                                            โ”‚
 โ”‚       โ–ผ                                                             โ”‚
 โ”‚  Surrogate:  DecisionTreeRegressor(X, predictions)                  โ”‚
 โ”‚              โ†’ extract .feature_importances_                        โ”‚
 โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                            โ”‚
                            โ–ผ
 โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
 โ”‚  Step 2: Data Quality Score (Per Row)                               โ”‚
 โ”‚                                                                     โ”‚
 โ”‚  dq_score = 1.0                                                     โ”‚
 โ”‚    - 0.3 x has_any_null                                             โ”‚
 โ”‚    - 0.3 x is_outlier  (modified Z-score)                          โ”‚
 โ”‚    - 0.2 x has_encoding_issue                                       โ”‚
 โ”‚    - 0.2 x has_type_mismatch                                        โ”‚
 โ”‚                                                                     โ”‚
 โ”‚  dq_suspicion = 1.0 - dq_score                                     โ”‚
 โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                            โ”‚
                            โ–ผ
 โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
 โ”‚  Step 3: Model Confidence Score (Per Row)                           โ”‚
 โ”‚                                                                     โ”‚
 โ”‚  Classifier:  confidence = 1 - max(predict_proba(row))              โ”‚
 โ”‚               (low confidence = high suspicion)                     โ”‚
 โ”‚                                                                     โ”‚
 โ”‚  Regressor:   confidence from normalized absolute residuals         โ”‚
 โ”‚                                                                     โ”‚
 โ”‚  model_suspicion = 1.0 - confidence                                 โ”‚
 โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                            โ”‚
                            โ–ผ
 โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
 โ”‚  Step 4: Composite Suspicion Score (Per Row)                        โ”‚
 โ”‚                                                                     โ”‚
 โ”‚  suspicion = 0.50 x dq_suspicion                                   โ”‚
 โ”‚            + 0.30 x weighted_anomaly_score                          โ”‚
 โ”‚            + 0.20 x model_suspicion                                 โ”‚
 โ”‚                                                                     โ”‚
 โ”‚  Top-K rows by suspicion score = "suspicious rows"                 โ”‚
 โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                            โ”‚
                            โ–ผ
 โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
 โ”‚  Step 5: Column Attribution Score (Per Column)                      โ”‚
 โ”‚                                                                     โ”‚
 โ”‚  col_score = 0.6 x model_importance                                 โ”‚
 โ”‚            + 0.4 x data_quality_contribution                        โ”‚
 โ”‚                                                                     โ”‚
 โ”‚  Sorted descending โ†’ top columns driving failures                   โ”‚
 โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                            โ”‚
                            โ–ผ
         TraceReport  ยท  suspicious_rows[]  ยท  column_attributions{}

๐Ÿ’ก Usage Examples

import datamend

# โ”€โ”€ Basic failure trace โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
report = datamend.trace(model, df, predictions)
print(report.summary())

# โ”€โ”€ With ground truth (shows actual errors) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
report = datamend.trace(model, df, predictions, ground_truth=y_true)

# โ”€โ”€ Inspect suspicious rows โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
for row in report.suspicious_rows[:5]:
    print(f"Row {row.row_index}  suspicion={row.suspicion_score:.3f}")
    print(f"  Top cols: {row.top_columns}")
    print(f"  DQ score: {row.data_quality_score:.3f}")
    print(f"  Reason: {row.reason}")

# โ”€โ”€ Inspect which columns drive failures โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
for col, attr in sorted(report.column_attributions.items(),
                        key=lambda x: -x[1].importance_score):
    print(f"{col}: importance={attr.importance_score:.3f}  "
          f"anomaly_rate={attr.anomaly_rate:.3f}")

# โ”€โ”€ Works with sklearn, XGBoost, LightGBM, PyTorch โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBRegressor
report_sk = datamend.trace(rf_model, df, preds)
report_xgb = datamend.trace(xgb_model, df, preds)

๐Ÿ†š FailureTrace vs SHAP vs LIME

Feature datamend SHAP LIME
Row-level suspicion score โœ… โŒ โŒ
Data quality ร— model signal โœ… โŒ โŒ
Zero-configuration โœ… โŒ (needs tree explainer) โŒ
Works on black-box models โœ… โš  (KernelSHAP slow) โœ…
Column attribution โœ… โœ… โœ…
Integrated pipeline โœ… โŒ โŒ
HTML dashboard output โœ… โŒ โŒ

๐Ÿš€ MendPipeline โ€” All Four Pillars, One Call

For production ML systems, MendPipeline chains all four pillars into a single, stateful object:

from datamend import MendPipeline

# โ”€โ”€ Fit on clean training data (once) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
pipeline = MendPipeline(
    repair_strategy="auto",
    null_threshold=0.05,
    drift_alpha=0.05,
    psi_buckets=10,
    top_k_trace=10,
    verbose=True,
)
pipeline.fit(train_df)

# โ”€โ”€ Run on every production batch โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
result = pipeline.transform(
    prod_df,
    model=model,
    predictions=preds,
    ground_truth=y_true,    # optional
)

# โ”€โ”€ Full report โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
print(result.summary())
# =================================================================
# datamend MendPipeline โ€” Full Health Report
# =================================================================
#   Overall MendScore   : 91.4/100
#
#   [Pillar 1] AutoRepair
#     Issues fixed      : 142
#     MendScore change  : 54.2 โ†’ 96.8
#
#   [Pillar 2] DataContract โ€” PASSED
#     Violations        : 0
#     MendScore         : 98.0
#
#   [Pillar 3] DriftRadar โ€” STABLE
#     Columns drifted   : 0
#     MendScore (drift) : 4.2
#
#   [Pillar 4] FailureTrace
#     Suspicious rows   : 3
#     MendScore         : 87.1

# โ”€โ”€ Export repaired data โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
result.repaired_df.to_parquet("clean_batch.parquet")

# โ”€โ”€ Serialize to JSON โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
result.to_json()

Overall MendScore Formula

Overall MendScore =
    0.35 x repair_score_after
  + 0.30 x contract_score
  + 0.20 x (100 - drift_score)    โ† inverted: low drift = good
  + 0.15 x (100 - trace_score)    โ† inverted: low failures = good

๐Ÿ–ฅ๏ธ HTML Dashboard

datamend generates a self-contained, single-file dark-mode HTML dashboard โ€” no server, no internet, no dependencies:

from datamend import MendReport

# Build report from individual pillar outputs
report = MendReport(
    repair_report=repair_report,
    contract_report=contract_report,
    drift_report=drift_report,
    trace_report=trace_report,
)

# Write dashboard to disk
report.to_html("dashboard.html")

# Or launch a live server in your browser
report.serve(port=8080, open_browser=True)

Dashboard sections:

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  datamend Dashboard                           MendScore 96 โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ AutoRepair โ”‚  Contract  โ”‚ DriftRadar โ”‚  FailureTrace       โ”‚
โ”‚  Fixes: 142โ”‚  PASSED โœ“  โ”‚  STABLE โœ“  โ”‚  Rows: 3           โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚  Repair Actions Table   (sortable, filterable)            โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚  Contract Violations    (severity colour-coded)           โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚  Drift Results          (per-column PSI/KS/JSD)           โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚  Column Attribution     (importance scores bar chart)     โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

๐Ÿ’ป CLI Reference

datamend ships a full command-line interface:

# โ”€โ”€ Repair โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
datamend repair data.csv -o repaired.csv --strategy median --verbose
datamend repair data.parquet -o clean.parquet --fast

# โ”€โ”€ Validate against a contract โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
datamend validate data.csv --contract contracts/v1.json
datamend contract data.csv -o contracts/v1.json   # fit contract

# โ”€โ”€ Detect drift โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
datamend drift train.csv prod.csv --alpha 0.01 --columns age income

# โ”€โ”€ Score data quality โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
datamend score data.csv           # prints MendScore

# โ”€โ”€ Generate HTML dashboard โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
datamend dashboard data.csv -o report.html --open

# โ”€โ”€ List registered plugins โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
datamend plugins list

# โ”€โ”€ Supported formats: CSV ยท Parquet ยท JSON ยท Excel (.xlsx) โ”€โ”€โ”€โ”€โ”€โ”€โ”€
datamend repair data.xlsx -o clean.xlsx

๐Ÿ”Œ Plugin System

Build custom repair logic and plug it in with a decorator:

from datamend.plugins.base import BaseRepairPlugin, register_plugin
from datamend.core.repair import RepairAction
import pandas as pd

@register_plugin
class ClipNegativePlugin(BaseRepairPlugin):
    name = "clip_negative"
    description = "Clips all negative values in numeric columns to 0"

    def repair(self, df):
        df = df.copy()
        actions = []
        for col in df.select_dtypes("number").columns:
            mask = df[col] < 0
            count = mask.sum()
            if count > 0:
                df.loc[mask, col] = 0
                actions.append(RepairAction(
                    column=col,
                    issue_type="NEGATIVE_VALUE",
                    description=f"Clipped {count} negative values to 0",
                    rows_affected=int(count),
                    before_sample=None, after_sample=None,
                    strategy="clip_negative",
                ))
        return df, actions

# โ”€โ”€ Use your plugin โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
repaired, report = datamend.repair(df, plugins=[ClipNegativePlugin()])

Plugin auto-discovery via entry points:

# In your pyproject.toml
[project.entry-points."datamend.plugins"]
my_plugin = "my_package.plugins:MyPlugin"

๐Ÿ”— Integrations

MLflow

from datamend.integrations.mlflow import log_repair, log_drift, log_pipeline_result
import mlflow

with mlflow.start_run():
    repaired, repair_report = datamend.repair(df)
    log_repair(repair_report)           # logs MendScore, issue counts as metrics

    pipeline_result = pipeline.transform(prod_df, model=model, predictions=preds)
    log_pipeline_result(pipeline_result)  # logs all 4 pillars + artifacts

Weights & Biases

from datamend.integrations.wandb import log_repair, log_drift

import wandb
wandb.init(project="my-ml-project")

repaired, repair_report = datamend.repair(df)
log_repair(repair_report)      # logs to current wandb run

drift_report = datamend.drift(train_df, prod_df)
log_drift(drift_report)

DVC

from datamend.integrations.dvc import save_repair_metrics, save_pipeline_result

repaired, report = datamend.repair(df)
save_repair_metrics(report, path="metrics/repair.json")    # git + dvc tracked

result = pipeline.transform(prod_df, model=model, predictions=preds)
save_pipeline_result(result, path="metrics/pipeline.json")

โš™๏ธ Advanced Usage

๐Ÿ”น Async / Concurrent Processing
import asyncio
import datamend

async def process_batch(df):
    loop = asyncio.get_event_loop()
    # Run blocking repair in a thread pool
    repaired, report = await loop.run_in_executor(
        None, lambda: datamend.repair(df, verbose=False)
    )
    return repaired, report

# Process multiple batches concurrently
tasks = [process_batch(batch) for batch in batches]
results = await asyncio.gather(*tasks)
๐Ÿ”น Large Dataset โ€” Chunked Mode
from datamend import AutoRepair

# Handles 50M+ rows without memory blowup
engine = AutoRepair(strategy="median", fast_mode=True)
repaired, report = engine.repair_chunked(
    df,
    chunk_size=1_000_000,   # process 1M rows at a time
)
print(f"Total rows processed: {len(repaired):,}")
print(f"MendScore: {report.mend_score_after:.1f}")
๐Ÿ”น Production-Safe Selective Repair
# Repair only specific columns (e.g., don't touch ID columns)
from datamend import AutoRepair

engine = AutoRepair(strategy="auto")
subset = df[["age", "income", "score"]].copy()
repaired_subset, report = engine.fit_transform(subset)

# Merge back into original frame
df[["age", "income", "score"]] = repaired_subset
๐Ÿ”น Selective Drift Monitoring
# Monitor only numeric features for drift (skip ID/timestamp cols)
numeric_cols = [c for c in prod_df.select_dtypes("number").columns
                if c not in ["id", "timestamp", "row_num"]]

report = datamend.drift(train_df, prod_df, columns=numeric_cols)

# Send alert if any column is critical
critical = [c for c, r in report.column_results.items()
            if r.severity == "critical"]
if critical:
    send_pagerduty_alert(f"Critical drift: {critical}")
๐Ÿ”น Custom DataContract Rules
from datamend import DataContract

# Strict contract: 0% nulls, max 10% cardinality change
contract = DataContract(
    null_threshold=0.0,        # zero nulls allowed
)
contract.fit(train_df)

# Save with metadata
import json
contract_dict = json.loads(contract.to_json())
contract_dict["version"] = "1.2.0"
contract_dict["fitted_on"] = "2024-01-15"
with open("contract_v1.2.json", "w") as f:
    json.dump(contract_dict, f, indent=2)

๐Ÿ“Š Benchmark

Measured on a 100,000-row ยท 20-column dataset (MacBook Pro M2, Python 3.11):

Task datamend pandas manual Great Expectations Evidently SHAP
Null imputation 0.12s 0.08s N/A N/A N/A
Outlier detection + fix 0.31s ~1.2s manual N/A N/A N/A
Duplicate removal 0.09s 0.07s N/A N/A N/A
Full data repair 0.61s ~4s manual N/A N/A N/A
Contract fit 0.18s N/A ~2.1s N/A N/A
Contract validate 0.11s N/A ~0.9s N/A N/A
Drift detection (10 cols) 0.29s N/A N/A ~0.8s N/A
Failure trace (RF model) 1.14s N/A N/A N/A ~8.2s
Full pipeline 2.1s ~7s+ combined N/A N/A N/A

Benchmarks are indicative. Performance varies by data shape, column types, and hardware.


๐Ÿ—๏ธ Architecture & Project Structure

datamend/
โ”‚
โ”œโ”€โ”€ datamend/                      โ† library package
โ”‚   โ”œโ”€โ”€ __init__.py                โ† top-level API (repair, contract, drift, trace)
โ”‚   โ”œโ”€โ”€ pipeline.py                โ† MendPipeline (all 4 pillars unified)
โ”‚   โ”œโ”€โ”€ report.py                  โ† MendReport + HTML dashboard generator
โ”‚   โ”œโ”€โ”€ cli.py                     โ† Click CLI (repair/validate/drift/score/dashboard)
โ”‚   โ”‚
โ”‚   โ”œโ”€โ”€ core/
โ”‚   โ”‚   โ”œโ”€โ”€ repair.py              โ† AutoRepair โ€” 8-phase engine (15+ detectors)
โ”‚   โ”‚   โ”œโ”€โ”€ contract.py            โ† DataContract โ€” fit / validate / persist
โ”‚   โ”‚   โ”œโ”€โ”€ drift.py               โ† DriftRadar โ€” PSI + KS + chiยฒ + JSD
โ”‚   โ”‚   โ””โ”€โ”€ trace.py               โ† FailureTrace โ€” row + column attribution
โ”‚   โ”‚
โ”‚   โ”œโ”€โ”€ plugins/
โ”‚   โ”‚   โ””โ”€โ”€ base.py                โ† BaseRepairPlugin, PluginRegistry, @register_plugin
โ”‚   โ”‚
โ”‚   โ””โ”€โ”€ integrations/
โ”‚       โ”œโ”€โ”€ mlflow.py              โ† MLflow metrics + artifact logging
โ”‚       โ”œโ”€โ”€ wandb.py               โ† W&B metrics logging
โ”‚       โ””โ”€โ”€ dvc.py                 โ† DVC-tracked JSON metrics
โ”‚
โ”œโ”€โ”€ tests/                         โ† 113 tests, 94% coverage
โ”‚   โ”œโ”€โ”€ conftest.py                โ† shared fixtures
โ”‚   โ”œโ”€โ”€ test_repair.py             โ† 32 tests
โ”‚   โ”œโ”€โ”€ test_contract.py           โ† 22 tests
โ”‚   โ”œโ”€โ”€ test_drift.py              โ† 19 tests
โ”‚   โ”œโ”€โ”€ test_trace.py              โ† 11 tests
โ”‚   โ”œโ”€โ”€ test_pipeline.py           โ† 12 tests
โ”‚   โ”œโ”€โ”€ test_report.py             โ† 8 tests
โ”‚   โ””โ”€โ”€ test_plugins.py            โ† 9 tests
โ”‚
โ”œโ”€โ”€ .github/
โ”‚   โ”œโ”€โ”€ workflows/ci.yml           โ† Tests: ubuntu/windows/macos ร— py3.9โ€“3.12
โ”‚   โ””โ”€โ”€ workflows/publish.yml      โ† PyPI trusted publish on v*.*.* tags
โ”‚
โ”œโ”€โ”€ pyproject.toml
โ””โ”€โ”€ README.md

๐Ÿงช Running Tests

git clone https://github.com/vignesh2027/datamend.py.git
cd datamend.py

python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"

# Run all 113 tests with coverage
pytest tests/ -v --cov=datamend --cov-report=term-missing

# Run a single pillar
pytest tests/test_repair.py -v
pytest tests/test_drift.py -v

โฑ๏ธ Time Saved Per Week

Task Manual time With datamend Saved
Null imputation per dataset ~25 min < 1 sec 25 min
Outlier detection & fix ~45 min < 1 sec 45 min
Schema validation setup ~2 hours 2 lines 2 hours
Drift monitoring setup ~3 hours 1 line 3 hours
Debugging model failures ~4 hours 2 sec ~4 hours
Total per week ~10+ hours < 5 seconds 10 hours

๐Ÿ“‹ Requirements

Package Version Why
pandas โ‰ฅ 1.5.0 Core DataFrame operations
numpy โ‰ฅ 1.23.0 Numerical computations
scipy โ‰ฅ 1.9.0 KS test, chi-square, statistical tests
click โ‰ฅ 8.0.0 CLI framework
rich โ‰ฅ 13.0.0 Beautiful terminal output
jinja2 โ‰ฅ 3.1.0 HTML dashboard templating
pydantic โ‰ฅ 2.0.0 Data validation models

Optional extras:

pip install "datamend[sklearn]"   # scikit-learn integration
pip install "datamend[xgboost]"   # XGBoost native importances
pip install "datamend[lightgbm]"  # LightGBM native importances
pip install "datamend[torch]"     # PyTorch layer attribution
pip install "datamend[mlflow]"    # MLflow experiment tracking
pip install "datamend[wandb]"     # Weights & Biases logging
pip install "datamend[dvc]"       # DVC metric tracking
pip install "datamend[all]"       # Everything

๐Ÿ—บ๏ธ Roadmap

  • AutoRepair โ€” 8-phase repair engine
  • DataContract โ€” statistical contract learning
  • DriftRadar โ€” PSI + KS + chiยฒ + JSD
  • FailureTrace โ€” surrogate row attribution
  • MendPipeline โ€” unified 4-pillar pipeline
  • CLI โ€” repair / validate / drift / score / dashboard
  • HTML dashboard โ€” self-contained dark-mode output
  • MLflow / W&B / DVC integrations
  • Plugin system with entry-point discovery
  • PyPI release (0.1.0)
  • Async native support (0.2.0)
  • Polars DataFrame support (0.2.0)
  • Time-series drift (CUSUM / ADWIN) (0.3.0)
  • REST API server mode (0.3.0)
  • Grafana plugin for MendScore dashboards (0.4.0)
  • AutoML-style repair strategy search (0.5.0)

๐Ÿค Contributing

Contributions are welcome! Please open an issue first to discuss the change, then submit a PR.

# Fork and clone
git clone https://github.com/<your-username>/datamend.py.git

# Install dev dependencies
pip install -e ".[dev]"

# Run the full test suite before submitting
pytest tests/ -v
ruff check datamend/
mypy datamend/

๐Ÿ“„ License

MIT โ€” see LICENSE for details.


Built with care by Vignesh

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datamend-1.1.0.tar.gz (91.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

datamend-1.1.0-py3-none-any.whl (59.9 kB view details)

Uploaded Python 3

File details

Details for the file datamend-1.1.0.tar.gz.

File metadata

  • Download URL: datamend-1.1.0.tar.gz
  • Upload date:
  • Size: 91.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for datamend-1.1.0.tar.gz
Algorithm Hash digest
SHA256 87d265ed5945fb83f9e93fe15a8c098a20377821bd105121690e752e00ea369e
MD5 9870def43600c7439112f7d522fbc2b7
BLAKE2b-256 70dafc2108344737094185412b8e0eabedf0ac4bfa066d2afa54e8763fb7abec

See more details on using hashes here.

File details

Details for the file datamend-1.1.0-py3-none-any.whl.

File metadata

  • Download URL: datamend-1.1.0-py3-none-any.whl
  • Upload date:
  • Size: 59.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for datamend-1.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 54196ab1405f037757379670cf51c0f9e27ebbb0a60421fb450fdde033f49ecc
MD5 d3639d86e238528fee107ba7fed8958b
BLAKE2b-256 d1de84dd76d86149459a7deeb9178a64743c6cb62edd7ff119d6a53a1932478c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page