The unified data repair, validation, drift detection, and failure tracing library for production ML
Project description
The Problem That Costs $78M Every Day
Every data scientist and ML engineer faces the same brutal cycle — every single day:
Raw data arrives → It's dirty → Days wasted cleaning manually
↓
Model trained → Works in notebook → Silently breaks in production
↓
No idea which rows failed → No idea which columns caused it → No idea why
↓
Start over. Repeat forever.
Pandas doesn't fix data. Great Expectations only validates. Evidently only detects drift. SHAP only explains outputs. Nothing does all four in one unified API. Until now.
datamend is the first library to solve all four problems together — in one line of code each.
The Five Lines That Replace Days of Work
import datamend
clean_df, repair_report = datamend.repair(df) # Pillar 1 — Fix everything
contract = datamend.contract(clean_df) # Pillar 2 — Define the standard
violations = datamend.validate(prod_df, contract) # Pillar 2 — Enforce in prod
drift_report = datamend.drift(clean_df, prod_df) # Pillar 3 — Catch distribution shift
failure_report = datamend.trace(model, prod_df, preds) # Pillar 4 — Diagnose failures
Or chain all four in a single production-ready pipeline:
from datamend import MendPipeline
pipeline = MendPipeline()
pipeline.fit(train_df) # Learn everything from training data
result = pipeline.transform( # Apply to any new batch
prod_df,
model=my_model,
predictions=preds,
)
print(f"Overall health: {result.overall_mend_score:.1f}/100") # One number
result.repair_report.summary() # What was fixed
result.contract_report.summary() # What violated the schema
result.drift_report.summary() # What drifted and by how much
result.trace_report.summary() # Which rows and columns failed
Architecture
┌─────────────────────────────────────────────────────────────────────┐
│ datamend API │
│ datamend.repair() datamend.contract() datamend.drift() datamend.trace() │
└──────────┬──────────────────┬──────────────────┬──────────────┬─────┘
│ │ │ │
┌──────▼──────┐ ┌───────▼──────┐ ┌───────▼──────┐ ┌───▼──────────┐
│ AutoRepair │ │ DataContract │ │ DriftRadar │ │FailureTrace │
│ │ │ │ │ │ │ │
│ • Null imp. │ │ • Schema gen │ │ • PSI │ │ • Feat. imp. │
│ • Outliers │ │ • Null rate │ │ • KS test │ │ • Surrogate │
│ • Type fix │ │ • Range chk │ │ • Chi-square │ │ • Row scores │
│ • Dupes │ │ • Cardinality│ │ • Jensen- │ │ • Col attrib │
│ • Encoding │ │ • Dist drift │ │ Shannon │ │ • DQ contrib │
│ • Categories│ │ • JSON save │ │ • MendScore │ │ • Model cont │
│ • Whitespace│ │ • JSON load │ │ • Severity │ │ │
│ • Units │ │ │ │ │ │ │
└──────┬──────┘ └───────┬──────┘ └───────┬──────┘ └───┬──────────┘
│ │ │ │
┌──────▼──────────────────▼──────────────────▼──────────────▼─────┐
│ MendPipeline │
│ fit(train_df) → transform(prod_df, model, preds) │
└──────────────────────────────┬──────────────────────────────────┘
│
┌──────────────────────────────▼──────────────────────────────────┐
│ MendReport + HTML Dashboard │
│ MendScore Reports Visualisations │
└─────────────────────────────────────────────────────────────────┘
The Four Pillars — Deep Dive
🔧 Pillar 1 — AutoRepair: Detect and Fix Everything Automatically
AutoRepair runs 8 detection phases in sequence, each feeding clean data to the next:
Input DataFrame
│
▼
Phase 1: Whitespace & Hidden Characters
│ Strips leading/trailing whitespace, zero-width spaces,
│ null bytes, and other invisible Unicode from all string columns
▼
Phase 2: Encoding Corruption (Mojibake)
│ Detects Latin-1 interpreted as UTF-8 and reverses the encoding
│ using regex pattern matching on high-byte sequences
▼
Phase 3: Type Mismatch Coercion
│ Detects object columns that contain >80% numeric strings and
│ converts them. Detects date strings and parses to datetime64.
▼
Phase 4: Null Imputation
│ Numeric: auto-selects mean vs median based on skewness (>1.0 → median)
│ Categorical: mode imputation
│ Datetime: median imputation
▼
Phase 5: Outlier Detection & Clipping
│ Uses Modified Z-Score with MAD (robust to outliers themselves).
│ Clips to IQR bounds [Q1 - 1.5·IQR, Q3 + 1.5·IQR]
▼
Phase 6: Duplicate Removal
│ Exact: pandas duplicated()
│ Near-duplicate: Jaccard similarity on string bag-of-words (threshold 0.85)
▼
Phase 7: Category Normalisation
│ Groups variants via Unicode NFKD normalisation + lowercase + strip
│ Male / male / MALE / M → canonical form
▼
Phase 8: Community Plugins
Any registered BaseRepairPlugin instances run here
▼
Clean DataFrame + RepairReport
clean_df, report = datamend.repair(df, strategy="auto", verbose=True)
# Every change is logged:
# [NULL] age — Imputed 47 nulls with median=34.0
# [OUTLIER] income — Clipped 3 outliers to IQR bounds [18k, 142k]
# [DUPLICATE] [ALL] — Removed 12 exact duplicate rows
# [INCONSISTENT_CATEGORY] gender — Normalised 3 variants to canonical form
# MendScore: 52.3 → 91.7
Strategies supported:
| Strategy | When to use |
|---|---|
"auto" (default) |
Detects skewness — median for skewed (>1.0), mean otherwise |
"mean" |
Force mean imputation for all numeric nulls |
"median" |
Force median imputation for all numeric nulls |
Production-safe mode — shows full repair plan and asks before applying:
clean_df, report = datamend.repair(df, confirm=True)
# → Apply all 47 repairs? [y/N]:
Large dataset support — chunked processing + fast mode:
engine = datamend.AutoRepair(fast_mode=True, chunk_size=50_000)
repaired, reports = engine.repair_chunked(huge_df) # one report per chunk
📋 Pillar 2 — DataContract: Define the Standard. Enforce It Forever.
DataContract captures schema + statistical fingerprint of your clean training data into a JSON file. You validate any new DataFrame against it in milliseconds.
Training DataFrame (clean reference)
│
▼ datamend.contract(train_df)
┌───────────────────────────────────────┐
│ Per-column ColumnSpec: │
│ dtype : float64 │
│ nullable : False │
│ null_rate : 0.0 │
│ min / max : 18.0 / 79.0 │
│ mean / std : 41.3 / 15.7 │
│ percentiles: p5=22, p25=29... │
│ dist_params: μ=41.3, σ=15.7 │
│ cardinality: (for categoricals) │
│ allowed_values: [male, female] │
└──────────────┬────────────────────────┘
│ contract.save("my_contract.json")
▼
DataContract JSON
│
│ DataContract.load("my_contract.json")
▼
Production DataFrame → datamend.validate(prod_df, contract)
│
▼
ContractReport:
✗ [age] NULL_RATE — 12.3% nulls (threshold: 5%)
✗ [gender] CARDINALITY_VIOLATION — new value 'non-binary' not in contract
⚠ [income] DISTRIBUTION_DRIFT — KS=0.34, p=0.001
✓ [score] All checks passed
# Generate and save the contract from training data
contract = datamend.contract(
train_df,
name="production_v1",
null_threshold=0.05, # max 5% nulls allowed
drift_threshold=0.10, # KS threshold for distribution warnings
)
contract.save("contracts/production_v1.json")
# In production — validate every incoming batch
contract = datamend.DataContract.load("contracts/production_v1.json")
report = datamend.validate(prod_df, contract)
if not report.passed:
# Machine-readable JSON for alerting systems
alert_payload = report.to_json()
# Hard gate — raise exception and block the pipeline
datamend.validate(prod_df, contract, raise_on_failure=True)
Checks performed per column:
| Check | Description |
|---|---|
| Schema | Missing or extra columns detected |
| Null rate | Exceeds configured threshold |
| Dtype | Incompatible type change (float→object etc.) |
| Range | Min/max far outside training distribution |
| Distribution | KS test against fitted normal parameters |
| Cardinality | Unseen category values present |
📡 Pillar 3 — DriftRadar: Four Algorithms. One Score. Full Attribution.
DriftRadar runs four statistical tests per column and combines them into a single MendScore (0=stable, 100=critical drift):
Training Series (reference) Production Series (current)
│ │
└──────────────┬───────────────┘
│
┌─────▼──────────────────────────────┐
│ Numeric columns │
│ │
│ PSI = Σ (A%-E%) × ln(A%/E%) │
│ Population Stability Index │
│ <0.1=stable >0.2=drift │
│ │
│ KS = max|F₁(x) - F₂(x)| │
│ Kolmogorov-Smirnov test │
│ p-value < α → drift │
│ │
│ JSD = ½KL(P‖M) + ½KL(Q‖M) │
│ Jensen-Shannon Divergence │
│ 0=identical 1=disjoint │
└─────────────────────────────────────┘
┌─────────────────────────────────────┐
│ Categorical columns │
│ │
│ χ² = Σ (O-E)²/E │
│ Chi-square goodness of fit │
│ │
│ JSD = on value frequency dists │
└─────────────────────────────────────┘
│
┌───────────────▼─────────────────────┐
│ Composite MendScore (0–100) │
│ = mean(PSI/0.5, KS, JSD, χ²_norm) │
│ × 100, per column │
│ │
│ Severity: │
│ 0–10% → none ████░░░░ green │
│ 10–20% → low ████████ yellow │
│ 20–25% → medium ████████ orange │
│ 25–50% → high ████████ red │
│ >50% → critical████████ crimson │
└──────────────────────────────────────┘
report = datamend.drift(train_df, prod_df, verbose=True)
# Output:
# MendScore (drift): 34.2/100 (0=stable, 100=critical)
# Columns drifted : 3/12
#
# [DRIFT] income: severity=high, score=67.1, PSI=0.342, KS=0.41, JSD=0.38
# [DRIFT] age: severity=medium, score=23.4, PSI=0.198, KS=0.22, JSD=0.19
# [DRIFT] region: severity=low, score=11.2, JSD=0.14, χ²=18.4
# [ok] score: severity=none, score=2.1, PSI=0.024, KS=0.04, JSD=0.02
# Per-column PSI, KS, chi-square, JSD — all in one dict
report.to_dict()
🔍 Pillar 4 — FailureTrace: Know Exactly Which Rows and Columns Broke Your Model
FailureTrace combines model-level attribution with data-quality anomaly detection to pinpoint the root cause of prediction failures at the row and column level:
Model + Input DataFrame + Predictions
│
▼
Step 1: Feature Importance Extraction
┌─────────────────────────────────────────────────────┐
│ sklearn tree models → feature_importances_ │
│ sklearn linear models → |coef_| │
│ XGBoost / LightGBM → feature_importances_ │
│ Black-box / PyTorch → Surrogate DecisionTree │
│ (fits DecisionTreeRegressor on X→predictions │
│ and reads its feature_importances_ as proxy) │
└─────────────────────────────────────────────────────┘
│
▼
Step 2: Per-Column Anomaly Rates
┌─────────────────────────────────────────────────────┐
│ For each column: │
│ anomaly_rate = (nulls + outliers) / total_rows │
│ Outlier detection via Modified Z-Score (MAD) │
└─────────────────────────────────────────────────────┘
│
▼
Step 3: Per-Row Suspicion Scoring
┌─────────────────────────────────────────────────────┐
│ For each row: │
│ dq_suspicion = 1 - row_quality_score/100 │
│ model_suspicion= 1 - predict_proba.max() │
│ weighted_anomaly= Σ col_anomaly × feature_imp │
│ │
│ suspicion_score = ( │
│ 0.5 × dq_suspicion + │
│ 0.3 × weighted_anomaly + │
│ 0.2 × model_suspicion │
│ ) × 100 │
└─────────────────────────────────────────────────────┘
│
▼
Step 4: Column Attribution (sorted by importance)
┌─────────────────────────────────────────────────────┐
│ importance = 0.6 × model_contribution │
│ + 0.4 × data_quality_contribution │
└─────────────────────────────────────────────────────┘
│
▼
TraceReport:
Suspicious rows (sorted by suspicion score, top 50)
Column attributions (top-K, sorted by importance)
data_quality_failure_pct → % rows with DQ issues
model_failure_pct → % rows with low confidence
report = datamend.trace(model, prod_df, predictions, ground_truth=y_true)
# Top failure columns:
# income: importance=78.3 dq_contrib=45.1 model_contrib=91.2 anomaly_rate=12.4%
# age: importance=31.2 dq_contrib=8.3 model_contrib=42.7 anomaly_rate=3.1%
# Most suspicious rows:
# Row 1847: score=94.1 reason='data quality issues; low model confidence'
# Row 392: score=87.3 reason='feature anomalies; low model confidence'
How AutoRepair Detects Each Issue — Under the Hood
Issue Detection Method Fix Strategy
─────────────────────────────────────────────────────────────────────────────
Null values df[col].isnull() mean / median / mode
(auto-selected by skewness)
Outliers Modified Z-Score using MAD IQR clipping
z = 0.6745 × (x−median) / MAD [Q1−1.5·IQR, Q3+1.5·IQR]
flag if |z| > 3.5
Type mismatch >80% of object column values pd.to_numeric() /
match ^-?\d+(\.\d+)?$ regex pd.to_datetime()
or parse as date format
Exact duplicates df.duplicated() df.drop_duplicates()
Near-duplicates Jaccard(bag_of_words(row_i), Drop the duplicate row
bag_of_words(row_j)) ≥ 0.85 (keep first)
Encoding corruption Regex [\xc0-\xff][\x80-\xbf]{1,3} Encode latin-1, decode utf-8
(mojibake)
Inconsistent Unicode NFKD normalise + lower Replace all variants with
categories + strip → group identical norms canonical (most common) form
Whitespace / r"^\s+|\s+$" + hidden char regex str.strip() + re.sub(hidden)
hidden chars [\x00-\x1f\x7f\xa0]
Unit mismatch CV = std / |mean| > 5.0 Flag only — requires human
(suspected) + IQR ratio (Q3/Q1) > 10 domain confirmation
─────────────────────────────────────────────────────────────────────────────
Installation
# Core (pandas + numpy + scipy + click + rich + jinja2 + pydantic)
pip install datamend
# With model integrations
pip install "datamend[sklearn]" # scikit-learn — enables full FailureTrace
pip install "datamend[xgboost]" # XGBoost
pip install "datamend[lightgbm]" # LightGBM
pip install "datamend[torch]" # PyTorch
# With experiment tracking
pip install "datamend[mlflow]" # MLflow integration
pip install "datamend[wandb]" # Weights & Biases
pip install "datamend[dvc]" # DVC
# Everything
pip install "datamend[all]"
# Verify
python -c "import datamend; print(datamend.__version__)"
System requirements: Python 3.9+, Windows / macOS / Linux (all tested in CI on every commit)
The MendScore — One Number for Data Health
Every datamend function returns a MendScore — a single number from 0 to 100 that tells you exactly how healthy your data is.
MendScore Interpretation
─────────────────────────────────────────────────────────────────────
Score Colour Meaning Recommended action
─────────────────────────────────────────────────────────────────────
90–100 GREEN Excellent. Production-ready. Deploy with confidence.
70–89 TEAL Good. Minor issues. Review repair report.
50–69 YELLOW Moderate problems. Repair before deploying.
30–49 ORANGE Serious issues. Do not deploy without review.
0–29 RED Critical. Severe data rot. Block deployment. Fix now.
─────────────────────────────────────────────────────────────────────
Each pillar produces its own MendScore:
| Pillar | MendScore meaning |
|---|---|
repair_report.mend_score_before |
Quality score of raw input data |
repair_report.mend_score_after |
Quality score after AutoRepair |
contract_report.mend_score |
How many contract checks passed (100 = all pass) |
drift_report.mend_score |
Drift severity (0 = no drift, 100 = critical drift) |
trace_report.mend_score |
Failure severity (0 = no failures, 100 = widespread) |
result.overall_mend_score |
Weighted composite of all four pillars |
# One-liner MendScore from the CLI
$ datamend score production_data.csv
MendScore: 47.3/100 ← RED — serious issues detected
Full Benchmark: datamend vs Every Alternative
| Capability | pandas | Great Expectations | Evidently | SHAP | datamend |
|---|---|---|---|---|---|
| Auto-repair nulls | ❌ | ❌ | ❌ | ❌ | ✅ smart imputation |
| Auto-repair outliers | ❌ | ❌ | ❌ | ❌ | ✅ MAD + IQR clip |
| Fix type mismatches | ❌ | ❌ | ❌ | ❌ | ✅ auto-coerce |
| Deduplicate (near-dupes) | Partial | ❌ | ❌ | ❌ | ✅ Jaccard similarity |
| Fix encoding corruption | ❌ | ❌ | ❌ | ❌ | ✅ mojibake repair |
| Normalise categories | ❌ | ❌ | ❌ | ❌ | ✅ NFKD normalise |
| Data contract generation | ❌ | ✅ | ❌ | ❌ | ✅ one line |
| Contract enforcement | ❌ | ✅ | ❌ | ❌ | ✅ + raise_on_failure |
| PSI drift detection | ❌ | ❌ | ✅ | ❌ | ✅ |
| KS + chi-square + JSD | ❌ | ❌ | Partial | ❌ | ✅ all four |
| Row-level failure attribution | ❌ | ❌ | ❌ | ❌ | ✅ |
| Column-level root cause | ❌ | ❌ | ❌ | Partial | ✅ DQ + model combined |
| Unified pipeline API | ❌ | ❌ | ❌ | ❌ | ✅ MendPipeline |
| Single health score | ❌ | ❌ | ❌ | ❌ | ✅ MendScore |
| HTML dashboard | ❌ | Partial | ✅ | ❌ | ✅ self-contained |
| CLI (no Python needed) | ❌ | ❌ | ❌ | ❌ | ✅ full CLI |
| Plugin / extension system | ❌ | Partial | ❌ | ❌ | ✅ 4 plugin types |
| MLflow / W&B / DVC hooks | ❌ | ❌ | Partial | ❌ | ✅ all three |
| Core deps only | ✅ | No | No | No | ✅ pandas+numpy+scipy |
| Framework-agnostic models | — | — | Partial | ✅ | ✅ any sklearn API |
| Chunked / large dataset | Partial | ❌ | ❌ | ❌ | ✅ repair_chunked() |
| Audit log / changelog | ❌ | ❌ | ❌ | ❌ | ✅ every change logged |
CLI Reference — No Python Required
datamend ships a complete CLI. Point it at any file. Get results.
# ── Repair any file ───────────────────────────────────────────────────────────
datamend repair data.csv
datamend repair data.csv -o clean.csv --strategy median
datamend repair data.csv --report repair.json --html dashboard.html
datamend repair data.csv --fast # sampling mode for large files
datamend repair data.csv --confirm # ask before applying (production safe)
# ── Generate a DataContract from your training data ───────────────────────────
datamend contract training.csv -o contract.json
datamend contract training.csv --name "v1_production" --null-threshold 0.02
# ── Validate production data against the contract ─────────────────────────────
datamend validate prod.csv contract.json
datamend validate prod.csv contract.json --fail-fast # exit code 1 on violations
datamend validate prod.csv contract.json --report violations.json --html report.html
# ── Detect drift between two datasets ─────────────────────────────────────────
datamend drift training.csv production.csv
datamend drift train.csv prod.csv --report drift.json --html drift.html --alpha 0.01
# ── Get a quick health score for any file ─────────────────────────────────────
datamend score mydata.csv
# MendScore: 47.3/100
# ── Serve a live HTML dashboard from any report JSON ─────────────────────────
datamend dashboard repair_report.json --port 8899
# ── List all installed plugins ────────────────────────────────────────────────
datamend plugins
HTML Dashboard — Self-Contained. Dark Mode. Zero Dependencies.
Every report exports as a single HTML file — no server, no external CSS, no JavaScript frameworks. Open it anywhere.
from datamend.report import MendReport
mr = MendReport(
repair=repair_report,
contract=contract_report,
drift=drift_report,
trace=trace_report,
title="Production Health — 2026-05-14",
)
mr.to_html("health_dashboard.html") # Save as self-contained file
mr.serve(port=8899) # Or serve live — opens browser automatically
From the CLI:
datamend repair data.csv --html dashboard.html
datamend drift train.csv prod.csv --html drift_dashboard.html
datamend dashboard report.json --port 9000
Integrations — Track Data Health Alongside Model Experiments
MLflow
import mlflow
import datamend
from datamend.integrations import mlflow as dm_mlflow
with mlflow.start_run():
# Repair
clean_df, repair_report = datamend.repair(df)
dm_mlflow.log_repair(repair_report)
# Logged: datamend.repair.mend_score_before/after, issues_found, rows_affected
# Drift
drift_report = datamend.drift(train_df, prod_df)
dm_mlflow.log_drift(drift_report)
# Logged: datamend.drift.mend_score, per-column PSI/KS/JSD
# Full pipeline at once
dm_mlflow.log_pipeline_result(pipeline_result)
Weights & Biases
import wandb
from datamend.integrations import wandb as dm_wandb
with wandb.init(project="my-ml-project"):
dm_wandb.log_repair(repair_report, step=epoch)
dm_wandb.log_drift(drift_report, step=epoch)
dm_wandb.log_pipeline_result(result, step=epoch)
DVC
from datamend.integrations import dvc as dm_dvc
dm_dvc.save_pipeline_result(result, output_dir="datamend_metrics")
# Creates:
# datamend_metrics/repair_metrics.json
# datamend_metrics/drift_metrics.json
# datamend_metrics/drift_plots.json ← dvc plots show
# datamend_metrics/summary.json
dvc metrics show datamend_metrics/repair_metrics.json
dvc plots show datamend_metrics/drift_plots.json
Plugin System — Extend Every Pillar
datamend has four plugin types — one for each pillar. Write a class, register it, done.
from datamend.plugins.base import BaseRepairPlugin, register_plugin
from datamend.core.repair import RepairAction
import pandas as pd
import re
@register_plugin
class PhoneNormalisationPlugin(BaseRepairPlugin):
"""Normalise phone numbers to E.164 format."""
name = "phone_normalise"
description = "Strips non-digit characters and prepends + for phone columns."
version = "1.0.0"
author = "Your Name"
def repair(self, df):
df = df.copy()
actions = []
for col in df.select_dtypes(include=["object", "str"]).columns:
if "phone" not in col.lower():
continue
count = df[col].notna().sum()
df[col] = df[col].apply(
lambda v: f"+{re.sub(r'\\D', '', str(v))}" if pd.notna(v) else v
)
actions.append(RepairAction(
column=col, issue_type="PHONE_NORMALISE",
description=f"Normalised {count} phone numbers to E.164",
rows_affected=count, before_sample=None, after_sample=None,
strategy="e164",
))
return df, actions
# Use inline
clean_df, report = datamend.repair(df, plugins=[PhoneNormalisationPlugin()])
# Or register globally and it auto-runs in all repair() calls
# Publish as a package with entry-point: datamend.plugins → auto-discovered
The four plugin types:
| Base class | Pillar | Override method |
|---|---|---|
BaseRepairPlugin |
AutoRepair | repair(df) → (df, actions) |
BaseValidatorPlugin |
DataContract | validate(df, col, stats) → violations |
BaseDriftDetectorPlugin |
DriftRadar | detect(ref, cur, col) → result_dict |
BaseTracerPlugin |
FailureTrace | score_rows(model, df, preds) → rows |
Auto-discovery — publish a package with:
[project.entry-points."datamend.plugins"]
my_plugin = "my_package:MyRepairPlugin"
datamend finds it automatically when installed.
Advanced Usage
Large datasets — chunked processing:
engine = datamend.AutoRepair(chunk_size=50_000, fast_mode=True)
repaired_df, chunk_reports = engine.repair_chunked(huge_10M_row_df)
# Returns one RepairReport per chunk — merge as needed
Async / streaming (custom chunking):
import pandas as pd
repaired_chunks = []
for chunk in pd.read_csv("huge_file.csv", chunksize=100_000):
clean_chunk, _ = datamend.repair(chunk, verbose=False)
repaired_chunks.append(clean_chunk)
repaired = pd.concat(repaired_chunks, ignore_index=True)
Hard production gate:
contract = datamend.DataContract.load("contract.json")
# Raises ContractViolationError and stops the pipeline
datamend.validate(prod_df, contract, raise_on_failure=True)
Selective drift check:
# Only check the features that matter most
report = datamend.drift(
train_df, prod_df,
columns=["income", "age", "credit_score"],
alpha=0.01, # stricter significance level
)
MendPipeline with all options:
from datamend import MendPipeline
pipeline = MendPipeline(
repair_strategy="median", # force median imputation
null_threshold=0.02, # 2% max nulls in contract
drift_alpha=0.01, # stricter drift detection
psi_buckets=20, # finer PSI granularity
top_k_trace=15, # top 15 failure columns
enable_repair=True,
enable_contract=True,
enable_drift=True,
enable_trace=True,
fast_mode=True, # sampling for large data
verbose=True, # rich terminal output
)
pipeline.fit(train_df)
result = pipeline.transform(prod_df, model=model, predictions=preds)
Why datamend Saves 10–40 Hours Per Week
The average data team spends without datamend:
Task Hours/week
───────────────────────────────────────────────────
Manual data cleaning (custom scripts) 3–8 hours
Debugging why a model failed on prod 2–5 hours
Writing & maintaining validation rules 2–4 hours
Checking for data drift after deploy 1–3 hours
───────────────────────────────────────────────────
Total wasted per engineer 8–20 hours
Total wasted per team (5 engineers) 40–100 hours
datamend automates all four. That is $78M/day saved globally across the industry.
Project Structure
datamend/
├── datamend/
│ ├── __init__.py ← Public API: repair(), contract(), validate(), drift(), trace()
│ ├── pipeline.py ← MendPipeline (unified 4-pillar pipeline)
│ ├── report.py ← MendReport + HTML dashboard generator
│ ├── cli.py ← Full Click-based CLI
│ ├── core/
│ │ ├── repair.py ← AutoRepair engine (8-phase detection + fix)
│ │ ├── contract.py ← DataContract generation + validation
│ │ ├── drift.py ← DriftRadar (PSI + KS + chi2 + JSD + MendScore)
│ │ └── trace.py ← FailureTrace (row + column attribution)
│ ├── plugins/
│ │ └── base.py ← BaseRepairPlugin, PluginRegistry, @register_plugin
│ └── integrations/
│ ├── mlflow.py ← MLflow logging hooks
│ ├── wandb.py ← Weights & Biases logging hooks
│ └── dvc.py ← DVC metrics + plots export
├── tests/ ← 113 tests, 90%+ coverage
├── docs/ ← MkDocs site (API + tutorials + plugin guide)
├── .github/workflows/
│ ├── ci.yml ← Tests on Windows/macOS/Linux, Python 3.9–3.12
│ └── publish.yml ← Auto-publish to PyPI on git tag
├── pyproject.toml
├── README.md
├── CONTRIBUTING.md
└── CHANGELOG.md
Contributing
datamend welcomes contributions of all kinds.
How to contribute:
- Bug reports — open an issue with a minimal reproducible example
- New repair strategy — subclass
BaseRepairPluginand open a PR - New drift algorithm — subclass
BaseDriftDetectorPluginand open a PR - New validator — subclass
BaseValidatorPluginand open a PR - Docs, tests, examples — always welcome
git clone https://github.com/vignesh2027/datamend.py.git
cd datamend.py
pip install -e ".[dev]"
pytest # all 113 tests must pass
ruff check datamend/
See CONTRIBUTING.md for the full guide including how to publish your plugin as a standalone package.
License
MIT © Vignesh — Free to use in any project, commercial or otherwise.
Built to solve the single most painful and expensive problem in data science.
Every data scientist who finds it should never want to work without it again.
PyPI · GitHub · Issues · Discussions · Contributing
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file datamend-0.1.0.tar.gz.
File metadata
- Download URL: datamend-0.1.0.tar.gz
- Upload date:
- Size: 80.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a01a075fe9964bca8d251a201345a4c21e69e8514bc1dd6ba3938d98c480154b
|
|
| MD5 |
0059bca670471ef53b3da23f154e59b6
|
|
| BLAKE2b-256 |
03f285beb5a29a2824529d6ff27dba9348b34ae28f2af4f1b8948c738e2191e3
|
File details
Details for the file datamend-0.1.0-py3-none-any.whl.
File metadata
- Download URL: datamend-0.1.0-py3-none-any.whl
- Upload date:
- Size: 57.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6dc182437a80d6c1cfbe2b6cfdb5e2484a0177775b6b7df0ddd07f84d46e362a
|
|
| MD5 |
07ed585e029f1c3c933456dee3576aa1
|
|
| BLAKE2b-256 |
354a4a2e8f642fa87de412e5d2dfa323e828d325f29eaf6084e11ddc7e7e16ad
|