puredatalib

Automatic data cleaning and silent incompatibility detection for Python

These details have not been verified by PyPI

Project links

Project description

██████╗ ██╗   ██╗██████╗ ███████╗██████╗  █████╗ ████████╗ █████╗
██╔══██╗██║   ██║██╔══██╗██╔════╝██╔══██╗██╔══██╗╚══██╔══╝██╔══██╗
██████╔╝██║   ██║██████╔╝█████╗  ██║  ██║███████║   ██║   ███████║
██╔═══╝ ██║   ██║██╔══██╗██╔══╝  ██║  ██║██╔══██║   ██║   ██╔══██║
██║     ╚██████╔╝██║  ██║███████╗██████╔╝██║  ██║   ██║   ██║  ██║
╚═╝      ╚═════╝ ╚═╝  ╚═╝╚══════╝╚═════╝ ╚═╝  ╚═╝   ╚═╝   ╚═╝  ╚═╝

Two problems. Solved perfectly. Forever.

Website · Install · Quickstart · AutoClean · DataWatch · API · CLI

The Problem That Wastes Your Life

If you work with data in Python — as a data scientist, ML engineer, analyst, or developer — you know exactly what happens every single project:

60 to 80 percent of your time goes into two tasks that should not take that long:

Cleaning dirty data by hand. You write the same pandas code you always write. Null checks. Type coercions. Duplicate drops. Outlier filters. String normalization. Date parsing. Again. Every project. Every dataset. Every team. This is the most repeated, most hated task in all of Python data work. Every library you've ever used makes you do it manually.
Debugging silent data failures in production. Your training data gets cleaned. Your model learns. Then production data arrives — slightly different nulls, slightly different ranges, slightly different categories. Your model quietly makes wrong predictions. You have no idea why. You spend days investigating. By then the damage is done.

puredata eliminates both. Permanently. In one import.

Five-Line Demo

import puredata

# Problem 1: dirty data — fixed automatically in one line
clean_df, report = puredata.clean(dirty_df)
print(report.summary())      # shows every fix: what, where, why

# Problem 2: silent incompatibility — caught before it reaches your model
contract = puredata.watch(train_df)          # profile your training data once
result   = puredata.check(prod_df, contract) # validate anything new, instantly
print(result.summary())      # green ticks and red flags with exact row numbers

That is the entire API for 90% of use cases. Two lines to clean. Two lines to validate. One import.

Before vs After

Column	Raw (what you get)	After `puredata.clean()`
`gender`	`"Male"`, `"male"`, `"M"`, `"MALE"`, `"m"`, `"FEMALE"`, `"F"`	`"Male"`, `"Female"`
`date_joined`	`"15/01/2020"`, `"January 15, 2020"`, `"01-15-2020"`, `"2020-01-15"`	`"2020-01-15"`
`income`	`NaN` (18 % missing), `9999999.0` (outlier)	Imputed via KNN, outlier clipped
`age`	`"25"`, `"30"`, `"42"` (strings)	`25.0`, `30.0`, `42.0` (float64)
`weight`	`"70kg"`, `"154lbs"`, `"70000g"`	`70.0`, `69.85`, `70.0` (kg, SI)
`name`	`" Alice "`, `"alice"`, `"BOB"`	`"Alice"`, `"Alice"`, `"Bob"`
`description`	`"HelloWorld"` (zero-width space), `"Title"` (BOM)	`"HelloWorld"`, `"Title"`
`created_at`	`NaT` (3 missing)	Forward-filled from neighbours

Why puredata? What Makes It Different?

There are other cleaning libraries. There are other validation frameworks. None of them does what puredata does.

Capability	puredata	pandas	pyjanitor	great_expectations	evidently
One-line automatic cleaning	✅	❌	❌	❌	❌
Context-aware null imputation	✅	❌	❌	❌	❌
Ensemble outlier detection	✅	❌	❌	❌	❌
Fuzzy category normalization	✅	❌	❌	❌	❌
200+ date format detection	✅	partial	❌	❌	❌
Unit normalization	✅	❌	❌	❌	❌
Encoding repair (BOM, mojibake)	✅	❌	❌	❌	❌
Full fix-by-fix repair log	✅	❌	❌	❌	❌
MendScore (0–100 health)	✅	❌	❌	❌	❌
Distribution drift (PSI+KS+JS)	✅	❌	❌	partial	✅
Schema + range + null violations	✅	❌	❌	✅	partial
Custom business rules	✅	❌	❌	✅	❌
sklearn pipeline compatible	✅	❌	❌	❌	❌
Works with polars/numpy/files	✅	❌	❌	❌	❌

pandas is a data manipulation library. It does not clean your data automatically — it gives you tools to clean it yourself. pyjanitor adds some convenience methods but still requires you to write the logic. great_expectations validates data but does not clean it and requires verbose configuration files. evidently detects drift but does not integrate into a cleaning pipeline. puredata does all of it, automatically, with zero configuration, in one import.

Architecture

                        ┌─────────────────────────────────────────┐
                        │              import puredata             │
                        └──────────────────┬──────────────────────┘
                                           │
              ┌────────────────────────────┴──────────────────────────┐
              │                                                        │
   ┌──────────▼──────────┐                              ┌─────────────▼──────────┐
   │    PILLAR 1          │                              │    PILLAR 2             │
   │    AutoClean         │                              │    DataWatch            │
   │    puredata.clean()  │                              │    puredata.watch()     │
   └──────────┬───────────┘                              │    puredata.check()     │
              │                                          └─────────────┬──────────┘
   ┌──────────▼───────────────────────────┐                            │
   │  Input Layer                         │             ┌──────────────▼────────────────────┐
   │  pandas · polars · numpy · CSV/Excel │             │  Fit Phase (watch)                │
   │  Parquet · JSON · file path          │             │  Per-column statistical profiling  │
   └──────────┬───────────────────────────┘             │  dtype · nulls · range · histogram│
              │                                         │  categories · percentiles          │
   ┌──────────▼───────────────────────────────────┐     └──────────────┬────────────────────┘
   │  Cleaning Pipeline (ordered)                 │                    │
   │  ┌─────────────────────────────────────────┐ │     ┌──────────────▼────────────────────┐
   │  │ 1. Encoding  BOM·zero-width·NFC norm    │ │     │  Check Phase (check)              │
   │  │ 2. Whitespace  strip·collapse·tab       │ │     │  Schema violations                │
   │  │ 3. Types  numeric-strings·dates         │ │     │  Range violations                 │
   │  │ 4. Dates  200+ formats → ISO 8601       │ │     │  Null rate spikes                 │
   │  │ 5. Duplicates  exact row dedup          │ │     │  New category values              │
   │  │ 6. Categories  fuzzy + prefix cluster   │ │     │  Distribution drift               │
   │  │ 7. Units  SI normalization              │ │     │  Custom business rules            │
   │  │ 8. Nulls  KNN + Iterative + mode        │ │     └──────────────┬────────────────────┘
   │  │ 9. Outliers  IQR+Zscore+IsoF+LOF vote  │ │                    │
   │  └─────────────────────────────────────────┘ │     ┌──────────────▼────────────────────┐
   └──────────┬───────────────────────────────────┘     │  WatchReport                      │
              │                                         │  compatibility_score 0-100         │
   ┌──────────▼───────────────────────────────────┐     │  per-check: PASS / WARN / FAIL    │
   │  CleanReport                                 │     │  HTML · JSON export               │
   │  mend_score 0-100                            │     └───────────────────────────────────┘
   │  per-fix: column · rows · before → after     │
   │  HTML · JSON · CSV export                    │              ┌──────────────────────┐
   └──────────────────────────────────────────────┘              │  MendPipeline        │
                                                                 │  AutoClean + Watch   │
              ┌─────────────────────────────────────────────┐    │  sklearn compatible  │
              │  CLI  puredata clean / watch / check         │    └──────────────────────┘
              │  Dashboard  puredata.dashboard(df)           │
              │  Integrations  MLflow · W&B · DVC            │
              │  Plugins  CleanerPlugin · ValidatorPlugin     │
              └─────────────────────────────────────────────┘

Pillar 1 — AutoClean

How it works: the nine-stage pipeline

Every input goes through a fixed, ordered sequence of cleaning stages. Each stage is independent and can be enabled or disabled. The output of each stage feeds the next. A complete per-fix log is maintained throughout.

Stage 1 — Encoding Repair

Detects and repairs:

BOM (byte-order mark) prepended to string values
Zero-width characters: , ‌, ‍, ,
Mojibake: UTF-8 bytes misread as latin-1 → corrected via Unicode NFC normalization
Invisible characters that silently break string comparisons

Stage 2 — Whitespace Normalization

On every string column:

Leading and trailing whitespace stripped
Internal runs of whitespace (\t, , etc.) collapsed to a single space
Tab characters inside strings removed

Stage 3 — Type Coercion

For each column, checks whether the stored type matches the semantic type:

Numeric strings: if > 80% of non-null values match ^\s*-?\d+(\.\d+)?\s*$, converts to float64 via pd.to_numeric
Works correctly with both object dtype and pandas 4 StringDtype

Stage 4 — Date Format Normalization

For string columns where > 70% of values parse as dates:

Tests against 25 explicit strptime formats covering global conventions
Falls back to dateutil.parser for any remaining ambiguous formats
Pure numeric strings are excluded to prevent misidentification (e.g. "1.5" is not a date)
Output format: ISO 8601 by default (%Y-%m-%d) or any user-specified format

Stage 5 — Duplicate Removal

Exact duplicate rows detected and removed. Index is reset after removal. Reported in the fix log.

Stage 6 — Category Normalization

For low-cardinality string columns (≤ 50 unique values):

Fuzzy clustering: RapidFuzz ratio ≥ 85 — catches "Male" → "male" → "MALE"
Prefix/abbreviation matching: if one value is a short (≤ 3 char) prefix of another → clusters "M" → "Male", "F" → "Female" without false positives
Canonical form: the most frequent value in each cluster
All mappings logged in the fix report

Stage 7 — Unit Normalization

For string columns containing mixed-unit numeric values (e.g. "70kg", "154lbs"):

Detects weight, distance, and temperature unit families by pattern matching
Normalizes to SI base unit (kilograms, kilometers, Celsius)
Result column becomes numeric float64

Stage 8 — Null Imputation

For numeric columns:

Missing rate	Strategy	Why
0 % → 40 %	KNN Imputation	Preserves local correlation structure
40 % → 99 %	Iterative Imputation (MICE)	Models each feature as a function of others
100 %	Fill with 0	Column has no information

KNN finds the k nearest complete rows using Euclidean distance across all numeric features. Each missing value is replaced by the weighted mean of its neighbours' values:

x̂ᵢⱼ = Σₖ wₖ · xₖⱼ    where   wₖ = 1/d(xᵢ, xₖ)

For categorical columns:

Missing rate	Fill value
≤ 50 %	Mode (most frequent category)
> 50 %	`"__unknown__"` special token

For datetime columns: forward fill then backward fill.

Stage 9 — Outlier Detection & Handling

Four independent detection methods run in parallel. A value is flagged as an outlier only when multiple methods agree — this eliminates the false positives that plague single-method approaches.

Method 1 — Interquartile Range (IQR)

Q1 = 25th percentile
Q3 = 75th percentile
IQR = Q3 − Q1
Lower fence = Q1 − 1.5 × IQR
Upper fence = Q3 + 1.5 × IQR

Values outside [Lower fence, Upper fence] cast one vote.

Method 2 — Z-Score

z = (x − μ) / σ

Values with |z| > 3 cast one vote. Robust for normally distributed data. Requires scipy.stats.zscore with nan_policy="omit".

Method 3 — Isolation Forest

Randomly partitions the feature space into trees. Anomalies are isolated in fewer splits:

score(x) = 2^(−E[h(x)] / c(n))

where E[h(x)] is the expected path length and c(n) = 2H(n−1) − 2(n−1)/n is the average path length in a random binary search tree. Values predicted as -1 (anomaly) cast one vote.

Requires minimum 20 samples. Uses contamination=0.05 (expects 5% outliers).

Method 4 — Local Outlier Factor (LOF)

Compares the local density of each point to its k nearest neighbours:

LOF_k(x) = (Σ_{o∈N_k(x)} lrd_k(o)) / (|N_k(x)| · lrd_k(x))

where lrd_k(x) is the local reachability density. Values with LOF >> 1 are anomalies.

Voting threshold: by default, a value must be flagged by at least 50% of applicable methods to be considered an outlier. Configurable via outlier_threshold.

Three outlier actions:

"clip" (default): clip to [1st percentile, 99th percentile]
"remove": drop the entire row
"nan": replace with NaN (then re-imputed in stage 8)

Pillar 2 — DataWatch

The silent incompatibility problem

When you train a model, your cleaning code is calibrated to your training data. When production data arrives, it is never exactly the same. The differences are usually small enough that your code doesn't crash — but large enough to quietly destroy prediction quality. DataWatch catches every difference before it reaches your model.

Fit once. Check forever.

# Day 1: profile your training data
contract = puredata.watch(train_df)
contract.save("contract.json")   # persist for production deployment

# Every day in production:
contract = DataContract.load("contract.json")
result = puredata.check(incoming_batch, contract)

if not result.passed:
    alert_team(result.summary())
    rollback_pipeline()

The seven checks DataWatch runs automatically

1 — Schema Violations

Exact column-level schema comparison:

Missing columns (FAIL): training had revenue, production doesn't
Extra columns (WARN): production has new columns not seen in training
Type changes (FAIL): int64 in training, object in production

2 — Range Violations

Per-column numeric range derived from training:

[min_ref − tolerance, max_ref + tolerance]

where tolerance = (max_ref − min_ref) × range_tolerance_factor.

Any production value outside this range is flagged with the exact row index and value.

3 — Null Rate Spikes

Δnull = null_rate_production − null_rate_reference

If Δnull > null_rate_tolerance (default 10 pp) → FAIL. Indicates upstream data pipeline breakage.

4 — New Category Values

For categorical columns, tracks the exact set of observed categories in training. Any new value in production is flagged:

new_values = set(production_categories) − set(training_categories)

If |new_values| / |training_categories| > cardinality_tolerance → FAIL.

5 — Distribution Drift

Three drift statistics computed simultaneously:

Population Stability Index (PSI):

PSI = Σᵢ (Aᵢ − Eᵢ) × ln(Aᵢ / Eᵢ)

where Aᵢ = production proportion in bin i, Eᵢ = reference proportion. Uses Laplace smoothing (count + 0.5) / (n + 0.5k) to handle empty bins.

PSI	Interpretation
< 0.10	No significant change
0.10 – 0.20	Slight change — monitor
> 0.20	Significant change — investigate

Kolmogorov-Smirnov Test:

D = sup_x |F_n(x) − F_m(x)|

The maximum absolute difference between empirical CDFs. p-value accounts for sample size — only meaningful at p < 0.05.

Jensen-Shannon Divergence:

JSD(P ∥ Q) = ½ KL(P ∥ M) + ½ KL(Q ∥ M)    where M = ½(P + Q)

Symmetric, bounded [0, 1], log(2) = maximum divergence.

puredata requires both PSI above threshold AND KS p < 0.05 to call a FAIL. This eliminates false alarms on small samples where PSI is inherently noisy.

Drift Score (0–100):

drift_score = min(100, (PSI / threshold) × 50 + KS_stat × 50)

6 — Custom Business Rules

contract.add_rule(
    lambda df: None if (df["revenue"] >= 0).all()
               else f"Negative revenue: {df.loc[df['revenue']<0, 'revenue'].tolist()}",
    name="revenue_non_negative"
)

Rules receive the full DataFrame and return None (pass) or an error string (fail). Exceptions are caught and reported as failures.

7 — Validation Modes

Mode	Behaviour
`"warn"`	Emits Python `UserWarning` on failures, pipeline continues
`"strict"`	Raises `DataCompatibilityError`, pipeline stops immediately
`"silent"`	Logs to report only, no interruption

MendScore — Dataset Health at a Glance

Every dataset gets a single MendScore (0–100) measuring its production readiness:

cells_affected = Σ max(1, len(fix.rows)) for each fix
MendScore = (1 − cells_affected / total_cells) × 100

Score	Meaning	Action
90 – 100	Clean — production ready	Deploy with confidence
75 – 89	Minor issues — easily fixed	Review fix report, spot-check
50 – 74	Significant issues	Review before deploying
25 – 49	Severe data quality problems	Fix upstream pipeline
0 – 24	Critical — do not use	Data source investigation needed

Installation

# Minimal (core only)
pip install puredata

# With polars support
pip install "puredata[polars]"

# With MLflow tracking
pip install "puredata[mlflow]"

# With Weights & Biases
pip install "puredata[wandb]"

# With DVC metrics
pip install "puredata[dvc]"

# Everything
pip install "puredata[all]"

Requirements: Python ≥ 3.9 — Windows, macOS, Linux

Core dependencies: pandas, numpy, scipy, scikit-learn, rapidfuzz, typer, rich, jinja2, python-dateutil, chardet, joblib

API Reference

`puredata.clean(data, *, config=None, target_col=None)`

Clean dirty data automatically.

clean_df, report = puredata.clean(
    dirty_df,
    config=AutoCleanConfig(
        fix_nulls=True,           # impute missing values
        fix_outliers=True,        # detect and handle outliers
        fix_types=True,           # coerce mistyped columns
        fix_duplicates=True,      # remove exact duplicates
        fix_encoding=True,        # repair encoding artefacts
        fix_categories=True,      # normalise inconsistent categories
        fix_dates=True,           # normalise date formats
        fix_whitespace=True,      # strip/collapse whitespace
        fix_units=True,           # normalise mixed units
        outlier_action="clip",    # "clip" | "remove" | "nan"
        outlier_threshold=0.5,    # fraction of methods required to agree
        date_output_format="%Y-%m-%d",
        n_neighbors=5,            # for KNN imputation
        n_jobs=-1,                # parallel jobs (-1 = all cores)
    ),
    target_col="label",           # protect this column from modification
)

# report attributes
report.mend_score               # float 0–100
report.fixes                    # List[Fix] — every change made
report.original_shape           # (rows, cols) before
report.cleaned_shape            # (rows, cols) after
report.duration_seconds         # float

# export
report.to_html("report.html")   # self-contained HTML page
report.to_json("report.json")   # machine-readable JSON
report.to_csv("report.csv")     # one row per fix

Accepts: pd.DataFrame, pl.DataFrame, np.ndarray, str (file path), Path (file path)
Returns: (pd.DataFrame, CleanReport)

`puredata.watch(reference, *, mode="warn", metadata=None)`

Profile reference data and create a DataContract.

contract = puredata.watch(
    train_df,
    mode="warn",           # "warn" | "strict" | "silent"
    metadata={"version": "1.0", "date": "2026-01-15"},
)

contract.save("contract.json")   # persist to disk
contract = DataContract.load("contract.json")   # reload later

Returns: DataContract

`puredata.check(new_data, contract, *, mode=None)`

Validate new data against a DataContract.

result = puredata.check(prod_df, contract)

# result attributes
result.passed                   # bool — True if no FAIL checks
result.compatibility_score      # float 0–100
result.n_passed                 # int
result.n_warned                 # int
result.n_failed                 # int
result.checks                   # List[CheckResult]

# react to failures
result.raise_if_failed()        # raises DataCompatibilityError

# export
result.to_html("watch.html")
result.to_json("watch.json")
result.summary()                # human-readable string

Returns: WatchReport

`MendPipeline` — full sklearn-compatible pipeline

from puredata import MendPipeline, AutoCleanConfig

pipeline = MendPipeline(
    clean_config=AutoCleanConfig(outlier_action="clip"),
    watch_mode="strict",        # raise on production violations
    target_col="churn",
)

# fit on training data (cleans + profiles)
pipeline.fit(train_df)
pipeline.save_contract("pipeline_contract.json")

# run on new data (cleans + validates in one call)
clean_df, clean_report, watch_report = pipeline.run(new_df)

# sklearn Pipeline integration
from sklearn.pipeline import Pipeline
from sklearn.ensemble import GradientBoostingClassifier

ml_pipeline = Pipeline([
    ("puredata", MendPipeline(target_col="churn")),
    ("model",    GradientBoostingClassifier()),
])
ml_pipeline.fit(X_train, y_train)

`puredata.dashboard(df, ...)`

path = puredata.dashboard(
    df,
    clean_report=report,        # embed AutoClean summary
    watch_report=result,        # embed DataWatch results
    open_browser=True,          # open in browser automatically
    output_path="dashboard.html",
)

Command Line Interface

# Clean a dataset and save results
puredata clean data.csv -o clean.csv --report-html report.html --report-json report.json

# Clean with options
puredata clean data.csv --no-outliers --no-duplicates --target label

# Fit a data contract on training data
puredata watch train.csv --contract contract.json

# Validate production data (strict mode = exit code 1 on failure)
puredata check prod.csv contract.json --strict --report-html watch.html

# Open the interactive dashboard
puredata dashboard data.csv --no-browser -o dashboard.html

# Get the MendScore health score
puredata score data.csv

Integration Examples

MLflow

import mlflow, puredata
from puredata.integrations.mlflow import log_clean_report, log_watch_report

with mlflow.start_run():
    clean_df, clean_report = puredata.clean(raw_df)
    log_clean_report(clean_report)
    # Logged: mend_score, n_fixes, fix_counts_by_type, JSON artifact

    contract = puredata.watch(clean_df)
    result = puredata.check(prod_df, contract)
    log_watch_report(result)
    # Logged: compatibility_score, n_passed/warned/failed, per-check status

Weights & Biases

import wandb, puredata
from puredata.integrations.wandb import log_clean_report

wandb.init(project="my-ml-project")
clean_df, report = puredata.clean(raw_df)
log_clean_report(report)

DVC

from puredata.integrations.dvc import log_clean_report

clean_df, report = puredata.clean(raw_df)
log_clean_report(report, "metrics/data_quality.json")
# add to dvc.yaml:  metrics: [metrics/data_quality.json]

Polars

import polars as pl, puredata

df = pl.read_parquet("data.parquet")      # polars DataFrame
clean_df, report = puredata.clean(df)     # returns pandas DataFrame
contract = puredata.watch(df)             # polars input accepted

Any file format

# puredata reads files directly — no manual pd.read_csv needed
clean_df, report = puredata.clean("data.csv")
clean_df, report = puredata.clean("data.xlsx")
clean_df, report = puredata.clean("data.parquet")
clean_df, report = puredata.clean("data.json")

Plugin System

Extend puredata with your own cleaning strategies, validators, and drift detectors:

from puredata.plugins import CleanerPlugin, register_cleaner
from puredata.core.report import Fix, FixAction

@register_cleaner
class EmailNormalizer(CleanerPlugin):
    name = "email_normalizer"
    description = "Normalize email addresses to lowercase"
    version = "1.0.0"

    def clean(self, df, report):
        for col in df.columns:
            if "email" in col.lower():
                original = df[col].copy()
                df[col] = df[col].str.lower().str.strip()
                changed = df.index[df[col] != original].tolist()
                if changed:
                    report.add_fix(Fix(
                        column=col, action=FixAction.ENCODING,
                        rows=changed, details=f"Lowercased {len(changed)} emails"
                    ))
        return df, report

Distribute your plugin as a package with entry point:

[project.entry-points."puredata.plugins"]
my_plugin = "my_package.plugin:register"

Performance

Dataset size	Operation	Time
10 K rows × 20 cols	Full `puredata.clean()`	~0.4s
100 K rows × 20 cols	Full `puredata.clean()`	~2.1s
1 M rows × 20 cols	Full `puredata.clean()`	~18s
10 M rows × 5 cols	Full `puredata.clean()`	~95s
Any size	`puredata.check()`	< 0.5s
Cold import	`import puredata`	< 100ms

Parallelism: n_jobs=-1 in AutoCleanConfig uses all CPU cores for outlier detection.

Works With Everything

puredata is designed to work alongside every other Python data and ML library with zero glue code:

Data libraries: pandas, polars, numpy, dask
ML frameworks: scikit-learn, xgboost, lightgbm, catboost, torch, tensorflow, keras
Experiment tracking: MLflow, Weights & Biases, DVC, Neptune
Data platforms: Great Expectations, Feast, Tecton
Deployment: FastAPI, Flask, Streamlit, Gradio, BentoML

Design Philosophy

Two problems, solved perfectly. Not ten features. Not a framework. Two problems that every Python developer faces every day, solved so completely that every other approach becomes unnecessary.
One line is the right answer. If a user needs more than one line to clean data or more than two lines to validate it, the API has failed them.
Zero configuration is the default. puredata analyzes your data and makes intelligent decisions automatically. Configuration exists to override defaults, not to get started.
Fixes are reversible and auditable. Every change is logged with the exact rows, original values, and reasons. You can export the full repair log and replay or undo any individual fix.
Trust no single method. Outlier detection uses four methods and requires agreement. Drift detection uses three statistics and requires both a magnitude threshold and statistical significance. Defensive ensemble thinking is built into the core.

Repository Structure

puredata/
├── puredata/
│   ├── __init__.py          public API surface
│   ├── api.py               unified top-level entry points
│   ├── pipeline.py          MendPipeline (AutoClean + DataWatch)
│   ├── dashboard.py         live HTML dashboard
│   ├── cli.py               Typer CLI application
│   ├── core/
│   │   ├── clean.py         AutoClean engine (2200+ lines)
│   │   ├── watch.py         DataWatch engine (700+ lines)
│   │   └── report.py        HTML · JSON · CSV report generation
│   ├── plugins/
│   │   └── base.py          plugin base classes and registry
│   └── integrations/
│       ├── mlflow.py        MLflow metric logging
│       ├── wandb.py         W&B metric logging
│       └── dvc.py           DVC JSON metrics
├── tests/                   123 tests · 93.6% coverage
├── docs/                    MkDocs documentation
│   └── index.html           GitHub Pages website
├── .github/workflows/
│   ├── ci.yml               test matrix (3 OS × 4 Python versions)
│   └── publish.yml          PyPI trusted publishing + GitHub Pages
└── pyproject.toml           PEP 517 build with Hatch

Contributing

Read CONTRIBUTING.md first.

git clone https://github.com/vignesh2027/puredata.py.git
cd puredata.py
pip install -e ".[dev]"
pytest --cov

Guidelines:

All new code must have tests, coverage must not drop below 90%
No placeholder or stub implementations — every function must work in production
Type hints on all public functions, NumPy docstrings on all public classes
PRs must pass CI on all three platforms and all four Python versions

Roadmap

Async support for streaming pipelines
Time series-specific drift detection (CUSUM, ADWIN)
LLM-powered column semantic labeling
Native DuckDB and Arrow backend for 1B+ row datasets
Web UI for contract management and drift monitoring
Automatic feature type inference from column names
Changelog diffing: compare two contracts to track schema evolution

License

MIT — free for commercial and personal use.

puredata — because cleaning data should take one line, not one week.

⭐ Star on GitHub · 🌐 Website · 📦 PyPI

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.2.0

May 17, 2026

This version

0.1.0

May 15, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

puredatalib-0.1.0.tar.gz (62.5 kB view details)

Uploaded May 15, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

puredatalib-0.1.0-py3-none-any.whl (50.6 kB view details)

Uploaded May 15, 2026 Python 3

File details

Details for the file puredatalib-0.1.0.tar.gz.

File metadata

Download URL: puredatalib-0.1.0.tar.gz
Upload date: May 15, 2026
Size: 62.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for puredatalib-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`44552c611c1e4cedeb8dab2bd252a043e3e96f7f8a995d388712ef8b5798de0f`
MD5	`f26c1ceef986f7115a293ff7c3051b20`
BLAKE2b-256	`5a554162a7e22ff5027d394a8d9cbb92dd9127d45a338ffc4360589edab973d6`

See more details on using hashes here.

File details

Details for the file puredatalib-0.1.0-py3-none-any.whl.

File metadata

Download URL: puredatalib-0.1.0-py3-none-any.whl
Upload date: May 15, 2026
Size: 50.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for puredatalib-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`5ddb32957d63db56d6acf8eedb070de9cd8263643302f35be7e1f8a81fdacd89`
MD5	`879c90ca422ee85f541219107e73ac13`
BLAKE2b-256	`1e9009a0c4e2712c97713e7bacf0526550caa7307c154a592afa9be03e285f5d`

See more details on using hashes here.

puredatalib 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

The Problem That Wastes Your Life

Five-Line Demo

Before vs After

Why puredata? What Makes It Different?

Architecture

Pillar 1 — AutoClean

How it works: the nine-stage pipeline

Stage 1 — Encoding Repair

Stage 2 — Whitespace Normalization

Stage 3 — Type Coercion

Stage 4 — Date Format Normalization

Stage 5 — Duplicate Removal

Stage 6 — Category Normalization

Stage 7 — Unit Normalization

Stage 8 — Null Imputation

Stage 9 — Outlier Detection & Handling

Pillar 2 — DataWatch

The silent incompatibility problem

Fit once. Check forever.

The seven checks DataWatch runs automatically

1 — Schema Violations

2 — Range Violations

3 — Null Rate Spikes

4 — New Category Values

5 — Distribution Drift

6 — Custom Business Rules

7 — Validation Modes

MendScore — Dataset Health at a Glance

Installation

API Reference

puredata.clean(data, *, config=None, target_col=None)

puredata.watch(reference, *, mode="warn", metadata=None)

puredata.check(new_data, contract, *, mode=None)

MendPipeline — full sklearn-compatible pipeline

puredata.dashboard(df, ...)

Command Line Interface

Integration Examples

MLflow

Weights & Biases

DVC

Polars

Any file format

Plugin System

Performance

Works With Everything

Design Philosophy

Repository Structure

Contributing

Roadmap

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`puredata.clean(data, *, config=None, target_col=None)`

`puredata.watch(reference, *, mode="warn", metadata=None)`

`puredata.check(new_data, contract, *, mode=None)`

`MendPipeline` — full sklearn-compatible pipeline

`puredata.dashboard(df, ...)`