Automatic data cleaning and standardization for ML pipelines

These details have not been verified by PyPI

Project links

Project description

DataCleaner

Automated data cleaning & standardization pipeline for ML projects.

DataCleaner takes raw CSV/Excel data and transforms it into production-ready ML features - handling nulls, encoding categories, selecting the best scaler per column, and packaging everything into a reusable inference pipeline.

Features

Feature	Description
Input	CSV & Excel files or in-memory DataFrames
Column dropping	Pass a list of unwanted columns (IDs, timestamps, etc.)
Target column	Designate any column as the prediction target
Auto-detect problem type	Automatically detects classification vs regression from target column
Auto-drop useless columns	Removes zero-variance, high-cardinality, and duplicated columns
Null handling	Dynamic threshold: drop rows if nulls are few, KNN-impute if nulls are abundant
Outlier handling	IQR-based detection with clip or remove options
Encoding	Auto-detects binary vs multi-category columns; LabelEncoder for binary, OneHotEncoder for categorical
Auto-scaler	Tests each numeric column for normality & outliers, then picks the optimal scaler (Standard, Robust, MinMax, MaxAbs)
Feature engineering	Generates polynomial features (interactions, squares) for numeric columns
Imbalance handling	SMOTE oversampling for imbalanced classification datasets
Train/Val/Test split	Configurable split ratios
Pipeline export	Save & reload the full transformation pipeline for inference on new data
Summary	Quick overview of shape, dtypes, null counts & percentages
Date feature extraction	Expands datetime columns into year/month/day/dayofweek/weekend
Missing indicators	Adds `{col}_missing` binary columns for imputed nulls
Feature selection	Removes weak features via mutual information
Custom encoders/scalers	Pass your own sklearn encoders and scalers to `prepare()`
Statistical test suite	Integrated A/B testing, t-tests, z-tests, chi-square, ANOVA
Data profiling	Self-contained HTML report with distributions, correlations, quality warnings
Schema validation	Validate column existence and expected dtypes
Duplicate removal	Drop duplicate rows

Quick Start

Install

pip install clean-data-ml

With optional extras:

pip install clean-data-ml[plot]       # visualization (matplotlib, seaborn)
pip install clean-data-ml[imbalance]   # SMOTE oversampling support
pip install clean-data-ml[all]         # all optional features

For a development (editable) install from source:

git clone https://github.com/MohammadvHossein/clean-data-ml.git
cd clean-data-ml
pip install -e .
pip install -e .[all]                # including all extras

Minimal example

from clean_data_ml import DataCleaner
from sklearn.svm import SVC

dc = DataCleaner()
dc.load("data.csv")
dc.set_target("purchased")
dc.drop_columns(["ID", "timestamp"])

X_train, X_test, y_train, y_test = dc.prepare(test_size=0.2)

model = SVC()
model.fit(X_train, y_train)
print(f"Accuracy: {model.score(X_test, y_test):.2f}")

Full Example

1. Train & save pipeline

from clean_data_ml import DataCleaner
import pandas as pd
from sklearn.svm import SVC
import joblib

# Sample data
data = pd.DataFrame({
    "ID": range(100),
    "age": [25, 30, 35, None, 40, 45, 50, 55, 60, 65] * 10,
    "salary": [50000, 60000, None, 80000, 90000, 100000, 110000, 120000, None, 140000] * 10,
    "city": ["Tehran", "Shiraz", "Tehran", "Isfahan", None, "Tehran", "Shiraz", "Isfahan", "Tehran", "Shiraz"] * 10,
    "gender": ["M", "F", "M", "F", "M", "F", "M", "F", "M", "F"] * 10,
    "purchased": [1, 0, 1, 0, 1, 1, 0, 1, 0, 1] * 10,
})

dc = DataCleaner()
dc.load_df(data)
dc.set_target("purchased")
dc.drop_columns(["ID"])

X_train, X_test, y_train, y_test = dc.prepare(test_size=0.2)

model = SVC(probability=True)
model.fit(X_train, y_train)
print(f"Accuracy: {model.score(X_test, y_test):.2f}")

# Save model & pipeline for later inference
joblib.dump(model, "model.pkl")
dc.save_pipeline("my_pipeline.pkl")

2. Inference on new data

from clean_data_ml import DataCleaner
import pandas as pd
from sklearn.svm import SVC
import joblib

dc = DataCleaner.load_pipeline("my_pipeline.pkl")
model = joblib.load("model.pkl")

new_data = pd.DataFrame({
    "age": [28, 42, 35],
    "salary": [65000, 95000, 78000],
    "city": ["Tehran", "Isfahan", "Shiraz"],
    "gender": ["F", "M", "F"],
})

processed = dc.transform(new_data)
predictions = model.predict(processed)
probabilities = model.predict_proba(processed)

for i, (pred, prob) in enumerate(zip(predictions, probabilities)):
    status = "Purchased" if pred == 1 else "Not Purchased"
    print(f"Customer {i+1}: {status} (confidence: {max(prob):.2%})")

API Reference

`DataCleaner(random_state=42)`

Main class. All methods return self for chaining.

Method	Description
`.load(filepath)`	Load CSV or Excel file
`.load_df(df)`	Load from an existing pandas DataFrame
`.set_target(col)`	Set the target column
`.drop_columns(cols)`	Drop unwanted columns (IDs, etc.) — append-only, safe to call multiple times
`.prepare(...)`	Execute the full pipeline (see parameters below)
`.get_pipeline()`	Returns the fitted `CleanPipeline` for transforming new data
`.save_pipeline(path)`	Save pipeline to disk
`.load_pipeline(path)`	Load a saved pipeline — returns a `DataCleaner` instance wrapping the pipeline
`.transform(df)`	Apply all cleaning steps to new raw data (same as `get_pipeline().transform(df)`)
`.export_cleaned(filepath, include_target=False)`	Export the fully cleaned dataset (features only, or with target if `True`) to CSV or Excel (.xlsx)
`.summary()`	Dict with shape, columns, dtypes, null counts
`.profile_report(filepath)`	Generate a self-contained HTML data profiling report with stats, distributions, and quality warnings
`.drop_duplicates(subset, keep)`	Remove duplicate rows
`.validate_schema(expected_schema, required_cols)`	Validate column existence and expected dtypes
`.auto_fix_dtypes()`	Auto-convert object columns to datetime or numeric where possible

`prepare()` Parameters

Parameter	Default	Description
`test_size`	`0.2`	Fraction of data for test set
`val_size`	`None`	If set, also creates a validation set
`handle_nulls`	`True`	Auto-detect and handle missing values
`auto_scale`	`True`	Auto-select and apply optimal scaler per column
`auto_encode`	`True`	Auto-encode binary (Label) and categorical (OneHot) columns
`null_drop_ratio`	`None`	Override dynamic null threshold
`auto_drop_useless`	`True`	Drop zero-variance and high-cardinality columns
`handle_outliers`	`None`	`"clip"` to cap outliers, `"remove"` to drop, `None` to skip
`feature_engineering`	`False`	Add polynomial features (interactions, squares)
`handle_imbalance`	`False`	Apply SMOTE oversampling on imbalanced classification data
`n_jobs`	`1`	Number of parallel jobs for scaler selection and outlier handling. `-1` uses all cores
`extract_date_features`	`False`	Expand datetime columns into year, month, day, dayofweek, weekend
`add_missing_indicators`	`False`	Add `{col}_missing` binary columns for imputed nulls
`feature_selection`	`None`	`"auto"` (median MI threshold) or a float threshold; removes features below threshold
`custom_encoders`	`None`	Dict of `{col: encoder_instance}` to override auto-encoding
`custom_scalers`	`None`	Dict of `{col: scaler_instance}` to override auto-scaling

`CleanPipeline` (internal)

Holds all fitted transformers. Can be used directly, but prefer DataCleaner for full functionality.

Method	Description
`.transform(df)`	Apply all cleaning steps to a raw DataFrame
`.save(path)`	Pickle to disk
`.load(path)`	Static method -- load from disk

Pipeline Attributes (accessible via `dc.pipeline.*`)

Attribute	Description
`.problem_type`	`"classification"` or `"regression"` (auto-detected)
`.dropped_useless_cols`	Columns auto-dropped by `auto_drop_useless`
`.outlier_bounds`	IQR bounds used for outlier handling (applied in transform)
`.scalers`	Dict of column -> fitted scaler
`.onehot_cols`	One-hot encoded column names
`.label_encoders`	Binary column -> mapping dict
`.feature_cols`	Ordered list of all feature columns after transformation
`.poly_features`	Fitted `PolynomialFeatures` transformer (if feature_engineering was enabled)
`.custom_encoders`	Dict of user-provided encoders
`.custom_scalers`	Dict of user-provided scalers
`.cat_impute_values`	Dict of categorical column -> mode used for imputation
`.feature_importances_`	Dict of column -> mutual information score (if feature_selection was used)

How Nulls Are Handled

The threshold for "drop vs impute" is dynamic -- it adapts to dataset size:

Dataset size	Drop threshold	Behavior
100 rows	25%	Very conservative -- prefers KNN imputation
1,000 rows	5%	Balanced approach
10,000+ rows	1%	More aggressive dropping (plentiful data)

Numeric columns with many nulls -- KNNImputer(n_neighbors=5)
Categorical columns with many nulls -- filled with mode
Any column with few nulls -- those rows are dropped

You can override this with prepare(null_drop_ratio=0.1).

Statistical Test Suite

The clean_data_ml.stats module provides a comprehensive set of statistical tests for data analysis:

Standalone Functions

Function	Description
`normality_test(series, method)`	Shapiro-Wilk, D'Agostino, or Anderson-Darling normality test
`correlation_test(x, y, method)`	Pearson, Spearman, or Kendall correlation
`ks_test(a, b)`	Kolmogorov-Smirnov (two-sample distribution test)
`chi_square_test(a, b)`	Chi-square test of independence
`variance_test(a, b, method)`	Levene, Bartlett, or Fligner test for equal variance
`anova_one_way(*groups)`	One-way ANOVA
`z_test_one_sample(series, pop_mean)`	One-sample z-test for mean
`z_test_two_sample(a, b)`	Two-sample z-test for mean
`z_test_proportion(successes, n, p)`	One-sample proportion z-test
`z_test_two_proportion(s1, n1, s2, n2)`	Two-sample proportion z-test
`t_test_one_sample(series, pop_mean)`	One-sample t-test
`t_test_independent(a, b)`	Independent two-sample t-test
`t_test_paired(a, b)`	Paired t-test
`ab_test_mean(control, treatment)`	A/B test on means (lift, CI, significance)
`ab_test_proportion(control, treatment)`	A/B test on proportions
`mutual_information(X, y)`	Mutual Information between features and target

StatisticalTestSuite (integration with DataCleaner)

from clean_data_ml import DataCleaner, stats

dc = DataCleaner()
dc.load_df(data).set_target("purchased")

suite = stats.StatisticalTestSuite(dc)
suite.test_normality()
suite.test_correlations(target_col="purchased")
suite.test_chi_square("gender", "city")
suite.test_anova("age", "city")
suite.test_z_one_sample("age", pop_mean=35)
suite.test_t_independent("age", "score")
suite.test_ab_by_group("converted", "group", "A", "B", metric_type="proportion")
print(suite.summary())

Visualization Module

The clean_data_ml.plotting module (requires pip install -e .[plot]):

Function	Description
`plot_null_report(dc)`	Bar charts of null counts and percentages
`plot_distributions(dc, cols)`	Histograms + boxplots for numeric columns
`plot_correlation(dc)`	Correlation heatmap
`plot_before_after(dc)`	Compare raw vs cleaned distributions

Project Structure

clean_data_ml/
  __init__.py       Package exports
  cleaner.py        DataCleaner + CleanPipeline classes
  auto_scaler.py    Automatic scaler selection logic
  stats.py          Statistical test suite (t-test, z-test, AB test, etc.)
  plotting.py       Optional visualization module
setup.py                  Package metadata
pyproject.toml            Build configuration
MANIFEST.in               sdist inclusion rules
LICENSE                   MIT license
example_train.py          Training example
example_inference.py      Inference example
.pre-commit-config.yaml   Linting hooks (black, isort, flake8)
.gitignore                Ignored files
README.md                 This file
tests/
    conftest.py           Shared test fixtures
    test_cleaner.py       DataCleaner / CleanPipeline tests
    test_auto_scaler.py   Scaler selection tests
    test_stats.py         Statistical test suite tests
    test_plotting.py      Visualization module tests

Additional Features

Auto-Drop Useless Columns

The library automatically detects and removes:

Zero-variance columns -- columns with a single unique value
High-cardinality columns -- non-numeric columns where unique values exceed 90% of rows (e.g., free-text fields)

Disabled with prepare(auto_drop_useless=False).

Outlier Handling (IQR)

After null handling, each numeric column is checked using the Interquartile Range method:

Lower bound: Q1 - 1.5 x IQR
Upper bound: Q3 + 1.5 x IQR

Two modes:

"clip" -- caps values at the bounds (preserves row count)
"remove" -- drops rows with outliers

Activated with prepare(handle_outliers="clip").

Feature Engineering

Generates polynomial features (degree 2) for numeric columns with more than 2 unique values. Creates interaction terms and squared features automatically.

Activated with prepare(feature_engineering=True).

Date Feature Extraction

When prepare(extract_date_features=True), datetime columns are automatically expanded into numerical components:

{col}_year, {col}_month, {col}_day, {col}_dayofweek, {col}_weekend
The original datetime column is dropped afterward.

This happens early in the pipeline so the derived numeric columns benefit from all subsequent steps (encoding, scaling, feature engineering, etc.).

Missing Indicators

When prepare(add_missing_indicators=True), for every column that receives KNN imputation (null ratio above threshold), an additional binary column {col}_missing is added, flagging which rows originally contained nulls. This lets the model learn patterns from the missingness itself.

Feature Selection

Controlled by prepare(feature_selection="auto") or prepare(feature_selection=0.01).

After all transformations, Mutual Information is computed between each feature and the target. Features with MI below the threshold are dropped:

"auto" -- drops features below the median MI score
float (e.g., 0.01) -- drops features below that absolute threshold

Set to None (default) to skip feature selection entirely.

Custom Encoders & Custom Scalers

Pass fitted or unfitted sklearn-compatible transformers to override auto-detection:

from sklearn.preprocessing import OrdinalEncoder, KBinsDiscretizer

dc.prepare(
    custom_encoders={"city": OrdinalEncoder()},
    custom_scalers={"salary": KBinsDiscretizer(n_bins=5, encode="ordinal")},
)

These are stored in dc.pipeline.custom_encoders / dc.pipeline.custom_scalers and applied during transform() as well.

Imbalanced Data (SMOTE)

When handle_imbalance=True and the problem is classification, SMOTE oversampling is applied to the training set after the train/test split. Requires imbalanced-learn:

pip install imbalanced-learn

How Scaler Selection Works

For each numeric column, the library tests:

Normality -- Shapiro-Wilk test (p > 0.05 => normal)
Outliers -- IQR method (1.5x IQR rule)
Bounds -- min >= 0 & max <= 1
Sparsity -- >40% zeros

Note: Tree-based models (Random Forest, XGBoost, LightGBM, etc.) do not require scaling or normalization -- they split on thresholds and are invariant to monotonic transformations. Scaling is only needed for distance-based or gradient-based models (SVM, KNN, Neural Networks, Logistic Regression, etc.). You can skip auto-scaling with prepare(auto_scale=False) if using a tree-based model.

Then assigns the optimal scaler:

Condition	Scaler
Normal + no outliers	`StandardScaler`
Has outliers	`RobustScaler`
Bounded [0, 1]	`MinMaxScaler`
Sparse	`MaxAbsScaler`
Default	`StandardScaler`

Requirements

Python >= 3.8
pandas >= 1.3
numpy >= 1.21
scikit-learn >= 1.0
scipy >= 1.7
joblib
openpyxl (for Excel support)

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

1.2.0

Jun 6, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

clean_data_ml-1.2.0.tar.gz (45.4 kB view details)

Uploaded Jun 6, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

clean_data_ml-1.2.0-py3-none-any.whl (34.6 kB view details)

Uploaded Jun 6, 2026 Python 3

File details

Details for the file clean_data_ml-1.2.0.tar.gz.

File metadata

Download URL: clean_data_ml-1.2.0.tar.gz
Upload date: Jun 6, 2026
Size: 45.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for clean_data_ml-1.2.0.tar.gz
Algorithm	Hash digest
SHA256	`6ccd0e7a77d6d374fbdbd8accae0be7bbcb2a3ca5f644b452cdf066c5e30e291`
MD5	`cd06a3347ca1fa1fcfc1546d668a13cf`
BLAKE2b-256	`f57dccdf3f416ef81aac0a846ff27da7a3c8aeb4ae2301952d0d78ec772627d7`

See more details on using hashes here.

File details

Details for the file clean_data_ml-1.2.0-py3-none-any.whl.

File metadata

Download URL: clean_data_ml-1.2.0-py3-none-any.whl
Upload date: Jun 6, 2026
Size: 34.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for clean_data_ml-1.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`27b41ad809e5525841b4ffca2772db88621ef7f5c6f61af049bf36e3955898a4`
MD5	`d2b3978487e768ec04344acdc61bcdf7`
BLAKE2b-256	`d54f9be776093fab16463d9058fedcb1cb7d658974f2da98286406a6a4d54718`

See more details on using hashes here.

clean-data-ml 1.2.0

Navigation

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Project description

DataCleaner

Features

Quick Start

Install

Minimal example

Full Example

1. Train & save pipeline

2. Inference on new data

API Reference

DataCleaner(random_state=42)

prepare() Parameters

CleanPipeline (internal)

Pipeline Attributes (accessible via dc.pipeline.*)

How Nulls Are Handled

Statistical Test Suite

Standalone Functions

StatisticalTestSuite (integration with DataCleaner)

Visualization Module

Project Structure

Additional Features

Auto-Drop Useless Columns

Outlier Handling (IQR)

Feature Engineering

Date Feature Extraction

Missing Indicators

Feature Selection

Custom Encoders & Custom Scalers

Imbalanced Data (SMOTE)

How Scaler Selection Works

Requirements

License

Project details

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`DataCleaner(random_state=42)`

`prepare()` Parameters

`CleanPipeline` (internal)

Pipeline Attributes (accessible via `dc.pipeline.*`)