Skip to main content

Automated ML data cleaning and preprocessing pipeline

Project description

AutoCleanML

Stop wasting hours cleaning data. Let AutoCleanML do it for you.

AutoCleanML automatically cleans and prepares your messy data for machine learning. Just give it your data and target column - it handles the rest.


Why Use AutoCleanML?

Before AutoCleanML:

  • Spend hours handling missing values
  • Manually encode categorical variables
  • Figure out which scaling to use
  • Deal with imbalanced datasets
  • Wonder if you're doing it right

With AutoCleanML:

from autocleanml import AutoCleanML

cleaner = AutoCleanML(target="target_col")
X_train, X_test, y_train, y_test, report = cleaner.fit_transform(df)

# Done! Your data is ready for any model

What It Does

  1. Fixes data types - Converts strings to numbers, handles dates
  2. Handles missing values - Smartly imputes using KNN, median, or mode
  3. Removes outliers - Detects and handles outliers intelligently
  4. Transforms skewed features - Applies log/power transforms for highly skewed data
  5. Engineers features - Creates useful features from text, dates, numbers
  6. Encodes categories - Handles categorical variables without exploding features
  7. Scales features - Chooses right scaling based on your model type
  8. Handles imbalance - Detects and suggests fixes for imbalanced classes
  9. Removes useless features - Gets rid of constants and highly correlated features

And it tells you WHY it made each decision.


Installation

pip install -e .

Quick Start

Example 1: Predicting House Prices (Regression)

import pandas as pd
from autocleanml import AutoCleanML
from sklearn.ensemble import RandomForestRegressor

# Load your messy data
df = pd.read_csv("house_prices.csv")

# Method 1: Pass your model (AutoCleanML auto-detects optimal preprocessing)
model = RandomForestRegressor()
cleaner = AutoCleanML(target="price", model=model)
X_train, X_test, y_train, y_test, report = cleaner.fit_transform(df)
#  Auto-detected: tree → skips scaling (trees don't need it!)

# Train
model.fit(X_train, y_train)
predictions = model.predict(X_test)

# Method 2: Or just specify model type
cleaner = AutoCleanML(target="price", model_type='tree')
X_train, X_test, y_train, y_test, report = cleaner.fit_transform(df)

### Example 2: Customer Churn (Classification)

```python
# For classification, AutoCleanML detects imbalanced classes
cleaner = AutoCleanML(target="churned")
X_train, X_test, y_train, y_test, report = cleaner.fit_transform(df)

# It tells you if your classes are imbalanced and what to do
print(report['imbalance'])
# Shows: Class weights to use, recommended strategy, reasoning

# Train with recommended class weights
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(class_weight=report['imbalance']['class_weights'])
model.fit(X_train, y_train)

Key Features

Smart Scaling Based on Model Type

AutoCleanML has TWO ways to be model-aware:

Method 1: Pass Your Model (Automatic Detection) ⭐ RECOMMENDED

from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.neural_network import MLPRegressor

# AutoCleanML detects the model type automatically!
model = RandomForestRegressor()
cleaner = AutoCleanML(target="price", model=model)
#  Auto-detected: tree → No scaling needed

model = LinearRegression()
cleaner = AutoCleanML(target="price", model=model)
#  Auto-detected: linear → StandardScaler + log transforms

model = MLPRegressor()
cleaner = AutoCleanML(target="price", model=model)
#  Auto-detected: nn → MinMaxScaler [0,1]

Supported models auto-detection:

  • 🌳 Tree-based: RandomForest, XGBoost, LightGBM, CatBoost, DecisionTree
  • 📊 Linear: LinearRegression, LogisticRegression, Ridge, Lasso, ElasticNet, SGD
  • 🧠 Neural Network: MLPClassifier, MLPRegressor, Keras, PyTorch models
  • 📍 Distance-based: KNN, SVM

Method 2: Specify Model Type Manually

# If you don't have the model object yet
cleaner = AutoCleanML(target="price", model_type='linear')
cleaner = AutoCleanML(target="price", model_type='tree')
cleaner = AutoCleanML(target="price", model_type='nn')
cleaner = AutoCleanML(target="price", model_type='auto')  # Let it guess

Automatic Transformations:

  • Highly skewed features (skewness > 1) → Log transform or Yeo-Johnson power transform
  • Features with outliers → RobustScaler (uses median, less sensitive)
  • Normal distribution → StandardScaler (zero mean, unit variance)
  • Neural networks → MinMaxScaler (0-1 bounded for activation functions)

Example output:

Transformed 3 skewed features:
  - income: log transform (skewness was 2.34)
  - sales: yeo-johnson (skewness was -1.89)
  - amount: log transform (skewness was 3.12)
  
Scaling: StandardScaler
Reason: Linear model with clean data after transformation

Imbalanced Dataset Handling

For classification, it automatically:

  • Detects class imbalance
  • Recommends best strategy (class weights, SMOTE, etc.)
  • Provides ready-to-use class weights
  • Explains why it recommends that strategy
cleaner = AutoCleanML(target="fraud")  # Highly imbalanced dataset
X_train, X_test, y_train, y_test, report = cleaner.fit_transform(df)

# Check imbalance report
if report['imbalance']['is_imbalanced']:
    print(f"Dataset is imbalanced!")
    print(f"Ratio: {report['imbalance']['imbalance_ratio']}")
    print(f"Recommended: {report['imbalance']['recommended_strategy']}")
    print(f"Reason: {report['imbalance']['reasoning']}")
    
    # Use recommended class weights
    model = RandomForestClassifier(
        class_weight=report['imbalance']['class_weights']
    )

Detailed Reporting

Every decision is explained:

# After cleaning
print(report['summary'])  # Overall summary
print(report['missing_values'])  # How missing values were handled
print(report['outliers'])  # Outlier detection details
print(report['scaling'])  # Why this scaling was chosen
print(report['imbalance'])  # Imbalance analysis (classification)
print(report['feature_engineering'])  # Features created

Example report:

Scaling: RobustScaler
Reason: Data has outliers (>3 columns), using RobustScaler (less sensitive to outliers)

Imbalance: SEVERE (ratio=0.12)
Recommended: class_weight
Reason: Severe imbalance (ratio=0.12) with large dataset, 
        using class_weight (efficient for tree-based models)
Class weights: {0: 1.0, 1: 7.33}

Configuration Options

cleaner = AutoCleanML(
    target="price",              # Required: your target column
    
    # Train/test split
    test_size=0.2,              # 80-20 split
    random_state=42,            # For reproducibility
    
    # Outlier handling
    outlier_method='auto',      # 'iqr', 'zscore', 'isolation_forest'
    outlier_action='cap',       # 'cap', 'remove', 'flag'
    
    # Feature engineering
    feature_extraction=True,    # Create new features
    max_features=100,          # Limit feature count
    
    # Model optimization
    model_type='auto',         # 'linear', 'tree', 'nn', 'auto'
    
    # Verbosity
    verbose=True               # Show progress
)

What Makes AutoCleanML Smart?

1. Context-Aware Missing Value Imputation

Not all missing values should be filled the same way:

  • Skewed data? → Uses median (not affected by outliers)
  • Correlated features? → Uses KNN (preserves relationships)
  • Normal distribution? → Uses mean
  • Categories? → Uses most frequent value

2. Intelligent Scaling

Chooses scaling based on:

  • Your model type (tree models don't need scaling!)
  • Your data characteristics (outliers? → RobustScaler)
  • Task requirements (neural nets → MinMaxScaler)

3. Imbalance Awareness

For classification:

  • Detects severity of imbalance
  • Considers dataset size
  • Recommends appropriate strategy
  • Provides ready-to-use class weights

4. No Data Leakage

Always:

  1. Splits data FIRST
  2. Fits transformations on training data ONLY
  3. Applies learned transformations to test data

You'll never accidentally leak information from test to train.

5. Guaranteed Clean Output

  • Zero NaN values - Triple-layer protection ensures no missing values
  • All features encoded - Everything is numeric and ready for models
  • Proper scaling - Features scaled appropriately for your model type

Common Use Cases

Use Case 1: Quick Model Baseline

# Get a clean baseline fast
cleaner = AutoCleanML(target="target")
X_train, X_test, y_train, y_test, _ = cleaner.fit_transform(df)

# Try multiple models
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor

for Model in [LinearRegression, RandomForestRegressor]:
    model = Model()
    model.fit(X_train, y_train)
    print(f"{Model.__name__}: {model.score(X_test, y_test):.3f}")

Use Case 2: Production Pipeline

# Save the cleaner for production
import pickle

cleaner = AutoCleanML(target="price")
X_train, X_test, y_train, y_test, report = cleaner.fit_transform(train_df)

# Train model
model = RandomForestRegressor()
model.fit(X_train, y_train)

# Save both
pickle.dump(cleaner, open('cleaner.pkl', 'wb'))
pickle.dump(model, open('model.pkl', 'wb'))

# In production
cleaner = pickle.load(open('cleaner.pkl', 'rb'))
model = pickle.load(open('model.pkl', 'rb'))

new_data['price'] = 0


# Clean new data the same way
new_data_clean = cleaner.transform(new_data)
new_data_clean = new_data_clean.drop(columns=['price'])
predictions = model.predict(new_data_clean)

Use Case 3: Kaggle Competitions

# Quick clean for competitions
cleaner = AutoCleanML(
    target="target",
    feature_extraction=True,    # Create extra features
    max_features=200,          # Keep more features
    model_type='tree'          # No scaling for XGBoost
)
X_train, X_test, y_train, y_test, report = cleaner.fit_transform(train_df)

# Check what was done
print(f"Created {report['feature_engineering']['total_features_created']} new features")
print(f"Final feature count: {X_train.shape[1]}")

Requirements

  • Python 3.8+
  • pandas
  • numpy
  • scikit-learn
  • scipy

Install dependencies:

pip install pandas numpy scikit-learn scipy

Tips

Tip 1: Check the Report

Always look at the report to understand what was done:

cleaner = AutoCleanML(target="price")
X_train, X_test, y_train, y_test, report = cleaner.fit_transform(df)

# See what happened
print(report['scaling'])  # Why this scaling?
print(report['imbalance'])  # Is data imbalanced?

Tip 2: Match Model Type

Tell AutoCleanML what model you'll use:

# For tree-based models (no scaling needed)
cleaner = AutoCleanML(target="price", model_type='tree')

# For linear models (needs scaling)
cleaner = AutoCleanML(target="price", model_type='linear')

Tip 3: Handle Imbalanced Data

For classification with imbalanced classes:

cleaner = AutoCleanML(target="fraud")
X_train, X_test, y_train, y_test, report = cleaner.fit_transform(df)

# Use recommended class weights
if report['imbalance']['is_imbalanced']:
    model = RandomForestClassifier(
        class_weight=report['imbalance']['class_weights']
    )
    model.fit(X_train, y_train)

Troubleshooting

Q: Getting import errors?

cd AutoCleanML
python -m pip uninstall -y autocleanml
python -m pip install -e .

Q: Model performance seems off?

  • Check model_type matches your model
  • Review scaling report: print(report['scaling'])
  • For classification, check imbalance report

Q: Want more features?

cleaner = AutoCleanML(target="price", max_features=200)

Q: Want less processing?

cleaner = AutoCleanML(
    target="price",
    feature_extraction=False,  # Skip feature engineering
    model_type='tree'          # Skip scaling
)

What's Next?

After cleaning with AutoCleanML:

  1. Train models - Your data is ready for any sklearn model
  2. Tune hyperparameters - Use GridSearchCV or RandomizedSearchCV
  3. Deploy - Save the cleaner with your model for production

License

MIT License - Use it however you want!


Summary

AutoCleanML makes ML data preprocessing automatic and intelligent.

✅ One line to clean data
✅ Smart decisions based on data characteristics
✅ Model-aware preprocessing
✅ Handles imbalanced datasets
✅ Explains every decision
✅ Guaranteed clean output
✅ No data leakage

Stop cleaning data manually. Start using AutoCleanML.

from autocleanml import AutoCleanML

cleaner = AutoCleanML(target="your_target")
X_train, X_test, y_train, y_test, report = cleaner.fit_transform(df)

# Done! Train your model now.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

autocleanml-0.1.0.tar.gz (45.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

autocleanml-0.1.0-py3-none-any.whl (46.3 kB view details)

Uploaded Python 3

File details

Details for the file autocleanml-0.1.0.tar.gz.

File metadata

  • Download URL: autocleanml-0.1.0.tar.gz
  • Upload date:
  • Size: 45.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.11

File hashes

Hashes for autocleanml-0.1.0.tar.gz
Algorithm Hash digest
SHA256 852061bf9d23325b057298b0d6eddb4b10f6a0e284c8722ef7a4eb8c3e28189b
MD5 47faaa18b5182ad8991a6414b45762ea
BLAKE2b-256 92f09f04e3d42d674a6c9ccec21965e48e10a7ea4962624de9a4457ba786de04

See more details on using hashes here.

File details

Details for the file autocleanml-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: autocleanml-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 46.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.11

File hashes

Hashes for autocleanml-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c4432504a6bdaed61d3fc8b1e1fcabeb0316b3cbb76b2322df71a83424155070
MD5 608318606573fba8961fe272ee1b75b7
BLAKE2b-256 df30bbc174c1bfea65bae1f66d038aa9300aa2024fd09c572b44d9c1adbda0e3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page