Automated ML data cleaning and preprocessing pipeline

These details have not been verified by PyPI

Project links

Homepage

Project description

AutoCleanML

Stop wasting hours cleaning data. Let AutoCleanML do it for you.

AutoCleanML automatically cleans and prepares your messy data for machine learning. Just give it your data and target column - it handles the rest.

Why Use AutoCleanML?

Before AutoCleanML:

Spend hours handling missing values
Manually encode categorical variables
Figure out which scaling to use
Deal with imbalanced datasets
Wonder if you're doing it right

With AutoCleanML:

from autocleanml import AutoCleanML

cleaner = AutoCleanML(target="target_col")
X_train, X_test, y_train, y_test, report = cleaner.fit_transform(df)

# Done! Your data is ready for any model

What It Does

Fixes data types - Converts strings to numbers, handles dates
Handles missing values - Smartly imputes using KNN, median, or mode
Removes outliers - Detects and handles outliers intelligently
Transforms skewed features - Applies log/power transforms for highly skewed data
Engineers features - Creates useful features from text, dates, numbers
Encodes categories - Handles categorical variables without exploding features
Scales features - Chooses right scaling based on your model type
Handles imbalance - Detects and suggests fixes for imbalanced classes
Removes useless features - Gets rid of constants and highly correlated features

And it tells you WHY it made each decision.

Installation

pip install -e .

Quick Start

Example 1: Predicting House Prices (Regression)

import pandas as pd
from autocleanml import AutoCleanML
from sklearn.ensemble import RandomForestRegressor

# Load your messy data
df = pd.read_csv("house_prices.csv")

# Method 1: Pass your model (AutoCleanML auto-detects optimal preprocessing)
model = RandomForestRegressor()
cleaner = AutoCleanML(target="price", model=model)
X_train, X_test, y_train, y_test, report = cleaner.fit_transform(df)
#  Auto-detected: tree → skips scaling (trees don't need it!)

# Train
model.fit(X_train, y_train)
predictions = model.predict(X_test)

# Method 2: Or just specify model type
cleaner = AutoCleanML(target="price", model_type='tree')
X_train, X_test, y_train, y_test, report = cleaner.fit_transform(df)


### Example 2: Customer Churn (Classification)

```python
# For classification, AutoCleanML detects imbalanced classes
cleaner = AutoCleanML(target="churned")
X_train, X_test, y_train, y_test, report = cleaner.fit_transform(df)

# It tells you if your classes are imbalanced and what to do
print(report['imbalance'])
# Shows: Class weights to use, recommended strategy, reasoning

# Train with recommended class weights
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(class_weight=report['imbalance']['class_weights'])
model.fit(X_train, y_train)

Key Features

Smart Scaling Based on Model Type

AutoCleanML has TWO ways to be model-aware:

Method 1: Pass Your Model (Automatic Detection) ⭐ RECOMMENDED

from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.neural_network import MLPRegressor

# AutoCleanML detects the model type automatically!
model = RandomForestRegressor()
cleaner = AutoCleanML(target="price", model=model)
#  Auto-detected: tree → No scaling needed

model = LinearRegression()
cleaner = AutoCleanML(target="price", model=model)
#  Auto-detected: linear → StandardScaler + log transforms

model = MLPRegressor()
cleaner = AutoCleanML(target="price", model=model)
#  Auto-detected: nn → MinMaxScaler [0,1]

Supported models auto-detection:

🌳 Tree-based: RandomForest, XGBoost, LightGBM, CatBoost, DecisionTree
📊 Linear: LinearRegression, LogisticRegression, Ridge, Lasso, ElasticNet, SGD
🧠 Neural Network: MLPClassifier, MLPRegressor, Keras, PyTorch models
📍 Distance-based: KNN, SVM

Method 2: Specify Model Type Manually

# If you don't have the model object yet
cleaner = AutoCleanML(target="price", model_type='linear')
cleaner = AutoCleanML(target="price", model_type='tree')
cleaner = AutoCleanML(target="price", model_type='nn')
cleaner = AutoCleanML(target="price", model_type='auto')  # Let it guess

Automatic Transformations:

Highly skewed features (skewness > 1) → Log transform or Yeo-Johnson power transform
Features with outliers → RobustScaler (uses median, less sensitive)
Normal distribution → StandardScaler (zero mean, unit variance)
Neural networks → MinMaxScaler (0-1 bounded for activation functions)

Example output:

Transformed 3 skewed features:
  - income: log transform (skewness was 2.34)
  - sales: yeo-johnson (skewness was -1.89)
  - amount: log transform (skewness was 3.12)
  
Scaling: StandardScaler
Reason: Linear model with clean data after transformation

Imbalanced Dataset Handling

For classification, it automatically:

Detects class imbalance
Recommends best strategy (class weights, SMOTE, etc.)
Provides ready-to-use class weights
Explains why it recommends that strategy

cleaner = AutoCleanML(target="fraud")  # Highly imbalanced dataset
X_train, X_test, y_train, y_test, report = cleaner.fit_transform(df)

# Check imbalance report
if report['imbalance']['is_imbalanced']:
    print(f"Dataset is imbalanced!")
    print(f"Ratio: {report['imbalance']['imbalance_ratio']}")
    print(f"Recommended: {report['imbalance']['recommended_strategy']}")
    print(f"Reason: {report['imbalance']['reasoning']}")
    
    # Use recommended class weights
    model = RandomForestClassifier(
        class_weight=report['imbalance']['class_weights']
    )

Detailed Reporting

Every decision is explained:

# After cleaning
print(report['summary'])  # Overall summary
print(report['missing_values'])  # How missing values were handled
print(report['outliers'])  # Outlier detection details
print(report['scaling'])  # Why this scaling was chosen
print(report['imbalance'])  # Imbalance analysis (classification)
print(report['feature_engineering'])  # Features created

Example report:

Scaling: RobustScaler
Reason: Data has outliers (>3 columns), using RobustScaler (less sensitive to outliers)

Imbalance: SEVERE (ratio=0.12)
Recommended: class_weight
Reason: Severe imbalance (ratio=0.12) with large dataset, 
        using class_weight (efficient for tree-based models)
Class weights: {0: 1.0, 1: 7.33}

Configuration Options

cleaner = AutoCleanML(
    target="price",              # Required: your target column
    
    # Train/test split
    test_size=0.2,              # 80-20 split
    random_state=42,            # For reproducibility
    
    # Outlier handling
    outlier_method='auto',      # 'iqr', 'zscore', 'isolation_forest'
    outlier_action='cap',       # 'cap', 'remove', 'flag'
    
    # Feature engineering
    feature_extraction=True,    # Create new features
    max_features=100,          # Limit feature count
    
    # Model optimization
    model_type='auto',         # 'linear', 'tree', 'nn', 'auto'
    
    # Verbosity
    verbose=True               # Show progress
)

What Makes AutoCleanML Smart?

1. Context-Aware Missing Value Imputation

Not all missing values should be filled the same way:

Skewed data? → Uses median (not affected by outliers)
Correlated features? → Uses KNN (preserves relationships)
Normal distribution? → Uses mean
Categories? → Uses most frequent value

2. Intelligent Scaling

Chooses scaling based on:

Your model type (tree models don't need scaling!)
Your data characteristics (outliers? → RobustScaler)
Task requirements (neural nets → MinMaxScaler)

3. Imbalance Awareness

For classification:

Detects severity of imbalance
Considers dataset size
Recommends appropriate strategy
Provides ready-to-use class weights

4. No Data Leakage

Always:

Splits data FIRST
Fits transformations on training data ONLY
Applies learned transformations to test data

You'll never accidentally leak information from test to train.

5. Guaranteed Clean Output

Zero NaN values - Triple-layer protection ensures no missing values
All features encoded - Everything is numeric and ready for models
Proper scaling - Features scaled appropriately for your model type

Common Use Cases

Use Case 1: Quick Model Baseline

# Get a clean baseline fast
cleaner = AutoCleanML(target="target")
X_train, X_test, y_train, y_test, _ = cleaner.fit_transform(df)

# Try multiple models
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor

for Model in [LinearRegression, RandomForestRegressor]:
    model = Model()
    model.fit(X_train, y_train)
    print(f"{Model.__name__}: {model.score(X_test, y_test):.3f}")

Use Case 2: Production Pipeline

# Save the cleaner for production
import pickle

cleaner = AutoCleanML(target="price")
X_train, X_test, y_train, y_test, report = cleaner.fit_transform(train_df)

# Train model
model = RandomForestRegressor()
model.fit(X_train, y_train)

# Save both
pickle.dump(cleaner, open('cleaner.pkl', 'wb'))
pickle.dump(model, open('model.pkl', 'wb'))

# In production
cleaner = pickle.load(open('cleaner.pkl', 'rb'))
model = pickle.load(open('model.pkl', 'rb'))

new_data['price'] = 0


# Clean new data the same way
new_data_clean = cleaner.transform(new_data)
new_data_clean = new_data_clean.drop(columns=['price'])
predictions = model.predict(new_data_clean)

Use Case 3: Kaggle Competitions

# Quick clean for competitions
cleaner = AutoCleanML(
    target="target",
    feature_extraction=True,    # Create extra features
    max_features=200,          # Keep more features
    model_type='tree'          # No scaling for XGBoost
)
X_train, X_test, y_train, y_test, report = cleaner.fit_transform(train_df)

# Check what was done
print(f"Created {report['feature_engineering']['total_features_created']} new features")
print(f"Final feature count: {X_train.shape[1]}")

Requirements

Python 3.8+
pandas
numpy
scikit-learn
scipy

Install dependencies:

pip install pandas numpy scikit-learn scipy

Tips

Tip 1: Check the Report

Always look at the report to understand what was done:

cleaner = AutoCleanML(target="price")
X_train, X_test, y_train, y_test, report = cleaner.fit_transform(df)

# See what happened
print(report['scaling'])  # Why this scaling?
print(report['imbalance'])  # Is data imbalanced?

Tip 2: Match Model Type

Tell AutoCleanML what model you'll use:

# For tree-based models (no scaling needed)
cleaner = AutoCleanML(target="price", model_type='tree')

# For linear models (needs scaling)
cleaner = AutoCleanML(target="price", model_type='linear')

Tip 3: Handle Imbalanced Data

For classification with imbalanced classes:

cleaner = AutoCleanML(target="fraud")
X_train, X_test, y_train, y_test, report = cleaner.fit_transform(df)

# Use recommended class weights
if report['imbalance']['is_imbalanced']:
    model = RandomForestClassifier(
        class_weight=report['imbalance']['class_weights']
    )
    model.fit(X_train, y_train)

Troubleshooting

Q: Getting import errors?

cd AutoCleanML
python -m pip uninstall -y autocleanml
python -m pip install -e .

Q: Model performance seems off?

Check model_type matches your model
Review scaling report: print(report['scaling'])
For classification, check imbalance report

Q: Want more features?

cleaner = AutoCleanML(target="price", max_features=200)

Q: Want less processing?

cleaner = AutoCleanML(
    target="price",
    feature_extraction=False,  # Skip feature engineering
    model_type='tree'          # Skip scaling
)

What's Next?

After cleaning with AutoCleanML:

Train models - Your data is ready for any sklearn model
Tune hyperparameters - Use GridSearchCV or RandomizedSearchCV
Deploy - Save the cleaner with your model for production

License

MIT License - Use it however you want!

Summary

AutoCleanML makes ML data preprocessing automatic and intelligent.

✅ One line to clean data
✅ Smart decisions based on data characteristics
✅ Model-aware preprocessing
✅ Handles imbalanced datasets
✅ Explains every decision
✅ Guaranteed clean output
✅ No data leakage

Stop cleaning data manually. Start using AutoCleanML.

from autocleanml import AutoCleanML

cleaner = AutoCleanML(target="your_target")
X_train, X_test, y_train, y_test, report = cleaner.fit_transform(df)

# Done! Train your model now.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.1.0

Feb 7, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

autocleanml-0.1.0.tar.gz (45.8 kB view details)

Uploaded Feb 7, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

autocleanml-0.1.0-py3-none-any.whl (46.3 kB view details)

Uploaded Feb 7, 2026 Python 3

File details

Details for the file autocleanml-0.1.0.tar.gz.

File metadata

Download URL: autocleanml-0.1.0.tar.gz
Upload date: Feb 7, 2026
Size: 45.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.11

File hashes

Hashes for autocleanml-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`852061bf9d23325b057298b0d6eddb4b10f6a0e284c8722ef7a4eb8c3e28189b`
MD5	`47faaa18b5182ad8991a6414b45762ea`
BLAKE2b-256	`92f09f04e3d42d674a6c9ccec21965e48e10a7ea4962624de9a4457ba786de04`

See more details on using hashes here.

File details

Details for the file autocleanml-0.1.0-py3-none-any.whl.

File metadata

Download URL: autocleanml-0.1.0-py3-none-any.whl
Upload date: Feb 7, 2026
Size: 46.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.11

File hashes

Hashes for autocleanml-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c4432504a6bdaed61d3fc8b1e1fcabeb0316b3cbb76b2322df71a83424155070`
MD5	`608318606573fba8961fe272ee1b75b7`
BLAKE2b-256	`df30bbc174c1bfea65bae1f66d038aa9300aa2024fd09c572b44d9c1adbda0e3`

See more details on using hashes here.

autocleanml 0.1.0

Navigation

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Project description

AutoCleanML

Why Use AutoCleanML?

What It Does

Installation

Quick Start

Example 1: Predicting House Prices (Regression)

Key Features

Smart Scaling Based on Model Type

Method 1: Pass Your Model (Automatic Detection) ⭐ RECOMMENDED

Method 2: Specify Model Type Manually

Imbalanced Dataset Handling

Detailed Reporting

Configuration Options

What Makes AutoCleanML Smart?

1. Context-Aware Missing Value Imputation

2. Intelligent Scaling

3. Imbalance Awareness

4. No Data Leakage

5. Guaranteed Clean Output

Common Use Cases

Use Case 1: Quick Model Baseline

Use Case 2: Production Pipeline

Use Case 3: Kaggle Competitions

Requirements

Tips

Tip 1: Check the Report

Tip 2: Match Model Type

Tip 3: Handle Imbalanced Data

Troubleshooting

What's Next?

License

Summary

Project details

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes