Automated ML data cleaning and preprocessing pipeline
Project description
AutoCleanML
Stop wasting hours cleaning data. Let AutoCleanML do it for you.
AutoCleanML automatically cleans and prepares your messy data for machine learning. Just give it your data and target column - it handles the rest.
Why Use AutoCleanML?
Before AutoCleanML:
- Spend hours handling missing values
- Manually encode categorical variables
- Figure out which scaling to use
- Deal with imbalanced datasets
- Wonder if you're doing it right
With AutoCleanML:
from autocleanml import AutoCleanML
cleaner = AutoCleanML(target="target_col")
X_train, X_test, y_train, y_test, report = cleaner.fit_transform(df)
# Done! Your data is ready for any model
What It Does
- Fixes data types - Converts strings to numbers, handles dates
- Handles missing values - Smartly imputes using KNN, median, or mode
- Removes outliers - Detects and handles outliers intelligently
- Transforms skewed features - Applies log/power transforms for highly skewed data
- Engineers features - Creates useful features from text, dates, numbers
- Encodes categories - Handles categorical variables without exploding features
- Scales features - Chooses right scaling based on your model type
- Handles imbalance - Detects and suggests fixes for imbalanced classes
- Removes useless features - Gets rid of constants and highly correlated features
And it tells you WHY it made each decision.
Installation
pip install -e .
Quick Start
Example 1: Predicting House Prices (Regression)
import pandas as pd
from autocleanml import AutoCleanML
from sklearn.ensemble import RandomForestRegressor
# Load your messy data
df = pd.read_csv("house_prices.csv")
# Method 1: Pass your model (AutoCleanML auto-detects optimal preprocessing)
model = RandomForestRegressor()
cleaner = AutoCleanML(target="price", model=model)
X_train, X_test, y_train, y_test, report = cleaner.fit_transform(df)
# Auto-detected: tree → skips scaling (trees don't need it!)
# Train
model.fit(X_train, y_train)
predictions = model.predict(X_test)
# Method 2: Or just specify model type
cleaner = AutoCleanML(target="price", model_type='tree')
X_train, X_test, y_train, y_test, report = cleaner.fit_transform(df)
### Example 2: Customer Churn (Classification)
```python
# For classification, AutoCleanML detects imbalanced classes
cleaner = AutoCleanML(target="churned")
X_train, X_test, y_train, y_test, report = cleaner.fit_transform(df)
# It tells you if your classes are imbalanced and what to do
print(report['imbalance'])
# Shows: Class weights to use, recommended strategy, reasoning
# Train with recommended class weights
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(class_weight=report['imbalance']['class_weights'])
model.fit(X_train, y_train)
Key Features
Smart Scaling Based on Model Type
AutoCleanML has TWO ways to be model-aware:
Method 1: Pass Your Model (Automatic Detection) ⭐ RECOMMENDED
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.neural_network import MLPRegressor
# AutoCleanML detects the model type automatically!
model = RandomForestRegressor()
cleaner = AutoCleanML(target="price", model=model)
# Auto-detected: tree → No scaling needed
model = LinearRegression()
cleaner = AutoCleanML(target="price", model=model)
# Auto-detected: linear → StandardScaler + log transforms
model = MLPRegressor()
cleaner = AutoCleanML(target="price", model=model)
# Auto-detected: nn → MinMaxScaler [0,1]
Supported models auto-detection:
- 🌳 Tree-based: RandomForest, XGBoost, LightGBM, CatBoost, DecisionTree
- 📊 Linear: LinearRegression, LogisticRegression, Ridge, Lasso, ElasticNet, SGD
- 🧠 Neural Network: MLPClassifier, MLPRegressor, Keras, PyTorch models
- 📍 Distance-based: KNN, SVM
Method 2: Specify Model Type Manually
# If you don't have the model object yet
cleaner = AutoCleanML(target="price", model_type='linear')
cleaner = AutoCleanML(target="price", model_type='tree')
cleaner = AutoCleanML(target="price", model_type='nn')
cleaner = AutoCleanML(target="price", model_type='auto') # Let it guess
Automatic Transformations:
- Highly skewed features (skewness > 1) → Log transform or Yeo-Johnson power transform
- Features with outliers → RobustScaler (uses median, less sensitive)
- Normal distribution → StandardScaler (zero mean, unit variance)
- Neural networks → MinMaxScaler (0-1 bounded for activation functions)
Example output:
Transformed 3 skewed features:
- income: log transform (skewness was 2.34)
- sales: yeo-johnson (skewness was -1.89)
- amount: log transform (skewness was 3.12)
Scaling: StandardScaler
Reason: Linear model with clean data after transformation
Imbalanced Dataset Handling
For classification, it automatically:
- Detects class imbalance
- Recommends best strategy (class weights, SMOTE, etc.)
- Provides ready-to-use class weights
- Explains why it recommends that strategy
cleaner = AutoCleanML(target="fraud") # Highly imbalanced dataset
X_train, X_test, y_train, y_test, report = cleaner.fit_transform(df)
# Check imbalance report
if report['imbalance']['is_imbalanced']:
print(f"Dataset is imbalanced!")
print(f"Ratio: {report['imbalance']['imbalance_ratio']}")
print(f"Recommended: {report['imbalance']['recommended_strategy']}")
print(f"Reason: {report['imbalance']['reasoning']}")
# Use recommended class weights
model = RandomForestClassifier(
class_weight=report['imbalance']['class_weights']
)
Detailed Reporting
Every decision is explained:
# After cleaning
print(report['summary']) # Overall summary
print(report['missing_values']) # How missing values were handled
print(report['outliers']) # Outlier detection details
print(report['scaling']) # Why this scaling was chosen
print(report['imbalance']) # Imbalance analysis (classification)
print(report['feature_engineering']) # Features created
Example report:
Scaling: RobustScaler
Reason: Data has outliers (>3 columns), using RobustScaler (less sensitive to outliers)
Imbalance: SEVERE (ratio=0.12)
Recommended: class_weight
Reason: Severe imbalance (ratio=0.12) with large dataset,
using class_weight (efficient for tree-based models)
Class weights: {0: 1.0, 1: 7.33}
Configuration Options
cleaner = AutoCleanML(
target="price", # Required: your target column
# Train/test split
test_size=0.2, # 80-20 split
random_state=42, # For reproducibility
# Outlier handling
outlier_method='auto', # 'iqr', 'zscore', 'isolation_forest'
outlier_action='cap', # 'cap', 'remove', 'flag'
# Feature engineering
feature_extraction=True, # Create new features
max_features=100, # Limit feature count
# Model optimization
model_type='auto', # 'linear', 'tree', 'nn', 'auto'
# Verbosity
verbose=True # Show progress
)
What Makes AutoCleanML Smart?
1. Context-Aware Missing Value Imputation
Not all missing values should be filled the same way:
- Skewed data? → Uses median (not affected by outliers)
- Correlated features? → Uses KNN (preserves relationships)
- Normal distribution? → Uses mean
- Categories? → Uses most frequent value
2. Intelligent Scaling
Chooses scaling based on:
- Your model type (tree models don't need scaling!)
- Your data characteristics (outliers? → RobustScaler)
- Task requirements (neural nets → MinMaxScaler)
3. Imbalance Awareness
For classification:
- Detects severity of imbalance
- Considers dataset size
- Recommends appropriate strategy
- Provides ready-to-use class weights
4. No Data Leakage
Always:
- Splits data FIRST
- Fits transformations on training data ONLY
- Applies learned transformations to test data
You'll never accidentally leak information from test to train.
5. Guaranteed Clean Output
- Zero NaN values - Triple-layer protection ensures no missing values
- All features encoded - Everything is numeric and ready for models
- Proper scaling - Features scaled appropriately for your model type
Common Use Cases
Use Case 1: Quick Model Baseline
# Get a clean baseline fast
cleaner = AutoCleanML(target="target")
X_train, X_test, y_train, y_test, _ = cleaner.fit_transform(df)
# Try multiple models
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
for Model in [LinearRegression, RandomForestRegressor]:
model = Model()
model.fit(X_train, y_train)
print(f"{Model.__name__}: {model.score(X_test, y_test):.3f}")
Use Case 2: Production Pipeline
# Save the cleaner for production
import pickle
cleaner = AutoCleanML(target="price")
X_train, X_test, y_train, y_test, report = cleaner.fit_transform(train_df)
# Train model
model = RandomForestRegressor()
model.fit(X_train, y_train)
# Save both
pickle.dump(cleaner, open('cleaner.pkl', 'wb'))
pickle.dump(model, open('model.pkl', 'wb'))
# In production
cleaner = pickle.load(open('cleaner.pkl', 'rb'))
model = pickle.load(open('model.pkl', 'rb'))
new_data['price'] = 0
# Clean new data the same way
new_data_clean = cleaner.transform(new_data)
new_data_clean = new_data_clean.drop(columns=['price'])
predictions = model.predict(new_data_clean)
Use Case 3: Kaggle Competitions
# Quick clean for competitions
cleaner = AutoCleanML(
target="target",
feature_extraction=True, # Create extra features
max_features=200, # Keep more features
model_type='tree' # No scaling for XGBoost
)
X_train, X_test, y_train, y_test, report = cleaner.fit_transform(train_df)
# Check what was done
print(f"Created {report['feature_engineering']['total_features_created']} new features")
print(f"Final feature count: {X_train.shape[1]}")
Requirements
- Python 3.8+
- pandas
- numpy
- scikit-learn
- scipy
Install dependencies:
pip install pandas numpy scikit-learn scipy
Tips
Tip 1: Check the Report
Always look at the report to understand what was done:
cleaner = AutoCleanML(target="price")
X_train, X_test, y_train, y_test, report = cleaner.fit_transform(df)
# See what happened
print(report['scaling']) # Why this scaling?
print(report['imbalance']) # Is data imbalanced?
Tip 2: Match Model Type
Tell AutoCleanML what model you'll use:
# For tree-based models (no scaling needed)
cleaner = AutoCleanML(target="price", model_type='tree')
# For linear models (needs scaling)
cleaner = AutoCleanML(target="price", model_type='linear')
Tip 3: Handle Imbalanced Data
For classification with imbalanced classes:
cleaner = AutoCleanML(target="fraud")
X_train, X_test, y_train, y_test, report = cleaner.fit_transform(df)
# Use recommended class weights
if report['imbalance']['is_imbalanced']:
model = RandomForestClassifier(
class_weight=report['imbalance']['class_weights']
)
model.fit(X_train, y_train)
Troubleshooting
Q: Getting import errors?
cd AutoCleanML
python -m pip uninstall -y autocleanml
python -m pip install -e .
Q: Model performance seems off?
- Check
model_typematches your model - Review scaling report:
print(report['scaling']) - For classification, check imbalance report
Q: Want more features?
cleaner = AutoCleanML(target="price", max_features=200)
Q: Want less processing?
cleaner = AutoCleanML(
target="price",
feature_extraction=False, # Skip feature engineering
model_type='tree' # Skip scaling
)
What's Next?
After cleaning with AutoCleanML:
- Train models - Your data is ready for any sklearn model
- Tune hyperparameters - Use GridSearchCV or RandomizedSearchCV
- Deploy - Save the cleaner with your model for production
License
MIT License - Use it however you want!
Summary
AutoCleanML makes ML data preprocessing automatic and intelligent.
✅ One line to clean data
✅ Smart decisions based on data characteristics
✅ Model-aware preprocessing
✅ Handles imbalanced datasets
✅ Explains every decision
✅ Guaranteed clean output
✅ No data leakage
Stop cleaning data manually. Start using AutoCleanML.
from autocleanml import AutoCleanML
cleaner = AutoCleanML(target="your_target")
X_train, X_test, y_train, y_test, report = cleaner.fit_transform(df)
# Done! Train your model now.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file autocleanml-0.1.0.tar.gz.
File metadata
- Download URL: autocleanml-0.1.0.tar.gz
- Upload date:
- Size: 45.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
852061bf9d23325b057298b0d6eddb4b10f6a0e284c8722ef7a4eb8c3e28189b
|
|
| MD5 |
47faaa18b5182ad8991a6414b45762ea
|
|
| BLAKE2b-256 |
92f09f04e3d42d674a6c9ccec21965e48e10a7ea4962624de9a4457ba786de04
|
File details
Details for the file autocleanml-0.1.0-py3-none-any.whl.
File metadata
- Download URL: autocleanml-0.1.0-py3-none-any.whl
- Upload date:
- Size: 46.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c4432504a6bdaed61d3fc8b1e1fcabeb0316b3cbb76b2322df71a83424155070
|
|
| MD5 |
608318606573fba8961fe272ee1b75b7
|
|
| BLAKE2b-256 |
df30bbc174c1bfea65bae1f66d038aa9300aa2024fd09c572b44d9c1adbda0e3
|