Skip to main content

Automated data cleaning and auto ML training with hyperparameter tuning.

Project description

Autocure

Automated data cleaning + AutoML model training with hyperparameter tuning.

Key features

  • Data scan

    • Missing values summary
    • Dtype inference
    • Outlier counts (IQR method)
    • Duplicate rows count
  • Data cleaning (cure)

    • Numeric imputation (median)
    • Categorical imputation (mode / fallback "Unknown")
    • Text preprocessing (applies only to text columns):
      • remove punctuation
      • lowercase
      • tokenize
      • remove stopwords
      • lemmatize
    • Drop duplicates
  • Pipeline & AutoML

    • Automatic ColumnTransformer (StandardScaler for numeric, OneHotEncoder for categorical)
    • Support for classification and regression
    • Multiple algorithms with GridSearchCV hyperparameter search
    • Model leaderboard and best-model selection
    • Save best model as a pickle into autocure/models/{model_name}_model.pickle
  • CLI

    • scan-data, fix, train-model commands for quick workflows

Installation

From project root (editable install for development):

python -m pip install -e .

Or install dependencies:

pip install -r requirements.txt
pip install nltk

NLTK data used by cure() is downloaded at runtime (punkt, stopwords, wordnet).

Quick usage

Python:

import autocure as ac
import pandas as pd

df = pd.read_csv("your_dataset.csv")

# 1) Inspect dataset
report = ac.scan(df)
print(report)

# 2) Clean dataset
clean = ac.cure(df)

# 3) Create pipeline (optional)
pipeline = ac.make_pipeline(clean, target="target_column")

# 4) Train models (returns a dict)
result = ac.train(clean, target="target_column")

print(result["best_model_name"])
print("score:", result["best_score"])
# Best model object for predict:
best_model = result["best_model"]
preds = best_model.predict(clean.drop(columns=["target_column"]).iloc[:5])

Returned keys from train():

  • best_model (fitted sklearn estimator / pipeline)
  • best_model_name
  • best_score
  • best_params
  • leaderboard (list/dict of model results)
  • model_path (path to saved pickle file under autocure/models/)

Train a custom model (pass model_name + model_params)

You can request training of a specific model with custom (fixed) parameters by passing model_name and model_params to train(). When both are provided, train() will set the estimator parameters, fit on the data (no GridSearch), evaluate on the test split, and save the fitted pipeline to a pickle file:

  • Saved path: autocure/models/{model_name}_model.pickle

Example (your requested usage):

from autocure import cure, train
import pandas as pd
from sklearn import datasets

data = datasets.load_iris()
df = pd.DataFrame(data.data, columns=data.feature_names)
df["label"] = data.target

clean = cure(df)

result_custom = train(
    clean,
    target="label",
    model_name="random_forest_classifier",
    model_params={"n_estimators": 50, "max_depth": 5}
)

print("Best model:", result_custom["best_model_name"])
print("Score:", result_custom["best_score"])
print("Saved model path:", result_custom["model_path"])

Notes:

  • model_name must match one of the identifiers in train.py's base_models mapping (e.g. "random_forest_classifier", "logistic_regression", etc.). If you add new models to train.py's base_models, you can use the new identifier here.
  • model_params should be a dict of estimator parameters (the keys are the estimator's parameter names, not pipeline step names). train() calls set_params(**model_params) on the estimator before fitting.
  • The saved pickle contains the full pipeline (preprocessing + estimator) so it can be used directly for predict().

Loading a saved model

import pickle
import pandas as pd

X = pd.read_csv("my_features.csv")  # ensure same columns as training input

with open("autocure/models/random_forest_classifier_model.pickle", "rb") as f:
    model = pickle.load(f)

preds = model.predict(X)
if hasattr(model, "predict_proba"):
    probs = model.predict_proba(X)

CLI

Examples:

# Scan a CSV
autocure scan-data data.csv

# Clean and save
autocure fix data.csv --out cleaned.csv

# Train via CLI
autocure train-model data.csv --target target_column

Tips & notes

  • cure() applies text preprocessing only to columns detected as object/string dtype.
  • For custom high-cardinality categorical handling or advanced NLP, prepare features (vectorize / embed) before calling train() or extend the preprocessing in train.py.
  • To add custom models to the AutoML search, add them to train.py's base_models dict (include estimator instance and param grid).

Contributing

Bug reports and PRs welcome. Follow repository guidelines and add tests for new behavior.

License

MIT License

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

autocure-1.2.0.tar.gz (10.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

autocure-1.2.0-py3-none-any.whl (9.4 kB view details)

Uploaded Python 3

File details

Details for the file autocure-1.2.0.tar.gz.

File metadata

  • Download URL: autocure-1.2.0.tar.gz
  • Upload date:
  • Size: 10.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.11

File hashes

Hashes for autocure-1.2.0.tar.gz
Algorithm Hash digest
SHA256 e73d7d9697ef45c4ec8b30917f02fa79ef98cbe81c8ad3c9ad2d4f02113c802b
MD5 32a5966ca75cc36eed9e4e70243a943d
BLAKE2b-256 4804b86602d46e8ea344f65603d301b3fe6e82794aa83aa664b19494532934e1

See more details on using hashes here.

File details

Details for the file autocure-1.2.0-py3-none-any.whl.

File metadata

  • Download URL: autocure-1.2.0-py3-none-any.whl
  • Upload date:
  • Size: 9.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.11

File hashes

Hashes for autocure-1.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0ecfd5cce59ed1effba6801566e787aedd4329486bf8133d0ec0376a1cb6b77e
MD5 7e459bc335c876706815a1774e7da2d2
BLAKE2b-256 926c8029678ddee0ae7457f8dcef43462e9353547d31d0624f30bb5c24e670b2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page