Skip to main content

Automated data cleaning and auto ML training with hyperparameter tuning.

Project description

Autocure

Automated data cleaning + AutoML model training with hyperparameter tuning.

Key features

  • Data scan

    • Missing values summary
    • Dtype inference
    • Outlier counts (IQR method)
    • Duplicate rows count
  • Data cleaning (cure)

    • Numeric imputation (median)
    • Categorical imputation (mode / fallback "Unknown")
    • Text preprocessing (applies only to text columns):
      • remove punctuation
      • lowercase
      • tokenize
      • remove stopwords
      • lemmatize
    • Drop duplicates
  • Pipeline & AutoML

    • Automatic ColumnTransformer (StandardScaler for numeric, OneHotEncoder for categorical)
    • Support for classification and regression
    • Multiple algorithms with GridSearchCV hyperparameter search
    • Model leaderboard and best-model selection
  • CLI

    • scan-data, fix, train-model commands for quick workflows

Installation

From project root (editable install for development):

python -m pip install -e .

Or install dependencies:

pip install -r requirements.txt
pip install nltk

NLTK data used by cure() is downloaded at runtime (punkt, stopwords, wordnet).

Quick usage

Python:

import autocure as ac
import pandas as pd

df = pd.read_csv("your_dataset.csv")

# 1) Inspect dataset
report = ac.scan(df)
print(report)

# 2) Clean dataset
clean = ac.cure(df)

# 3) Create pipeline (optional)
pipeline = ac.make_pipeline(clean, target="target_column")

# 4) Train models (returns a dict)
result = ac.train(clean, target="target_column")

print(result["best_model_name"])
print("score:", result["best_score"])
# Best model object for predict:
best_model = result["best_model"]
preds = best_model.predict(clean.drop(columns=["target_column"]).iloc[:5])

Returned keys from train():

  • best_model (fitted sklearn estimator / pipeline)
  • best_model_name
  • best_score
  • best_params
  • leaderboard (list/dict of model results)

CLI

Examples:

# Scan a CSV
autocure scan-data data.csv

# Clean and save
autocure fix data.csv --out cleaned.csv

# Train via CLI
autocure train-model data.csv --target target_column

Tips & notes

  • cure() applies text preprocessing only to columns detected as object/string dtype.
  • For small categorical cardinality detection and problem-type heuristics, train() uses simple rules; override by preparing the DataFrame appropriately.
  • If you use large text datasets, consider vectorization or specialized NLP pipelines before heavy GridSearchCV.

Contributing

Bug reports and PRs welcome. Follow repository guidelines and add tests for new behavior.

License

MIT License

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

autocure-0.2.0.tar.gz (7.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

autocure-0.2.0-py3-none-any.whl (7.7 kB view details)

Uploaded Python 3

File details

Details for the file autocure-0.2.0.tar.gz.

File metadata

  • Download URL: autocure-0.2.0.tar.gz
  • Upload date:
  • Size: 7.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.11

File hashes

Hashes for autocure-0.2.0.tar.gz
Algorithm Hash digest
SHA256 8dbb6fc6ef7b7e7c542c36c15ba9076cae6045269963cd7517d710945c6655c6
MD5 6a98360d1527283823794eaae0f96fd1
BLAKE2b-256 fff971d7712201efd4c057bd7bb992d8adc9109962f11e20ba425d36ed2e9cef

See more details on using hashes here.

File details

Details for the file autocure-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: autocure-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 7.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.11

File hashes

Hashes for autocure-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f5069856078b949e07af8f140fbf93a3f65c76e8ca9fd8909e30fd5aeeffd094
MD5 55d0e115b9340b051eff934085f9d352
BLAKE2b-256 12aec43b4c83245599d6d1e10fbd043af41a84079df5ac77d9c959475d690392

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page