Automated data cleaning and auto ML training with hyperparameter tuning.
Project description
Autocure
Automated data cleaning + AutoML model training with hyperparameter tuning.
Key features
-
Data scan
- Missing values summary
- Dtype inference
- Outlier counts (IQR method)
- Duplicate rows count
-
Data cleaning (cure)
- Numeric imputation (median)
- Categorical imputation (mode / fallback "Unknown")
- Text preprocessing (applies only to text columns):
- remove punctuation
- lowercase
- tokenize
- remove stopwords
- lemmatize
- Drop duplicates
-
Pipeline & AutoML
- Automatic ColumnTransformer (StandardScaler for numeric, OneHotEncoder for categorical)
- Support for classification and regression
- Multiple algorithms with GridSearchCV hyperparameter search
- Model leaderboard and best-model selection
-
CLI
- scan-data, fix, train-model commands for quick workflows
Installation
From project root (editable install for development):
python -m pip install -e .
Or install dependencies:
pip install -r requirements.txt
pip install nltk
NLTK data used by cure() is downloaded at runtime (punkt, stopwords, wordnet).
Quick usage
Python:
import autocure as ac
import pandas as pd
df = pd.read_csv("your_dataset.csv")
# 1) Inspect dataset
report = ac.scan(df)
print(report)
# 2) Clean dataset
clean = ac.cure(df)
# 3) Create pipeline (optional)
pipeline = ac.make_pipeline(clean, target="target_column")
# 4) Train models (returns a dict)
result = ac.train(clean, target="target_column")
print(result["best_model_name"])
print("score:", result["best_score"])
# Best model object for predict:
best_model = result["best_model"]
preds = best_model.predict(clean.drop(columns=["target_column"]).iloc[:5])
Returned keys from train():
- best_model (fitted sklearn estimator / pipeline)
- best_model_name
- best_score
- best_params
- leaderboard (list/dict of model results)
CLI
Examples:
# Scan a CSV
autocure scan-data data.csv
# Clean and save
autocure fix data.csv --out cleaned.csv
# Train via CLI
autocure train-model data.csv --target target_column
Tips & notes
- cure() applies text preprocessing only to columns detected as object/string dtype.
- For small categorical cardinality detection and problem-type heuristics, train() uses simple rules; override by preparing the DataFrame appropriately.
- If you use large text datasets, consider vectorization or specialized NLP pipelines before heavy GridSearchCV.
Contributing
Bug reports and PRs welcome. Follow repository guidelines and add tests for new behavior.
License
MIT License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file autocure-0.2.0.tar.gz.
File metadata
- Download URL: autocure-0.2.0.tar.gz
- Upload date:
- Size: 7.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8dbb6fc6ef7b7e7c542c36c15ba9076cae6045269963cd7517d710945c6655c6
|
|
| MD5 |
6a98360d1527283823794eaae0f96fd1
|
|
| BLAKE2b-256 |
fff971d7712201efd4c057bd7bb992d8adc9109962f11e20ba425d36ed2e9cef
|
File details
Details for the file autocure-0.2.0-py3-none-any.whl.
File metadata
- Download URL: autocure-0.2.0-py3-none-any.whl
- Upload date:
- Size: 7.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f5069856078b949e07af8f140fbf93a3f65c76e8ca9fd8909e30fd5aeeffd094
|
|
| MD5 |
55d0e115b9340b051eff934085f9d352
|
|
| BLAKE2b-256 |
12aec43b4c83245599d6d1e10fbd043af41a84079df5ac77d9c959475d690392
|