Automatically Build Variant Interpretable ML models fast - now with CatBoost!

These details have not been verified by PyPI

Project links

Homepage

Project description

Auto-ViML

banner
Automatically Build Various Interpretable ML models fast!

Auto_ViML is pronounced as "auto vimal" (autovimal logo created by Sanket Ghanmare).

Update (Jan 2025)

Auto_ViML is now upgraded to version 0.2 which means it now runs on Python 3.12 or greater and also pandas 2.0 - this is a huge upgrade to those working in Colabs, Kaggle and other latest kernels. Please make sure you check the `requirements.txt` file to know which versions are recommended.

Update (March 2023)

Auto_ViML has a new flag to speed up processing using GPU's. Just set the `GPU_flag`=`True` on Colab and other environments. But don't forget to set the runtime type to be "GPU" while running on Colab. Otherwise you will get an error.

Update (May 2022)

Auto_ViML as of version 0.1.710 uses a very high performance library called `imbalanced_ensemble` for Imbalanced dataset problems. It will produce a 5-10% boost in your balanced_accuracy based on my experience with many datasets I have tried.

In addition, new features in this version are:

SULOV -> Uses the SULOV algorithm for removing highly correlated features automatically.
Auto_NLP -> AutoViML automatically detects Text variables and does NLP processing using Auto_NLP
Date Time -> It automatically detects date time variables and generates new features
`imbalanced_ensemble` library -> Uses imbalanced_ensemble library for imbalanced data. Just set Imbalanced_Flag = True in arguments
Feature Selection -> We use the same algorithm that featurewiz library uses: SULOV and Recursive XGBoost to select best features fast. See below.

Background
Install
Usage
Tips for using Auto_ViML
API
Maintainers
Contributing
License

Background

Read this Medium article to learn how to use Auto_ViML effectively.

Auto_ViML was designed for building High Performance Interpretable Models with the fewest variables needed. The "V" in Auto_ViML stands for Variant because it tries multiple models with multiple features to find you the best performing model for your dataset. The "i" in Auto_ViML stands for "interpretable" since Auto_ViML selects the least number of features necessary to build a simpler, more interpretable model. In most cases, Auto_ViML builds models with 20%-99% fewer features than a similar performing model with all included features (this is based on my trials. Your experience may vary).

Auto_ViML is every Data Scientist's model accelerator tool that:

Helps you with data cleaning: you can send in your entire dataframe as is and Auto_ViML will suggest changes to help with missing values, formatting variables, adding variables, etc. It loves dirty data. The dirtier the better!
Performs Feature Selection: Auto_ViML selects variables automatically by default. This is very helpful when you have hundreds if not thousands of variables since it can readily identify which of those are important variables vs which are unnecessary. You can turn this off as well (see API).

xgboost

Removes highly correlated features automatically. If two variables are highly correlated in your dataset, which one should you remove and which one should you keep? The decision is not as easy as it looks. Auto_ViML uses the SULOV algorithm to remove highly correlated features. You can understand SULOV from this picture more intuitively.

sulov

Generates performance results graphically. Just set verbose = 1 (or) 2 instead of 0 (silent). You will get higher quality of insights as you increase verbosity.
Handles text, date-time, structs (lists, dictionaries), numeric, boolean, factor and categorical variables all in one model using one straight process.

Auto_ViML is built using scikit-learn, numpy, pandas and matplotlib. It should run on most Python 3 Anaconda installations. You won't have to import any special libraries other than "XGBoost", "Imbalanced-Learn", "CatBoost", and "featuretools" library. We use "SHAP" library for interpretability.
But if you don't have these libraries, Auto_ViML will install those for you automatically.

Install

Prerequsites:

Anaconda

To clone Auto_ViML, it is better to create a new environment, and install the required dependencies:

To install from PyPi:

$ pip install autoviml --upgrade --ignore-installed

pip install git+https://github.com/AutoViML/Auto_ViML.git

To install from source:

cd <AutoVIML_Destination>
git clone git@github.com:AutoViML/Auto_ViML.git
# or download and unzip https://github.com/AutoViML/Auto_ViML/archive/master.zip
conda create -n <your_env_name> python=3.7 anaconda
conda activate <your_env_name> # ON WINDOWS: `source activate <your_env_name>`
cd Auto_ViML
pip install -r requirements.txt

Usage

In the same directory, open a Jupyter Notebook and use this line to import the .py file:

from autoviml.Auto_ViML import Auto_ViML

Load a data set (any CSV or text file) into a Pandas dataframe and split it into Train and Test dataframes. If you don't have a test dataframe, you can simple assign the test variable below to '' (empty string):

model, features, trainm, testm = Auto_ViML(
    train,
    target,
    test,
    sample_submission,
    hyper_param="GS",
    feature_reduction=True,
    scoring_parameter="weighted-f1",
    KMeans_Featurizer=False,
    Boosting_Flag=False,
    Binning_Flag=False,
    Add_Poly=False,
    Stacking_Flag=False,
    Imbalanced_Flag=False,
    verbose=0,
)

Finally, it writes your submission file to disk in the current directory called mysubmission.csv. This submission file is ready for you to show it clients or submit it to competitions. If no submission file was given, but as long as you give it a test file name, it will create a submission file for you named mySubmission.csv. Auto_ViML works on any Multi-Class, Multi-Label Data Set. So you can have many target labels. You don't have to tell Auto_ViML whether it is a Regression or Classification problem.

Tips for using Auto_ViML:

scoring_parameter: For Classification problems and imbalanced classes, choose scoring_parameter="balanced_accuracy". It works better.
Imbalanced_Flag: For Imbalanced Classes (<5% samples in rare class), choose "Imbalanced_Flag"=True. You can also set this flag to True for Regression problems where the target variable might have skewed distributions.
target: For Multi-Label dataset, the target input target variable can be sent in as a list of variables.
Boosting_Flag: It is recommended that you first set Boosting_Flag=None to get a Linear model. Once you understand that, then you can try to set Boosting_Flag=False to get a Random Forest model. Finally, try Boosting_Flag=True to get an XGBoost model. This is the order that we recommend in order to use Auto_ViML. Finally try Boosting_Flag="CatBoost" to get a complex but high performing model.
Binning_Flag: Binning_Flag=True improves a CatBoost model since it adds to the list of categorical vars in data
KMeans_featurizer: KMeans_featurizer=True works well in NLP and CatBoost models since it creates cluster variables
Add_Poly: Add_Poly=3 improves certain models where there is date-time or categorical and numeric variables
feature_reduction: feature_reduction=True is the default and works best. But when you have <10 features in data, set it to False
Stacking_Flag: Do not set Stacking_Flag=True with Linear models since your results may not look great.
Stacking_Flag: Use Stacking_Flag=True only for complex models and as a last step with Boosting_Flag=True or CatBoost
hyper_param: Leave hyper_param ="RS" as input since it runs faster than GridSearchCV and gives better results unless you have a small data set and can afford to spend time on hyper tuning.
KMeans_Featurizer: KMeans_Featurizer=True does not work well for small data sets. Use it for data sets > 10,000 rows.
Final thoughts: Finally Auto_ViML is meant to be a baseline or challenger solution to your data set. So use it for making quick models that you can compare against or in Hackathons. It is not meant for production!

API

Arguments

train: could be a datapath+filename or a dataframe. It will detect which is which and load it.
test: could be a datapath+filename or a dataframe. If you don't have any, just leave it as "".
submission: must be a datapath+filename. If you don't have any, just leave it as empty string.
target: name of the target variable in the data set.
sep: if you have a spearator in the file such as "," or "\t" mention it here. Default is ",".
scoring_parameter: if you want your own scoring parameter such as "f1" give it here. If not, it will assume the appropriate scoring param for the problem and it will build the model.
hyper_param: Tuning options are GridSearch ('GS') and RandomizedSearch ('RS'). Default is 'RS'.
feature_reduction: Default = 'True' but it can be set to False if you don't want automatic feature_reduction since in Image data sets like digits and MNIST, you get better results when you don't reduce features automatically. You can always try both and see.
KMeans_Featurizer
- True: Adds a cluster label to features based on KMeans. Use for Linear.
- False (default) For Random Forests or XGB models, leave it False since it may overfit.
Boosting Flag: you have 4 possible choices (default is False):
- None This will build a Linear Model
- False This will build a Random Forest or Extra Trees model (also known as Bagging)
- True This will build an XGBoost model
- CatBoost This will build a CatBoost model (provided you have CatBoost installed)
Add_Poly: Default is 0 which means do-nothing. But it has three interesting settings:
- 1 Add interaction variables only such as x1x2, x2x3,...x9*10 etc.
- 2 Add Interactions and Squared variables such as x12, x22, etc.
- 3 Adds both Interactions and Squared variables such as x1x2, x1**2,x2x3, x2**2, etc.
Stacking_Flag: Default is False. If set to True, it will add an additional feature which is derived from predictions of another model. This is used in some cases but may result in overfitting. So be careful turning this flag "on".
Binning_Flag: Default is False. It set to True, it will convert the top numeric variables into binned variables through a technique known as "Entropy" binning. This is very helpful for certain datasets (especially hard to build models).
Imbalanced_Flag: Default is False. Uses imbalanced_ensemble library for imbalanced data. Just set Imbalanced_Flag = True in arguments
verbose: This has 3 possible states:
- 0 limited output. Great for running this silently and getting fast results.
- 1 more charts. Great for knowing how results were and making changes to flags in input.
- 2 lots of charts and output. Great for reproducing what Auto_ViML does on your own.

Return values

model: It will return your trained model
features: the fewest number of features in your model to make it perform well
train_modified: this is the modified train dataframe after removing and adding features
test_modified: this is the modified test dataframe with the same transformations as train

Maintainers

Contributing

See the contributing file!

PRs accepted.

License

DISCLAIMER

This project is not an official Google project. It is not supported by Google and Google specifically disclaims all warranties as to its quality, merchantability, or fitness for a particular purpose.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.2.2

Jan 30, 2025

0.2.0

Jan 29, 2025

0.1.800

May 11, 2024

0.1.722

May 9, 2024

0.1.721

May 7, 2024

0.1.720

May 7, 2024

0.1.716

May 21, 2023

0.1.715

May 17, 2023

0.1.714

Apr 19, 2023

0.1.713

Mar 8, 2023

0.1.712

Mar 7, 2023

0.1.710

Jun 1, 2022

0.1.704

Apr 28, 2022

0.1.703

Apr 26, 2022

0.1.702

Apr 26, 2022

0.1.701

Apr 25, 2022

0.1.684

Dec 6, 2021

0.1.683

Jul 25, 2021

0.1.682

May 17, 2021

0.1.681

Apr 15, 2021

0.1.680

Mar 21, 2021

0.1.679

Mar 20, 2021

0.1.678

Mar 9, 2021

0.1.677

Jan 15, 2021

0.1.676

Jan 4, 2021

0.1.675

Dec 27, 2020

0.1.674

Dec 26, 2020

0.1.673

Dec 10, 2020

0.1.672

Nov 21, 2020

0.1.671

Nov 20, 2020

0.1.669

Nov 15, 2020

0.1.668

Oct 28, 2020

0.1.667

Oct 28, 2020

0.1.665

Oct 19, 2020

0.1.664

Oct 19, 2020

0.1.663

Sep 9, 2020

0.1.662

Aug 8, 2020

0.1.661

Aug 2, 2020

0.1.660

Jul 27, 2020

0.1.659

Jul 27, 2020

0.1.658

Jul 20, 2020

0.1.657

Jul 6, 2020

0.1.656

Jul 5, 2020

0.1.655

Jul 5, 2020

0.1.654

Jun 20, 2020

0.1.653

Jun 8, 2020

0.1.652

Jun 1, 2020

0.1.651

May 24, 2020

0.1.650

May 24, 2020

0.1.639

May 18, 2020

0.1.633

May 18, 2020

0.1.632

May 18, 2020

0.1.631

May 18, 2020

0.1.630

May 18, 2020

0.1.629

May 18, 2020

0.1.628

May 17, 2020

0.1.627

May 16, 2020

0.1.626

May 16, 2020

0.1.625

May 16, 2020

0.1.624

May 10, 2020

0.1.623

May 8, 2020

0.1.622

May 8, 2020

0.1.621

May 6, 2020

0.1.620

May 4, 2020

0.1.619

May 3, 2020

0.1.618

May 1, 2020

0.1.617

Apr 29, 2020

0.1.616

Apr 29, 2020

0.1.615

Apr 29, 2020

0.1.614

Apr 29, 2020

0.1.613

Apr 27, 2020

0.1.612

Apr 27, 2020

0.1.611

Apr 27, 2020

0.1.610

Apr 26, 2020

0.1.609

Apr 25, 2020

0.1.608

Apr 25, 2020

0.1.607

Apr 24, 2020

0.1.606

Apr 23, 2020

0.1.605

Apr 20, 2020

0.1.604

Apr 20, 2020

0.1.603

Apr 20, 2020

0.1.602

Apr 20, 2020

0.1.601

Apr 19, 2020

0.1.600

Apr 18, 2020

0.1.510

Apr 16, 2020

0.1.509

Apr 15, 2020

0.1.508

Apr 14, 2020

0.1.507

Apr 10, 2020

0.1.506

Apr 10, 2020

0.1.505

Apr 10, 2020

0.1.504

Apr 5, 2020

0.1.503

Apr 1, 2020

0.1.502

Mar 31, 2020

0.1.501

Mar 29, 2020

0.1.500

Mar 28, 2020

0.1.495

Mar 27, 2020

0.1.494

Mar 27, 2020

0.1.493

Mar 24, 2020

0.1.492

Mar 22, 2020

0.1.491

Mar 22, 2020

0.1.490

Mar 21, 2020

0.1.489

Mar 20, 2020

0.1.488

Mar 18, 2020

0.1.486

Mar 17, 2020

0.1.485

Mar 16, 2020

0.1.484

Mar 16, 2020

0.1.483

Mar 16, 2020

0.1.482

Mar 16, 2020

0.1.481

Mar 16, 2020

0.1.480

Mar 16, 2020

0.1.478

Mar 10, 2020

0.1.477

Mar 2, 2020

0.1.476

Mar 2, 2020

0.1.475

Mar 2, 2020

0.1.474

Mar 2, 2020

0.1.472

Jan 12, 2020

0.1.471

Jan 5, 2020

0.1.470

Jan 3, 2020

0.1.469

Jan 3, 2020

0.1.468

Dec 18, 2019

0.1.467

Dec 17, 2019

0.1.466

Dec 8, 2019

0.1.463

Dec 8, 2019

0.1.462

Dec 8, 2019

0.1.461

Dec 7, 2019

0.1.460

Dec 7, 2019

0.1.452

Dec 7, 2019

0.1.451

Dec 7, 2019

0.1.45

Dec 5, 2019

0.1.42

Dec 5, 2019

0.1.41

Dec 5, 2019

0.1.33

Dec 4, 2019

0.1.32

Dec 2, 2019

0.1.31

Dec 2, 2019

0.1.5

Dec 7, 2019

0.1.4

Dec 5, 2019

0.1.3

Dec 2, 2019

0.1.2

Dec 2, 2019

0.1.1

Dec 2, 2019

0.1.0

Dec 2, 2019

0.0.651

Dec 7, 2019

0.0.51

Nov 27, 2019

0.0.7

Aug 20, 2019

0.0.6

Aug 15, 2019

0.0.5

Aug 13, 2019

0.0.4

Aug 13, 2019

0.0.3

Aug 13, 2019

0.0.2

Aug 13, 2019

0.0.1

Aug 13, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

autoviml-0.2.2.tar.gz (133.5 kB view details)

Uploaded Jan 30, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

autoviml-0.2.2-py3-none-any.whl (134.9 kB view details)

Uploaded Jan 30, 2025 Python 3

File details

Details for the file autoviml-0.2.2.tar.gz.

File metadata

Download URL: autoviml-0.2.2.tar.gz
Upload date: Jan 30, 2025
Size: 133.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for autoviml-0.2.2.tar.gz
Algorithm	Hash digest
SHA256	`2132cf5baff42a6cae3a12db66a753dfd0537b2f58e0b98bbd02c129e6303e7f`
MD5	`6581efe453732eda5a6a2887d33a038a`
BLAKE2b-256	`84757bee52707d2239fb09fb9c519c4976265196755bafe3be4f3bbea03387c9`

See more details on using hashes here.

File details

Details for the file autoviml-0.2.2-py3-none-any.whl.

File metadata

Download URL: autoviml-0.2.2-py3-none-any.whl
Upload date: Jan 30, 2025
Size: 134.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for autoviml-0.2.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`066a8d5e5280287f9988d8c772a6c2dc55a737c3cfed97ef77007c10eb527dd5`
MD5	`687a9013ed68c6faccd3852c65fd01ca`
BLAKE2b-256	`4697c3c4819e681435d4c988c2d076b6ebd29b4830b784e3a66fa039d7bff275`

See more details on using hashes here.

autoviml 0.2.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Auto-ViML

Update (Jan 2025)

Update (March 2023)

Update (May 2022)

Table of Contents

Background

Install

Usage

Tips for using Auto_ViML:

API

Maintainers

Contributing

License

DISCLAIMER

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes