Package for fast exploration and experimentation of ML tasks

## Project description

# Automated Tool for Optimized Modelling

Author: tvdboom

Email: m.524687@gmail.com

## Description

Automated Tool for Optimized Modelling (ATOM) is a python package designed for fast exploration and experimentation of supervised machine learning tasks. With just a few lines of code, you can perform basic data cleaning steps, feature selection and compare the performance of multiple models on a given dataset. ATOM should be able to provide quick insights on which algorithms perform best for the task at hand and provide an indication of the feasibility of the ML solution. This package supports binary classification, multiclass classification, and regression tasks.

NOTE: A data scientist with knowledge of the data will quickly outperform ATOM if he applies usecase-specific feature engineering or data cleaning methods. Use ATOM only for a fast exploration of the problem! |
---|

Possible steps taken by the ATOM pipeline:

- Data Cleaning
- Handle missing values
- Encode categorical features
- Balance the dataset
- Remove outliers

- Perform feature selection
- Remove features with too high collinearity
- Remove features with too low variance
- Select best features according to a chosen strategy

- Fit all selected models (either direct or via successive halving)
- Select hyperparameters using a Bayesian Optimization approach
- Perform bagging to assess the robustness of the model

- Analyze the results using the provided plotting functions!

## Installation

Intall ATOM easily using `pip`

```
pip install atom-ml
```

## Usage

Call the `ATOMClassifier`

or `ATOMRegressor`

class and provide the data you want to use:

```
from atom import ATOMClassifier
atom = ATOMClassifier(X, y, log='atom_log', n_jobs=2, verbose=1)
```

ATOM has multiple data cleaning methods to help you prepare the data for modelling:

```
atom.impute(strat_num='knn', strat_cat='most_frequent', max_frac_rows=0.1)
atom.encode(max_onehot=10, frac_to_other=0.05)
atom.outliers(max_sigma=4)
atom.balance(oversample=0.8, n_neighbors=15)
atom.feature_selection(strategy='univariate', solver='chi2', max_features=0.9)
```

Fit the data to different models:

```
atom.fit(models=['LR', 'LDA', 'XGB', 'lSVM'],
metric='f1',
max_iter=10,
max_time=1000,
init_points=3,
cv=4,
bagging=10)
```

Make plots and analyze results:

```
atom.plot_bagging(filename='bagging_results.png')
atom.lSVM.plot_probabilities()
atom.lda.plot_confusion_matrix()
```

## API

**ATOMClassifier(X, y=None, percentage=100, test_size=0.3, log=None, n_jobs=1, warnings=False, verbose=0, random_state=None)**

ATOM class for classification tasks. When initializing the class, ATOM will automatically proceed to apply some standard data cleaning steps unto the data. These steps include transforming the input data into a pd.DataFrame (if it wasn't one already) that can be accessed through the class' attributes, removing columns with prohibited data types, removing categorical columns with maximal cardinality (the number of unique values is equal to the number of instances. Usually the case for IDs, names, etc...), remove features with all the same value, removing duplicate rows and remove rows with missing values in the target column.**X: list, np.array or pd.DataFrame**

Data features with shape=(n_samples, n_features).**y: None, string, list, np.array or pd.Series, optional (default=None)**- If None: the last column of X is selected as target column
- If string: name of the target column in X (X has to be a pd.DataFrame)
- Else: data target column with shape=(n_samples,)

**percentage: int or float, optional (default=100)**

Percentage of the data to use.**test_size: float, optional (default=0.3)**

Split ratio of the train and test set.**log: None or string, optional (default=None)**

Name of the log file. None to not save any log.**n_jobs: int, optional (default=1)**

Number of cores to use for parallel processing.- If -1, use all available cores
- If <-1, use available_cores - 1 + n_jobs

**warnings: bool, optional (default=False)**

Wether to show warnings when running the pipeline.**verbose: int, optional (default=0)**

Verbosity level of the class. Possible values are:- 0 to not print anything
- 1 to print minimum information
- 2 to print average information
- 3 to print maximum information

**random_state: None or int, optional (default=None)**

Seed used by the random number generator. If None, the random number generator is the RandomState instance used by`np.random`

.

**ATOMRegressor(X, y=None, target=None, percentage=100, test_size=0.3, log=None, n_jobs=1, warnings=False, verbose=0, random_state=None)**

ATOM class for regression tasks. See`ATOMClassifier`

for an explanation of the class' parameters.

## Methods

ATOM contains multiple methods for standard data cleaning and feature selection processes. Calling on one of them will automatically apply the method on the dataset in the class and update the class' attributes accordingly.

TIP: Use the `report` method to examine the data and help you determine suitable parameters for the methods |
---|

**impute(strat_num='remove', strat_cat='remove', max_frac_rows=0.5, max_frac_cols=0.5, missing=[None, np.nan, np.inf, -np.inf, '', '?', 'NA', 'nan', 'inf'])**

Handle missing values according to the selected strategy. Also removes rows and columns with too many missing values.**strat_num: int, float or string, optional (default='remove')**

Imputing strategy for numerical columns. Possible values are:- 'remove': remove row
- 'mean': impute with mean of column
- 'median': impute with median of column
- 'knn': impute using k-Nearest Neighbors
- 'most_frequent': impute with most frequent value
- int or float: impute with provided numerical value

**strat_cat: string, optional (default='remove')**

Imputing strategy for categorical columns. Possible values are:- 'remove': remove row
- 'most_frequent': impute with most frequent value
- string: impute with provided string

**max_frac_rows: float, optional (default=0.5)**

Minimum fraction of non missing values in row. If less, the row is removed.**max_frac_cols: float, optional (default=0.5)**

Minimum fraction of non missing values in column. If less, the column is removed.**missing: value or list of values, optional (default=[None, np.nan, np.inf, -np.inf, '', '?', 'NA', 'nan', 'inf'])**

List of values to consider as missing. None, np.nan, np.inf and -np.inf are always imputed since they are incompatible with the models.

**encode(max_onehot=10, frac_to_other=0)**

Perform encoding of categorical features. The encoding type depends on the number of unique values in the column: label-encoding for n_unique=2, one-hot-encoding for 2 < n_unique <= max_onehot and target-encoding for n_unique > max_onehot. It also can replace classes with low occurences with the value 'other' in order to prevent too high cardinality.**max_onehot: None or int, optional (default=10)**

Maximum number of unique values in a feature to perform one-hot-encoding. If None, it will never perform one-hot-encoding.**frac_to_other: float, optional (default=0)**

Classes with less instances than n_rows * frac_to_other are replaced with 'other'.

**outliers(max_sigma=3, include_target=False)**

Remove outliers from the training set.**max_sigma: int or float, optional (default=3)**

Remove rows containing any value with a maximum standard deviation (on the respective column) above max_sigma.**include_target: bool, optional (default=False)**

Wether to include the target column when searching for outliers.

**balance(oversample=None, undersample=None, n_neighbors=5)**

Balance the number of instances per target class. Only for classification tasks. Dependency: imbalanced-learn.**oversample: None, float or string, optional (default=None)**

Oversampling strategy using ADASYN. Choose from:- None: do not perform oversampling
- float: fraction minority/majority (only for binary classification)
- 'minority': resample only the minority class
- 'not minority': resample all but minority class
- 'not majority': resample all but majority class
- 'all': resample all classes

**undersample: None, float or string, optional (default=None)**

Undersampling strategy using NearMiss methods. Choose from:- None: do not perform undersampling
- float: fraction minority/majority (only for binary classification)
- 'majority': resample only the majority class
- 'not minority': resample all but minority class
- 'not majority': resample all but majority class
- 'all': resample all classes

**n_neighbors: int, optional (default=5)**

Number of nearest neighbors used for any of the algorithms.

**feature_insertion(n_features=2, generations=20, population=500)**

Use a genetic algorithm to create new combinations of existing features and add them to the original dataset in order to capture the non-linear relations between the original features. A dataframe containing the description of the newly generated features and their scores can be accessed through the`genetic_features`

attribute. The algorithm is implemented using the Symbolic Transformer method, which can be accessed through the`genetic_algorithm`

attribute. It is adviced to only use this method when fitting linear models. Dependency: gplearn.**n_features: int, optional (default=2)**

Maximum number of newly generated features (no more than 1% of the population).**generations: int, optional (default=20)**

Number of generations to evolve.**population: int, optional (default=500)**

Number of entities in each generation.

**feature_selection(strategy=None, solver=None, max_features=None, threshold=-np.inf, min_variance_frac=1., max_correlation=0.98)**

Select best features according to the selected strategy. Ties between features with equal scores will be broken in an unspecified way. Also removes features with too low variance and too high collinearity.**strategy: None or string, optional (default=None)**

Feature selection strategy to use. Choose from:- None: do not perform any feature selection algorithm (it does still look for multicollinearity and variance)
- 'univariate': perform a univariate statistical test
- 'PCA': perform a principal component analysis
- 'SFM': select best features from an existing model
- 'RFE': recursive feature eliminator

**solver: None, string or callable (default=None)**

Solver or model to use for the feature selection strategy. See the sklearn documentation for an extended descrition of the choices. Select None for the default option per strategy (not applicable for SFM).- for 'univariate', choose from:
- 'f_classif' (default for classification tasks)
- 'f_regression' (default for regression tasks)
- 'mutual_info_classif'
- 'mutual_info_regression'
- 'chi2'
- Any function taking two arrays (X, y), and returning arrays (scores, pvalues). See the documentation.

- for 'PCA', choose from:
- 'auto' (default)
- 'full'
- 'arpack'
- 'randomized'

- for 'SFM': choose a base estimator from which the transformer is built. The estimator must have either a feature_importances_ or coef_ attribute after fitting. This parameter has no default option.
- for 'RFE': choose a supervised learning estimator. The estimator must have either a feature_importances_ or coef_ attribute after fitting. This parameter has no default option.

- for 'univariate', choose from:
**max_features: None, int or float, optional (default=None)**

Number of features to select.- None: select all features
- if >= 1: number of features to select
- if < 1: fraction of features to select

**threshold: string, int or float, optional (default=-np.inf)**

Threshold value to attain when selecting the best features (only for strategy='SFM'). Features whose importance is greater or equal are kept while the others are discarded.- if 'mean': set the mean of feature_importances as threshold
- if 'median': set the median of feature_importances as threshold

**min_variance_frac: None or float, optional (default=1.)**

Remove features with the same value in at least this fraction of the total. The default is to keep all features with non-zero variance, i.e. remove the features that have the same value in all samples. None to skip this step.**max_correlation: None or float, optional (default=0.98)**

Minimum value of the Pearson correlation cofficient to identify correlated features. A dataframe of the removed features and their correlation values can be accessed through the`collinear`

attribute. None to skip this step.

**fit(models, metric, greater_is_better=True, needs_proba=False, successive_halving=False, skip_steps=0, max_iter=15, max_time=np.inf, eps=1e-08, batch_size=1, init_points=5, plot_bo=False, cv=3, bagging=None)**

Fit class to the selected models. The optimal hyperparameters per model are selectred using a Bayesian Optimization (BO) algorithm with gaussian process as kernel. The resulting score of each step of the BO is either computed by cross-validation on the complete training set or by creating a validation set from the training set. This process will create some minimal leakage but ensures a maximal use of the provided data. The test set, however, does not contain any leakage and will be used to determine the final score of every model. Note that the best score on the BO can be consistently lower than the final score on the test set (despite the leakage) due to the considerable fewer instances on which it is trained. At the end of te pipeline, you can choose to evaluate the robustness of the model's performance on the test set applying a bagging algorithm.**models: string or list of strings**

List of models to fit on the data. If 'all', all available models are used. Use the predefined acronyms to select the models. Possible values are (case insensitive):- 'GNB' for Gaussian NaÃ¯ve Bayes (no hyperparameter tuning)
- 'MNB' for Multinomial NaÃ¯ve Bayes
- 'BNB' for Bernoulli NaÃ¯ve Bayes
- 'GP' for Gaussian Process classifier/regressor (no hyperparameter tuning)
- 'OLS' for Ordinary Least Squares (no hyperparameter tuning)
- 'Ridge' for Ridge Linear classifier/regressor
- 'Lasso' for Lasso Linear Regression
- 'EN' for ElasticNet Linear Regression
- 'BR' for Bayesian Regression (with ridge regularization)
- 'LR' for Logistic Regression
- 'LDA' for Linear Discriminant Analysis
- 'QDA' for Quadratic Discriminant Analysis
- 'KNN' for K-Nearest Neighbors classifier/regressor
- 'Tree' for a single Decision Tree classifier/regressor
- 'Bag' for Bagging classifier/regressor (with decision tree as base estimator)
- 'ET' for Extra-Trees classifier/regressor
- 'RF' for Random Forest classifier/regressor
- 'AdaB' for AdaBoost classifier/regressor (with decision tree as base estimator)
- 'GBM' for Gradient Boosting Machine classifier/regressor
- 'XGB' for XGBoost classifier/regressor (if package is available)
- 'LGB' for LightGBM classifier/regressor (if package is available)
- 'CatB' for CatBoost classifier/regressor (if package is available)
- 'lSVM' for Linear Support Vector Machine classifier/regressor
- 'kSVM' for Kernel (non-linear) Support Vector Machine classifier/regressor
- 'PA' for Passive Aggressive classifier/regressor
- 'SGD' for Stochastic Gradient Descent classifier/regressor
- 'MLP' for Multilayer Perceptron classifier/regressor

**metric: string or callable**

Metric on which the pipeline fits the models. Choose from any of the metrics described here or use a score (or loss) function with signature`metric(y, y_pred, **kwargs)`

.**greater_is_better: bool, optional (default=True)**

Wether the metric is a score function or a loss function, i.e. if True, a higher score is better and if False, lower is better. Will be ignored if the metric is one of the pre-defined (string) metrics.**needs_proba: bool, optional (default=False)**

Whether the metric function requires predict_proba to get probability estimates out of a classifier. Will be ignored if the metric is one of the pre-defined (string) metrics.**successive_halving: bool, optional (default=False)**

Fit the pipeline using a successive halving approach, that is, fitting the model on 1/N of the data, where N stands for the number of models still in the pipeline. After this, the best half of the models are selected for the next iteration. This process is repeated until only one model is left. Since models perform quite differently depending on the size of the training set, we recommend to use this feature when fitting similar models (e.g. only using tree-based models).**skip_iter: int, optional (default=0)**

Skip n last iterations of the successive halving.**max_iter: int, optional (default=15)**

Maximum number of iterations of the BO. 0 to skip the BO and fit the model on its default parameters.**max_time: int or float, optional (default=np.inf)**

Maximum time allowed for the BO per model (in seconds). 0 to skip the BO and fit the model on its default parameters.**eps: int or float, optional (default=1e-08)**

Minimum hyperparameter distance between two consecutive steps in the BO.**batch_size: int, optional (default=1)**

Size of the batch in which the objective is evaluated.**init_points: int, optional (default=5)**

Initial number of tests the BO runs before fitting the surrogate function.**plot_bo: bool, optional (default=False)**

Wether to plot the BO's progress as it runs. Creates a canvas with two plots: the first plot shows the score of every trial and the second shows the distance between the last consecutive steps. Don't forget to call`%matplotlib`

at the start of the cell if you are using jupyter notebook!**cv: bool, optional (default=3)**

Strategy to fit and score the model selected after every step of the BO.- if 1, randomly split the training data into a train and validation set
- if >1, perform a k-fold cross validation on the training set

**bagging: None or int, optional (default=None)**

Number of data sets (bootstrapped from the training set) to use in the bagging algorithm. If None, no bagging is performed.

## Methods (utilities)

**stats()**

Print out a list of basic statistics on the dataset.**scale()**

Scale all the features to mean=1 and std=0.**report(df='dataset', rows=None, filename=None)**

Get an extensive profile analysis of the data. The report is rendered in HTML5 and CSS3 and can be accessed through the`report`

attribute. Note that this method can be very slow for large datasets. Dependency: pandas-profiling.**df: string, optional (default='dataset')**

Name of the data class attribute to get the report from.**rows: None or int, optional (default=None)**

Number of rows selected randomly from the dataset to perform the analysis on. None to select all rows.**filename: None or string, optional (default=None)**

Name of the file when saved (as .html). None to not save anything.

**reset_attributes(truth='dataset')**

If you change any of the class' data attributes (dataset, X, y, train, test, X_train, X_test, y_train, y_test) in between the pipeline, you should call this method to change all other data attributes to their correct values. Independent attributes are updated in unison, that is, setting truth='X_train' will also update X_test, y_train and y_test, or truth='train' will also update the test set, etc...**truth: string, optional (default='dataset')**

Data attribute that has been changed (as string)

**plot_bagging(iteration=-1, **kwargs)**

Make a boxplot of the bagging's results after fitting the class.**iteration: int, optional (default=-1)**

Iteration of the successive_halving to plot. If -1, use the last iteration.

**plot_correlation(**kwargs)**

Make a correlation maxtrix plot of the dataset. Ignores non-numeric columns.**plot_PCA(show=None, **kwargs)**

Plot the explained variance ratio of the components. Only if PCA was applied on the dataset through the feature_selection method.**show: int, optional (default=20)**

Number of best components to show in the plot. None for all components.

**plot_successive_halving(**kwargs)**

Make a plot of the models' scores per iteration of the successive halving.**plot_ROC(**kwargs)**

Plot the ROC curve of all the models. Only for binary classification tasks.**plot_PRC(**kwargs)**

Plot the precision-recall curve of all the models. Only for binary classification tasks.**Additionnaly, you can call different metrics as methods of the main class to get the results of the fit method on this specific metric, e.g.**`atom.precision()`

. For a list of the available metrics click here.

## Attributes

**dataset**: Dataframe of the complete dataset.**X, y**: Data features and target.**train, test**: Train and test set.**X_train, y_train**: Training set features and target.**X_test, y_test**: Test set features and target.**mapping**: Dictionary of the target values mapped to their encoded integer (only for classification tasks).**report**: Pandas profiling report of the selected dataset (if the report method was used).**genetic_features**: Contains the description of the generated features and their scores (if feature_insertion was used).**collinear**: Dataframe of the collinear features and their correlation values (if feature_selection was used).**errors**: Dictionary of the encountered exceptions (if any) while fitting the models.**results**: Dataframe (or list of dataframes if successive_halving=True) of the results.

## Subclass methods

After fitting, the models become subclasses of the main class. They can be called upon for handy plot functions and attributes, e.g. `atom.LGB.plot_confusion_matrix()`

. If successive_halving=True, the model subclass corresponds to the last fitted model.

**plot_threshold(metric=None, steps=100, **kwargs)**

Plot performance metrics against multiple threshold values. If None, the metric used to fit the model will be selected. Only for binary classification tasks.**metric: None, string, callable or list, optional (default=None)**

Metric(s) to plot. If None, the selected metric will be the one chosen to fit the model.**steps: int, optional (default=100)**

Number of thresholds (steps) to plot.

**plot_probabilities(target=1, **kwargs)**

Plot the probability of every class in the target variable against the class selected by target_class. Only for classification tasks.**target: int or string, optional (default=1)**

Target class to plot the probabilities against. Either the class' name or the index (0 corresponds to the first class, 1 to the second, etc...).

**plot_permutation_importance(show=20, n_repeats=10, **kwargs)**

Plot the feature importance permutation scores in a boxplot. A dictionary containing the permutation's results can be accessed through the`permutations`

attribute.**n_repeats: int, optional (default=10)**

Number of times to permute a feature.**show: int, optional (default=20)**

Number of best features to show in the plot. None for all features.

**plot_feature_importance(show=None, **kwargs)**

Plot the normalized feature importance scores. Only works with tree based algorithms.**show: int, optional (default=None)**

Number of best features to show in the plot. None for all features.

**plot_ROC(**kwargs)**

Plot the ROC curve. Only for binary classification tasks.**plot_PRC(**kwargs)**

Plot the precision-recall curve. Only for binary classification tasks.**plot_confusion_matrix(normalize=True, **kwargs)**

Plot the confusion matrix for the model. Only for classification tasks.**normalize: bool, optional (default=True)**

Wether to normalize the confusion matrix.

**save(filename=None)**

Save the best found model as a pickle file.**filename: None or string, optional (default=None)**

Name of the file when saved. If None, it will be saved as 'ATOM_[model_type].pkl'.

## Subclass attributes

**error**: If the model encountered an exception, this shows it.**best_params**: Get parameters of the model with highest score.**best_model**: Get the model with highest score (not fitted).**best_model_fit**: Get the model with highest score fitted on the training set.**predict_train**: Get the predictions on the training set.**predict_test**: Get the predictions on the test set.**predict_proba_train**: Get the predicted probabilities on the training set.**predict_proba_test**: Get the predicted probabilities on the test set.**score_train**: Metric score of the BO's selected model on the training set.**score_test**: Metric score of the BO's selected model on the test set.**bagging_scores**: Array of the bagging's results.**permutations**: Dictionary of the permutation's results (if plot_permutation_importance was used).**BO**: Dictionary containing the information of every step taken by the BO.- 'params': Parameters used for the model
- 'score': Score of the chosen metric

**Any of the metrics described here.**

## Plots

All plot methods contain the following parameters (on top of the plot-specific parameters explained in their respective documentation):

**title: None or string, optional (default=None)**

Plot's title. None for default title.**figsize: 2d-tuple, optional (default=depends on plot)**

Figure size: format as (x, y).**filename: None or string, optional (default=None)**

Name of the file when saved. None to not save anything.**display: bool, optional (default=True)**

Wether to display the plot.

The plotting aesthetics can be customized with the use of the `@classmethods`

described hereunder, e.g. `ATOMClassifier.set_style('white')`

.

**set_style(style='darkgrid')**

Change the seaborn plotting style.**style: string, optional (default='darkgrid')**

Name of the style to use. Possible values are: darkgrid, whitegrid, dark, white, and ticks.

**set_palette(palette='GnBu_d')**

Change the seaborn color palette.**palette: string, optional (default='GnBu_d')**

Name of the palette to use. Click here for more information.

**set_title_fontsize(fontsize=20)**

Change the fontsize of the plot's title.**fontsize: int, optional (default=20)**

Size of the font.

**set_label_fontsize(fontsize=16)**

Change the fontsize of the plot's labels and legends.**fontsize: int, optional (default=16)**

Size of the font.

**set_tick_fontsize(fontsize=12)**

Change the fontsize of the plot's ticks.**fontsize: int, optional (default=12)**

Size of the font.

## Metrics

Some of the most common metrics are integrated in the ATOM class. They can be filled in the metric parameter of the fit method, called as method of the main class, e.g. `atom.RF.accuracy()`

, and they are saved as attributes of every model subclass, e.g. `atom.LDA.recall`

. All metrics are calculated on the test set. For multiclass tasks, the type of averaging performed on the data is 'weighted'. The available metrics are:

- For binary classification tasks only:
**tn**for the number of true negatives**fp**for the number of false positives**fn**for the number of false negatives**tp**for the number of true positives**ap**for the average_precision_score

- For classification tasks only:
**accuracy**for the accuracy_score**auc**for the roc_auc_score**mcc**for the matthews_corrcoef**f1**for the f1_score**hamming**for the hamming_loss**jaccard**for the jaccard_score**logloss**for the log_loss**precision**for the precision_score**recall**for the recall_score

- For all tasks:
**mae**for the mean_absolute_error**max_error**for the max_error**mse**for the mean_squared_error**msle**for the mean_squared_log_error**r2**for the r2_score

## Dependencies

**numpy**(>=1.17.2)**pandas**(>=0.25.1)**scikit-learn**(>=0.22)**tqdm**(>=4.35.0)**gpyopt**(>=1.2.5)**matplotlib**(>=3.1.0)**seaborn**(>=0.9.0)**imbalanced-learn**, optional (>=0.5.0)**pandas-profiling**, optional (>=2.3.0)**gplearn**, optional (>=0.4.1)**xgboost**, optional (>=0.90)**lightgbm**, optional (>=2.3.0)**catboost**, optional (>=0.19.1)

## Project details

## Release history Release notifications | RSS feed

## Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Filename, size | File type | Python version | Upload date | Hashes |
---|---|---|---|---|

Filename, size atom-ml-2.4.0.tar.gz (64.6 kB) | File type Source | Python version None | Upload date | Hashes View |