Skip to main content

MultiTrain allows you to train multiple machine learning algorthims on a dataset all at once to determine the best for that particular use case

Project description

MultiTrain

MultiTrain is a python module for machine built with the aim of assisting you to find the machine learning model that works best on a particular dataset.

REQUIREMENTS

MultiTrain requires:

INSTALLATION

Install MultiTrain using:

pip install MultiTrain

USAGE

CLASSIFICATION

MultiClassifier

The MultiClassifier is a combination of many classifier estimators, each of which is fitted on the training data and returns assessment metrics such as accuracy, balanced accuracy, r2 score, f1 score, precision, recall, roc auc score for each of the models.

#This is a code snippet of how to import the MultiClassifier and the parameters contained in an instance

from MultiTrain import MultiClassifier
train = MultiClassifier(cores=-1, #this parameter works exactly the same as setting n_jobs to -1, this uses all the cpu cores to make training faster
                        random_state=42, #setting random state here automatically sets a unified random state across function imports
                        verbose=True, #set this to True to display the name of the estimators being fitted at a particular time
                        target_class='binary', #Recommended: set this to one of binary or multiclass to allow the library to adjust to the type of classification problem
                        imbalanced=True, #set this parameter to true if you are working with an imbalanced dataset
                        sampling='SMOTE', #set this parameter to any over_sampling, under_sampling or over_under_sampling methods if imbalanced is True
                        strategy='auto' #not all samplers use this parameters, the parameter is named as sampling_strategy for the samplers that support,
                                        #read more in the imbalanced learn documentation before using this parameter
                        )

In continuation of the code snippet above, if you're unsure about the various sampling techniques accessible after setting imbalanced to True when working on an imbalanced dataset, a code snippet is provided below to generate a list of all available sampling techniques.

from MultiTrain import MultiClassifier
train = MultiClassifier()
print(train.strategies()) #this line of codes returns all the under sampling, over sampling and over_under sampling methods available for use

Classifier Model Names

To return a list of all models available for training


Split

This function operates identically like the scikit-learn framework's train test split function. However, it has some extra features. For example, the split method is demonstrated in the code below.


If you want to run Principal Component Analysis on your dataset to reduce its dimensionality, You can achieve this with the split function. See the code excerpt below.

import pandas as pd
from MultiTrain import MultiClassifier #import the module

train = MultiClassifier()
df = pd.read_csv('NameOfFile.csv')
features = df.drop("nameOfLabelColumn", axis=1)
labels = df['nameOfLabelColumn']
pretend_columns = ['columnA', 'columnB', 'columnC']
#It's important to note that when using the split function, it must be assigned to a variable as it returns values.
split = train.split(X=features, #the features of the dataset
                    y=labels,   #the labels of the dataset
                    sizeOfTest=0.2, #same as test_size parameter in train_test_split
                    randomState=42, #initialize the value of the random state parameter
                    dimensionality_reduction=True, #setting to True enables this function to perform PCA on both X_train and X_test automatically after splitting
                    normalize='StandardScaler', #when using dimensionality_reduction, this must be set to one of StandardScaler,MinMaxScaler or RobustScaler if feature columns aren't scaled before a split
                    n_components=2, #when using dimensionality_reduction, this parameter must be set to define the number of components to keep.
                    columns_to_scale=pretend_columns #pass in a list of the columns in your dataset that you wish to scale 
                    ) 

Fit

Now that the dataset has been split using the split method, it is time to train on it using the fit method. Instead of the standard training in scikit-learn, catboost, or xgboost, this fit method integrates almost all available machine learning algorithms and trains them all on the dataset. It then returns a pandas dataframe including information such as which algorithm is overfitting, which algorithm has the greatest accuracy, and so on. A basic code example for using the fit function is shown below.


Now, we would be looking at the various ways the fit method can be implemented.

If you used the traditional train_test_split method available in scikit-learn

import pandas as pd
from sklearn.model_selection import train_test_split
from MultiTrain import MultiClassifier
train = MultiClassifier()

df = pd.read_csv('filename.csv')

features = df.drop('labelName', axis=1)
labels = df['labelName']

X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.2, random_state=42)
fit = train.fit(X_train=X_train, 
              X_test=X_test, 
              y_train=y_train, 
              y_test=y_test, 
              split_self=True, #always set this to true if you used the traditional train_test_split
              show_train_score=True, #only set this to true if you want to compare train equivalent of all the metrics shown on the dataframe
              return_best_model=True, #setting this to True means that you'll get a dataframe containing only the best performing model
              excel=True #when this parameter is set to true, an spreadsheet report of the training is stored in your current working directory
              ) 

If you used the split method provided by the MultiClassifier

import pandas as pd
from MultiTrain import MultiClassifier

train = MultiClassifier()
df = pd.read_csv('filename.csv')

features = df.drop('labelName', axis=1)
labels = df['labelName']

split = train.split(X=features,
                    y=labels,
                    sizeOfTest=0.2,
                    randomState=42,
                    shuffle_data=True)

fit = train.fit(splitting=True,
                split_data=split,
                show_train_score=True,
                excel=True)     

If you want to train on your dataset with KFold

import pandas as pd
from MultiTrain import MultiClassifier

train = MultiClassifier()
df = pd.read_csv('filename.csv')

features = df.drop('labelName', axis=1)
labels = df['labelName']

fit = train.fit(X=features,
                y=labels,
                kf=True, #set this to true if you want to train on your dataset with KFold
                fold=5, #you can adjust this to use any number of folds you want for kfold, higher numbers leads to higher training times
                show_train_score=True,
                excel=True)     

REGRESSION


You can only use this code on classification problems

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

MultiTrain-0.1.11.tar.gz (24.5 kB view details)

Uploaded Source

Built Distribution

MultiTrain-0.1.11-py3-none-any.whl (26.1 kB view details)

Uploaded Python 3

File details

Details for the file MultiTrain-0.1.11.tar.gz.

File metadata

  • Download URL: MultiTrain-0.1.11.tar.gz
  • Upload date:
  • Size: 24.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.9.7

File hashes

Hashes for MultiTrain-0.1.11.tar.gz
Algorithm Hash digest
SHA256 2a362dec3356edb99232cd51b6a68c4b6cddc763ad77a0a4067bcbd3f3d7a32b
MD5 6726e46c7aa79907e4c99df2243a7e41
BLAKE2b-256 6eedd9d49991754b8f8a9c57800cafdb7aa49dac6747c97951dd6f42521c3789

See more details on using hashes here.

Provenance

File details

Details for the file MultiTrain-0.1.11-py3-none-any.whl.

File metadata

  • Download URL: MultiTrain-0.1.11-py3-none-any.whl
  • Upload date:
  • Size: 26.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.9.7

File hashes

Hashes for MultiTrain-0.1.11-py3-none-any.whl
Algorithm Hash digest
SHA256 ec79afc94a89ca8f5cb99a15c66505f4af393e1f0ad70ca611ba130a2d465eca
MD5 4f6cb558c0ce38beff914dd149b36c10
BLAKE2b-256 c3f064d48b4f1c414c831c4bcb533341d0519b65abdcea304c6c42ec235c97ea

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page