Skip to main content

Package by the Data-Scientists for the Data Scientists ; with Scikit-learn type fit() transform() functionality

Project description

Fast-ML is a Python package with numerous inbuilt functionalities to make the life of a data scientist much easier

fast_ml follow Scikit-learn type functionality with fit() and transform() methods to first learn the transforming parameters from training dataset and then transforms the training/validation/test dataset

Important Note : You learn the parameter by applying fit() method ONLY on train method and then apply transform on train/valid/test dataset. Be it Missing Value Imputation, Outliers, Feature Engineering for Numerical/Categorical ... Parameters are learned from the training dataset on which the model trains.

Installing

pip install fast_ml

Table of Contents:

  1. Utilities
  2. Exploratory Data Analysis (EDA)
  3. Missing Data Analysis
  4. Missing Data Imputation
  5. Outlier Treatment
  6. Feature Engineering
  7. Feature Selection
  8. Model Development
  9. Model Evaluation

Glossary

  • df : Dataframe, refers to dataset used for analysis
  • variable : str, refers to a single variable. As required in the function it has to be passed ex 'V1'
  • variables : list type, refers to list of variables. Must be passed as list ex ['V1', 'V2]. Even a single variable has to be passed in list format. ex ['V1']
  • target : str, refers to target variable
  • model : str, ML problem type. use 'classification' or 'clf' for classification problems and 'regression' or 'reg' for regression problems
  • method : str, refers to various techniques available for Missing Value Imputation, Feature Engieering... as available in each module

1. Utilities

from fast_ml.utilities import reduce_memory_usage, display_all

# reduces the memory usage of the dataset by optimizing for the datatype used for storing the data
train = reduce_memory_usage(train, convert_to_category=False)
  1. reduce_memory_usage(df, convert_to_category = False)
    • This function reduces the memory used by dataframe
  2. display_all(df)
    • Use this function to show all rows and all columns of dataframe. By default pandas only show top and bottom 20 rows, columns

2. Exploratory Data Analysis (EDA)

from fast_ml import eda

2.1) Overview

from fast_ml import eda

train = pd.read_csv('train.csv')

# One of the most useful dataframe summary view
summary_df = eda.df_info(train)
display_all(summary_df)
  1. eda.df_info(df)
    • Returns a dataframe with useful summary - variables, datatype, number of unique values, sample of unique values, missing count, missing percent
  2. eda.df_cardinality_info(df, raw_data = True)
    • Returns a dataframe with useful summary - variables, datatype, number of unique values, sample of unique values, missing count, missing percent
  3. eda.df_missing_info(df, raw_data = True)
    • Returns a dataframe with useful summary - variables, datatype, number of unique values, sample of unique values, missing count, missing percent

2.2) Numerical Variables

from fast_ml import eda

train = pd.read_csv('train.csv')

#one line of command to get commonly used plots for all the variables provided to the function
eda.numerical_plots_with_target(train, num_vars, target, model ='clf')
  1. eda.numerical_describe(df, variables=None, method='10p')
    • Dataframe with variouls count, mean, std and spread statistics for all the variables passed in input
  2. eda.numerical_variable_detail(df, variable, model = None, target=None, threshold = 20)
    • Various summary statistics, spread statistics, outlier, missing values, transformation diagnostic... a detailed analysis for a single variable provided as input
  3. eda.numerical_plots(df, variables, normality_check = False)
    • Uni-variate plots - Variable Distribution of all the numerical variables provided as input with target. Can also get the Q-Q plot for assessing the normality
  4. eda.numerical_plots_with_target(df, variables, target, model)
    • Bi-variate plots - Scatter plot of all the numerical variables provided as input with target.
  5. eda.numerical_check_outliers(df, variables=None, tol=1.5, print_vars = False)
  6. eda.numerical_bins_with_target(df, variables, target, model='clf', create_buckets = True, method='5p', custom_buckets=None)
    • Useful for deciding the suitable binning for numerical variable. Displays 2 graphs 'overall event rate' & 'within category event rate'

2.3) Categorical Variables

from fast_ml import eda

train = pd.read_csv('train.csv')

#one line of command to get commonly used plots for all the variables provided to the function
eda.categorical_plots_with_target(train, cat_vars, target, add_missing=True, rare_tol=5)
  1. eda.categorical_variable_detail(df, variable, model = None, target=None, rare_tol=5)
    • Various summary statistics, missing values, distributions ... a detailed analysis for a single variable provided as input
  2. eda.categorical_plots(df, variables, add_missing = True, add_rare = False, rare_tol=5)
    • Uni-variate plots - distribution of all the categorical provided as input
  3. eda.categorical_plots_with_target(df, variables, target, model='clf', add_missing = True, rare_tol1 = 5, rare_tol2 = 10)
    • Bi-variate plots - distribution of all the categorical provided as input with target
  4. eda.categorical_plots_with_rare_and_target(df, variables, target, model='clf', add_missing=True, rare_tol1=5, rare_tol2=10)
    • Bi-variate plots - distribution of all the categorical provided as input with target with 2 inputs as rare threshold. Useful for deciding the rare bucketing
  5. eda.categorical_plots_for_miss_and_freq(df, variables, target, model = 'reg')
    • Uni-variate plots - distribution of all the categorical provided as input with target with 2 inputs as rare threshold. Useful for deciding the rare bucketing

3. Missing Data Analysis

from fast_ml.missing_data_analysis import MissingDataAnalysis

2.1) Class MissingDataAnalysis

  1. explore_numerical_imputation (variable)
  2. explore_categorical_imputation (variable)

4. Missing Data Imputation

from fast_ml.missing_data_imputation import MissingDataImputer_Numerical, MissingDataImputer_Categorical

4.1) class MissingDataImputer_Numerical

from fast_ml.missing_data_imputation import MissingDataImputer_Numerical

train = pd.read_csv('train.csv')

num_imputer = MissingDataImputer_Numerical(df, method = 'median')

#Scikit-learn type fit() transform() functionality
# Use fit() only on the train dataset
num_imputer.fit(train, num_vars)

# Use transform() on train/test dataset
train = num_imputer.transform(train)
test = num_imputer.transform(test)
  • Methods:
    • 'mean'
    • 'median'
    • 'mode'
    • 'custom_value'
    • 'random'
  1. fit(df, num_vars)
  2. transform(df)

4.2) class MissingDataImputer_Categorical

from fast_ml.missing_data_imputation import MissingDataImputer_Categorical

train = pd.read_csv('train.csv')

cat_imputer = MissingDataImputer_Categorical(df, method = 'frequent')

#Scikit-learn type fit() transform() functionality
# Use fit() only on the train dataset
cat_imputer.fit(train, cat_vars)

# Use transform() on train/test dataset
train = cat_imputer.transform(train)
test = cat_imputer.transform(test)
  • Methods:
    • 'frequent' or 'mode'
    • 'custom_value'
    • 'random'
  1. fit(df, cat_vars)
  2. transform(df)

5. Outlier Treatment

from fast_ml.outlier_treatment import OutlierTreatment

5.1) class OutlierTreatment

  • Methods:
    • 'iqr' or 'IQR'
    • 'gaussian'
  1. fit(df, num_vars)
  2. transform(df)

6. Feature Engineering

from fast_ml.feature_engineering import FeatureEngineering_Numerical, FeatureEngineering_Categorical, FeatureEngineering_DateTime

6.1) class FeatureEngineering_Numerical

from fast_ml.feature_engineering import FeatureEngineering_Categorical

num_binner = FeatureEngineering_Numerical(method = '10p', adaptive = True)

#Scikit-learn type fit() transform() functionality
# Use fit() only on the train dataset
num_binner.fit(train, num_vars)

# Use transform() on train/test dataset
train = num_binner.transform(train)
test = num_binner.transform(test)
  • Methods:
    • '5p' : [0,5,10,15,20,25,30,35,40,45,50,55,60,65,70,75,80,85,90,95,100]
    • '10p' : [0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
    • '20p' : [0, 20, 40, 60, 80, 100]
    • '25p' : [0, 25, 50, 75, 100]
    • '95p' : [0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 95, 100]
    • '98p' : [0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 95, 98, 100]
    • 'custom' : Custom Buckets
  1. fit(df, num_vars)
  2. transform(df)

6.2) class FeatureEngineering_Categorical(model=None, method='label', drop_last=False):

from fast_ml.feature_engineering import FeatureEngineering_Categorical

rare_encoder_5 = FeatureEngineering_Categorical(method = 'rare')

#Scikit-learn type fit() transform() functionality
# Use fit() only on the train dataset
rare_encoder_5.fit(train, cat_vars, rare_tol=5)

# Use transform() on train/test dataset
train = rare_encoder_5.transform(train)
test = rare_encoder_5.transform(test)
  • Methods:
    • 'rare_encoding' or 'rare'
    • 'label' or 'integer'
    • 'count'
    • 'freq'
    • 'ordered_label'
    • 'target_ordered'
    • 'target_mean'
    • 'target_prob_ratio'
    • 'target_woe'
  1. fit(df, cat_vars, target=None, rare_tol=5)
  2. transform(df)

6.3) class FeatureEngineering_DateTime (drop_orig=True)

from fast_ml.feature_engineering import FeatureEngineering_DateTime

dt_encoder = FeatureEngineering_DateTime()

#Scikit-learn type fit() transform() functionality
# Use fit() only on the train dataset
dt_encoder.fit(train, datetime_vars, prefix = 'default')

# Use transform() on train/test dataset
train = dt_encoder.transform(train)
test = dt_encoder.transform(test)
  1. fit(df, datetime_variables, prefix = 'default')
  2. transform(df)

7. Feature Selection

from fast_ml.feature_selection import get_constant_features

constant_features = get_constant_features(df, threshold=0.99, dropna=False)
# constant_features is a dataframe
display_all(constant_features)

# to get list of constant features
constant_feats = (constant_features['Var'].to_list()
print(constant_feats)
  1. get_constant_features(df, threshold=0.99, dropna=False)
  2. get_duplicate_features(df)
  3. get_correlated_pairs(df, threshold=0.9)
  4. recursive_feature_elimination(model, X_train, y_train, X_valid, y_valid, X_test, y_test)
  5. variables_clustering (df, variables, method)

8. Model Development

from fast_ml.model_development import train_valid_test_split

X_train, y_train, X_valid, y_valid, X_test, y_test = train_valid_test_split(df, target = target, 
                                                                            train_size=0.8, valid_size=0.1, test_size=0.1)

# Get the shape of all the datasets
print(X_train.shape), print(y_train.shape)
print(X_valid.shape), print(y_valid.shape)
print(X_test.shape), print(y_test.shape)
  1. train_valid_test_split(df, target, train_size=0.8, valid_size=0.1, test_size=0.1, method='random', sort_by_col = None, random_state=None)
  2. all_classifiers(X_train, y_train, X_valid, y_valid, X_test=None, y_test=None, threshold_by = 'ROC AUC' ,verbose = True)

9. Model Evaluation

from fast_ml.model_evaluation import threshold_evaluation

threshold_df = threshold_evaluation(y_true, y_prob, start=0, end=1, step_size=0.1)

display_all(threshold_df)
  1. model_save (model, model_name)
  2. model_load (model_name)
  3. plot_confidence_interval_for_data (model, X)
  4. plot_confidence_interval_for_variable (model, X, y, variable)
  5. threshold_evaluation(y_true, y_prob, start=0, end=1, step_size=0.1)
  6. metrics_evaluation(y_true, y_pred_prob=None, y_pred=None, threshold=None, df_type='train')

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fast_ml-3.68.tar.gz (39.5 kB view details)

Uploaded Source

Built Distribution

fast_ml-3.68-py3-none-any.whl (42.1 kB view details)

Uploaded Python 3

File details

Details for the file fast_ml-3.68.tar.gz.

File metadata

  • Download URL: fast_ml-3.68.tar.gz
  • Upload date:
  • Size: 39.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.6.1 requests/2.24.0 setuptools/50.3.1.post20201107 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.5

File hashes

Hashes for fast_ml-3.68.tar.gz
Algorithm Hash digest
SHA256 8f7c31d9b1b0921dbc76151fc6b429ad3c2dcb17ad4120589cf2b39fa597babc
MD5 4a1bc840713b6e165e4c4902e4aa65e5
BLAKE2b-256 f63620b43f164f371780a0f3571604acd3e8f4758e71b069018e74d60c38ba1f

See more details on using hashes here.

File details

Details for the file fast_ml-3.68-py3-none-any.whl.

File metadata

  • Download URL: fast_ml-3.68-py3-none-any.whl
  • Upload date:
  • Size: 42.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.6.1 requests/2.24.0 setuptools/50.3.1.post20201107 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.5

File hashes

Hashes for fast_ml-3.68-py3-none-any.whl
Algorithm Hash digest
SHA256 e262e5fa4faef9ebb25bd96206cf2eea5cfbae943c441bb7ecddb0e7b4972e99
MD5 9d998e4b43d4ded1c83f25110510adcc
BLAKE2b-256 2fc1ff0d486b163cc98a0ed85be0bb1e50ad72a286befe78f90dc36572228a44

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page