fast-ml

Package by the Data-Scientists for the Data Scientists ; with Scikit-learn type fit() transform() functionality

Project description

Fast-ML is a Python package with numerous inbuilt functionalities to make the life of a data scientist much easier

fast_ml follow Scikit-learn type functionality with fit() and transform() methods to first learn the transforming parameters from training dataset and then transforms the training/validation/test dataset

Important Note : You learn the parameter by applying fit() method ONLY on train method and then apply transform on train/valid/test dataset. Be it Missing Value Imputation, Outliers, Feature Engineering for Numerical/Categorical ... Parameters are learned from the training dataset on which the model trains.

Installing

pip install fast_ml

Glossary

df : Dataframe, refers to dataset used for analysis

variable : str, refers to a single variable. As required in the function it has to be passed ex 'V1'

variables : list type, refers to list of variables. Must be passed as list ex ['V1', 'V2]. Even a single variable has to be passed in list format. ex ['V1']

target : str, refers to target variable

model : str, ML problem type. use 'classification' or 'clf' for classification problems and 'regression' or 'reg' for regression problems

method : str, refers to various techniques available for Missing Value Imputation, Feature Engieering... as available in each module

1. Utilities

from fast_ml.utilities import reduce_memory_usage, display_all

# reduces the memory usage of the dataset by optimizing for the datatype used for storing the data
train = reduce_memory_usage(train, convert_to_category=False)

reduce_memory_usage(df, convert_to_category = False)
- This function reduces the memory used by dataframe
display_all(df)
- Use this function to show all rows and all columns of dataframe. By default pandas only show top and bottom 20 rows, columns

2. Exploratory Data Analysis (EDA)

from fast_ml import eda

2.1) Overview

from fast_ml import eda

train = pd.read_csv('train.csv')

# One of the most useful dataframe summary view
summary_df = eda.df_info(train)
display_all(summary_df)

eda.df_info(df)
- Returns a dataframe with useful summary - variables, datatype, number of unique values, sample of unique values, missing count, missing percent
eda.df_cardinality_info(df, raw_data = True)
- Returns a dataframe with useful summary - variables, datatype, number of unique values, sample of unique values, missing count, missing percent
eda.df_missing_info(df, raw_data = True)
- Returns a dataframe with useful summary - variables, datatype, number of unique values, sample of unique values, missing count, missing percent

2.2) Numerical Variables

from fast_ml import eda

train = pd.read_csv('train.csv')

#one line of command to get commonly used plots for all the variables provided to the function
eda.numerical_plots_with_target(train, num_vars, target, model ='clf')

eda.numerical_describe(df, variables=None, method='10p')
- Dataframe with variouls count, mean, std and spread statistics for all the variables passed in input
eda.numerical_variable_detail(df, variable, model = None, target=None, threshold = 20)
- Various summary statistics, spread statistics, outlier, missing values, transformation diagnostic... a detailed analysis for a single variable provided as input
eda.numerical_plots(df, variables, normality_check = False)
- Uni-variate plots - Variable Distribution of all the numerical variables provided as input with target. Can also get the Q-Q plot for assessing the normality
eda.numerical_plots_with_target(df, variables, target, model)
- Bi-variate plots - Scatter plot of all the numerical variables provided as input with target.
eda.numerical_check_outliers(df, variables=None, tol=1.5, print_vars = False)
eda.numerical_bins_with_target(df, variables, target, model='clf', create_buckets = True, method='5p', custom_buckets=None)
- Useful for deciding the suitable binning for numerical variable. Displays 2 graphs 'overall event rate' & 'within category event rate'

2.3) Categorical Variables

from fast_ml import eda

train = pd.read_csv('train.csv')

#one line of command to get commonly used plots for all the variables provided to the function
eda.categorical_plots_with_target(train, cat_vars, target, add_missing=True, rare_tol=5)

eda.categorical_variable_detail(df, variable, model = None, target=None, rare_tol=5)
- Various summary statistics, missing values, distributions ... a detailed analysis for a single variable provided as input
eda.categorical_plots(df, variables, add_missing = True, add_rare = False, rare_tol=5)
- Uni-variate plots - distribution of all the categorical provided as input
eda.categorical_plots_with_target(df, variables, target, model='clf', add_missing = True, rare_tol1 = 5, rare_tol2 = 10)
- Bi-variate plots - distribution of all the categorical provided as input with target
eda.categorical_plots_with_rare_and_target(df, variables, target, model='clf', add_missing=True, rare_tol1=5, rare_tol2=10)
- Bi-variate plots - distribution of all the categorical provided as input with target with 2 inputs as rare threshold. Useful for deciding the rare bucketing
eda.categorical_plots_for_miss_and_freq(df, variables, target, model = 'reg')
- Uni-variate plots - distribution of all the categorical provided as input with target with 2 inputs as rare threshold. Useful for deciding the rare bucketing

3. Missing Data Analysis

from fast_ml.missing_data_analysis import MissingDataAnalysis

2.1) Class MissingDataAnalysis

explore_numerical_imputation (variable)
explore_categorical_imputation (variable)

4. Missing Data Imputation

from fast_ml.missing_data_imputation import MissingDataImputer_Numerical, MissingDataImputer_Categorical

4.1) class MissingDataImputer_Numerical

from fast_ml.missing_data_imputation import MissingDataImputer_Numerical

train = pd.read_csv('train.csv')

num_imputer = MissingDataImputer_Numerical(df, method = 'median')

#Scikit-learn type fit() transform() functionality
# Use fit() only on the train dataset
num_imputer.fit(train, num_vars)

# Use transform() on train/test dataset
train = num_imputer.transform(train)
test = num_imputer.transform(test)

Methods:
- 'mean'
- 'median'
- 'mode'
- 'custom_value'
- 'random'

fit(df, num_vars)
transform(df)

4.2) class MissingDataImputer_Categorical

from fast_ml.missing_data_imputation import MissingDataImputer_Categorical

train = pd.read_csv('train.csv')

cat_imputer = MissingDataImputer_Categorical(df, method = 'frequent')

#Scikit-learn type fit() transform() functionality
# Use fit() only on the train dataset
cat_imputer.fit(train, cat_vars)

# Use transform() on train/test dataset
train = cat_imputer.transform(train)
test = cat_imputer.transform(test)

Methods:
- 'frequent' or 'mode'
- 'custom_value'
- 'random'

fit(df, cat_vars)
transform(df)

5. Outlier Treatment

from fast_ml.outlier_treatment import OutlierTreatment

5.1) class OutlierTreatment

Methods:
- 'iqr' or 'IQR'
- 'gaussian'

fit(df, num_vars)
transform(df)

6. Feature Engineering

from fast_ml.feature_engineering import FeatureEngineering_Numerical, FeatureEngineering_Categorical, FeatureEngineering_DateTime

6.1) class FeatureEngineering_Numerical

from fast_ml.feature_engineering import FeatureEngineering_Categorical

num_binner = FeatureEngineering_Numerical(method = '10p', adaptive = True)

#Scikit-learn type fit() transform() functionality
# Use fit() only on the train dataset
num_binner.fit(train, num_vars)

# Use transform() on train/test dataset
train = num_binner.transform(train)
test = num_binner.transform(test)

Methods:
- '5p' : [0,5,10,15,20,25,30,35,40,45,50,55,60,65,70,75,80,85,90,95,100]
- '10p' : [0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
- '20p' : [0, 20, 40, 60, 80, 100]
- '25p' : [0, 25, 50, 75, 100]
- '95p' : [0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 95, 100]
- '98p' : [0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 95, 98, 100]
- 'custom' : Custom Buckets

fit(df, num_vars)
transform(df)

6.2) class FeatureEngineering_Categorical(model=None, method='label', drop_last=False):

from fast_ml.feature_engineering import FeatureEngineering_Categorical

rare_encoder_5 = FeatureEngineering_Categorical(method = 'rare')

#Scikit-learn type fit() transform() functionality
# Use fit() only on the train dataset
rare_encoder_5.fit(train, cat_vars, rare_tol=5)

# Use transform() on train/test dataset
train = rare_encoder_5.transform(train)
test = rare_encoder_5.transform(test)

Methods:
- 'rare_encoding' or 'rare'
- 'label' or 'integer'
- 'count'
- 'freq'
- 'ordered_label'
- 'target_ordered'
- 'target_mean'
- 'target_prob_ratio'
- 'target_woe'

fit(df, cat_vars, target=None, rare_tol=5)
transform(df)

6.3) class FeatureEngineering_DateTime (drop_orig=True)

from fast_ml.feature_engineering import FeatureEngineering_DateTime

dt_encoder = FeatureEngineering_DateTime()

#Scikit-learn type fit() transform() functionality
# Use fit() only on the train dataset
dt_encoder.fit(train, datetime_vars, prefix = 'default')

# Use transform() on train/test dataset
train = dt_encoder.transform(train)
test = dt_encoder.transform(test)

fit(df, datetime_variables, prefix = 'default')
transform(df)

7. Feature Selection

from fast_ml.feature_selection import get_constant_features

constant_features = get_constant_features(df, threshold=0.99, dropna=False)
# constant_features is a dataframe
display_all(constant_features)

# to get list of constant features
constant_feats = (constant_features['Var'].to_list()
print(constant_feats)

get_constant_features(df, threshold=0.99, dropna=False)
get_duplicate_features(df)
get_correlated_pairs(df, threshold=0.9)
recursive_feature_elimination(model, X_train, y_train, X_valid, y_valid, X_test, y_test)
variables_clustering (df, variables, method)

8. Model Development

from fast_ml.model_development import train_valid_test_split

X_train, y_train, X_valid, y_valid, X_test, y_test = train_valid_test_split(df, target = target, 
                                                                            train_size=0.8, valid_size=0.1, test_size=0.1)

# Get the shape of all the datasets
print(X_train.shape), print(y_train.shape)
print(X_valid.shape), print(y_valid.shape)
print(X_test.shape), print(y_test.shape)

train_valid_test_split(df, target, train_size=0.8, valid_size=0.1, test_size=0.1, method='random', sort_by_col = None, random_state=None)
all_classifiers(X_train, y_train, X_valid, y_valid, X_test=None, y_test=None, threshold_by = 'ROC AUC' ,verbose = True)

9. Model Evaluation

from fast_ml.model_evaluation import threshold_evaluation

threshold_df = threshold_evaluation(y_true, y_prob, start=0, end=1, step_size=0.1)

display_all(threshold_df)

model_save (model, model_name)
model_load (model_name)
plot_confidence_interval_for_data (model, X)
plot_confidence_interval_for_variable (model, X, y, variable)
threshold_evaluation(y_true, y_prob, start=0, end=1, step_size=0.1)
metrics_evaluation(y_true, y_pred_prob=None, y_pred=None, threshold=None, df_type='train')

Project details

Release history Release notifications | RSS feed

This version

3.68

Jul 11, 2021

3.67

Jul 5, 2021

3.66

May 24, 2021

3.51

Apr 28, 2021

3.48

Mar 30, 2021

3.39

Feb 27, 2021

3.12

Aug 22, 2020

2.69

Jul 31, 2020

1.97

Jul 12, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fast_ml-3.68.tar.gz (39.5 kB view details)

Uploaded Jul 11, 2021 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

fast_ml-3.68-py3-none-any.whl (42.1 kB view details)

Uploaded Jul 11, 2021 Python 3

File details

Details for the file fast_ml-3.68.tar.gz.

File metadata

Download URL: fast_ml-3.68.tar.gz
Upload date: Jul 11, 2021
Size: 39.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.3.0 pkginfo/1.6.1 requests/2.24.0 setuptools/50.3.1.post20201107 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.5

File hashes

Hashes for fast_ml-3.68.tar.gz
Algorithm	Hash digest
SHA256	`8f7c31d9b1b0921dbc76151fc6b429ad3c2dcb17ad4120589cf2b39fa597babc`
MD5	`4a1bc840713b6e165e4c4902e4aa65e5`
BLAKE2b-256	`f63620b43f164f371780a0f3571604acd3e8f4758e71b069018e74d60c38ba1f`

See more details on using hashes here.

File details

Details for the file fast_ml-3.68-py3-none-any.whl.

File metadata

Download URL: fast_ml-3.68-py3-none-any.whl
Upload date: Jul 11, 2021
Size: 42.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.3.0 pkginfo/1.6.1 requests/2.24.0 setuptools/50.3.1.post20201107 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.5

File hashes

Hashes for fast_ml-3.68-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e262e5fa4faef9ebb25bd96206cf2eea5cfbae943c441bb7ecddb0e7b4972e99`
MD5	`9d998e4b43d4ded1c83f25110510adcc`
BLAKE2b-256	`2fc1ff0d486b163cc98a0ed85be0bb1e50ad72a286befe78f90dc36572228a44`

See more details on using hashes here.

fast-ml 3.68

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

Fast-ML is a Python package with numerous inbuilt functionalities to make the life of a data scientist much easier

Installing

Table of Contents:

Glossary

1. Utilities

2. Exploratory Data Analysis (EDA)

2.1) Overview

2.2) Numerical Variables

2.3) Categorical Variables

3. Missing Data Analysis

2.1) Class MissingDataAnalysis

4. Missing Data Imputation

4.1) class MissingDataImputer_Numerical

4.2) class MissingDataImputer_Categorical

5. Outlier Treatment

5.1) class OutlierTreatment

6. Feature Engineering

6.1) class FeatureEngineering_Numerical

6.2) class FeatureEngineering_Categorical(model=None, method='label', drop_last=False):

6.3) class FeatureEngineering_DateTime (drop_orig=True)

7. Feature Selection

8. Model Development

9. Model Evaluation

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes