Package by the Data-Scientists for the Data Scientists ; with Scikit-learn type fit() transform() functionality
Project description
Fast-ML is a Python package with numerous inbuilt functionalities to make the life of a data scientist much easier
fast_ml follow Scikit-learn type functionality with fit() and transform() methods to first learn the transforming parameters from training dataset and then transforms the training/validation/test dataset
Important Note : You learn the parameter by applying fit() method ONLY on train method and then apply transform on train/valid/test dataset. Be it Missing Value Imputation, Outliers, Feature Engineering for Numerical/Categorical ... Parameters are learned from the training dataset on which the model trains.
Installing
pip install fast_ml
Table of Contents:
- Utilities
- Exploratory Data Analysis (EDA)
- Missing Data Analysis
- Missing Data Imputation
- Outlier Treatment
- Feature Engineering
- Feature Selection
- Model Development
- Model Evaluation
Glossary
- df : Dataframe, refers to dataset used for analysis
- variable : str, refers to a single variable. As required in the function it has to be passed ex 'V1'
- variables : list type, refers to list of variables. Must be passed as list ex ['V1', 'V2]. Even a single variable has to be passed in list format. ex ['V1']
- target : str, refers to target variable
- model : str, ML problem type. use 'classification' or 'clf' for classification problems and 'regression' or 'reg' for regression problems
- method : str, refers to various techniques available for Missing Value Imputation, Feature Engieering... as available in each module
1. Utilities
from fast_ml.utilities import reduce_memory_usage, display_all
# reduces the memory usage of the dataset by optimizing for the datatype used for storing the data
train = reduce_memory_usage(train, convert_to_category=False)
- reduce_memory_usage(df, convert_to_category = False)
- This function reduces the memory used by dataframe
- display_all(df)
- Use this function to show all rows and all columns of dataframe. By default pandas only show top and bottom 20 rows, columns
2. Exploratory Data Analysis (EDA)
from fast_ml import eda
2.1) Overview
from fast_ml import eda
train = pd.read_csv('train.csv')
# One of the most useful dataframe summary view
summary_df = eda.df_info(train)
display_all(summary_df)
- eda.df_info(df)
- Returns a dataframe with useful summary - variables, datatype, number of unique values, sample of unique values, missing count, missing percent
- eda.df_cardinality_info(df, raw_data = True)
- Returns a dataframe with useful summary - variables, datatype, number of unique values, sample of unique values, missing count, missing percent
- eda.df_missing_info(df, raw_data = True)
- Returns a dataframe with useful summary - variables, datatype, number of unique values, sample of unique values, missing count, missing percent
2.2) Numerical Variables
from fast_ml import eda
train = pd.read_csv('train.csv')
#one line of command to get commonly used plots for all the variables provided to the function
eda.numerical_plots_with_target(train, num_vars, target, model ='clf')
- eda.numerical_describe(df, variables=None, method='10p')
- Dataframe with variouls count, mean, std and spread statistics for all the variables passed in input
- eda.numerical_variable_detail(df, variable, model = None, target=None, threshold = 20)
- Various summary statistics, spread statistics, outlier, missing values, transformation diagnostic... a detailed analysis for a single variable provided as input
- eda.numerical_plots(df, variables, normality_check = False)
- Uni-variate plots - Variable Distribution of all the numerical variables provided as input with target. Can also get the Q-Q plot for assessing the normality
- eda.numerical_plots_with_target(df, variables, target, model)
- Bi-variate plots - Scatter plot of all the numerical variables provided as input with target.
- eda.numerical_check_outliers(df, variables=None, tol=1.5, print_vars = False)
- eda.numerical_bins_with_target(df, variables, target, model='clf', create_buckets = True, method='5p', custom_buckets=None)
- Useful for deciding the suitable binning for numerical variable. Displays 2 graphs 'overall event rate' & 'within category event rate'
2.3) Categorical Variables
from fast_ml import eda
train = pd.read_csv('train.csv')
#one line of command to get commonly used plots for all the variables provided to the function
eda.categorical_plots_with_target(train, cat_vars, target, add_missing=True, rare_tol=5)
- eda.categorical_variable_detail(df, variable, model = None, target=None, rare_tol=5)
- Various summary statistics, missing values, distributions ... a detailed analysis for a single variable provided as input
- eda.categorical_plots(df, variables, add_missing = True, add_rare = False, rare_tol=5)
- Uni-variate plots - distribution of all the categorical provided as input
- eda.categorical_plots_with_target(df, variables, target, model='clf', add_missing = True, rare_tol1 = 5, rare_tol2 = 10)
- Bi-variate plots - distribution of all the categorical provided as input with target
- eda.categorical_plots_with_rare_and_target(df, variables, target, model='clf', add_missing=True, rare_tol1=5, rare_tol2=10)
- Bi-variate plots - distribution of all the categorical provided as input with target with 2 inputs as rare threshold. Useful for deciding the rare bucketing
- eda.categorical_plots_for_miss_and_freq(df, variables, target, model = 'reg')
- Uni-variate plots - distribution of all the categorical provided as input with target with 2 inputs as rare threshold. Useful for deciding the rare bucketing
3. Missing Data Analysis
from fast_ml.missing_data_analysis import MissingDataAnalysis
2.1) Class MissingDataAnalysis
- explore_numerical_imputation (variable)
- explore_categorical_imputation (variable)
4. Missing Data Imputation
from fast_ml.missing_data_imputation import MissingDataImputer_Numerical, MissingDataImputer_Categorical
4.1) class MissingDataImputer_Numerical
from fast_ml.missing_data_imputation import MissingDataImputer_Numerical
train = pd.read_csv('train.csv')
num_imputer = MissingDataImputer_Numerical(df, method = 'median')
#Scikit-learn type fit() transform() functionality
# Use fit() only on the train dataset
num_imputer.fit(train, num_vars)
# Use transform() on train/test dataset
train = num_imputer.transform(train)
test = num_imputer.transform(test)
- Methods:
- 'mean'
- 'median'
- 'mode'
- 'custom_value'
- 'random'
- fit(df, num_vars)
- transform(df)
4.2) class MissingDataImputer_Categorical
from fast_ml.missing_data_imputation import MissingDataImputer_Categorical
train = pd.read_csv('train.csv')
cat_imputer = MissingDataImputer_Categorical(df, method = 'frequent')
#Scikit-learn type fit() transform() functionality
# Use fit() only on the train dataset
cat_imputer.fit(train, cat_vars)
# Use transform() on train/test dataset
train = cat_imputer.transform(train)
test = cat_imputer.transform(test)
- Methods:
- 'frequent' or 'mode'
- 'custom_value'
- 'random'
- fit(df, cat_vars)
- transform(df)
5. Outlier Treatment
from fast_ml.outlier_treatment import OutlierTreatment
5.1) class OutlierTreatment
- Methods:
- 'iqr' or 'IQR'
- 'gaussian'
- fit(df, num_vars)
- transform(df)
6. Feature Engineering
from fast_ml.feature_engineering import FeatureEngineering_Numerical, FeatureEngineering_Categorical, FeatureEngineering_DateTime
6.1) class FeatureEngineering_Numerical
from fast_ml.feature_engineering import FeatureEngineering_Categorical
num_binner = FeatureEngineering_Numerical(method = '10p', adaptive = True)
#Scikit-learn type fit() transform() functionality
# Use fit() only on the train dataset
num_binner.fit(train, num_vars)
# Use transform() on train/test dataset
train = num_binner.transform(train)
test = num_binner.transform(test)
- Methods:
- '5p' : [0,5,10,15,20,25,30,35,40,45,50,55,60,65,70,75,80,85,90,95,100]
- '10p' : [0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
- '20p' : [0, 20, 40, 60, 80, 100]
- '25p' : [0, 25, 50, 75, 100]
- '95p' : [0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 95, 100]
- '98p' : [0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 95, 98, 100]
- 'custom' : Custom Buckets
- fit(df, num_vars)
- transform(df)
6.2) class FeatureEngineering_Categorical(model=None, method='label', drop_last=False):
from fast_ml.feature_engineering import FeatureEngineering_Categorical
rare_encoder_5 = FeatureEngineering_Categorical(method = 'rare')
#Scikit-learn type fit() transform() functionality
# Use fit() only on the train dataset
rare_encoder_5.fit(train, cat_vars, rare_tol=5)
# Use transform() on train/test dataset
train = rare_encoder_5.transform(train)
test = rare_encoder_5.transform(test)
- Methods:
- 'rare_encoding' or 'rare'
- 'label' or 'integer'
- 'count'
- 'freq'
- 'ordered_label'
- 'target_ordered'
- 'target_mean'
- 'target_prob_ratio'
- 'target_woe'
- fit(df, cat_vars, target=None, rare_tol=5)
- transform(df)
6.3) class FeatureEngineering_DateTime (drop_orig=True)
from fast_ml.feature_engineering import FeatureEngineering_DateTime
dt_encoder = FeatureEngineering_DateTime()
#Scikit-learn type fit() transform() functionality
# Use fit() only on the train dataset
dt_encoder.fit(train, datetime_vars, prefix = 'default')
# Use transform() on train/test dataset
train = dt_encoder.transform(train)
test = dt_encoder.transform(test)
- fit(df, datetime_variables, prefix = 'default')
- transform(df)
7. Feature Selection
from fast_ml.feature_selection import get_constant_features
constant_features = get_constant_features(df, threshold=0.99, dropna=False)
# constant_features is a dataframe
display_all(constant_features)
# to get list of constant features
constant_feats = (constant_features['Var'].to_list()
print(constant_feats)
- get_constant_features(df, threshold=0.99, dropna=False)
- get_duplicate_features(df)
- get_correlated_pairs(df, threshold=0.9)
- recursive_feature_elimination(model, X_train, y_train, X_valid, y_valid, X_test, y_test)
- variables_clustering (df, variables, method)
8. Model Development
from fast_ml.model_development import train_valid_test_split
X_train, y_train, X_valid, y_valid, X_test, y_test = train_valid_test_split(df, target = target,
train_size=0.8, valid_size=0.1, test_size=0.1)
# Get the shape of all the datasets
print(X_train.shape), print(y_train.shape)
print(X_valid.shape), print(y_valid.shape)
print(X_test.shape), print(y_test.shape)
- train_valid_test_split(df, target, train_size=0.8, valid_size=0.1, test_size=0.1, method='random', sort_by_col = None, random_state=None)
- all_classifiers(X_train, y_train, X_valid, y_valid, X_test=None, y_test=None, threshold_by = 'ROC AUC' ,verbose = True)
9. Model Evaluation
from fast_ml.model_evaluation import threshold_evaluation
threshold_df = threshold_evaluation(y_true, y_prob, start=0, end=1, step_size=0.1)
display_all(threshold_df)
- model_save (model, model_name)
- model_load (model_name)
- plot_confidence_interval_for_data (model, X)
- plot_confidence_interval_for_variable (model, X, y, variable)
- threshold_evaluation(y_true, y_prob, start=0, end=1, step_size=0.1)
- metrics_evaluation(y_true, y_pred_prob=None, y_pred=None, threshold=None, df_type='train')
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file fast_ml-3.68.tar.gz
.
File metadata
- Download URL: fast_ml-3.68.tar.gz
- Upload date:
- Size: 39.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.3.0 pkginfo/1.6.1 requests/2.24.0 setuptools/50.3.1.post20201107 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8f7c31d9b1b0921dbc76151fc6b429ad3c2dcb17ad4120589cf2b39fa597babc |
|
MD5 | 4a1bc840713b6e165e4c4902e4aa65e5 |
|
BLAKE2b-256 | f63620b43f164f371780a0f3571604acd3e8f4758e71b069018e74d60c38ba1f |
File details
Details for the file fast_ml-3.68-py3-none-any.whl
.
File metadata
- Download URL: fast_ml-3.68-py3-none-any.whl
- Upload date:
- Size: 42.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.3.0 pkginfo/1.6.1 requests/2.24.0 setuptools/50.3.1.post20201107 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | e262e5fa4faef9ebb25bd96206cf2eea5cfbae943c441bb7ecddb0e7b4972e99 |
|
MD5 | 9d998e4b43d4ded1c83f25110510adcc |
|
BLAKE2b-256 | 2fc1ff0d486b163cc98a0ed85be0bb1e50ad72a286befe78f90dc36572228a44 |