Skip to main content

Run different validation tests on machine learning models.

Project description

FluidAI - Sanatio

sanatio(Latin) OR validation(English)

noun

  1. the action of checking or proving the validity or accuracy of something.
  2. the action of making or declaring something legally or officially acceptable.
  3. recognition or affirmation that a person or their feelings or opinions are valid or worthwhile.

The FluidAI - Sanatio package provides functions to perform different types of validation functions, the results of which can be used for ensuring the machine learning model trained is robust and the accuracy levels attained are dependable.

Version 2.1.0 updates

  1. Code Generation modules
  2. CF model training modules
  3. Correlation structure in routines

Creating a PIP Package on Pypi.org

Steps to Create a pip package

Process for Token Creation and Setup in Pypi.org and Twine

Use the fluidai username for logging into test.pypi.org or pypi.org respectively

Use 192.168.1.90 to upload packages, all packages are uploaded from there

SPECSHEET / MILESTONES

  • REQUIREMENT: a package that serves to cross validate the results of machine learning model trained and provide recommendations to correct improve the results of the model finally used for predictions
  • OBJECTIVES:
  1. python package
  2. submodules with function definitions of different calculations to be performed during model validation
  3. documentation
  4. submodule with function definitions of different graphs / visualizations to be plotted
  • SPEC:
  1. the package needs to have an authentication server and authentication mechanism
  2. it needs a library / submodule of all the different functions that need to be called for performing the analysis
  3. it needs a wrapper that has functions binding to the library of submodules defined above performing each operation
  4. it needs another wrapper class that extends the previous one to include functions that call the base class's function in a sequence to generate recommendations & overall results

Current directory structure

sanatio/
├───authentication_server/
│   ├───authentication_server.py   
│   ├───utils/
│   │   ├───db_credential.json        # credential file
│   │   ├───utils.py
│   │   └───__init__.py
│   └───__init__.py
├───code_generation/
│   ├───generate_code.py              # codes to generate new cell with functions defined in the below files
│   ├───metrics_codes.py
│   └───preprocessing_codes.py
├───training/
│   ├───automl/                       # pipeline generation functions
│   │   ├───pipeline_generation.py
│   │   └───__init__.py
│   └───__init__.py
│   ├───recommsys/                       # recommendation for CF functions
│   │   ├───cf.py
│   └───__init__.py
├───validations/
│   ├───db_credential.json
│   ├───graphs/                       # plotting functions
│   │   ├───graphs.py
│   │   └───__init__.py                   
│   ├───routines/                     # folder has different routines 
│   │   ├───binary_logistic_regression_routine.py 
│   │   ├───helper_routines.py        # has the base routine and validation routine class that calls different routines
│   │   ├───linear_regression_routine.py
│   │   ├───tree_based_classification_routine.py
│   │   └───__init__.py
│   ├───sanatio_enact.py              # code to enact sanatio recommendations
│   ├───session_authentication.py     # code to authenticate with main server  
│   ├───stats/
│   │   ├───stats.py                  # plotting functions
|   |   |───correlation_structure.py
│   │   └───__init__.py
│   └───__init__.py
├───validation_tests/                 # test cases
│   ├───test_authentication.py
│   ├───test_automl.py
│   ├───test_enact.py
│   ├───test_graphs.py
│   ├───test_routines.py
│   ├───test_stats.py
│   └───__init__.py
└───__init__.py

Note: Subject to future changes in the file structure.

Function List

Note: Only the required functions are listed below. As such, it is possible that some present in the actual .py file are not given below. HINT: You can mark things as done by changing Status column value from <span>&#10003</span> to <span>&#10003</span>.

filename definitions
function/class input output description status
validations/routines/helper_routines.py BaseRoutine Functions are called from binary_logistic_regression_routine.py, linear_regression_routine.py and binary_logistic_regression_routine.py - Class with calls to all different validation functions defined within stats.
ValidationRoutine - - Child class of BaseRoutine with functions calling parent class's functions in sequence for different validation routines.
validations/authentication.py set_key str: key - This function should be importable and callable before `routine.py` is used. Will set a global variable with auth-key - which is verified by connecting to auth-server.
authenticate_session - bool: active/inactive This function simply uses the global key variable (which can be defined by `set_key`) and checks with auth-server if the key is active. And accordingly return's response.
validations/stats/stats.py accuracy pd.Series: predicted, actual float: value This function returns the accuracy of the model using sklearn.
precision pd.Series: predicted, actual float: value This function returns the precision of the model using sklearn.
recall pd.Series: predicted, actual float: value This function returns the recall of the model using sklearn.
f1_score pd.Series: predicted, actual float: value This function returns the f1 score of the model using sklearn.
r_square_cox_snell pd.Series: weight, X, actual float: value Its an R-square measurement based on the assumption that the usual R2 for linear regression depends on the likelihoods for the models.
r_square_mcfadden pd.Series: weight, X, actual float: value Loosely speaking, its 1 minus the ratio of log likelihood of the fitted model and the null model. Best value within range 0.2-0.4.
f_test pd.Series: sample1, sample2 tuple of floats: (fstatistic, pvalue) The F-test of overall significance indicates whether your linear regression model provides a better fit to the data than a model that contains no independent variables.
log_loss pd.Series: predicted, actual float: value Log-loss is indicative of how close the prediction probability is to the corresponding actual/true value (0 or 1 in case of binary classification). The more the predicted probability diverges from the actual value, the higher is the log-loss value.
r_square pd.Series: predicted, actual float: value In statistics, the coefficient of determination, denoted R² or r² and pronounced "R squared", is the proportion of the variation in the dependent variable that is predictable from the independent variable.
r_square_adjusted pd.Series: predicted, actual int: no_of_features float: value Adjusted R-squared is a modified version of R-squared that has been adjusted for the number of predictors in the model.
mse pd.Series: predicted, actual float: value Abbreviation: mean square error
mae pd.Series: predicted, actual float: value Abbreviation: mean absolute error
rmse pd.Series: predicted, actual float: value Abbreviation: root mean square error
dunn_index pd.DataFrame: data_points, cluster_centroids: float: value It is calculated as the lowest inter-cluster distance (ie. the smallest distance between any two cluster centroids) divided by the highest intra-cluster distance (ie. the largest distance between any two points in any cluster).
silhoutte_score pd.DataFrame: data_points, cluster_centroids float: value Silhouette Coefficient or silhouette score is a metric used to calculate the goodness of a clustering technique. Its value ranges from -1 to 1.
check_sparsity pd.DataFrame: data list : column names The check sparsity function returns the name of the columns that has more than 40% missing values
check_correlation pd.DataFrame: data pd.DataSeries : target Float(Optional): pearson_threshold, vif_factor dictionary : Name of columns to remove for pearson,vif,correlation with target The check sparsity function returns the name of the columns that has more than 40% missing values
mutual_information_classification pd.DataFrame: data pd.DataSeries : target dataframe : column and score This function will give you the mutual information score of the data
mutual_information_regression pd.DataFrame: data pd.DataSeries : target dataframe : column and score This function will give you the mutual information score of the data
chi_square pd.DataFrame: data, list : categorical columns dataframe : column and score This function is used to check the correlation between categorical variables using chi square
hl_test pd.DataFrame: data, pd.Series: actual, predicted_probability, list : categorical columns dataframe : column and score The Hosmer-Lemeshow test (HL test) is a goodness of fit test for logistic regression, especially for risk prediction models. A goodness of fit test tells you how well your data fits the model. Specifically, the HL test calculates if the observed event rates match the expected event rates in population subgroups.
kurtosis_check pd.DataFrame: data, dataframe : column and score Kurtosis refers to the degree of presence of outliers in the distribution. Kurtosis is a statistical measure, whether the data is heavy-tailed or light-tailed in a normal distribution.
skew_check pd.DataFrame: data, dataframe : column and score Skewness is the measure of symmetry or asymmetry of data distribution.
ranking_metrics metrics: list of metrics, topK: int list : contain function list cornac lib Ranking metrics is wrapper around cornac lib metrics and use to obtain the result of the models. such as MAP, Presion@K etc.
rating_metrics metrics: list of metrics, list : contain function list cornac lib< Rating metrics is wrapper around cornac lib metrics and use to obtain the result of the models. such as RMSE, MAE,MSE
validations/graphs/graphs.py calibration_curve_plot pd.Series: predicted, actual tuple: (ax: matplotlib axes object, float: auc) Calibration curves are used to evaluate how calibrated a classifier is i.e., how the probabilities of predicting each class label differ. The x-axis represents the average predicted probability in each bin. The y-axis is the ratio of positives (the proportion of positive predictions).
roc Dataframe : two_class_probability, pd.Series: actual plt It is a plot of the false positive rate (x-axis) versus the true positive rate (y-axis) for a number of different candidate threshold values between 0.0 and 1.0. Put another way, it plots the false alarm rate versus the hit rate.
auc pd.Series: predicted_probability, actual plt It returns a plot with the area under the curve given by roc plotting.
ks_chart pd.Series: predicted, actual plt K-S or Kolmogorov-Smirnov chart measures performance of classification models. More accurately, K-S is a measure of the degree of separation between the positive and negative distributions.
elbow pd.DataFrame: data used for training ax: matplotlib axes object Elbow plots the cluster number vs the overall inertia of the model
residual_plot pd.Series: predicted,actual plt A residual plot is a graph that shows the residuals on the vertical axis and the independent variable on the horizontal axis. If the points in a residual plot are randomly dispersed around the horizontal axis, a linear regression model is appropriate for the data; otherwise, a nonlinear model is more appropriate.
data_relation pd.DataFrame : data pd.Series: actual ax: matplotlib axes object This function plots the significant data relation between the individual feature with the target.
confusion_matrix pd.Series: predicted,actual plt A confusion matrix is a summary of prediction results on a classification problem. The number of correct and incorrect predictions are summarized with count values and broken down by each class.
validations/sanatio_enact.py enact_helper pd.DataFrame:X_train, y_train, X_test, y_test object:model - This class contains all the enacting functions that are used to modify the data passed
enact pd.DataFrame:X_train, y_train, X_test, y_test object:model Generates a enacting report This class is used to generate a report by using the functions in the above class and the changes in performance
training/automl/pipeline_generation.py generate_sanatio_pipeline pd.DataFrame:X_train, y_train, X_test, y_test boolean:classification,regression object:automl_model This function trains an automl model for the given data, generates a validation report and returns an automl object
training/recommsys/cf.py TrainCFModel list(object):models,pd.DataFrame:item_user_df,string:useIDName,list:metrics_list object:model, dict:prediction,pickle:model,csv: user_idx_map Class to train single collaborative filtering recommendation system model.
CFExperiment list(object):models,pd.DataFrame:item_user_df,string:useIDName,list:metrics_list,int:ratioSplit Table with result for the model Class to run experiment on different models of collaborative filtering recommendation system.
code_generation/ generate_code.py Functions are called from regression_search_codes.py, classification_search_codes.py, preprocessing_codes.py and metrics_code.py Creates new cell on notebook Functions to create new cells on notebook with codes
metrics_codes.py The functions available in this file are gini_index This file has codes that will be called by the generate_code.py file Functions to create metrics code
preprocessing_codes.py The functions available in this file are find_columns_type, numericals_summary, categoricals_summary, skew_kurtosis_summary, pearson_correlation_filter, chi_square_filter, p_val_filter, vif_filter, standardize_encoder, onehot_encoder, robust_encoder This file has codes that will be called by the generate_code.py file Functions to create preprocessing code
regression_search_codes.py The functions available in this file are search_xgb_regression, search_random_forest_regression, search_linear_regression This file has codes that will be called by the generate_code.py file Functions to create search codes for regression models code
classification_search_codes.py The functions available in this file are search_xgb_classification, search_random_forest_classification, search_logistic_regression_classification This file has codes that will be called by the generate_code.py file Functions to create search codes for classification model

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

fluidai_sanatio-2.1.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (8.4 MB view hashes)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64

fluidai_sanatio-2.1.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (8.6 MB view hashes)

Uploaded CPython 3.8 manylinux: glibc 2.17+ x86-64

fluidai_sanatio-2.1.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.3 MB view hashes)

Uploaded CPython 3.7m manylinux: glibc 2.17+ x86-64

fluidai_sanatio-2.1.0-cp36-cp36m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.4 MB view hashes)

Uploaded CPython 3.6m manylinux: glibc 2.17+ x86-64

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page