Gitlab Data Science and Modeling Tools
Project description
What is it?
gitlabds is a set of tools designed make it quicker and easier to build predictive models.
Where to get it?
gitlabds can be installed directly via pip: pip install gitlabds
.
Alternatively, you can download the source code from Gitlab at https://gitlab.com/gitlab-data/gitlabds and compile locally.
Main Features
- Data prep tools:
- Treat outliers
- Dummy code
- Miss fill
- Reduce feature space
- Split and sample data into train/test
- Modeling tools:
References and Examples
MAD Outliers
Description
Median Absoutely Deviation for outlier detection and correction. By default will windsor all numeric values in your dataframe that are more than 4 standard deviations above or below the median ('threshold').
gitlabds.mad_outliers(df, dv=None, min_levels=10, columns = 'all', threshold=4, inplace=False, verbose=True, windsor_threshold=0.01):
Parameters:
- df : your pandas dataframe
- dv : The column name of your outcome. Entering your outcome variable in will prevent it from being windsored. May be left blank there is no outcome variable.
- min_levels : Only include columns that have at least the number of levels specified.
- columns : Will examine at all numeric columns by default. To limit to just a subset of columns, pass a list of column names. Doing so will ignore any constraints put on by the 'dv' and 'min_levels' paramaters.
- threshold : Windsor values greater than this number of standard deviations from the median.
- inplace : Set to
True
to replace existing dataframe. Set to false to create a new one. Set toFalse
to suppress - verbose : Set to
True
to print outputs of windsoring being done. Set toFalse
to suppress. - windsor_threshold : Only windsor values that affect less than this percentage of the population.
Returns
- DataFrame with windsored values or None if
inplace=True
.
Examples:
#Create a new df; only windsor selected columns; suppress verbose
import gitlabds
new_df = gitlabds.mad_outliers(df = my_df, dv='my_outcome', columns = ['colA', 'colB', 'colC'], verbose=False)
#Inplace outliers. Will windsor values by altering the current dataframe
import gitlabds
gitlabds.mad_outliers(df = my_df, dv='my_outcome', columns = 'all', inplace=True)
Missing Values Check
Description
Check for missing values.
gitlabds.missing_check(df=None, threshold = 0, by='column_name', ascending=True, return_missing_cols = False):
Parameters:
- df : your pandas dataframe
- threshold : The percent of missing values at which a column is considered to have missing values. For example, threshold = .10 will only display columns with more than 10% of its values missing. Defaults to 0.
- by : Columns to sort by. Defaults to
column_name
. Also acceptspercent_missing
,total_missing
, or a list. - ascending : Sort ascending vs. descending. Defaults to ascending (ascending=True).
- return_missing_cols : Set to
True
to return a list of column names that meet the threshold criteria for missing.
Returns
- List of columns with missing values filled or None if
return_missing_cols=False
.
Examples:
#Check for missing values using default settings
gitlabds.missing_check(df=my_df, threshold = 0, by='column_name', ascending=True, return_missing_cols = False)
#Check for columns with more than 5% missing values and return a list of those columns
missing_list = gitlabds.missing_check(df=my_df, threshold = 0.05, by='column_name', ascending=True, return_missing_cols = True)
Missing Values Fill
Description
Fill missing values using a range of different options.
gitlabds.missing_fill(df=None, columns='all', method='zero', inplace=False):
Parameters:
- df : your pandas dataframe
- columns : Columns which to miss fill. Defaults to
all
which will miss fill all columns with missing values. - method : Options are
zero
,median
, andmean
. Defaults tozero
. - inplace : Set to
True
to replace existing dataframe. Set to false to create a new one. Set toFalse
to suppress
Returns
- DataFrame with missing values filled or None if
inplace=True
.
Examples:
#Miss fill specificied columns with the mean value into a new dataframe
new_df = gitlabds,missing_fill(df=my_df, columns=['colA', 'colB', 'colC'], method='mean', inplace=False):
#Miss fill all values with zero in place.
gitlabds.missing_fill(df=my_df, columns='all', method='zero', inplace=True)
Dummy Code
Description
Dummy code (AKA "one-hot encode") categorical and numeric columns based on the paremeters specificed below. Note: categorical columns will be dropped after they are dummy coded; numeric columns will not
gitlabds.dummy_code(df, dv=None, columns='all', categorical=True, numeric=True, categorical_max_levels = 20, numeric_max_levels = 10, dummy_na=False):
Parameters:
- df : your pandas dataframe
- dv : The column name of your outcome. Entering your outcome variable in will prevent it from being dummy coded. May be left blank there is no outcome variable.
- columns : Will examine at all columns by default. To limit to just a subset of columns, pass a list of column names.
- categorical : Set to
True
to attempt to dummy code any categorical column passed via thecolumns
parameter. - numeric : Set to
True
to attempt to dummy code any numeric column passed via thecolumns
parameter. - categorical_max_levels : Maximum number of levels a categorical column can have to be eligable for dummy coding.
- categorical_max_levels : Maximum number of levels a numeric column can have to be eligable for dummy coding.
- dummy_na : Set to
True
to create a dummy coded column for missing values.
Returns
- DataFrame with dummy-coded columns. Categorical columns that were dummy coded will be dropped from the dataframe.
Examples:
#Dummy code only categorical columns with a maxinum of 30 levels. Do not dummy code missing values
new_df = gitlabds.dummy_code(df=my_df, dv='my_outcome', columns='all', categorical=True, numeric=False, categorical_max_levels = 30, dummy_na=False)
#Dummy code only columns specified in the `columns` parameter with a maxinum of 10 levels for categorical and numeric. Also dummy code missing values
new_df = gitlabds.dummy_code(df=my_df, dv='my_outcome', columns= ['colA', colB', 'colC'], categorical=True, numeric=True, categorical_max_levels = 10, numeric_max_levels = 10, dummy_na=True)
Top Dummies
Description
Dummy codes only categorical levels above a certain threshold of the population. Useful when a column contains many levels but there is not a need or desire to dummy code every level. Currently only works for categorical columns.
gitlabds.dummy_top(df=None, dv=None, columns = 'all', min_threshold = 0.05, drop_categorial=True, verbose=True):
Parameters:
- df : your pandas dataframe
- dv : The column name of your outcome. Entering your outcome variable in will prevent it from being dummy coded. May be left blank there is no outcome variable.
- columns : Will examine at all columns by default. To limit to just a subset of columns, pass a list of column names.
- min_threshold: The threshold at which levels will be dummy coded. For example, the default value of
0.05
will dummy code any categorical level that is in at least 5% of all rows. _ drop_categorical: Set toTrue
to drop categorical columns after they are considered for dummy coding. Set toFalse
to keep the original categorical columns in the dataframe. - verbose : Set to
True
to print detailed list of all dummy columns being created. Set toFalse
to suppress.
Returns
- DataFrame with dummy coded columns.
Examples:
#Dummy code all categorical levels from all categorical columns whose values are in at least 5% of all rows.
new_df = gitlabds.dummy_top(df=my_df, dv='my_outcome', columns = 'all', min_threshold = 0.05, drop_categorial=True, verbose=True)
#Dummy code all categorical levels from the selected columns who values are in at least 10% of all rows; suppress verbose printout and retain original categorical columns.
new_df = gitlabds.dummy_top(df=my_df, dv='my_outcome', columns = ['colA', 'colB', 'colC'], min_threshold = 0.10, drop_categorial=False, verbose=False)
Remove Low Variation columns
Description
Remove columns from a dataset that do not meet the variation threshold. That is, columns will be dropped that contain a high percentage of one value.
gitlabds.remove_low_variation(df=None, dv=None, columns='all', threshold=.98, inplace=False, verbose=True):
Parameters:
- df : your pandas dataframe
- dv : The column name of your outcome. Entering your outcome variable in will prevent it from being removed due to low variation. May be left blank there is no outcome variable.
- columns : Will examine at all columns by default. To limit to just a subset of columns, pass a list of column names.
- threshold: The maximum percentage one value in a column can represent. columns that exceed this threshold will be dropped. For example, the default value of
0.98
will drop any column where one value is present in more than 98% of rows. - inplace : Set to
True
to replace existing dataframe. Set to false to create a new one. Set toFalse
to suppress - verbose : Set to
True
to print outputs of windsoring being done. Set toFalse
to suppress.
Returns
- DataFrame with low variation columns dropped or None if
inplace=True
.
Examples:
#Dropped any columns (except for the outcome) where one value is present in more than 95% of rows. A new dataframe will be created.
new_df = gitlabds.remove_low_variation(df=my_df, dv='my_outcome', columns='all', threshold=.95):
#Dropped any of the selected columns where one value is present in more than 99% of rows. Operation will be done in place on the existing dataframe.
gitlabds.remove_low_variation(df=my_df, dv=None, columns = ['colA', 'colB', 'colC'], threshold=.99, inplace=True):
Correlation Reduction
Description
Reduce the number of columns on a dataframe by dropping columns that are highly correlated with other columns. Note: only one of the two highly correlated columns will be dropped. uses Pearson's correlation coefficient.
gitlabds.correlation_reduction(df=None, dv=None, threshold = 0.90, inplace=False, verbose=True):
Parameters:
- df : your pandas dataframe
- dv : The column name of your outcome. Entering your outcome variable in will prevent it from being dropped. May be left blank there is no outcome variable.
- threshold: The threshold above which columns will be dropped. If two variables exceed this threshold, one will be dropped from the dataframe. For example, the default value of
0.90
will identify columns that have correlations greater than 90% to each other and drop one of those columns. - inplace : Set to
True
to replace existing dataframe. Set to false to create a new one. Set toFalse
to suppress - verbose : Set to
True
to print outputs of windsoring being done. Set toFalse
to suppress.
Returns
- DataFrame with half of highly correlated columns dropped or None if
inplace=True
.
Examples:
#Perform column reduction via correlation using a threshold of 95%, excluding the outcome column. A new dataframe will be created.
new_df = gitlabds.correlation_reduction(df=my_df, dv=None, threshold = 0.95, verbose=True)
#Perform column reduction via correlation using a threshold of 90%. Operation will be done in place on the existing dataframe.
gitlabds.correlation_reduction(df=None, dv='my_outcome', threshold = 0.90, inplace=True, verbose=True)
Drop Categorical columns
Description
Drop all categorical columns from the dataframe. A useful step before regression modeling, as categorical variables are not used.
gitlabds.drop_categorical(df, inplace=False):
Parameters:
- df : your pandas dataframe
- inplace : Set to
True
to replace existing dataframe. Set to false to create a new one. Set toFalse
to suppress
Returns
- DataFrame with categorical columns dropped or None if
inplace=True
.
Examples:
#Dropping categorical columns and creating a new dataframe
new_df = gitlabds.drop_categorical(df=my_df)
#Dropping categorical columns in place
gitlabds.drop_categorical(df=my_df, inplace=True)
Remove Outcome Proxies
Description
Remove columns that are highly correlated with the outcome (target) column.
gitlabds.dv_proxies(df, dv, threshold=.8, inplace=False):
Parameters:
- df : your pandas dataframe
- dv : The column name of your outcome.
- threshold : The Pearson's correlation value to the outcome above which columns will be dropped. For example, the default value of
0.80
will identify and drop columns that have correlations greater than 80% to the outcome. - inplace : Set to
True
to replace existing dataframe. Set to false to create a new one. Set toFalse
to suppress
Returns
- DataFrame with outcome proxy columns dropped or None if
inplace=True
.
Examples:
#Drop columns with correlations to the outcome greater than 70% and create a new dataframe
new_df = gitlabds.dv_proxies(df=my_df, dv='my_outcome', threshold=.7)
#Drop columns with correlations to the outcome greater than 80% in place
gitlabds.dv_proxies(df=my_df, dv='my_outcome', threshold=.8, inplace=True)
Split and Sample Data
Description
This function will split your data into train and test datasets, separating the outcome from the rest of the file. The resultant datasets will be named x_train,y_train, x_test, and y_test.
gitlabds.split_data(df, train_pct=.7, dv=None, dv_threshold=.0, random_state = 5435):
Parameters:
- df : your pandas dataframe
- train_pct : The percentage of rows randomdly assigned to the training dataset.
- dv : The column name of your outcome.
- dv_threshold : The minimum percentage of rows that much contain a positive instance (i.e. > 0) of the outcome. SMOTE/SMOTE-NC will be used to upsample positive instances until this threshold is reached. Can be disabled by setting to 0. Only accepts values 0 to 0.5
- random_state : Random seed to use for splitting dataframe and for up-sampling (if needed)
Returns
- 4 dataframes for train and test and a list of model weights.
Examples:
#Split into train and test datasets with 70% of rows in train and 30% in test and change random seed.
x_train, y_train, x_test, y_test, model_weights = gitlabds.split_data(df=my_df, dv='my_outcome', train_pct=0.70, dv_threshold=0, random_state = 64522)
#Split into train and test datasets with 80% of rows in train and 20% in test; Up-sample if needed to hit 10% threshold.
x_train, y_train, x_test, y_test, model_weights = gitlabds.split_data(df=my_df, dv='my_outcome', train_pct=0.80, dv_threshold=0.1)
MARS (pyearth) Modeling - Logistic Regression Only (For now)
Description
Create a predictive model using MARS (pyearth). Further documentation on the algorythm can be found at https://contrib.scikit-learn.org/py-earth/
gitlabds.mars_modeling(x_train, y_train, x_test , model_weights=[1,1], allow_missing =True, max_degree=1, max_terms=100, max_iter=100, model_out='model.joblib'):
Parameters:
- x_train : train "predictors" dataframe
- y_train : train outcome/dv/target dataframe
- x_test : test "predictors" dataframe
- model_weights : (Optional) Pass model weights if up-/down-sampling was performed on the datasets. Otherwise defaults to a weight of 1 for x and y dataframes
- allow_missing : If True, use missing argument to determine missingness or,if X is a pandas DataFrame, infer missingness from X.
- max_degree : The maximum degree of terms generated by the forward pass. Setting = 2 will look at all first-order interactions between predictor columns (i.e x_train) and the outcome (y_train)
- max_terms : The maximum number of terms generated by the forward pass. All memory is allocated at the beginning of the forward pass, so setting max_terms to a very high number on a system with insufficient memory may cause a MemoryError at the start of the forward pass.
- max_iter : Maximum number of iterations taken for the solvers to converge.
- model_out : Where to save the resultant model file (packages as joblib).
Returns
- model fit, model equation (string), pared-down x_train and x_test dataframes that contain only the columns that hit in the model.
Examples:
#Model using MARS default settings
model, equation, x_train, x_test = mars_modeling()
#Model including all first-order interactions and save out pared-down x_train and x_test to new dataframes. Save the model fit to file as `my_model.mdl`.
model, equation, x_train_pared, x_test_pared = mars_modeling(max_degree=2, model_out='my_model.mdl')
XGBoost Modeling (Coming Soon)
Description
Description.
call:
Parameters:
- df : your pandas dataframe
- inplace : Set to
True
to replace existing dataframe. Set to false to create a new one. Set toFalse
to suppress - verbose : Set to
True
to print outputs of windsoring being done. Set toFalse
to suppress.
Returns
- DataFrame with windsored values or None if
inplace=True
.
Examples:
#Example 1
#Example 2
Model Metrics
Description
Display a variety of model metrics for linear and logistic predictive models.
gitlabds.model_metrics(model, x_train, y_train, x_test, y_test, show_graphs=True, f_score = 0.50, classification = True, algo=None):
Parameters:
- model : model file from training
- x_train : train "predictors" dataframe. If using
mars_modeling
, this is output containing only the columned used in the model (e.g.x_trained_pared
) - y_train : train outcome/dv/target dataframe
- x_test : test "predictors" dataframe. If using
mars_modeling
, this is output containing only the columned used in the model (e.g.x_test_pared
) - y_test : test outcome/dv/target dataframe
- show_graphs : Set to
True
to show visualizations - f_score : Cut point for determining a correct classification. Must also set classification to
True
to enable. - classification : Set to
True
to show classification model metrics (accuracy, precision, recall, F1). Set show_graphs toTrue
to display confusion matrix. - algo : Select the algorythm used to display additional model metrics. Currently supports
mars
,rf
, andxgb
Returns
- Prints and dataframes for model metrics and lift.
Examples:
#Display model metrics from an XGBoost model. Return classification metrics using a cut point of 0.30 F-Score
model_metrics, lift, class_metrics = gitlabds.model_metrics(model=model, x_train=x_train, y_train=y_train, x_test=x_test, y_test=y_test, show_graphs=True, f_score = 0.3, classification=True, algo='xgb')
#Display model metrics from a MARS model. Do not return classification metrics and suppress visualizations
model_metrics, lift = gitlabds.model_metrics(model=model, x_train=x_train, y_train=y_train, x_test=x_test, y_test=y_test, show_graphs=False, classification=False, algo='mars')
Marginal Effects
Description
Calculates and returns the marginal effects at the mean (MEM) for predcitor fields.
gitlabds.marginal_effects(model, x_test, dv_description, field_labels=None):
Parameters:
- model : model file from training
- x_test : test "predictors" dataframe. If using
mars_modeling
, this is output containing only the columned used in the model (e.g.x_test_pared
) - dv_description : Description of the outcome field to be used in text-based insights.
- field_labels : Dict of field descriptions. The key is the field/feature/predictor name. The value is descriptive text of the field. This field is optional and by default will use the field name
Returns
- Dataframe of marginal effects.
Examples:
#Model using MARS default settings
mem = gitlabds.marginal_effects(model=model, x_test=x_test, dv_description='likelihood to churn, field_labels={'spend':'Dollars spent in last 6 months', 'returns':'Item returns in last 3 months', 'emails_sent':'Marketing emails sent in last month'})
print(mem)
Prescriptions
Description
Return "actionable" prescriptions for each scored record. If not actions for a record are present, will return explainatory insights. This approach is recommended or linear/logistic methodologies only. Caution should be used if using a black box approach, as manpulating more than one prescription at a time could change a record's model score in unintended ways.
gitlabds.prescriptions(model, input_df, scored_df, actionable_fields, dv_description, field_labels=None, returned_insights=5):
Parameters:
- model : model file from training
- input_df : train "predictors" dataframe. If using
mars_modeling
, this is output containing only the columned used in the model (e.g.x_trained_pared
) - scored_df : dataframe containing model scores.
- actionable_fields : Dict of actionable fields. The key is the field/feature/predictor name. The value accepts one of 3 values:
Increasing
for prescriptions only when the field increases;Decreasing
for prescriptions only when the field decreases;Both
for when the field either increases or decreases. - dv_description : Description of the outcome field to be used in text-based insights.
- field_labels : Dict of field descriptions. The key is the field/feature/predictor name. The value is descriptive text of the field. This field is optional and by default will use the field name
- returned_insights : Number of insights per record to return. Defaults to 5
Returns
- Dataframe of prescriptive actions. One row per record input.
Examples:
#Return prescriptions for the actionable fields of 'spend', 'returns', and 'emails_sent':
gitlabds.prescriptions(model=model, input_df+my_df, scored_df=my_scores, actionable_fields={'spend':'Increasing', 'returns':'Decreasing', 'emails_sent':'Both'}, dv_description='likelihood to churn', field_labels={'spend':'Dollars spent in last 6 months', 'returns':'Item returns in last 3 months', 'emails_sent':'Marketing emails sent in last month'}, returned_insights=5)
Gitlab Data Science
The handbook is the single source of truth for all of our documentation.
Contributing
We welcome contributions and improvements, please see the contribution guidelines.
License
This code is distributed under the MIT license, please see the LICENSE file.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for gitlabds-1.0.12-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0610c7d68dba010e718d28fde39aaa877c827ec8ce92f499a01f933a1a0ad6ee |
|
MD5 | 11a8562d8fddf31719acc58fe289ec77 |
|
BLAKE2b-256 | 5b0ba0bba7818028e806e36acea24840eecf1f80b95613103791d1a04b853111 |