Skip to main content

Gitlab Data Science and Modeling Tools

Project description

What is it?

gitlabds is a set of tools designed make it quicker and easier to build predictive models.

Where to get it?

gitlabds can be installed directly via pip: pip install gitlabds.

Alternatively, you can download the source code from Gitlab at https://gitlab.com/gitlab-data/gitlabds and compile locally.

Main Features

  • Data prep tools:
    • Treat outliers
    • Dummy code
    • Miss fill
    • Reduce feature space
    • Split and sample data into train/test
  • Modeling tools:
    • Quickly generate models using MARS (via the pyearth implementation)
    • Quickly generate models using XGBoost
    • Easily produce model metrics, feature importance, performance graphs, and lift/gains charts
    • Generate model insights and prescriptions

References and Examples

MAD Outliers

Description

Median Absoutely Deviation for outlier detection and correction. By default will windsor all numeric values in your dataframe that are more than 4 standard deviations above or below the median ('threshold').

gitlabds.mad_outliers(df, dv=None, min_levels=10, columns = 'all', threshold=4, inplace=False, verbose=True, windsor_threshold=0.01, output_file=None, output_method='a'):

Parameters:

  • df : your pandas dataframe
  • dv : The column name of your outcome. Entering your outcome variable in will prevent it from being windsored. May be left blank there is no outcome variable.
  • min_levels : Only include columns that have at least the number of levels specified.
  • columns : Will examine at all numeric columns by default. To limit to just a subset of columns, pass a list of column names. Doing so will ignore any constraints put on by the 'dv' and 'min_levels' paramaters.
  • threshold : Windsor values greater than this number of standard deviations from the median.
  • inplace : Set to True to replace existing dataframe. Set to false to create a new one. Set to False to suppress
  • verbose : Set to True to print outputs of windsoring being done. Set to False to suppress.
  • windsor_threshold : Only windsor values that affect less than this percentage of the population.
  • output_file: Output syntax to file (e.g. 'my_syntax.py') as a function. Defaults to None.
  • output_method: Method of writing file; 'w' to write, 'a' to append. Defaults to 'a'.

Returns

  • DataFrame with windsored values or None if inplace=True.

Examples:

#Create a new df; only windsor selected columns; suppress verbose
import gitlabds
new_df = gitlabds.mad_outliers(df = my_df, dv='my_outcome', columns = ['colA', 'colB', 'colC'], verbose=False)
#Inplace outliers. Will windsor values by altering the current dataframe
import gitlabds
gitlabds.mad_outliers(df = my_df, dv='my_outcome', columns = 'all', inplace=True)
Missing Values Check

Description

Check for missing values.

gitlabds.missing_check(df=None, threshold = 0, by='column_name', ascending=True, return_missing_cols = False):

Parameters:

  • df : your pandas dataframe
  • threshold : The percent of missing values at which a column is considered to have missing values. For example, threshold = .10 will only display columns with more than 10% of its values missing. Defaults to 0.
  • by : Columns to sort by. Defaults to column_name. Also accepts percent_missing, total_missing, or a list.
  • ascending : Sort ascending vs. descending. Defaults to ascending (ascending=True).
  • return_missing_cols : Set to True to return a list of column names that meet the threshold criteria for missing.

Returns

  • List of columns with missing values filled or None if return_missing_cols=False.

Examples:

#Check for missing values using default settings
gitlabds.missing_check(df=my_df, threshold = 0, by='column_name', ascending=True, return_missing_cols = False)
#Check for columns with more than 5% missing values and return a list of those columns
missing_list = gitlabds.missing_check(df=my_df, threshold = 0.05, by='column_name', ascending=True, return_missing_cols = True) 
Missing Values Fill

Description

Fill missing values using a range of different options.

gitlabds.missing_fill(df=None, columns='all', method='zero', inplace=False, output_file=None, output_method='a'):

Parameters:

  • df : your pandas dataframe
  • columns : Columns which to miss fill. Defaults to all which will miss fill all columns with missing values.
  • method : Options are zero, median, mean, drop_column, and drop_row. Defaults to zero.
  • inplace : Set to True to replace existing dataframe. Set to false to create a new one. Set to False to suppress
  • output_file: Output syntax to file (e.g. 'my_syntax.py') as a function. Defaults to None.
  • output_method: Method of writing file; 'w' to write, 'a' to append. Defaults to 'a'.

Returns

  • DataFrame with missing values filled or None if inplace=True.

Examples:

#Miss fill specificied columns with the mean value into a new dataframe
new_df = gitlabds,missing_fill(df=my_df, columns=['colA', 'colB', 'colC'], method='mean', inplace=False):
#Miss fill all values with zero in place.
gitlabds.missing_fill(df=my_df, columns='all', method='zero', inplace=True)   
Dummy Code

Description

Dummy code (AKA "one-hot encode") categorical and numeric columns based on the paremeters specificed below. Note: categorical columns will be dropped after they are dummy coded; numeric columns will not

gitlabds.dummy_code(df, dv=None, columns='all', categorical=True, numeric=True, categorical_max_levels = 20, numeric_max_levels = 10, dummy_na=False, output_file=None, output_method='a'):

Parameters:

  • df : your pandas dataframe
  • dv : The column name of your outcome. Entering your outcome variable in will prevent it from being dummy coded. May be left blank there is no outcome variable.
  • columns : Will examine at all columns by default. To limit to just a subset of columns, pass a list of column names.
  • categorical : Set to True to attempt to dummy code any categorical column passed via the columns parameter.
  • numeric : Set to True to attempt to dummy code any numeric column passed via the columns parameter.
  • categorical_max_levels : Maximum number of levels a categorical column can have to be eligable for dummy coding.
  • categorical_max_levels : Maximum number of levels a numeric column can have to be eligable for dummy coding.
  • dummy_na : Set to True to create a dummy coded column for missing values.
  • output_file: Output syntax to file (e.g. 'my_syntax.py') as a function. Defaults to None.
  • output_method: Method of writing file; 'w' to write, 'a' to append. Defaults to 'a'.

Returns

  • DataFrame with dummy-coded columns. Categorical columns that were dummy coded will be dropped from the dataframe.

Examples:

#Dummy code only categorical columns with a maxinum of 30 levels. Do not dummy code missing values
new_df = gitlabds.dummy_code(df=my_df, dv='my_outcome', columns='all', categorical=True, numeric=False, categorical_max_levels = 30, dummy_na=False)
#Dummy code only columns specified in the `columns` parameter with a maxinum of 10 levels for categorical and numeric. Also dummy code missing values
new_df = gitlabds.dummy_code(df=my_df, dv='my_outcome', columns= ['colA', colB', 'colC'], categorical=True, numeric=True, categorical_max_levels = 10, numeric_max_levels = 10,  dummy_na=True)
Top Dummies

Description

Dummy codes only categorical levels above a certain threshold of the population. Useful when a column contains many levels but there is not a need or desire to dummy code every level. Currently only works for categorical columns.

gitlabds.dummy_top(df=None, dv=None, columns = 'all', min_threshold = 0.05, drop_categorial=True, verbose=True, output_file=None, output_method='a'):

Parameters:

  • df : your pandas dataframe
  • dv : The column name of your outcome. Entering your outcome variable in will prevent it from being dummy coded. May be left blank there is no outcome variable.
  • columns : Will examine at all columns by default. To limit to just a subset of columns, pass a list of column names.
  • min_threshold: The threshold at which levels will be dummy coded. For example, the default value of 0.05 will dummy code any categorical level that is in at least 5% of all rows. _ drop_categorical: Set to True to drop categorical columns after they are considered for dummy coding. Set to False to keep the original categorical columns in the dataframe.
  • verbose : Set to True to print detailed list of all dummy columns being created. Set to False to suppress.
  • output_file: Output syntax to file (e.g. 'my_syntax.py') as a function. Defaults to None.
  • output_method: Method of writing file; 'w' to write, 'a' to append. Defaults to 'a'.

Returns

  • DataFrame with dummy coded columns.

Examples:

#Dummy code all categorical levels from all categorical columns whose values are in at least 5% of all rows.
new_df = gitlabds.dummy_top(df=my_df, dv='my_outcome', columns = 'all', min_threshold = 0.05, drop_categorial=True, verbose=True)
#Dummy code all categorical levels from the selected columns who values are in at least 10% of all rows; suppress verbose printout and retain original categorical columns.
new_df = gitlabds.dummy_top(df=my_df, dv='my_outcome', columns = ['colA', 'colB', 'colC'], min_threshold = 0.10, drop_categorial=False, verbose=False)
Remove Low Variation columns

Description

Remove columns from a dataset that do not meet the variation threshold. That is, columns will be dropped that contain a high percentage of one value.

gitlabds.remove_low_variation(df=None, dv=None, columns='all', threshold=.98, inplace=False, verbose=True):

Parameters:

  • df : your pandas dataframe
  • dv : The column name of your outcome. Entering your outcome variable in will prevent it from being removed due to low variation. May be left blank there is no outcome variable.
  • columns : Will examine at all columns by default. To limit to just a subset of columns, pass a list of column names.
  • threshold: The maximum percentage one value in a column can represent. columns that exceed this threshold will be dropped. For example, the default value of 0.98 will drop any column where one value is present in more than 98% of rows.
  • inplace : Set to True to replace existing dataframe. Set to false to create a new one. Set to False to suppress
  • verbose : Set to True to print outputs of windsoring being done. Set to False to suppress.

Returns

  • DataFrame with low variation columns dropped or None if inplace=True.

Examples:

#Dropped any columns (except for the outcome) where one value is present in more than 95% of rows. A new dataframe will be created.
new_df = gitlabds.remove_low_variation(df=my_df, dv='my_outcome', columns='all', threshold=.95):
#Dropped any of the selected columns where one value is present in more than 99% of rows. Operation will be done in place on the existing dataframe.
gitlabds.remove_low_variation(df=my_df, dv=None, columns = ['colA', 'colB', 'colC'], threshold=.99, inplace=True):
Correlation Reduction

Description

Reduce the number of columns on a dataframe by dropping columns that are highly correlated with other columns. Note: only one of the two highly correlated columns will be dropped. uses Pearson's correlation coefficient.

gitlabds.correlation_reduction(df=None, dv=None, threshold = 0.90, inplace=False, verbose=True):

Parameters:

  • df : your pandas dataframe
  • dv : The column name of your outcome. Entering your outcome variable in will prevent it from being dropped. May be left blank there is no outcome variable.
  • threshold: The threshold above which columns will be dropped. If two variables exceed this threshold, one will be dropped from the dataframe. For example, the default value of 0.90 will identify columns that have correlations greater than 90% to each other and drop one of those columns.
  • inplace : Set to True to replace existing dataframe. Set to false to create a new one. Set to False to suppress
  • verbose : Set to True to print outputs of windsoring being done. Set to False to suppress.

Returns

  • DataFrame with half of highly correlated columns dropped or None if inplace=True.

Examples:

#Perform column reduction via correlation using a threshold of 95%, excluding the outcome column. A new dataframe will be created.
new_df = gitlabds.correlation_reduction(df=my_df, dv=None, threshold = 0.95, verbose=True)
#Perform column reduction via correlation using a threshold of 90%. Operation will be done in place on the existing dataframe.
gitlabds.correlation_reduction(df=None, dv='my_outcome', threshold = 0.90, inplace=True, verbose=True)
Drop Categorical columns

Description

Drop all categorical columns from the dataframe. A useful step before regression modeling, as categorical variables are not used.

gitlabds.drop_categorical(df, inplace=False):

Parameters:

  • df : your pandas dataframe
  • inplace : Set to True to replace existing dataframe. Set to false to create a new one. Set to False to suppress

Returns

  • DataFrame with categorical columns dropped or None if inplace=True.

Examples:

#Dropping categorical columns and creating a new dataframe
new_df = gitlabds.drop_categorical(df=my_df) 
#Dropping categorical columns in place
gitlabds.drop_categorical(df=my_df, inplace=True) 
Remove Outcome Proxies

Description

Remove columns that are highly correlated with the outcome (target) column.

gitlabds.dv_proxies(df, dv, threshold=.8, inplace=False):

Parameters:

  • df : your pandas dataframe
  • dv : The column name of your outcome.
  • threshold : The Pearson's correlation value to the outcome above which columns will be dropped. For example, the default value of 0.80 will identify and drop columns that have correlations greater than 80% to the outcome.
  • inplace : Set to True to replace existing dataframe. Set to false to create a new one. Set to False to suppress

Returns

  • DataFrame with outcome proxy columns dropped or None if inplace=True.

Examples:

#Drop columns with correlations to the outcome greater than 70% and create a new dataframe
new_df = gitlabds.dv_proxies(df=my_df, dv='my_outcome', threshold=.7)    
#Drop columns with correlations to the outcome greater than 80% in place
gitlabds.dv_proxies(df=my_df, dv='my_outcome', threshold=.8, inplace=True)        
Split and Sample Data

Description

This function will split your data into train and test datasets, separating the outcome from the rest of the file. The resultant datasets will be named x_train,y_train, x_test, and y_test.

gitlabds.split_data(df, train_pct=.7, dv=None, dv_threshold=.0, random_state = 5435):

Parameters:

  • df : your pandas dataframe
  • train_pct : The percentage of rows randomdly assigned to the training dataset.
  • dv : The column name of your outcome.
  • dv_threshold : The minimum percentage of rows that much contain a positive instance (i.e. > 0) of the outcome. SMOTE/SMOTE-NC will be used to upsample positive instances until this threshold is reached. Can be disabled by setting to 0. Only accepts values 0 to 0.5
  • random_state : Random seed to use for splitting dataframe and for up-sampling (if needed)

Returns

  • 4 dataframes for train and test and a list of model weights.

Examples:

#Split into train and test datasets with 70% of rows in train and 30% in test and change random seed.
x_train, y_train, x_test, y_test, model_weights = gitlabds.split_data(df=my_df, dv='my_outcome', train_pct=0.70, dv_threshold=0, random_state = 64522)
#Split into train and test datasets with 80% of rows in train and 20% in test; Up-sample if needed to hit 10% threshold.
x_train, y_train, x_test, y_test, model_weights = gitlabds.split_data(df=my_df, dv='my_outcome', train_pct=0.80, dv_threshold=0.1)
MARS (pyearth) Modeling - Logistic Regression Only (For now)

Description

Create a predictive model using MARS (pyearth). Further documentation on the algorythm can be found at https://contrib.scikit-learn.org/py-earth/

gitlabds.mars_modeling(x_train, y_train, x_test , model_weights=[1,1], allow_missing =True, max_degree=1, max_terms=100, max_iter=100, model_out='model.joblib'):

Parameters:

  • x_train : train "predictors" dataframe
  • y_train : train outcome/dv/target dataframe
  • x_test : test "predictors" dataframe
  • model_weights : (Optional) Pass model weights if up-/down-sampling was performed on the datasets. Otherwise defaults to a weight of 1 for x and y dataframes
  • allow_missing : If True, use missing argument to determine missingness or,if X is a pandas DataFrame, infer missingness from X.
  • max_degree : The maximum degree of terms generated by the forward pass. Setting = 2 will look at all first-order interactions between predictor columns (i.e x_train) and the outcome (y_train)
  • max_terms : The maximum number of terms generated by the forward pass. All memory is allocated at the beginning of the forward pass, so setting max_terms to a very high number on a system with insufficient memory may cause a MemoryError at the start of the forward pass.
  • max_iter : Maximum number of iterations taken for the solvers to converge.
  • model_out : Where to save the resultant model file (packages as joblib).

Returns

  • model fit, model equation (string), pared-down x_train and x_test dataframes that contain only the columns that hit in the model.

Examples:

#Model using MARS default settings
model, equation, x_train, x_test = mars_modeling()
#Model including all first-order interactions and save out pared-down x_train and x_test to new dataframes. Save the model fit to file as `my_model.mdl`.
model, equation, x_train_pared, x_test_pared = mars_modeling(max_degree=2, model_out='my_model.mdl')
XGBoost Modeling (Coming Soon)

Description

Description.

call:

Parameters:

  • df : your pandas dataframe
  • inplace : Set to True to replace existing dataframe. Set to false to create a new one. Set to False to suppress
  • verbose : Set to True to print outputs of windsoring being done. Set to False to suppress.

Returns

  • DataFrame with windsored values or None if inplace=True.

Examples:

#Example 1
#Example 2
Model Metrics

Description

Display a variety of model metrics for linear and logistic predictive models.

gitlabds.model_metrics(model, x_train, y_train, x_test, y_test, show_graphs=True, f_score = 0.50, classification = True, algo=None, decile_n=10, top_features_n=20):

Parameters:

  • model : model file from training
  • x_train : train "predictors" dataframe. If using mars_modeling, this is output containing only the columned used in the model (e.g. x_trained_pared)
  • y_train : train outcome/dv/target dataframe
  • x_test : test "predictors" dataframe. If using mars_modeling, this is output containing only the columned used in the model (e.g. x_test_pared)
  • y_test : test outcome/dv/target dataframe
  • show_graphs : Set to True to show visualizations
  • f_score : Cut point for determining a correct classification. Must also set classification to True to enable.
  • classification : Set to True to show classification model metrics (accuracy, precision, recall, F1). Set show_graphs to True to display confusion matrix.
  • algo : Select the algorythm used to display additional model metrics. Currently supports mars, rf, xgb, and None
  • decile_n : Specify number of group to create to calculate lift. Defaults to 10 (deciles)
  • top_features_n : Print a list of the top x features present in the model.

Returns

  • Prints and dataframes for model metrics and lift.

Examples:

#Display model metrics from an XGBoost model. Return classification metrics using a cut point of 0.30 F-Score
model_metrics, lift, class_metrics = gitlabds.model_metrics(model=model, x_train=x_train, y_train=y_train, x_test=x_test, y_test=y_test, show_graphs=True, f_score = 0.3, classification=True, algo='xgb')
#Display model metrics from a MARS model. Do not return classification metrics and suppress visualizations
model_metrics, lift = gitlabds.model_metrics(model=model, x_train=x_train, y_train=y_train, x_test=x_test, y_test=y_test, show_graphs=False, classification=False, algo='mars')
Marginal Effects

Description

Calculates and returns the marginal effects at the mean (MEM) for predcitor fields.

gitlabds.marginal_effects(model, x_test, dv_description, field_labels=None):

Parameters:

  • model : model file from training
  • x_test : test "predictors" dataframe. If using mars_modeling, this is output containing only the columned used in the model (e.g. x_test_pared)
  • dv_description : Description of the outcome field to be used in text-based insights.
  • field_labels : Dict of field descriptions. The key is the field/feature/predictor name. The value is descriptive text of the field. This field is optional and by default will use the field name

Returns

  • Dataframe of marginal effects.

Examples:

#Model using MARS default settings
mem = gitlabds.marginal_effects(model=model, x_test=x_test, dv_description='likelihood to churn, field_labels={'spend':'Dollars spent in last 6 months', 'returns':'Item returns in last 3 months', 'emails_sent':'Marketing emails sent in last month'})
print(mem)
Prescriptions

Description

Return "actionable" prescriptions and explanatory insights for each scored record. Insights first list actionable prescriptions follow by explainatory insights. This approach is recommended or linear/logistic methodologies only. Caution should be used if using a black box approach, as manpulating more than one prescription at a time could change a record's model score in unintended ways.

gitlabds.prescriptions(model, input_df, scored_df, actionable_fields, dv_description, field_labels=None, returned_insights=5, only_actionable=False, explanation_fields='all'):

Parameters:

  • model : model file from training
  • input_df : train "predictors" dataframe. If using mars_modeling, this is output containing only the columned used in the model (e.g. x_trained_pared)
  • scored_df : dataframe containing model scores.
  • actionable_fields : Dict of actionable fields. The key is the field/feature/predictor name. The value accepts one of 3 values: Increasing for prescriptions only when the field increases; Decreasing for prescriptions only when the field decreases; Both for when the field either increases or decreases.
  • dv_description : Description of the outcome field to be used in text-based insights.
  • field_labels : Dict of field descriptions. The key is the field/feature/predictor name. The value is descriptive text of the field. This field is optional and by default will use the field name
  • returned_insights : Number of insights per record to return. Defaults to 5
  • only_actionable : Only return actionable prescriptions
  • explanation_fields : List of explainable (non-actionable insights) fields to return insights for. Defaults to 'all'

Returns

  • Dataframe of prescriptive actions. One row per record input.

Examples:

#Return prescriptions for the actionable fields of 'spend', 'returns', and 'emails_sent':
gitlabds.prescriptions(model=model, input_df=my_df, scored_df=my_scores, actionable_fields={'spend':'Increasing', 'returns':'Decreasing', 'emails_sent':'Both'}, dv_description='likelihood to churn', field_labels={'spend':'Dollars spent in last 6 months', 'returns':'Item returns in last 3 months', 'emails_sent':'Marketing emails sent in last month'}, returned_insights=5, only_actionable=True, explaination_fields=['spend', returns'])

Gitlab Data Science

The handbook is the single source of truth for all of our documentation.

Contributing

We welcome contributions and improvements, please see the contribution guidelines.

License

This code is distributed under the MIT license, please see the LICENSE file.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gitlabds-1.0.18.tar.gz (31.9 kB view hashes)

Uploaded source

Built Distribution

gitlabds-1.0.18-py3-none-any.whl (31.3 kB view hashes)

Uploaded py3

Supported by

AWS AWS Cloud computing Datadog Datadog Monitoring Facebook / Instagram Facebook / Instagram PSF Sponsor Fastly Fastly CDN Google Google Object Storage and Download Analytics Huawei Huawei PSF Sponsor Microsoft Microsoft PSF Sponsor NVIDIA NVIDIA PSF Sponsor Pingdom Pingdom Monitoring Salesforce Salesforce PSF Sponsor Sentry Sentry Error logging StatusPage StatusPage Status page