Skip to main content

Gitlab Data Science and Modeling Tools

Project description

What is it?

gitlabds is a set of tools designed make it quicker and easier to build predictive models.

Where to get it?

gitlabds can be installed directly via pip: pip install gitlabds.

Alternatively, you can download the source code from Gitlab at https://gitlab.com/gitlab-data/gitlabds and compile locally.

Main Features

  • Data prep tools:
    • Treat outliers
    • Dummy code
    • Miss fill
    • Reduce feature space
    • Split and sample data into train/test
  • Modeling tools:
    • Quickly generate models using MARS (via the pyearth implementation)
    • Quickly generat models using XGBoost
    • Easily produce model metrics, feature importance, performance graphs, and lift/gains charts

References and Examples

MAD Outliers

Description

Median Absoutely Deviation for outlier detection and correction. By default will windsor all numeric values in your dataframe that are more than 4 standard deviations above or below the median ('threshold').

gitlabds.mad_outliers(df, dv=None, min_levels=10, columns = 'all', threshold=4, inplace=False, verbose=True, windsor_threshold=0.01):

Parameters:

  • df : your pandas dataframe
  • dv : The field name of your outcome. Entering your outcome variable in will prevent it from being windsored. May be left blank there is no outcome variable.
  • min_levels : Only include fields that have at least the number of levels specified.
  • columns : Will examine at all numeric columns by default. To limit to just a subset of columns, pass a list of column names. Doing so will ignore any constraints put on by the 'dv' and 'min_levels' paramaters.
  • threshold : Windsor values greater than this number of standard deviations from the median.
  • inplace : Set to True to replace existing dataframe. Set to false to create a new one. Set to False to suppress
  • verbose : Set to True to print outputs of windsoring being done. Set to False to suppress.
  • windsor_threshold : Only windsor values that affect less than this percentage of the population.

Returns

  • DataFrame with windsored values or None if inplace=True.

Examples:

#Create a new df; only windsor selected columns; suppress verbose
import gitlabds
new_df = gitlabds.mad_outliers(df = my_df, dv='my_outcome', columns = ['colA', 'colB', 'colC'], verbose=False)
#Inplace outliers. Will windsor values by altering the current dataframe
import gitlabds
gitlabds.mad_outliers(df = my_df, dv='my_outcome', columns = 'all', inplace=True)
Miss Fill

Description

Fill missing values using a range of different options.

gitlabds.missing_fill(df=None, columns='all', method='zero', inplace=False):

Parameters:

  • df : your pandas dataframe
  • columns : Columns which to miss fill. Defaults to all which will miss fill all fields with missing values.
  • method : Options are zero, median, and means. Defaults to zero.
  • inplace : Set to True to replace existing dataframe. Set to false to create a new one. Set to False to suppress

Returns

  • DataFrame with missing values filled or None if inplace=True.

Examples:

#Miss fill all values with zero in place.
gitlabds.missing_fill(df=my_df, columns='all', method='zero', inplace=True)   
#Miss fill specificied columns with the mean value into a new dataframe
new_df = gitlabds,missing_fill(df=my_df, columns=['colA', 'colB', 'colC'], method='mean', inplace=False):
Dummy Code

Description

Dummy code (AKA "one-hot encode") categorical and numeric fields based on the paremeters specificed below. Note: categorical fields will be dropped after they are dummy coded; numeric fields will not

gitlabds.dummy_code(df, dv=None, columns='all', categorical=True, numeric=True, categorical_max_levels = 20, numeric_max_levels = 10, dummy_na=False):

Parameters:

  • df : your pandas dataframe
  • dv : The field name of your outcome. Entering your outcome variable in will prevent it from being dummy coded. May be left blank there is no outcome variable.
  • columns : Will examine at all columns by default. To limit to just a subset of columns, pass a list of column names.
  • categorical : Set to True to attempt to dummy code any categorical column passed via the columns parameter.
  • numeric : Set to True to attempt to dummy code any numeric column passed via the columns parameter.
  • categorical_max_levels : Maximum number of levels a categorical column can have to be eligable for dummy coding.
  • categorical_max_levels : Maximum number of levels a numeric column can have to be eligable for dummy coding.
  • dummy_na : Set to True to create a dummy coded column for missing values.

Returns

  • DataFrame with dummy-coded columns. Categorical columns that were dummy coded will be dropped from the dataframe.

Examples:

#Dummy code only categorical columns with a maxinum of 30 levels. Do not dummy code missing values
new_df = gitlabds.dummy_code(df=my_df, dv='my_outcome', columns='all', categorical=True, numeric=False, categorical_max_levels = 30, dummy_na=False)
#Dummy code only columns specified in the `columns` parameter with a maxinum of 10 levels for categorical and numeric. Also dummy code missing values
new_df = gitlabds.dummy_code(df=my_df, dv='my_outcome', columns= ['colA', colB', 'colC'], categorical=True, numeric=True, categorical_max_levels = 10, numeric_max_levels = 10,  dummy_na=True)
Top Dummies

Description

Dummy codes only categorical levels above a certain threshold of the population. Useful when a field contains many levels but there is not a need or desire to dummy code every level. Currently only works for categorical columns.

gitlabds.dummy_top(df=None, dv=None, columns = 'all', min_threshold = 0.05, drop_categorial=True, verbose=True):

Parameters:

  • df : your pandas dataframe
  • dv : The field name of your outcome. Entering your outcome variable in will prevent it from being dummy coded. May be left blank there is no outcome variable.
  • columns : Will examine at all columns by default. To limit to just a subset of columns, pass a list of column names.
  • min_threshold: The threshold at which levels will be dummy coded. For example, the default value of 0.05 will dummy code any categorical level that is in at least 5% of all rows. _ drop_categorical: Set to True to drop categorical fields after they are considered for dummy coding. Set to False to keep the original categorical fields in the dataframe.
  • verbose : Set to True to print detailed list of all dummy fields being created. Set to False to suppress.

Returns

  • DataFrame with dummy coded columns.

Examples:

#Dummy code all categorical levels from all categorical fields whose values are in at least 5% of all rows.
new_df = gitlabds.dummy_top(df=my_df, dv='my_outcome', columns = 'all', min_threshold = 0.05, drop_categorial=True, verbose=True)
#Dummy code all categorical levels from the selected fields who values are in at least 10% of all rows; suppress verbose printout and retain original categorical fields.
new_df = gitlabds.dummy_top(df=my_df, dv='my_outcome', columns = ['colA', 'colB', 'colC'], min_threshold = 0.10, drop_categorial=False, verbose=False)
Remove Low Variation Fields

Description

Description.

call:

Parameters:

  • df : your pandas dataframe
  • inplace : Set to True to replace existing dataframe. Set to false to create a new one. Set to False to suppress
  • verbose : Set to True to print outputs of windsoring being done. Set to False to suppress.

Returns

  • DataFrame with windsored values or None if inplace=True.

Examples:

#Example 1
#Example 2
Correlation Reduction

Description

Description.

call:

Parameters:

  • df : your pandas dataframe
  • inplace : Set to True to replace existing dataframe. Set to false to create a new one. Set to False to suppress
  • verbose : Set to True to print outputs of windsoring being done. Set to False to suppress.

Returns

  • DataFrame with windsored values or None if inplace=True.

Examples:

#Example 1
#Example 2
Drop Categorical Fields

Description

Description.

call:

Parameters:

  • df : your pandas dataframe
  • inplace : Set to True to replace existing dataframe. Set to false to create a new one. Set to False to suppress
  • verbose : Set to True to print outputs of windsoring being done. Set to False to suppress.

Returns

  • DataFrame with windsored values or None if inplace=True.

Examples:

#Example 1
#Example 2
Remove Outcome Proxies

Description

Description.

call:

Parameters:

  • df : your pandas dataframe
  • inplace : Set to True to replace existing dataframe. Set to false to create a new one. Set to False to suppress
  • verbose : Set to True to print outputs of windsoring being done. Set to False to suppress.

Returns

  • DataFrame with windsored values or None if inplace=True.

Examples:

#Example 1
#Example 2
Split and Sample Data

Description

Description.

call:

Parameters:

  • df : your pandas dataframe
  • inplace : Set to True to replace existing dataframe. Set to false to create a new one. Set to False to suppress
  • verbose : Set to True to print outputs of windsoring being done. Set to False to suppress.

Returns

  • DataFrame with windsored values or None if inplace=True.

Examples:

#Example 1
#Example 2
MARS (pyearth) Modeling - Logistic Regression Only (For now)

Description

Description.

call:

Parameters:

  • df : your pandas dataframe
  • inplace : Set to True to replace existing dataframe. Set to false to create a new one. Set to False to suppress
  • verbose : Set to True to print outputs of windsoring being done. Set to False to suppress.

Returns

  • DataFrame with windsored values or None if inplace=True.

Examples:

#Example 1
#Example 2
XGBoost Modeling (Coming Soon)

Description

Description.

call:

Parameters:

  • df : your pandas dataframe
  • inplace : Set to True to replace existing dataframe. Set to false to create a new one. Set to False to suppress
  • verbose : Set to True to print outputs of windsoring being done. Set to False to suppress.

Returns

  • DataFrame with windsored values or None if inplace=True.

Examples:

#Example 1
#Example 2
Model Metrics

Description

Description.

call:

Parameters:

  • df : your pandas dataframe
  • inplace : Set to True to replace existing dataframe. Set to false to create a new one. Set to False to suppress
  • verbose : Set to True to print outputs of windsoring being done. Set to False to suppress.

Returns

  • DataFrame with windsored values or None if inplace=True.

Examples:

#Example 1
#Example 2

Gitlab Data Science

The handbook is the single source of truth for all of our documentation.

Contributing

We welcome contributions and improvements, please see the contribution guidelines.

License

This code is distributed under the MIT license, please see the LICENSE file.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gitlabds-1.0.7.tar.gz (15.6 kB view hashes)

Uploaded Source

Built Distribution

gitlabds-1.0.7-py3-none-any.whl (15.7 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page