gitlabds

Gitlab Data Science and Modeling Tools

Project description

What is it?

gitlabds is a set of tools designed make it quicker and easier to build predictive models.

Where to get it?

gitlabds can be installed directly via pip: pip install gitlabds.

Alternatively, you can download the source code from Gitlab at https://gitlab.com/gitlab-data/gitlabds and compile locally.

Main Features

Data prep tools:
- Treat outliers
- Dummy code
- Miss fill
- Reduce feature space
- Split and sample data into train/test
Modeling tools:
- Quickly generate models using MARS (via the pyearth implementation)
- Quickly generat models using XGBoost
- Easily produce model metrics, feature importance, performance graphs, and lift/gains charts

References and Examples

MAD Outliers

Description

Median Absoutely Deviation for outlier detection and correction. By default will windsor all numeric values in your dataframe that are more than 4 standard deviations above or below the median ('threshold').

gitlabds.mad_outliers(df, dv=None, min_levels=10, columns = 'all', threshold=4, inplace=False, verbose=True, windsor_threshold=0.01):

Parameters:

df : your pandas dataframe
dv : The field name of your outcome. Entering your outcome variable in will prevent it from being windsored. May be left blank there is no outcome variable.
min_levels : Only include fields that have at least the number of levels specified.
columns : Will examine at all numeric columns by default. To limit to just a subset of columns, pass a list of column names. Doing so will ignore any constraints put on by the 'dv' and 'min_levels' paramaters.
threshold : Windsor values greater than this number of standard deviations from the median.
inplace : Set to True to replace existing dataframe. Set to false to create a new one. Set to False to suppress
verbose : Set to True to print outputs of windsoring being done. Set to False to suppress.
windsor_threshold : Only windsor values that affect less than this percentage of the population.

Returns

DataFrame with windsored values or None if inplace=True.

Examples:

#Create a new df; only windsor selected columns; suppress verbose
import gitlabds
new_df = gitlabds.mad_outliers(df = my_df, dv='my_outcome', columns = ['colA', 'colB', 'colC'], verbose=False)

#Inplace outliers. Will windsor values by altering the current dataframe
import gitlabds
gitlabds.mad_outliers(df = my_df, dv='my_outcome', columns = 'all', inplace=True)

Miss Fill

Description

Fill missing values using a range of different options.

gitlabds.missing_fill(df=None, columns='all', method='zero', inplace=False):

Parameters:

df : your pandas dataframe
columns : Columns which to miss fill. Defaults to all which will miss fill all fields with missing values.
method : Options are zero, median, and means. Defaults to zero.
inplace : Set to True to replace existing dataframe. Set to false to create a new one. Set to False to suppress

Returns

DataFrame with missing values filled or None if inplace=True.

Examples:

#Miss fill all values with zero in place.
gitlabds.missing_fill(df=my_df, columns='all', method='zero', inplace=True)

#Miss fill specificied columns with the mean value into a new dataframe
new_df = gitlabds,missing_fill(df=my_df, columns=['colA', 'colB', 'colC'], method='mean', inplace=False):

Dummy Code

Description

Dummy code (AKA "one-hot encode") categorical and numeric fields based on the paremeters specificed below. Note: categorical fields will be dropped after they are dummy coded; numeric fields will not

gitlabds.dummy_code(df, dv=None, columns='all', categorical=True, numeric=True, categorical_max_levels = 20, numeric_max_levels = 10, dummy_na=False):

Parameters:

df : your pandas dataframe
dv : The field name of your outcome. Entering your outcome variable in will prevent it from being dummy coded. May be left blank there is no outcome variable.
columns : Will examine at all columns by default. To limit to just a subset of columns, pass a list of column names.
categorical : Set to True to attempt to dummy code any categorical column passed via the columns parameter.
numeric : Set to True to attempt to dummy code any numeric column passed via the columns parameter.
categorical_max_levels : Maximum number of levels a categorical column can have to be eligable for dummy coding.
categorical_max_levels : Maximum number of levels a numeric column can have to be eligable for dummy coding.
dummy_na : Set to True to create a dummy coded column for missing values.

Returns

DataFrame with dummy-coded columns. Categorical columns that were dummy coded will be dropped from the dataframe.

Examples:

#Dummy code only categorical columns with a maxinum of 30 levels. Do not dummy code missing values
new_df = gitlabds.dummy_code(df=my_df, dv='my_outcome', columns='all', categorical=True, numeric=False, categorical_max_levels = 30, dummy_na=False)

#Dummy code only columns specified in the `columns` parameter with a maxinum of 10 levels for categorical and numeric. Also dummy code missing values
new_df = gitlabds.dummy_code(df=my_df, dv='my_outcome', columns= ['colA', colB', 'colC'], categorical=True, numeric=True, categorical_max_levels = 10, numeric_max_levels = 10,  dummy_na=True)

Top Dummies

Description

Dummy codes only categorical levels above a certain threshold of the population. Useful when a field contains many levels but there is not a need or desire to dummy code every level. Currently only works for categorical columns.

gitlabds.dummy_top(df=None, dv=None, columns = 'all', min_threshold = 0.05, drop_categorial=True, verbose=True):

Parameters:

df : your pandas dataframe
dv : The field name of your outcome. Entering your outcome variable in will prevent it from being dummy coded. May be left blank there is no outcome variable.
columns : Will examine at all columns by default. To limit to just a subset of columns, pass a list of column names.
min_threshold: The threshold at which levels will be dummy coded. For example, the default value of 0.05 will dummy code any categorical level that is in at least 5% of all rows. _ drop_categorical: Set to True to drop categorical fields after they are considered for dummy coding. Set to False to keep the original categorical fields in the dataframe.
verbose : Set to True to print detailed list of all dummy fields being created. Set to False to suppress.

Returns

DataFrame with dummy coded columns.

Examples:

#Dummy code all categorical levels from all categorical fields whose values are in at least 5% of all rows.
new_df = gitlabds.dummy_top(df=my_df, dv='my_outcome', columns = 'all', min_threshold = 0.05, drop_categorial=True, verbose=True)

#Dummy code all categorical levels from the selected fields who values are in at least 10% of all rows; suppress verbose printout and retain original categorical fields.
new_df = gitlabds.dummy_top(df=my_df, dv='my_outcome', columns = ['colA', 'colB', 'colC'], min_threshold = 0.10, drop_categorial=False, verbose=False)

Remove Low Variation Fields

Description

Description.

call:

Parameters:

df : your pandas dataframe
inplace : Set to True to replace existing dataframe. Set to false to create a new one. Set to False to suppress
verbose : Set to True to print outputs of windsoring being done. Set to False to suppress.

Returns

DataFrame with windsored values or None if inplace=True.

Examples:

#Example 1

#Example 2

Correlation Reduction

Description

Description.

call:

Parameters:

df : your pandas dataframe
inplace : Set to True to replace existing dataframe. Set to false to create a new one. Set to False to suppress
verbose : Set to True to print outputs of windsoring being done. Set to False to suppress.

Returns

DataFrame with windsored values or None if inplace=True.

Examples:

#Example 1

#Example 2

Drop Categorical Fields

Description

Description.

call:

Parameters:

df : your pandas dataframe
inplace : Set to True to replace existing dataframe. Set to false to create a new one. Set to False to suppress
verbose : Set to True to print outputs of windsoring being done. Set to False to suppress.

Returns

DataFrame with windsored values or None if inplace=True.

Examples:

#Example 1

#Example 2

Remove Outcome Proxies

Description

Description.

call:

Parameters:

df : your pandas dataframe
inplace : Set to True to replace existing dataframe. Set to false to create a new one. Set to False to suppress
verbose : Set to True to print outputs of windsoring being done. Set to False to suppress.

Returns

DataFrame with windsored values or None if inplace=True.

Examples:

#Example 1

#Example 2

Split and Sample Data

Description

Description.

call:

Parameters:

df : your pandas dataframe
inplace : Set to True to replace existing dataframe. Set to false to create a new one. Set to False to suppress
verbose : Set to True to print outputs of windsoring being done. Set to False to suppress.

Returns

DataFrame with windsored values or None if inplace=True.

Examples:

#Example 1

#Example 2

MARS (pyearth) Modeling - Logistic Regression Only (For now)

Description

Description.

call:

Parameters:

df : your pandas dataframe
inplace : Set to True to replace existing dataframe. Set to false to create a new one. Set to False to suppress
verbose : Set to True to print outputs of windsoring being done. Set to False to suppress.

Returns

DataFrame with windsored values or None if inplace=True.

Examples:

#Example 1

#Example 2

XGBoost Modeling (Coming Soon)

Description

Description.

call:

Parameters:

df : your pandas dataframe
inplace : Set to True to replace existing dataframe. Set to false to create a new one. Set to False to suppress
verbose : Set to True to print outputs of windsoring being done. Set to False to suppress.

Returns

DataFrame with windsored values or None if inplace=True.

Examples:

#Example 1

#Example 2

Model Metrics

Description

Description.

call:

Parameters:

df : your pandas dataframe
inplace : Set to True to replace existing dataframe. Set to false to create a new one. Set to False to suppress
verbose : Set to True to print outputs of windsoring being done. Set to False to suppress.

Returns

DataFrame with windsored values or None if inplace=True.

Examples:

#Example 1

#Example 2

Gitlab Data Science

The handbook is the single source of truth for all of our documentation.

Contributing

We welcome contributions and improvements, please see the contribution guidelines.

License

This code is distributed under the MIT license, please see the LICENSE file.

Project details

Release history Release notifications | RSS feed

1.0.23

Feb 20, 2024

1.0.22

Sep 28, 2023

1.0.21

Jun 7, 2023

1.0.20

May 10, 2023

1.0.19

Mar 30, 2023

1.0.18

Dec 16, 2022

1.0.17

Dec 13, 2022

1.0.16

Jun 22, 2022

1.0.15

Jun 6, 2022

1.0.14

Mar 18, 2022

1.0.13

Feb 10, 2022

1.0.12

Jan 19, 2022

1.0.11

Dec 13, 2021

1.0.10

Nov 30, 2021

1.0.9

Nov 1, 2021

1.0.8

Oct 25, 2021

This version

1.0.7

Sep 30, 2021

1.0.6

Sep 30, 2021

1.0.5

Sep 30, 2021

1.0.4

Sep 30, 2021

1.0.3

Sep 17, 2021

1.0.2

Sep 17, 2021

1.0.1

Sep 17, 2021

1.0.0

Sep 17, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gitlabds-1.0.7.tar.gz (15.6 kB view hashes)

Uploaded Sep 30, 2021 Source

Built Distribution

gitlabds-1.0.7-py3-none-any.whl (15.7 kB view hashes)

Uploaded Sep 30, 2021 Python 3

Hashes for gitlabds-1.0.7.tar.gz

Hashes for gitlabds-1.0.7.tar.gz
Algorithm	Hash digest
SHA256	`7047646cf68dd8789239769342676dfb650df1eaaa1d41b0df6f55fad4d107cc`
MD5	`caa0aff3cdb88754d197973851cf4aa1`
BLAKE2b-256	`b6aa2b4b09792b03d590cec32456ea9aa754c41ff4ffcef7fb40c8614dd0e9c1`

Hashes for gitlabds-1.0.7-py3-none-any.whl

Hashes for gitlabds-1.0.7-py3-none-any.whl
Algorithm	Hash digest
SHA256	`efea582c6a0a079feeeac20176295cacc35a80ae770677081044b1d5e609286e`
MD5	`b76d767181a20f44510b78bd78d4fc4d`
BLAKE2b-256	`87a0d38034d32b3023b4ce62520821caf285f7db1699e809270b351fd2890410`

gitlabds 1.0.7

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

What is it?

Where to get it?

Main Features

References and Examples

Description

Parameters:

Returns

Examples:

Description

Parameters:

Returns

Examples:

Description

Parameters:

Returns

Examples:

Description

Parameters:

Returns

Examples:

Description

Parameters:

Returns

Examples:

Description

Parameters:

Returns

Examples:

Description

Parameters:

Returns

Examples:

Description

Parameters:

Returns

Examples:

Description

Parameters:

Returns

Examples:

Description

Parameters:

Returns

Examples:

Description

Parameters:

Returns

Examples:

Description

Parameters:

Returns

Examples:

Gitlab Data Science

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution