Skip to main content

Gitlab Data Science and Modeling Tools

Project description

PyPI version Python 3.10+ License: MIT

gitlabds

What is it?

gitlabds is a Python toolkit that streamlines the machine learning workflows with specialized functions for data preparation, feature engineering, model evaluation, and deployment. It helps data scientists focus on providing consistent patterns for both experimentation and production pipelines.

Installation

pip install gitlabds

Requirements

  • Python 3.10 or later
  • Core dependencies:
    • pandas>=2.1.4
    • numpy>=1.26.4
    • scipy>=1.13.1
    • scikit-learn>=1.5.1
    • imbalanced-learn>=0.12.3
    • seaborn>=0.13.2
    • shap>=0.46.0
    • tqdm>=4.66.1

Main Features by Category

Data Preparation

Outlier Detection and Treatment

MAD Outliers

Description

Median Absolute Deviation for outlier detection and correction. By default will windsor all numeric values in your dataframe that are more than 4 standard deviations above or below the median ('threshold').

gitlabds.mad_outliers(df, dv=None, min_levels=10, columns='all', threshold=4.0, auto_adjust_skew=False, verbose=True, windsor_threshold=0.01):

Parameters:

  • df : your pandas dataframe
  • dv : The column name of your outcome. Entering your outcome variable in will prevent it from being windsored. May be left blank there is no outcome variable.
  • min_levels : Only include columns that have at least the number of levels specified.
  • columns : Will examine at all numeric columns by default. To limit to just a subset of columns, pass a list of column names. Doing so will ignore any constraints put on by the 'dv' and 'min_levels' paramaters.
  • threshold : Windsor values greater than this number of standard deviations from the median.
  • auto_adjust_skew : Whether to adjust thresholds based on column skewness
  • verbose : Set to True to print outputs of windsoring being done. Set to False to suppress.
  • windsor_threshold : Only windsor values that affect less than this percentage of the population.

Returns

  • Tuple containing:
    • The transformed DataFrame by windsoring outliers
    • Dictionary of outlier limits that can be used with apply_outliers()

Examples:

# Create a new df; only windsor selected columns; suppress verbose
import gitlabds
new_df, outlier_limits = gitlabds.mad_outliers(df=my_df, dv='my_outcome', columns=['colA', 'colB', 'colC'], verbose=False)
# Windsor values with skew adjustment for highly skewed data
new_df, outlier_limits = gitlabds.mad_outliers(df=my_df, threshold=3.0, auto_adjust_skew=True)
Apply Outliers

Description

Apply previously determined outlier limits to a dataframe. This is typically used to apply the same outlier treatment to new data that was applied during model training.

gitlabds.apply_outliers(df, outlier_limits):

Parameters:

  • df : The dataframe to transform
  • outlier_limits : dictionary of outlier limits previously generated by mad_outliers()

Returns

  • DataFrame with outlier limits applied.

Examples:

# Find outliers in training data
train_df, outlier_limits = gitlabds.mad_outliers(df=train_data, dv='target', threshold=3.0)

# Apply same outlier limits to test data
test_df_transformed = gitlabds.apply_outliers(df=test_data, outlier_limits=outlier_limits)

Missing Value Handling

Missing Values

Description

Detect and optionally fill missing values in a DataFrame, with support for various filling methods and detailed reporting.

gitlabds.missing_values(df, threshold=0.0, method=None, columns="all", constant_value=None, verbose=True, operation="both")

Parameters:

  • df : Your pandas dataframe
  • threshold : The percent of missing values at which a column is considered for processing. For example, threshold=0.10 will only process columns with more than 10% missing values.
  • method : Method to fill missing values or dictionary mapping columns to methods. Options:
    • "mean": Fill with column mean (numeric only)
    • "median": Fill with column median (numeric only)
    • "zero": Fill with 0
    • "constant": Fill with the value specified in constant_value
    • "random": Fill with random values sampled from the column's distribution
    • "drop_column": Remove columns with missing values
    • "drop_row": Remove rows with any missing values in specified columns
  • columns : Columns to check and/or fill. If "all", processes all columns with missing values.
  • constant_value : Value to use when method="constant" or when specified columns use the constant method.
  • verbose : Whether to print detailed information about missing values and filling operations.
  • operation : Operation mode:
    • "check": Only check for missing values, don't fill
    • "fill": Fill missing values and return filled dataframe
    • "both": Check and fill missing values (default)

Returns

  • If operation="check": List of column names with missing values (or None)
  • If operation="fill" or "both": Tuple containing:
    • DataFrame with missing values handled
    • Dictionary with missing value information that can be used with apply_missing_fill()

Examples:

# Just check for missing values
missing_columns = gitlabds.missing_values(df, threshold=0.05, operation="check")

# Fill all columns with mean value
df_filled, missing_info = gitlabds.missing_values(df, method="mean")

# Fill different columns with different methods
df_filled, missing_info = gitlabds.missing_values(
    df, 
    method={"numeric_col": "median", "string_col": "constant"},
    constant_value="Unknown",
    verbose=True
)
Apply Missing Values

Description

Apply previously determined missing value handling to a dataframe.

gitlabds.apply_missing_values(df, missing_info):

Parameters:

  • df : The dataframe to transform
  • missing_info : Dictionary of missing value information previously generated by missing_values()

Returns

  • DataFrame with missing values handled according to the provided information.

Examples:

# Generate missing value info from training data
_, missing_info = gitlabds.missing_values(train_df, method="mean")
   
# Apply to test data
test_df_filled = gitlabds.apply_missing_values(test_df, missing_info)

Feature Engineering

Dummy Code

Description

Dummy code (AKA "one-hot encode") categorical and numeric columns based on the paremeters specificed below. Note: categorical columns will be dropped after they are dummy coded; numeric columns will not

gitlabds.dummy_code(df, dv=None, columns='all', categorical=True, numeric=True, categorical_max_levels=20, numeric_max_levels=10, dummy_na=False, prefix_sep="_dummy_", verbose=True):

Parameters:

  • df : Your pandas dataframe
  • dv : The column name of your outcome. Entering your outcome variable will prevent it from being dummy coded. May be left blank if there is no outcome variable.
  • columns : Will examine all columns by default. To limit to just a subset of columns, pass a list of column names.
  • categorical : Set to True to attempt to dummy code any categorical column passed via the columns parameter.
  • numeric : Set to True to attempt to dummy code any numeric column passed via the columns parameter.
  • categorical_max_levels : Maximum number of levels a categorical column can have to be eligible for dummy coding.
  • numeric_max_levels : Maximum number of levels a numeric column can have to be eligible for dummy coding.
  • dummy_na : Set to True to create a dummy coded column for missing values.
  • prefix_sep : String to use as separator between column name and value in dummy column names. Default is "dummy".
  • verbose : Set to True to print outputs of dummy coding being done. Set to False to suppress.

Returns

  • A tuple containing:
    • The transformed DataFrame with dummy-coded columns. Categorical columns that were dummy coded will be dropped from the dataframe.
    • A dictionary containing information about dummy coding that can be used with apply_dummy() to transform new data consistently.

Examples:

# Dummy code only categorical columns with a maximum of 30 levels; suppress verbose output
import gitlabds
new_df, dummy_dict = gitlabds.dummy_code(
    df=my_df, 
    dv='my_outcome', 
    columns='all', 
    categorical=True, 
    numeric=False, 
    categorical_max_levels=30, 
    verbose=False
)
# Dummy code with custom separator
new_df, dummy_dict = gitlabds.dummy_code(
    df=my_df, 
    columns=['colA', 'colB', 'colC'], 
    categorical=True, 
    numeric=True, 
    prefix_sep="_is_"
)
Dummy Top

Description

Dummy codes only categorical levels above a certain threshold of the population. Useful when a column contains many levels but there is not a need or desire to dummy code every level. Currently only works for categorical columns.

gitlabds.dummy_top(df, dv=None, columns='all', min_threshold=0.05, drop_categorical=True, prefix_sep="_dummy_", verbose=True):

Parameters:

  • df : Your pandas dataframe
  • dv : The column name of your outcome. Entering your outcome variable will prevent it from being dummy coded. May be left blank if there is no outcome variable.
  • columns : Will examine all columns by default. To limit to just a subset of columns, pass a list of column names.
  • min_threshold: The threshold at which levels will be dummy coded. For example, the default value of 0.05 will dummy code any categorical level that is in at least 5% of all rows.
  • drop_categorical: Set to True to drop categorical columns after they are considered for dummy coding. Set to False to keep the original categorical columns in the dataframe.
  • prefix_sep : String to use as separator between column name and value in dummy column names. Default is "dummy".
  • verbose : Set to True to print detailed list of all dummy columns being created. Set to False to suppress.

Returns

  • A tuple containing:
    • The transformed DataFrame with dummy-coded columns for high-frequency values.
    • A dictionary containing information about dummy coding that can be used with apply_dummy() to transform new data consistently.

Examples:

# Dummy code all categorical levels from all categorical columns whose values are in at least 5% of all rows
import gitlabds
new_df, dummy_top_dict = gitlabds.dummy_top(
    df=my_df, 
    dv='my_outcome', 
    columns='all', 
    min_threshold=0.05, 
    drop_categorical=True, 
    verbose=True
)
# Dummy code all categorical levels from the selected columns whose values are in at least 10% of all rows; 
# suppress verbose printout and retain original categorical columns
new_df, dummy_top_dict = gitlabds.dummy_top(
    df=my_df, 
    dv='my_outcome', 
    columns=['colA', 'colB', 'colC'], 
    min_threshold=0.10, 
    drop_categorical=False, 
    verbose=False
)
Apply Dummy

Description

Apply previously determined dummy coding to a new dataframe. This is typically used to apply the same dummy coding to new data that was created during model training.

gitlabds.apply_dummy(df, dummy_info, drop_original=False):

Parameters:

  • df : The dataframe to transform
  • dummy_info : Dictionary of dummy coding information previously generated by dummy_code() or dummy_top()
  • drop_original : Whether to drop the original columns after dummy coding. Default is False.

Returns

  • DataFrame with dummy coding applied according to the provided information.

Examples:

# Generate dummy coding information from training data
train_df, dummy_info = gitlabds.dummy_code(df=train_data, dv='target')

# Apply to test data
test_df_transformed = gitlabds.apply_dummy(
    df=test_data, 
    dummy_info=dummy_info
)

Feature Selection

Remove Low Variation

Description

Remove columns from a dataset that do not meet the variation threshold. That is, columns will be dropped that contain a high percentage of one value.

gitlabds.remove_low_variation(df=None, dv=None, columns='all', threshold=.98, verbose=True):

Parameters:

  • df : your pandas dataframe
  • dv : The column name of your outcome. Entering your outcome variable in will prevent it from being removed due to low variation. May be left blank there is no outcome variable.
  • columns : Will examine at all columns by default. To limit to just a subset of columns, pass a list of column names.
  • threshold: The maximum percentage one value in a column can represent. columns that exceed this threshold will be dropped. For example, the default value of 0.98 will drop any column where one value is present in more than 98% of rows.
  • verbose : Set to True to print outputs of columns being dropped. Set to False to suppress.

Returns

  • DataFrame with low variation columns dropped.

Examples:

# Drop any columns (except for the outcome) where one value is present in more than 95% of rows.
new_df = gitlabds.remove_low_variation(df=my_df, dv='my_outcome', columns='all', threshold=.95)
# Drop any of the selected columns where one value is present in more than 99% of rows.
new_df = gitlabds.remove_low_variation(df=my_df, dv=None, columns=['colA', 'colB', 'colC'], threshold=.99)
Correlation Reduction

Description

Reduce the number of columns on a dataframe by dropping columns that are highly correlated with other columns. Note: only one of the two highly correlated columns will be dropped.

gitlabds.correlation_reduction(df=None, dv=None, threshold=0.9, method="pearson", verbose=True):

Parameters:

  • df : your pandas dataframe
  • dv : The column name of your outcome. Entering your outcome variable will prevent it from being dropped. If provided, when choosing between correlated features, the one with higher correlation to the target will be kept.
  • threshold: The threshold above which columns will be dropped. If two variables exceed this threshold, one will be dropped from the dataframe. For example, the default value of 0.90 will identify columns that have correlations greater than 90% to each other and drop one of those columns.
  • method: The correlation method to use. Options are "pearson" (linear relationships), "spearman" (monotonic relationships), or "mutual_info" (any statistical dependency).
  • verbose : Set to True to print outputs of columns being dropped. Set to False to suppress.

Returns

  • DataFrame with redundant correlated columns dropped.

Examples:

# Perform column reduction via correlation using a threshold of 95%, excluding the outcome column.
new_df = gitlabds.correlation_reduction(df=my_df, dv='my_outcome', threshold=0.95, method="pearson")
# Perform column reduction using Spearman rank correlation with a threshold of 90%.
new_df = gitlabds.correlation_reduction(df=my_df, dv=None, threshold=0.90, method="spearman")
Remove Outcome Proxies

Description

Remove columns that are highly correlated with the outcome (target) column.

gitlabds.remove_outcome_proxies(df, dv, threshold=.8, method="pearson", verbose=True):

Parameters:

  • df : your pandas dataframe
  • dv : The column name of your outcome.
  • threshold : The correlation value to the outcome above which columns will be dropped. For example, the default value of 0.80 will identify and drop columns that have correlations greater than 80% to the outcome.
  • method: The correlation method to use. Options are "pearson" (linear relationships), "spearman" (monotonic relationships), or "mutual_info" (any statistical dependency).
  • verbose : Set to True to print outputs of columns being dropped. Set to False to suppress.

Returns

  • DataFrame with outcome proxy columns dropped.

Examples:

# Drop columns with correlations to the outcome greater than 70%
new_df = gitlabds.remove_outcome_proxies(df=my_df, dv='my_outcome', threshold=.7)    
# Drop columns with correlations to the outcome greater than 80% using Spearman correlation
new_df = gitlabds.remove_outcome_proxies(df=my_df, dv='my_outcome', threshold=.8, method="spearman")        
Drop Categorical

Description

Drop all categorical columns from the dataframe. A useful step before regression modeling, as categorical variables are not used.

gitlabds.drop_categorical(df):

Parameters:

  • df : your pandas dataframe

Returns

  • DataFrame with categorical columns dropped.

Examples:

# Dropping categorical columns
new_df = gitlabds.drop_categorical(df=my_df) 

Memory Optimization

Memory Optimization

Description

Apply multiple memory optimization techniques to dramatically reduce DataFrame memory usage.

gitlabds.memory_optimization(df, apply_numeric_downcasting=True, apply_categorical=True, apply_sparse=True, precision_mode='balanced', verbose=True, exclude_columns=None, **kwargs):

Parameters:

  • df : Input pandas dataframe to optimize
  • apply_numeric_downcasting : Whether to downcast numeric columns to smaller data types. Defaults to True.
  • apply_categorical : Whether to convert string columns to categorical when beneficial. Defaults to True.
  • apply_sparse : Whether to apply sparse encoding for columns with many repeated values. Defaults to True.
  • precision_mode: str, default="balanced" Controls aggressiveness of numeric downcasting: - "aggressive": Maximum memory savings, may affect precision - "balanced": Good memory savings while preserving most precision - "safe": Conservative downcasting to preserve numeric precision
  • verbose : Whether to print progress and memory statistics. Defaults to True.
  • exclude_columns : List of columns to exclude from optimization. Defaults to None.
  • **kwargs : Additional arguments for optimization techniques

Returns

  • Memory-optimized pandas DataFrame.

Examples:

# Basic optimization with default settings
import gitlabds
df_optimized = gitlabds.memory_optimization(df)
# Customize optimization approach
df_optimized = gitlabds.memory_optimization(
    df,
    apply_numeric_downcasting=True,
    apply_categorical=True,
    apply_sparse=False,  # Skip sparse encoding
    precision_mode='safe'
    exclude_columns=['id', 'timestamp'],
    verbose=True
)

Model Development

Data Splitting and Sampling

Split Data

Description

This function splits your data into train and test datasets, separating the outcome from the rest of the file. It supports stratified sampling, balanced upsampling for imbalanced datasets, and provides model weights for compensating sampling adjustments.

gitlabds.split_data(df, train_pct=0.7, dv=None, dv_threshold=0.0, random_state=5435, stratify=True, sampling_strategy=None, shuffle=True, verbose=True):

Parameters:

  • df : your pandas dataframe
  • train_pct : The percentage of rows randomly assigned to the training dataset. Defaults to 0.7 (70% train, 30% test).
  • dv : The column name of your outcome. If None, the function will return the entire dataframe split without separating features and target.
  • dv_threshold : The minimum percentage of rows that must contain a positive instance (i.e. > 0) of the outcome. SMOTE/SMOTE-NC will be used to upsample positive instances until this threshold is reached. Can be disabled by setting to 0. Only accepts values 0 to 0.5.
  • random_state : Random seed to use for splitting dataframe and for up-sampling (if needed).
  • stratify : Controls stratified sampling. If True and dv is provided, stratifies by the outcome variable. If a list of column names, stratifies by those columns. If False, does not use stratified sampling.
  • sampling_strategy : Sampling strategy for imbalanced data. If None, will use dv_threshold. See imblearn documentation for more details on acceptable values.
  • shuffle : Whether to shuffle the data before splitting.
  • verbose : Whether to print information about the splitting process.

Returns

  • A tuple containing:
    • x_train: Training features DataFrame
    • y_train: Training target Series (if dv is provided, otherwise empty Series)
    • x_test: Testing features DataFrame
    • y_test: Testing target Series (if dv is provided, otherwise empty Series)
    • model_weights: List of weights to use for modeling [negative_class_weight, positive_class_weight]

Examples:

# Basic split with default parameters (70% train, 30% test)
x_train, y_train, x_test, y_test, model_weights = gitlabds.split_data(
    df=my_df, 
    dv='my_outcome'
)
# Split with 80% training data and balancing for imbalanced target
x_train, y_train, x_test, y_test, model_weights = gitlabds.split_data(
    df=my_df, 
    dv='my_outcome', 
    train_pct=0.80, 
    dv_threshold=0.3
)
# Split with stratification on multiple variables
x_train, y_train, x_test, y_test, model_weights = gitlabds.split_data(
    df=my_df, 
    dv='my_outcome',
    stratify=['my_outcome', 'region', 'customer_segment']
)
# Split entire dataframe without separating target
train_df, _, test_df, _, _ = gitlabds.split_data(
    df=my_df, 
    dv=None, 
    train_pct=0.75
)

Model Configuration

ConfigGenerator

Description

A simple, flexible configuration builder for creating YAML files with any structure. This utility allows you to build complex, nested configuration files programmatically without being constrained to a predefined structure.

gitlabds.ConfigGenerator(**kwargs):

Parameters:

  • **kwargs : Initial configuration values to populate the configuration object with

Methods:

add(path, value)

Add or update a value at a specific path in the configuration.

  • path: String using dot-notation to specify the location (e.g., 'model.parameters.learning_rate')
  • value: Any value to set at the specified path
to_yaml(file_path)

Write the configuration to a YAML file.

  • file_path: Path to the output YAML file

Returns

  • ConfigGenerator object for method chaining

Examples:

# Initialize with some top-level parameters
config = ConfigGenerator(
    model_name="churn_prediction",
    version="1.0.0",
    unique_id="customer_id"
)

# Add nested model parameters
config.add("model.file", "xgboost_model.pkl")
config.add("model.parameters.learning_rate", 0.01)
config.add("model.parameters.max_depth", 6)

# Add preprocessing information from outlier detection and dummy coding
config.add("preprocessing.outliers", outlier_info)
config.add("preprocessing.dummy_coding", dummy_info)

# Add query information
config.add("query_parameters.query_file", "customer_data.sql")
config.add("query_parameters.lookback_months", 12)

# Save to YAML
config.to_yaml("churn_model_config.yaml")

Model Evaluation

Comprehensive Evaluation

ModelEvaluator

Description

A comprehensive framework for evaluating machine learning models, supporting both classification (binary and multi-class) and regression models. It provides extensive evaluation metrics, visualizations, and feature importance analysis.

gitlabds.ModelEvaluator(model, x_train, y_train, x_test, y_test, x_oot=None, y_oot=None, classification=True, algo=None, f_score=0.50, decile_n=10, top_features_n=20, show_all_classes=True, show_plots=True, save_plots=True, plot_dir='plots', plot_save_format='png', plot_save_dpi=300)

Parameters:

  • model : The trained model to evaluate. Must have predict for regression and predict_proba method for classification
  • x_train : Training features DataFrame.
  • y_train : Training labels (Series or DataFrame).
  • x_test : Test features DataFrame.
  • y_test : Test labels (Series or DataFrame).
  • x_oot : Optional out-of-time validation features.
  • y_oot : Optional out-of-time validation labels.
  • classification : Whether this is a classification model. If False, regression metrics will be used.
  • algo : Algorithm type for feature importance calculation. Options: 'xgb', 'rf', 'mars'. For other algorithms, use None
  • f_score : Threshold for binary classification.
  • decile_n : Number of n-tiles for lift calculation. Defaults to 10 for deciles
  • top_features_n : Number of top features to display in visualizations.
  • show_all_classes : Whether to show metrics for all classes in multi-class classification.
  • show_plots : Whether to display plots
  • save_plots : Whether to save plots locally
  • plot_dir : Directory to save plots
  • plot_save_format : Plot format
  • plot_save_dpi : Plot resolution

Returns

  • ModelMetricsResult object containing all evaluation metrics and results.

Key Methods:

  • evaluate() - Compute and return all metrics
  • evaluate_custom_metrics(custom_metrics) - Evaluate with additional custom metrics
  • display_metrics(results=None) - Display evaluation results in a formatted way
  • calibration_assessment() - Assess model calibration for classification models
  • get_feature_descriptives(display_results=False) - Generate descriptive statistics for features
  • plot_feature_importance(feature_importance, n_features=20) - Plot feature importance
  • plot_shap_beeswarm(n_features=20, plot_type="beeswarm") - Create SHAP visualization
  • plot_score_distribution(bins=None) - Plot distribution of predicted values
  • plot_feature_interactions(feature_pairs=None, n_top_pairs=5) - Plot feature interactions
  • plot_confusion_matrix() - Plot confusion matrix for classification models
  • plot_lift_analysis() - Plot comprehensive lift analysis
  • plot_performance_curves() - Plot ROC and precision-recall curves
  • plot_learning_history() - Plot learning curves for iterative models
  • plot_performance_comparison() - Plot model performance for out-of-time validation

Examples:

# Create an evaluator for a classification model
from gitlabds import ModelEvaluator

evaluator = ModelEvaluator(
    model=my_model,
    x_train=x_train,
    y_train=y_train,
    x_test=x_test,
    y_test=y_test,
    classification=True,
    algo='xgb'
)

# Get all evaluation metrics
results = evaluator.evaluate()

# Display metrics in a formatted way
evaluator.display_metrics(results)

# Create visualizations
evaluator.plot_feature_importance(results.feature_importance)
evaluator.plot_confusion_matrix()
evaluator.plot_performance_curves()

# Save results to file
results.metrics_df.to_csv("metrics.csv")
results.classification_metrics_df.to_csv("classification_metrics.csv")
results.feature_importance.to_csv("feature_importance.csv")

Metric Functions

Model Metrics

Description

Display a variety of model metrics for linear and logistic predictive models.

gitlabds.model_metrics(model, x_train, y_train, x_test, y_test, show_graphs=True, f_score=0.50, classification=True, algo=None, decile_n=10, top_features_n=20):

Parameters:

  • model : model file from training
  • x_train : Training features DataFrame.
  • y_train : Training labels (Series or DataFrame).
  • x_test : Test features DataFrame.
  • y_test : Test labels (Series or DataFrame).
  • show_graphs : Whether to display plots and graphs.
  • f_score : Threshold for binary classification.
  • classification : Whether this is a classification model. If False, regression metrics will be used.
  • algo : Algorithm type for feature importance calculation. Options: 'xgb', 'rf', 'mars'.
  • decile_n : Number of n-tiles for lift calculation.
  • top_features_n : Number of top features to display in feature importance.

Returns

  • For classification models: tuple of (metricx, lift, classification_metricx, top_features, decile_breaks)
  • For regression models: tuple of (metricx, top_features)

Examples:

# For a classification model
import gitlabds
metricx, lift, classification_metricx, top_features, decile_breaks = gitlabds.model_metrics(
    model=my_classifier,
    x_train=x_train,
    y_train=y_train, 
    x_test=x_test, 
    y_test=y_test,
    classification=True, 
    algo='xgb'
)

# For a regression model
metricx, top_features = gitlabds.model_metrics(
    model=my_regressor,
    x_train=x_train,
    y_train=y_train, 
    x_test=x_test, 
    y_test=y_test,
    classification=False
)

Insight Generation

Marginal Effects

Description

Calculates and returns the marginal effects at the mean (MEM) for predictor fields.

gitlabds.marginal_effects(model, x_test, dv_description, field_labels=None):

Parameters:

  • model : model file from training
  • x_test : test "predictors" dataframe.
  • dv_description : Description of the outcome field to be used in text-based insights.
  • field_labels : Dict of field descriptions. The key is the field/feature/predictor name. The value is descriptive text of the field. This field is optional and by default will use the field name

Returns

  • Dataframe of marginal effects.

Examples:

# Calculate marginal effects for a trained model
import gitlabds
effects_df = gitlabds.marginal_effects(
    model=trained_model,
    x_test=test_features,
    dv_description="probability of churn",
    field_labels={
        "tenure": "Customer tenure in months",
        "monthly_charges": "Average monthly bill amount",
        "total_charges": "Total amount charged to customer"
    }
)

# Display the marginal effects
display(effects_df)
Prescriptions

Description

Return "actionable" prescriptions and explanatory insights for each scored record. Insights first list actionable prescriptions follow by explainatory insights. This approach is recommended or linear/logistic methodologies only. Caution should be used if using a black box approach, as manpulating more than one prescription at a time could change a record's model score in unintended ways.

gitlabds.prescriptions(model, input_df, scored_df, actionable_fields, dv_description, field_labels=None, returned_insights=5, only_actionable=False, explanation_fields='all'):

Parameters:

  • model : model file from training
  • input_df : train "predictors" dataframe.
  • scored_df : dataframe containing model scores.
  • actionable_fields : Dict of actionable fields. The key is the field/feature/predictor name. The value accepts one of 3 values: Increasing for prescriptions only when the field increases; Decreasing for prescriptions only when the field decreases; Both for when the field either increases or decreases.
  • dv_description : Description of the outcome field to be used in text-based insights.
  • field_labels : Dict of field descriptions. The key is the field/feature/predictor name. The value is descriptive text of the field. This field is optional and by default will use the field name
  • returned_insights : Number of insights per record to return. Defaults to 5
  • only_actionable : Only return actionable prescriptions
  • explanation_fields : List of explainable (non-actionable insights) fields to return insights for. Defaults to 'all'

Returns

  • Dataframe of prescriptive actions. One row per record input.

Examples:

# Return prescriptions for the actionable fields of 'spend', 'returns', and 'emails_sent':
results = gitlabds.prescriptions(
    model=model, 
    input_df=my_df, 
    scored_df=my_scores, 
    actionable_fields={
        'spend': 'Increasing', 
        'returns': 'Decreasing', 
        'emails_sent': 'Both'
    }, 
    dv_description='likelihood to churn', 
    field_labels={
        'spend': 'Dollars spent in last 6 months', 
        'returns': 'Item returns in last 3 months', 
        'emails_sent': 'Marketing emails sent in last month'
    }, 
    returned_insights=5, 
    only_actionable=True, 
    explanation_fields=['spend', 'returns']
)

SQL and Trend Analysis

SQL Generation

SQL Trend Query Generator

Description

Generate SQL for trend analysis across time periods. The generated SQL transforms regular data into a time-series format with columns for each time period, allowing for easy trend detection.

gitlabds.generate_sql_trend_query(snapshot_date, date_field, date_unit='MONTH', periods=12, table_name=None, group_by_fields=None, metrics=None, filters=None, output_file=None):

Parameters:

  • snapshot_date : Reference date for analysis (e.g., '2025-04-08')
  • date_field : Field name in the table that contains the date to analyze
  • date_unit : Time unit for analysis: 'DAY', 'WEEK', 'MONTH', 'QUARTER', 'YEAR'
  • periods : Number of time periods to analyze
  • table_name : Table to query data from
  • group_by_fields : Fields to group by (entity identifiers)
  • metrics : Metrics to include in analysis with their properties. Each metric is a dict with:
    • name: output column name prefix
    • source: field name in the source table
    • aggregation: function to apply (AVG, SUM, MAX, etc.)
    • condition: optional WHERE condition
    • cumulative: if True, calculate period-over-period differences
    • is_case_expression: if True, the source is already a CASE WHEN expression
    • is_expression: if True, the source is a complex expression
  • filters : SQL WHERE clause conditions as a string
  • output_file : If provided, save the generated SQL to this file

Returns:

  • The generated SQL query as a string

Examples:

# Generate SQL for monthly trend analysis
import gitlabds

# Define metrics
metrics = [
    {"name": "active_users", "source": "monthly_active_users", "aggregation": "AVG"},
    {"name": "revenue", "source": "monthly_revenue", "aggregation": "SUM"},
    {"name": "projects", "source": "projects_created", "aggregation": "MAX", "cumulative": True}
]

# Generate SQL query
sql = gitlabds.generate_sql_trend_query(
    snapshot_date='2025-04-08',
    date_field='transaction_date',
    date_unit='MONTH',
    periods=12,
    table_name='analytics.user_metrics',
    group_by_fields=['account_id'],
    metrics=metrics,
    filters="is_active = TRUE",
    output_file='trend_query.sql'
)
# Generate SQL for daily trend analysis with custom conditions
metrics = [
    {"name": "logins", "source": "user_logins", "aggregation": "SUM"},
    {"name": "premium_logins", "source": "user_logins", "aggregation": "SUM", 
     "condition": "subscription_tier = 'premium'"}
]

sql = gitlabds.generate_sql_trend_query(
    snapshot_date='2025-04-08',
    date_field='login_date',
    date_unit='DAY',
    periods=30,
    table_name='analytics.daily_logins',
    metrics=metrics
)

Trend Analysis

Trend Analysis

Description

Calculate trend metrics for a dataframe produced by the SQL trend generator. This function analyzes time-series data to identify patterns like consecutive increases or decreases, proportion of periods with growth or decline, and average percentage changes.

gitlabds.trend_analysis(df, metric_list=None, time_unit='month', periods=6, include_cumulative=True, exclude_fields=None, verbose=False):

Parameters:

  • df : Dataframe containing trend data with time-based columns
  • metric_list : List of metric names to analyze. If None, auto-detects metrics from columns
  • time_unit : Time unit used in the column names (month, day, week, etc.)
  • periods : Number of time periods to analyze
  • include_cumulative : Whether to use cumulative (event) metrics when available
  • exclude_fields : List of fields to exclude from auto-detection
  • verbose : Whether to display intermediate output

Returns:

  • A dataframe containing trend metrics for each specified metric, including:
    • Count of periods with decreases/increases
    • Count of consecutive decreases/increases
    • Average percentage change across periods

Examples:

# Run trend analysis on data from SQL trend query
import gitlabds

# Run the SQL query to get trend data
trend_data = run_sql_query(trend_sql)  # Your function to execute SQL
trend_data.set_index('account_id', inplace=True)

# Analyze trends for all metrics
trends_df = gitlabds.trend_analysis(
    df=trend_data,
    time_unit='month',
    periods=12,
    verbose=True
)
# Analyze trends for specific metrics
trends_df = gitlabds.trend_analysis(
    df=trend_data,
    metric_list=['active_users', 'revenue'],
    time_unit='month',
    periods=6,
    include_cumulative=True,
    exclude_fields=['has_data']
)

# Use trend metrics for customer health scoring
account_data['declining_usage'] = trends_df['consecutive_drop_active_users_period_6_months_cnt'] > 0
account_data['growth_score'] = trends_df['avg_perc_change_revenue_period_6_months'] * 100

Gitlab Data Science

The handbook is the single source of truth for all of our documentation.

Contributing

We welcome contributions and improvements, please see the contribution guidelines.

License

This code is distributed under the MIT license, please see the LICENSE file.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gitlabds-2.0.0.tar.gz (95.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

gitlabds-2.0.0-py3-none-any.whl (82.4 kB view details)

Uploaded Python 3

File details

Details for the file gitlabds-2.0.0.tar.gz.

File metadata

  • Download URL: gitlabds-2.0.0.tar.gz
  • Upload date:
  • Size: 95.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.17

File hashes

Hashes for gitlabds-2.0.0.tar.gz
Algorithm Hash digest
SHA256 2c03d9fdb905e9155e43705e3c5b6e11c6821f0fc8dc82f973ce9a77bd665dce
MD5 8dbf28b91bf2fc9545227b169199a067
BLAKE2b-256 59052db9cfc2f6869d10f5c0d99f2d26eccdf957619a3702a0a3101b62e56eec

See more details on using hashes here.

File details

Details for the file gitlabds-2.0.0-py3-none-any.whl.

File metadata

  • Download URL: gitlabds-2.0.0-py3-none-any.whl
  • Upload date:
  • Size: 82.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.17

File hashes

Hashes for gitlabds-2.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 20ddd3317404cbabc68274270f2d3bfcc69100a6e8a7cf76733eec888aa0a45d
MD5 765571d310a478d288e171178ee01b03
BLAKE2b-256 66a081d3491738bb7cb29b88bd8963c77a50d81ca4b8f0b6e174ee9934537a74

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page