GitLab Data Science Tools
Project description
gitlabds
What is it?
gitlabds is a Python toolkit that streamlines the machine learning workflows with specialized functions for data preparation, feature engineering, model evaluation, and deployment. It helps data scientists focus on providing consistent patterns for both experimentation and production pipelines.
Installation
pip install gitlabds
Requirements
- Python 3.11 or later
- Core dependencies:
- pandas>=1.5.3
- numpy>=1.23.5
- scipy>=1.13.1
- scikit-learn>=1.1.1
- imbalanced-learn>=0.9.1
- seaborn>=0.13.2
- shap>=0.50.0
- tqdm>=4.66.2
- xgboost>=1.6.1
Optional Dependencies
Snowflake Feature Store
The serve_features and clear_feature_serving_locks functions require additional Snowflake dependencies.
To use these functions, install with:
pip install gitlabds[feature-store]
This will install:
snowflake-snowpark-pythonsnowflake-ml-python
If you try to use these functions without installing the optional dependencies, you'll get an error message with installation instructions.
Main Features by Category
Data Preparation
Outlier Detection and Treatment
MAD Outliers
Description
Median Absolute Deviation for outlier detection and correction. By default will windsor all numeric values in your dataframe that are more than 4 standard deviations above or below the median ('threshold').
gitlabds.mad_outliers(df, dv=None, min_levels=10, columns='all', threshold=4.0, auto_adjust_skew=False, verbose=True, windsor_threshold=0.01):
Parameters:
- df : your pandas dataframe
- dv : The column name of your outcome. Entering your outcome variable in will prevent it from being windsored. May be left blank there is no outcome variable.
- min_levels : Only include columns that have at least the number of levels specified.
- columns : Will examine at all numeric columns by default. To limit to just a subset of columns, pass a list of column names. Doing so will ignore any constraints put on by the 'dv' and 'min_levels' paramaters.
- threshold : Windsor values greater than this number of standard deviations from the median.
- auto_adjust_skew : Whether to adjust thresholds based on column skewness
- verbose : Set to
Trueto print outputs of windsoring being done. Set toFalseto suppress. - windsor_threshold : Only windsor values that affect less than this percentage of the population.
Returns
- Tuple containing:
- The transformed DataFrame by windsoring outliers
- Dictionary of outlier limits that can be used with apply_outliers()
Examples:
# Create a new df; only windsor selected columns; suppress verbose
import gitlabds
new_df, outlier_limits = gitlabds.mad_outliers(df=my_df, dv='my_outcome', columns=['colA', 'colB', 'colC'], verbose=False)
# Windsor values with skew adjustment for highly skewed data
new_df, outlier_limits = gitlabds.mad_outliers(df=my_df, threshold=3.0, auto_adjust_skew=True)
Apply Outliers
Description
Apply previously determined outlier limits to a dataframe. This is typically used to apply the same outlier treatment to new data that was applied during model training.
gitlabds.apply_outliers(df, outlier_limits):
Parameters:
- df : The dataframe to transform
- outlier_limits : dictionary of outlier limits previously generated by mad_outliers()
Returns
- DataFrame with outlier limits applied.
Examples:
# Find outliers in training data
train_df, outlier_limits = gitlabds.mad_outliers(df=train_data, dv='target', threshold=3.0)
# Apply same outlier limits to test data
test_df_transformed = gitlabds.apply_outliers(df=test_data, outlier_limits=outlier_limits)
Missing Value Handling
Missing Values
Description
Detect and optionally fill missing values in a DataFrame, with support for various filling methods and detailed reporting.
gitlabds.missing_values(df, threshold=0.0, method=None, columns="all", constant_value=None, verbose=True, operation="both")
Parameters:
- df : Your pandas dataframe
- threshold : The percent of missing values at which a column is considered for processing. For example, threshold=0.10 will only process columns with more than 10% missing values.
- method : Method to fill missing values or dictionary mapping columns to methods. Options:
- "mean": Fill with column mean (numeric only)
- "median": Fill with column median (numeric only)
- "zero": Fill with 0
- "constant": Fill with the value specified in constant_value
- "random": Fill with random values sampled from the column's distribution
- "drop_column": Remove columns with missing values
- "drop_row": Remove rows with any missing values in specified columns
- columns : Columns to check and/or fill. If "all", processes all columns with missing values.
- constant_value : Value to use when method="constant" or when specified columns use the constant method.
- verbose : Whether to print detailed information about missing values and filling operations.
- operation : Operation mode:
- "check": Only check for missing values, don't fill
- "fill": Fill missing values and return filled dataframe
- "both": Check and fill missing values (default)
Returns
- If operation="check": List of column names with missing values (or None)
- If operation="fill" or "both": Tuple containing:
- DataFrame with missing values handled
- Dictionary with missing value information that can be used with apply_missing_fill()
Examples:
# Just check for missing values
missing_columns = gitlabds.missing_values(df, threshold=0.05, operation="check")
# Fill all columns with mean value
df_filled, missing_info = gitlabds.missing_values(df, method="mean")
# Fill different columns with different methods
df_filled, missing_info = gitlabds.missing_values(
df,
method={"numeric_col": "median", "string_col": "constant"},
constant_value="Unknown",
verbose=True
)
Apply Missing Values
Description
Apply previously determined missing value handling to a dataframe.
gitlabds.apply_missing_values(df, missing_info):
Parameters:
- df : The dataframe to transform
- missing_info : Dictionary of missing value information previously generated by
missing_values()
Returns
- DataFrame with missing values handled according to the provided information.
Examples:
# Generate missing value info from training data
_, missing_info = gitlabds.missing_values(train_df, method="mean")
# Apply to test data
test_df_filled = gitlabds.apply_missing_values(test_df, missing_info)
Feature Engineering
Dummy Code
Description
Dummy code (AKA "one-hot encode") categorical and numeric columns based on the paremeters specificed below. Note: categorical columns will be dropped after they are dummy coded; numeric columns will not
gitlabds.dummy_code(df, dv=None, columns='all', categorical=True, numeric=True, categorical_max_levels=20, numeric_max_levels=10, dummy_na=False, prefix_sep="_dummy_", verbose=True):
Parameters:
- df : Your pandas dataframe
- dv : The column name of your outcome. Entering your outcome variable will prevent it from being dummy coded. May be left blank if there is no outcome variable.
- columns : Will examine all columns by default. To limit to just a subset of columns, pass a list of column names.
- categorical : Set to
Trueto attempt to dummy code any categorical column passed via thecolumnsparameter. - numeric : Set to
Trueto attempt to dummy code any numeric column passed via thecolumnsparameter. - categorical_max_levels : Maximum number of levels a categorical column can have to be eligible for dummy coding.
- numeric_max_levels : Maximum number of levels a numeric column can have to be eligible for dummy coding.
- dummy_na : Set to
Trueto create a dummy coded column for missing values. - prefix_sep : String to use as separator between column name and value in dummy column names. Default is "dummy".
- verbose : Set to
Trueto print outputs of dummy coding being done. Set toFalseto suppress.
Returns
- A tuple containing:
- The transformed DataFrame with dummy-coded columns. Categorical columns that were dummy coded will be dropped from the dataframe.
- A dictionary containing information about dummy coding that can be used with
apply_dummy()to transform new data consistently.
Examples:
# Dummy code only categorical columns with a maximum of 30 levels; suppress verbose output
import gitlabds
new_df, dummy_dict = gitlabds.dummy_code(
df=my_df,
dv='my_outcome',
columns='all',
categorical=True,
numeric=False,
categorical_max_levels=30,
verbose=False
)
# Dummy code with custom separator
new_df, dummy_dict = gitlabds.dummy_code(
df=my_df,
columns=['colA', 'colB', 'colC'],
categorical=True,
numeric=True,
prefix_sep="_is_"
)
Dummy Top
Description
Dummy codes only categorical levels above a certain threshold of the population. Useful when a column contains many levels but there is not a need or desire to dummy code every level. Currently only works for categorical columns.
gitlabds.dummy_top(df, dv=None, columns='all', min_threshold=0.05, drop_categorical=True, prefix_sep="_dummy_", verbose=True):
Parameters:
- df : Your pandas dataframe
- dv : The column name of your outcome. Entering your outcome variable will prevent it from being dummy coded. May be left blank if there is no outcome variable.
- columns : Will examine all columns by default. To limit to just a subset of columns, pass a list of column names.
- min_threshold: The threshold at which levels will be dummy coded. For example, the default value of
0.05will dummy code any categorical level that is in at least 5% of all rows. - drop_categorical: Set to
Trueto drop categorical columns after they are considered for dummy coding. Set toFalseto keep the original categorical columns in the dataframe. - prefix_sep : String to use as separator between column name and value in dummy column names. Default is "dummy".
- verbose : Set to
Trueto print detailed list of all dummy columns being created. Set toFalseto suppress.
Returns
- A tuple containing:
- The transformed DataFrame with dummy-coded columns for high-frequency values.
- A dictionary containing information about dummy coding that can be used with
apply_dummy()to transform new data consistently.
Examples:
# Dummy code all categorical levels from all categorical columns whose values are in at least 5% of all rows
import gitlabds
new_df, dummy_top_dict = gitlabds.dummy_top(
df=my_df,
dv='my_outcome',
columns='all',
min_threshold=0.05,
drop_categorical=True,
verbose=True
)
# Dummy code all categorical levels from the selected columns whose values are in at least 10% of all rows;
# suppress verbose printout and retain original categorical columns
new_df, dummy_top_dict = gitlabds.dummy_top(
df=my_df,
dv='my_outcome',
columns=['colA', 'colB', 'colC'],
min_threshold=0.10,
drop_categorical=False,
verbose=False
)
Apply Dummy
Description
Apply previously determined dummy coding to a new dataframe. This is typically used to apply the same dummy coding to new data that was created during model training.
gitlabds.apply_dummy(df, dummy_info, drop_original=False):
Parameters:
- df : The dataframe to transform
- dummy_info : Dictionary of dummy coding information previously generated by
dummy_code()ordummy_top() - drop_original : Whether to drop the original columns after dummy coding. Default is
False.
Returns
- DataFrame with dummy coding applied according to the provided information.
Examples:
# Generate dummy coding information from training data
train_df, dummy_info = gitlabds.dummy_code(df=train_data, dv='target')
# Apply to test data
test_df_transformed = gitlabds.apply_dummy(
df=test_data,
dummy_info=dummy_info
)
Feature Selection
Remove Low Variation
Description
Remove columns from a dataset that do not meet the variation threshold. That is, columns will be dropped that contain a high percentage of one value.
gitlabds.remove_low_variation(df=None, dv=None, columns='all', threshold=.98, verbose=True):
Parameters:
- df : your pandas dataframe
- dv : The column name of your outcome. Entering your outcome variable in will prevent it from being removed due to low variation. May be left blank there is no outcome variable.
- columns : Will examine at all columns by default. To limit to just a subset of columns, pass a list of column names.
- threshold: The maximum percentage one value in a column can represent. columns that exceed this threshold will be dropped. For example, the default value of
0.98will drop any column where one value is present in more than 98% of rows. - verbose : Set to
Trueto print outputs of columns being dropped. Set toFalseto suppress.
Returns
- DataFrame with low variation columns dropped.
Examples:
# Drop any columns (except for the outcome) where one value is present in more than 95% of rows.
new_df = gitlabds.remove_low_variation(df=my_df, dv='my_outcome', columns='all', threshold=.95)
# Drop any of the selected columns where one value is present in more than 99% of rows.
new_df = gitlabds.remove_low_variation(df=my_df, dv=None, columns=['colA', 'colB', 'colC'], threshold=.99)
Correlation Reduction
Description
Reduce the number of columns on a dataframe by dropping columns that are highly correlated with other columns. Note: only one of the two highly correlated columns will be dropped.
gitlabds.correlation_reduction(df=None, dv=None, threshold=0.9, method="pearson", verbose=True):
Parameters:
- df : your pandas dataframe
- dv : The column name of your outcome. Entering your outcome variable will prevent it from being dropped. If provided, when choosing between correlated features, the one with higher correlation to the target will be kept.
- threshold: The threshold above which columns will be dropped. If two variables exceed this threshold, one will be dropped from the dataframe. For example, the default value of
0.90will identify columns that have correlations greater than 90% to each other and drop one of those columns. - method: The correlation method to use. Options are "pearson" (linear relationships), "spearman" (monotonic relationships), or "mutual_info" (any statistical dependency).
- verbose : Set to
Trueto print outputs of columns being dropped. Set toFalseto suppress.
Returns
- DataFrame with redundant correlated columns dropped.
Examples:
# Perform column reduction via correlation using a threshold of 95%, excluding the outcome column.
new_df = gitlabds.correlation_reduction(df=my_df, dv='my_outcome', threshold=0.95, method="pearson")
# Perform column reduction using Spearman rank correlation with a threshold of 90%.
new_df = gitlabds.correlation_reduction(df=my_df, dv=None, threshold=0.90, method="spearman")
Remove Outcome Proxies
Description
Remove columns that are highly correlated with the outcome (target) column.
gitlabds.remove_outcome_proxies(df, dv, threshold=.8, method="pearson", verbose=True):
Parameters:
- df : your pandas dataframe
- dv : The column name of your outcome.
- threshold : The correlation value to the outcome above which columns will be dropped. For example, the default value of
0.80will identify and drop columns that have correlations greater than 80% to the outcome. - method: The correlation method to use. Options are "pearson" (linear relationships), "spearman" (monotonic relationships), or "mutual_info" (any statistical dependency).
- verbose : Set to
Trueto print outputs of columns being dropped. Set toFalseto suppress.
Returns
- DataFrame with outcome proxy columns dropped.
Examples:
# Drop columns with correlations to the outcome greater than 70%
new_df = gitlabds.remove_outcome_proxies(df=my_df, dv='my_outcome', threshold=.7)
# Drop columns with correlations to the outcome greater than 80% using Spearman correlation
new_df = gitlabds.remove_outcome_proxies(df=my_df, dv='my_outcome', threshold=.8, method="spearman")
Drop Categorical
Description
Drop all categorical columns from the dataframe. A useful step before regression modeling, as categorical variables are not used.
gitlabds.drop_categorical(df):
Parameters:
- df : your pandas dataframe
Returns
- DataFrame with categorical columns dropped.
Examples:
# Dropping categorical columns
new_df = gitlabds.drop_categorical(df=my_df)
Memory Optimization
Memory Optimization
Description
Apply multiple memory optimization techniques to dramatically reduce DataFrame memory usage.
gitlabds.memory_optimization(df, apply_numeric_downcasting=True, apply_categorical=True, apply_sparse=True, precision_mode='balanced', verbose=True, exclude_columns=None, **kwargs):
Parameters:
- df : Input pandas dataframe to optimize
- apply_numeric_downcasting : Whether to downcast numeric columns to smaller data types. Defaults to
True. - apply_categorical : Whether to convert string columns to categorical when beneficial. Defaults to
True. - apply_sparse : Whether to apply sparse encoding for columns with many repeated values. Defaults to
True. - precision_mode: str, default="balanced" Controls aggressiveness of numeric downcasting: - "aggressive": Maximum memory savings, may affect precision - "balanced": Good memory savings while preserving most precision - "safe": Conservative downcasting to preserve numeric precision
- verbose : Whether to print progress and memory statistics. Defaults to
True. - exclude_columns : List of columns to exclude from optimization. Defaults to
None. - **kwargs : Additional arguments for optimization techniques
Returns
- Memory-optimized pandas DataFrame.
Examples:
# Basic optimization with default settings
import gitlabds
df_optimized = gitlabds.memory_optimization(df)
# Customize optimization approach
df_optimized = gitlabds.memory_optimization(
df,
apply_numeric_downcasting=True,
apply_categorical=True,
apply_sparse=False, # Skip sparse encoding
precision_mode='safe'
exclude_columns=['id', 'timestamp'],
verbose=True
)
Model Development
Data Splitting and Sampling
Split Data
Description
This function splits your data into train and test datasets, separating the outcome from the rest of the file. It supports stratified sampling, balanced upsampling for imbalanced datasets, and provides model weights for compensating sampling adjustments.
gitlabds.split_data(df, train_pct=0.7, dv=None, dv_threshold=0.0, random_state=5435, stratify=True, sampling_strategy=None, shuffle=True, verbose=True):
Parameters:
- df : your pandas dataframe
- train_pct : The percentage of rows randomly assigned to the training dataset. Defaults to 0.7 (70% train, 30% test).
- dv : The column name of your outcome. If None, the function will return the entire dataframe split without separating features and target.
- dv_threshold : The minimum percentage of rows that must contain a positive instance (i.e. > 0) of the outcome. SMOTE/SMOTE-NC will be used to upsample positive instances until this threshold is reached. Can be disabled by setting to 0. Only accepts values 0 to 0.5.
- random_state : Random seed to use for splitting dataframe and for up-sampling (if needed).
- stratify : Controls stratified sampling. If True and dv is provided, stratifies by the outcome variable. If a list of column names, stratifies by those columns. If False, does not use stratified sampling.
- sampling_strategy : Sampling strategy for imbalanced data. If None, will use dv_threshold. See imblearn documentation for more details on acceptable values.
- shuffle : Whether to shuffle the data before splitting.
- verbose : Whether to print information about the splitting process.
Returns
- A tuple containing:
- x_train: Training features DataFrame
- y_train: Training target Series (if dv is provided, otherwise empty Series)
- x_test: Testing features DataFrame
- y_test: Testing target Series (if dv is provided, otherwise empty Series)
- model_weights: List of weights to use for modeling [negative_class_weight, positive_class_weight]
Examples:
# Basic split with default parameters (70% train, 30% test)
x_train, y_train, x_test, y_test, model_weights = gitlabds.split_data(
df=my_df,
dv='my_outcome'
)
# Split with 80% training data and balancing for imbalanced target
x_train, y_train, x_test, y_test, model_weights = gitlabds.split_data(
df=my_df,
dv='my_outcome',
train_pct=0.80,
dv_threshold=0.3
)
# Split with stratification on multiple variables
x_train, y_train, x_test, y_test, model_weights = gitlabds.split_data(
df=my_df,
dv='my_outcome',
stratify=['my_outcome', 'region', 'customer_segment']
)
# Split entire dataframe without separating target
train_df, _, test_df, _, _ = gitlabds.split_data(
df=my_df,
dv=None,
train_pct=0.75
)
Model Configuration
ConfigGenerator
Description
A simple, flexible configuration builder for creating YAML files with any structure. This utility allows you to build complex, nested configuration files programmatically without being constrained to a predefined structure.
gitlabds.ConfigGenerator(**kwargs):
Parameters:
- **kwargs : Initial configuration values to populate the configuration object with
Methods:
add(path, value)
Add or update a value at a specific path in the configuration.
- path: String using dot-notation to specify the location (e.g., 'model.parameters.learning_rate')
- value: Any value to set at the specified path
to_yaml(file_path)
Write the configuration to a YAML file.
- file_path: Path to the output YAML file
Returns
- ConfigGenerator object for method chaining
Examples:
# Initialize with some top-level parameters
config = ConfigGenerator(
model_name="churn_prediction",
version="1.0.0",
unique_id="customer_id"
)
# Add nested model parameters
config.add("model.file", "xgboost_model.pkl")
config.add("model.parameters.learning_rate", 0.01)
config.add("model.parameters.max_depth", 6)
# Add preprocessing information from outlier detection and dummy coding
config.add("preprocessing.outliers", outlier_info)
config.add("preprocessing.dummy_coding", dummy_info)
# Add query information
config.add("query_parameters.query_file", "customer_data.sql")
config.add("query_parameters.lookback_months", 12)
# Save to YAML
config.to_yaml("churn_model_config.yaml")
Model Evaluation
ModelEvaluator
Description
A comprehensive framework for evaluating machine learning models, supporting both classification (binary and multi-class) and regression models. It provides extensive evaluation metrics, visualizations, and feature importance analysis.
gitlabds.ModelEvaluator(model, x_train, y_train, x_test, y_test, x_oot=None, y_oot=None, classification=True, algo=None, f1_threshold=0.50, decile_n=10, top_features_n=20, show_all_classes=True, show_plots=True, save_plots=True, plot_dir='plots', plot_save_format='png', plot_save_dpi=300)
Parameters:
- model : The trained model to evaluate. Must have predict for regression and predict_proba method for classification
- x_train : Training features DataFrame.
- y_train : Training labels (Series or DataFrame).
- x_test : Test features DataFrame.
- y_test : Test labels (Series or DataFrame).
- x_oot : Optional out-of-time validation features.
- y_oot : Optional out-of-time validation labels.
- classification : Whether this is a classification model. If False, regression metrics will be used.
- algo : Algorithm type for feature importance calculation. Options: 'xgb', 'rf', 'mars'. For other algorithms, use
None - f1_threshold : Threshold for binary classification.
- decile_n : Number of n-tiles for lift calculation. Defaults to 10 for deciles
- top_features_n : Number of top features to display in visualizations.
- show_all_classes : Whether to show metrics for all classes in multi-class classification.
- show_plots : Whether to display plots
- save_plots : Whether to save plots locally
- plot_dir : Directory to save plots
- plot_save_format : Plot format
- plot_save_dpi : Plot resolution
Returns
- ModelMetricsResult object containing all evaluation metrics and results.
Key Methods:
- evaluate() - Compute and return all metrics
- evaluate_custom_metrics(custom_metrics) - Evaluate with additional custom metrics
- display_metrics(results=None) - Display evaluation results in a formatted way
- calibration_assessment() - Assess model calibration for classification models
- get_feature_descriptives(display_results=False) - Generate descriptive statistics for features
- plot_feature_importance(feature_importance, n_features=20) - Plot feature importance
- plot_shap_beeswarm(n_features=20, plot_type="beeswarm") - Create SHAP visualization
- plot_score_distribution(bins=None) - Plot distribution of predicted values
- plot_feature_interactions(feature_pairs=None, n_top_pairs=5) - Plot feature interactions
- plot_confusion_matrix() - Plot confusion matrix for classification models
- plot_lift_analysis() - Plot comprehensive lift analysis
- plot_performance_curves() - Plot ROC and precision-recall curves
- plot_learning_history() - Plot learning curves for iterative models
- plot_performance_comparison() - Plot model performance for out-of-time validation
Examples:
# Create an evaluator for a classification model
from gitlabds import ModelEvaluator
evaluator = ModelEvaluator(
model=my_model,
x_train=x_train,
y_train=y_train,
x_test=x_test,
y_test=y_test,
classification=True,
algo='xgb'
)
# Get all evaluation metrics
results = evaluator.evaluate()
# Display metrics in a formatted way
evaluator.display_metrics(results)
# Create visualizations
evaluator.plot_feature_importance(results.feature_importance)
evaluator.plot_confusion_matrix()
evaluator.plot_performance_curves()
# Save results to file
results.metrics_df.to_csv("metrics.csv")
results.classification_metrics_df.to_csv("classification_metrics.csv")
results.feature_importance.to_csv("feature_importance.csv")
Insight Generation
Marginal Effects
Description
Calculates and returns the marginal effects at the mean (MEM) for predictor fields.
gitlabds.marginal_effects(model, x_test, dv_description, field_labels=None):
Parameters:
- model : model file from training
- x_test : test "predictors" dataframe.
- dv_description : Description of the outcome field to be used in text-based insights.
- field_labels : Dict of field descriptions. The key is the field/feature/predictor name. The value is descriptive text of the field. This field is optional and by default will use the field name
Returns
- Dataframe of marginal effects.
Examples:
# Calculate marginal effects for a trained model
import gitlabds
effects_df = gitlabds.marginal_effects(
model=trained_model,
x_test=test_features,
dv_description="probability of churn",
field_labels={
"tenure": "Customer tenure in months",
"monthly_charges": "Average monthly bill amount",
"total_charges": "Total amount charged to customer"
}
)
# Display the marginal effects
display(effects_df)
Prescriptions
Description
Return "actionable" prescriptions and explanatory insights for each scored record. Insights first list actionable prescriptions follow by explainatory insights. This approach is recommended or linear/logistic methodologies only. Caution should be used if using a black box approach, as manpulating more than one prescription at a time could change a record's model score in unintended ways.
gitlabds.prescriptions(model, input_df, scored_df, actionable_fields, dv_description, field_labels=None, returned_insights=5, only_actionable=False, explanation_fields='all'):
Parameters:
- model : model file from training
- input_df : train "predictors" dataframe.
- scored_df : dataframe containing model scores.
- actionable_fields : Dict of actionable fields. The key is the field/feature/predictor name. The value accepts one of 3 values:
Increasingfor prescriptions only when the field increases;Decreasingfor prescriptions only when the field decreases;Bothfor when the field either increases or decreases. - dv_description : Description of the outcome field to be used in text-based insights.
- field_labels : Dict of field descriptions. The key is the field/feature/predictor name. The value is descriptive text of the field. This field is optional and by default will use the field name
- returned_insights : Number of insights per record to return. Defaults to 5
- only_actionable : Only return actionable prescriptions
- explanation_fields : List of explainable (non-actionable insights) fields to return insights for. Defaults to 'all'
Returns
- Dataframe of prescriptive actions. One row per record input.
Examples:
# Return prescriptions for the actionable fields of 'spend', 'returns', and 'emails_sent':
results = gitlabds.prescriptions(
model=model,
input_df=my_df,
scored_df=my_scores,
actionable_fields={
'spend': 'Increasing',
'returns': 'Decreasing',
'emails_sent': 'Both'
},
dv_description='likelihood to churn',
field_labels={
'spend': 'Dollars spent in last 6 months',
'returns': 'Item returns in last 3 months',
'emails_sent': 'Marketing emails sent in last month'
},
returned_insights=5,
only_actionable=True,
explanation_fields=['spend', 'returns']
)
Model Monitoring
Generate Baseline Features
Description
Generate baseline feature distributions, importance scores, and drift thresholds in a single comprehensive artifact for model monitoring.
gitlabds.generate_baseline_features(training_data, feature_importance_df, importance_method="shapley_values", n_bins=10, psi_warning=0.1, psi_critical=0.2, ks_warning=0.2, ks_critical=0.3, js_warning=0.1, js_critical=0.2, output_path="baseline_features.json"):
Parameters:
- training_data : Training feature data DataFrame
- feature_importance_df : DataFrame with columns: feature, importance
- importance_method : Method used to calculate importance (default: "shapley_values")
- n_bins : Number of bins for numerical features (default: 10)
- psi_warning, psi_critical : PSI thresholds for warning and critical drift detection
- ks_warning, ks_critical : KS statistic thresholds for warning and critical drift detection
- js_warning, js_critical : JS divergence thresholds for warning and critical drift detection
- output_path : Path to save the JSON file
Returns
- None (saves baseline artifact to JSON file)
Examples:
# Generate baseline features with default thresholds
import gitlabds
gitlabds.generate_baseline_features(
training_data=train_df,
feature_importance_df=importance_df,
importance_method="shapley_values",
output_path="model_baseline_features.json"
)
# Generate with custom drift thresholds
gitlabds.generate_baseline_features(
training_data=train_df,
feature_importance_df=importance_df,
n_bins=15,
psi_warning=0.15,
psi_critical=0.25,
output_path="custom_baseline_features.json"
)
Generate Baseline Calibration
Description
Generate baseline calibration data with train + test curves, prediction statistics, and model configuration for monitoring model calibration drift.
gitlabds.generate_baseline_calibration(train_predictions, train_actuals, test_predictions, test_actuals, model_configuration, n_bins=10, output_path="baseline_calibration.json"):
Parameters:
- train_predictions : Training set predicted probabilities
- train_actuals : Training set actual binary outcomes (0/1)
- test_predictions : Test set predicted probabilities
- test_actuals : Test set actual binary outcomes (0/1)
- model_configuration : Dictionary of model configuration parameters
- n_bins : Number of bins for calibration curve (default: 10)
- prediction_drift_warning : Warning level for prediction score drift (default: 0.10),
- prediction_drift_critical : Critical level for prediction score drift (default: 0.20),
- output_path : Path to save the JSON file
Returns
- None (saves baseline calibration to JSON file)
Examples:
# Generate baseline calibration
import gitlabds
model_config = {'f1_threshold': 0.15, 'random_state': 42}
gitlabds.generate_baseline_calibration(
train_predictions=y_train_pred,
train_actuals=y_train,
test_predictions=y_test_pred,
test_actuals=y_test,
model_configuration=model_config,
prediction_drift_warning=0.15,
prediction_drift_critical=0.25
output_path="model_baseline_calibration.json"
)
Calculate Monitoring Metrics
Description
Calculate all monitoring metrics including feature drift, prediction drift, and health status for a model scoring run.
gitlabds.calculate_monitoring_metrics(run_id, model_name, sub_model, model_version, score_date, feature_df, predictions, baseline_metrics, importance_threshold_pct=0.05):
Parameters:
- run_id : Unique identifier for this scoring run
- model_name : Model name (e.g., "propensity_model")
- sub_model : Sub model identifier
- model_version : Model version (e.g., "2.1")
- score_date : Date of scoring
- feature_df : Feature data used for scoring
- predictions : Model predictions (probabilities 0-1)
- baseline_metrics : Dictionary containing baseline_features and baseline_calibration JSON data
- importance_threshold_pct : Percentage of total importance required for a feature to be considered "important"
Returns
- Dictionary containing DataFrames for each table:
- 'scoring_summary': model_scoring_summary table
- 'feature_drift': model_feature_drift table
Examples:
# Calculate monitoring metrics for a scoring run
import gitlabds
from datetime import datetime
# Load baseline metrics
with open('baseline_features.json', 'r') as f:
baseline_features = json.load(f)
with open('baseline_calibration.json', 'r') as f:
baseline_calibration = json.load(f)
baseline_metrics = {
'baseline_features': baseline_features,
'baseline_calibration': baseline_calibration
}
# Calculate metrics
results = gitlabds.calculate_monitoring_metrics(
run_id="scoring_run_123",
model_name="churn_prediction",
sub_model="high_value_customers",
model_version="2.1",
score_date=datetime.now(),
feature_df=current_features,
predictions=model_predictions,
baseline_metrics=baseline_metrics,
importance_threshold_pct=0.05
)
# Access results
scoring_summary = results['scoring_summary']
feature_drift = results['feature_drift']
Deployment
Feature Serving
Serve Features
Description
Serve features from Snowflake Feature Store with built-in distributed locking and retry logic. This function orchestrates feature retrieval from multiple feature views with support for point-in-time lookups and flexible parameter customization.
gitlabds.serve_features(session, feature_store, feature_views_dict, spine_df=None, feature_date=None, spine_timestamp_col=None, include_feature_view_timestamp_col=False, lookback_window_value=None, lookback_window_unit=None, lock_timeout_seconds=300):
Parameters:
- session : Active Snowflake Session object
- feature_store : Initialized FeatureStore instance
- feature_views_dict : Dictionary mapping feature view names to versions (e.g., {"sales_activities": "1.0"})
- spine_df : One of the following:
- SQL query string to generate the spine DataFrame
- pandas DataFrame (will be converted to Snowpark DataFrame)
- Snowpark DataFrame (used directly)
- feature_date : Date for feature retrieval (e.g., '2025-04-08'). If None, uses UDF default.
- spine_timestamp_col : Name of timestamp column in spine DataFrame for point-in-time lookup. Used for when multiple feature_dates may be present in the feature view. If only one feature_date is present, can set value to None for faster execution.
- include_feature_view_timestamp_col : Whether to include timestamp column from feature views in output. Default is False.
- lookback_window_value : Either a global integer value (e.g., 6) or a dict with feature-view-specific values
- lookback_window_unit : Either a global string value (e.g., 'months') or a dict with feature-view-specific values
- lock_timeout_seconds : Timeout in seconds for acquiring the Snowflake lock. Default is 300.
Returns
- DataFrame with combined features from all feature views
Requirements
This function requires Snowflake packages. Install with:
pip install snowflake-snowpark-python snowflake-ml-python
Examples:
# Basic usage with SQL spine
import gitlabds
from snowflake.snowpark import Session
from snowflake.ml.feature_store import FeatureStore
session = Session.builder.configs(...).create()
fs = FeatureStore(session, ...)
spine_sql = "SELECT account_id FROM accounts WHERE active = TRUE"
features = gitlabds.serve_features(
session=session,
feature_store=fs,
feature_views_dict={"sales_activities": "1.0", "marketing_activity": "2.1"},
spine_df=spine_sql,
feature_date=None
)
# With per-feature-view parameters
features = gitlabds.serve_features(
session=session,
feature_store=fs,
feature_views_dict={"sales_activities": "1.0", "product_stage": "2.1"},
spine_df=spine_df,
lookback_window_value={"sales_activities": 3, "product_stage": 6},
lookback_window_unit={"sales_activities": "days", "product_stage": "months"}
)
# With point-in-time lookup
features = gitlabds.serve_features(
session=session,
feature_store=fs,
feature_views_dict={"sales_activities": "1.0"},
spine_df=spine_df,
spine_timestamp_col="snapshot_date",
feature_date="2025-04-08"
)
Clear Feature Serving Locks
Description
Utility function to manually clear all feature serving locks. Use this if locks get stuck due to killed processes.
gitlabds.clear_feature_serving_locks(session):
Parameters:
- session : Active Snowflake Session object
Returns
- int: Number of locks cleared
Examples:
# Clear stuck locks
import gitlabds
from snowflake.snowpark import Session
session = Session.builder.configs(...).create()
locks_cleared = gitlabds.clear_feature_serving_locks(session)
print(f"Cleared {locks_cleared} locks")
SQL and Trend Analysis
SQL Trend Query Generator
Description
Generate SQL for trend analysis across time periods. The generated SQL transforms regular data into a time-series format with columns for each time period, allowing for easy trend detection.
gitlabds.generate_sql_trend_query(snapshot_date, date_field, date_unit='MONTH', periods=12, table_name=None, group_by_fields=None, metrics=None, filters=None, output_file=None):
Parameters:
- snapshot_date : Reference date for analysis (e.g., '2025-04-08')
- date_field : Field name in the table that contains the date to analyze
- date_unit : Time unit for analysis: 'DAY', 'WEEK', 'MONTH', 'QUARTER', 'YEAR'
- periods : Number of time periods to analyze
- table_name : Table to query data from
- group_by_fields : Fields to group by (entity identifiers)
- metrics : Metrics to include in analysis with their properties. Each metric is a dict with:
- name: output column name prefix
- source: field name in the source table
- aggregation: function to apply (AVG, SUM, MAX, etc.)
- condition: optional WHERE condition
- cumulative: if True, calculate period-over-period differences
- is_case_expression: if True, the source is already a CASE WHEN expression
- is_expression: if True, the source is a complex expression
- filters : SQL WHERE clause conditions as a string
- output_file : If provided, save the generated SQL to this file
Returns:
- The generated SQL query as a string
Examples:
# Generate SQL for monthly trend analysis
import gitlabds
# Define metrics
metrics = [
{"name": "active_users", "source": "monthly_active_users", "aggregation": "AVG"},
{"name": "revenue", "source": "monthly_revenue", "aggregation": "SUM"},
{"name": "projects", "source": "projects_created", "aggregation": "MAX", "cumulative": True}
]
# Generate SQL query
sql = gitlabds.generate_sql_trend_query(
snapshot_date='2025-04-08',
date_field='transaction_date',
date_unit='MONTH',
periods=12,
table_name='analytics.user_metrics',
group_by_fields=['account_id'],
metrics=metrics,
filters="is_active = TRUE",
output_file='trend_query.sql'
)
# Generate SQL for daily trend analysis with custom conditions
metrics = [
{"name": "logins", "source": "user_logins", "aggregation": "SUM"},
{"name": "premium_logins", "source": "user_logins", "aggregation": "SUM",
"condition": "subscription_tier = 'premium'"}
]
sql = gitlabds.generate_sql_trend_query(
snapshot_date='2025-04-08',
date_field='login_date',
date_unit='DAY',
periods=30,
table_name='analytics.daily_logins',
metrics=metrics
)
Trend Analysis
Description
Calculate trend metrics for a dataframe produced by the SQL trend generator. This function analyzes time-series data to identify patterns like consecutive increases or decreases, proportion of periods with growth or decline, and average percentage changes.
gitlabds.trend_analysis(df, metric_list=None, time_unit='month', periods=6, include_cumulative=True, exclude_fields=None, verbose=False):
Parameters:
- df : Dataframe containing trend data with time-based columns
- metric_list : List of metric names to analyze. If None, auto-detects metrics from columns
- time_unit : Time unit used in the column names (month, day, week, etc.)
- periods : Number of time periods to analyze
- include_cumulative : Whether to use cumulative (event) metrics when available
- exclude_fields : List of fields to exclude from auto-detection
- verbose : Whether to display intermediate output
Returns:
- A dataframe containing trend metrics for each specified metric, including:
- Count of periods with decreases/increases
- Count of consecutive decreases/increases
- Average percentage change across periods
Examples:
# Run trend analysis on data from SQL trend query
import gitlabds
# Run the SQL query to get trend data
trend_data = run_sql_query(trend_sql) # Your function to execute SQL
trend_data.set_index('account_id', inplace=True)
# Analyze trends for all metrics
trends_df = gitlabds.trend_analysis(
df=trend_data,
time_unit='month',
periods=12,
verbose=True
)
# Analyze trends for specific metrics
trends_df = gitlabds.trend_analysis(
df=trend_data,
metric_list=['active_users', 'revenue'],
time_unit='month',
periods=6,
include_cumulative=True,
exclude_fields=['has_data']
)
# Use trend metrics for customer health scoring
account_data['declining_usage'] = trends_df['consecutive_drop_active_users_period_6_months_cnt'] > 0
account_data['growth_score'] = trends_df['avg_perc_change_revenue_period_6_months'] * 100
Gitlab Data Science
The handbook is the single source of truth for all of our documentation.
Contributing
We welcome contributions and improvements, please see the contribution guidelines.
License
This code is distributed under the MIT license, please see the LICENSE file.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file gitlabds-2.1.8.tar.gz.
File metadata
- Download URL: gitlabds-2.1.8.tar.gz
- Upload date:
- Size: 111.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
07d01891037f5d37c5fa0186cbf453d651bf10f6c9ac18a3f824c0e45675e211
|
|
| MD5 |
e1e5b9b1f29b4859a6aa752b3f6a12f3
|
|
| BLAKE2b-256 |
cacb815cc526639f6bd7b2775fccdc3d17ba2b309b668b43149695d23fc3b33f
|
File details
Details for the file gitlabds-2.1.8-py3-none-any.whl.
File metadata
- Download URL: gitlabds-2.1.8-py3-none-any.whl
- Upload date:
- Size: 94.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3e3eb1138d15f8a6c2e4215fc90e490fe12476cd274ea057d2a3136c0ff89157
|
|
| MD5 |
5ba8d6a840586ba369fad56af13ddec4
|
|
| BLAKE2b-256 |
09673ec574da26b14929c509070a164a702ae020d0df0b86d28a47c17005a619
|