Skip to main content

High-Performance ML Training, Scoring & Evaluation (Polars + GPU-Ready)

Project description

Version: 0.3.1 Build: Passing Maintenance PRs Welcome GitHub Stars

Installation

# Most up-to-date
pip install git+https://github.com/AdrianAntico/RetroFit.git#egg=retrofit

# From pypi
pip install retrofit==0.3.0

📦 RetroFit

High-Performance ML Training, Scoring & Evaluation (Polars + GPU-Ready)

Table of Contents

RetroFit is a fast, production-oriented machine learning framework designed for training, scoring, and evaluating models.

Built from the ground up with Polars, GPU acceleration, and a scalable evaluation engine, RetroFit provides a unified interface for:

  • 🧪 Model data creation & preprocessing (Polars-first)
  • 🚀 Fast training with automatic CPU/GPU switching
  • 📊 Unified scoring engine with inverse-transformation support
  • 📈 Full evaluation suite for regression, binary, and multiclass
  • 🎯 Calibration plots & tables
  • 📉 ROC, PR, and PR-ROC curves (QuickEcharts visuals)
  • 🔧 Automatic label encoding for classification/multiclass
  • 🔄 Target variable transformations

RetroFit is designed for data scientists who want speed, modern tooling, and high-quality diagnostics without boilerplate.


🔥 Key Features

⚙️ 1. Polars-Native Modeling Pipeline

RetroFit uses Polars internally for:

  • Numeric & categorical handling
  • Efficient grouping and slicing
  • Scored-data postprocessing
  • Evaluation data wrangling
  • Data preparation for CatBoost/XGBoost/LightGBM

Everything is vectorized whenever possible.


🎯 2. Target Transformations (Regression)

RetroFit supports:

  • "none"
  • "log" (auto detects ≤0 and applies min-shift)
  • "sqrt"
  • "standardize"

Transformation is applied automatically in create_model_data() and reversed in score(). User could also apply themselves before running create_model_data() and after running score().


⚡ 3. GPU-Ready Training

Enable GPU training with:

model = RetroFit(Algorithm="catboost", TargetType="regression", GPU=True)

RetroFit automatically:

  • Switches tree construction method
  • Adjusts booster settings
  • Removes CPU-only parameters
  • Ensures full CatBoost/XGBoost/LightGBM compatibility

📊 4. Unified Scoring Engine

model.score(DataName="test")

Or score external data:

model.score(NewData=df)

Outputs a Polars DataFrame with:

  • Predictions (Predict_target)
  • Probabilities (p1, class_k)
  • Inverse-transformed regression predictions

🧮 5. Full Evaluation Suite

Regression Metrics

  • MAE / MedianAE
  • MAPE
  • MSE / RMSE
  • MSLE (auto-disabled if invalid)

Binary Classification

  • Accuracy, Recall, Precision
  • TPR, FPR, TNR, FNR
  • F1, F0.5, F2
  • MCC
  • Threat score
  • Utility (custom cost matrix)
  • Full 101-point threshold curve

Multiclass Classification

  • Overall: Accuracy, macro/micro/weighted F1
  • One-vs-all threshold evaluation
  • Uses label decoder to restore original class names

🎛️ 6. Calibration Tables & Plot

Regression + Classification calibration:

  • Equal-width or quantile binning
  • MACE, RMSE, MAE, R²
  • Per-group calibration
  • QuickEcharts visualization
  • Metadata: timestamp, model name, grouping vars

📈 7. ROC / PR / PR-ROC Curves and Regression versions

RetroFit generates:

  • ROC
  • Precision-Recall
  • AUC and Average Precision
  • Metrics vs Threshold curves (Accuracy, F1, TPR, FPR, Utility, etc.)
  • Interactive QuickEcharts plots with gradient fills

🎨 8. QuickEcharts Visuals

All plots are powered by QuickEcharts:

  • Line / Area plots
  • Gradient fills
  • Auto subtitles with metrics
  • HTML export
  • Themes: 'chalk', 'dark', 'essos', 'halloween', 'infographic', 'light', 'macarons', 'purple-passion', 'roma', 'romantic', 'shine', 'vintage', 'walden', 'westeros', 'white', 'wonderland'

9. 🔍 Partial Dependence Plots (PDP)

RetroFit includes full PDP support for both numeric and categorical features to help explain model behavior.

✅ Numeric PDP

  • Automated binning (quantile or equal-width)
  • Mean actual vs mean predicted across bins
  • String-safe axis handling for QuickEcharts
  • Clean Line-chart visualization
  • Supports internal scored data or external df input
  • Returns both the PDP table and plot object

✅ Categorical PDP

  • Per-category partial dependence table
  • Mean actual vs mean predicted for each category
  • Optional sorting (feature order, actual mean, predicted mean)
  • Line-based visualization using QuickEcharts
  • Works with internal or external data
  • Returns both table and plot object

📑 10. Model Insights Reports

RetroFit can generate fully self-contained HTML Model Insights Reports for both regression and classification models.

Reports include:

  • Data summary and feature overview
  • Core metrics table (sortable & paginated)
  • Calibration tables and plots
  • Classification-specific diagnostics:
  • ROC curve
  • Precision–Recall curve
  • Metrics vs Threshold plot
  • Feature importance
  • Interaction importance (CatBoost)
  • Partial Dependence Plots
  • SHAP summary and dependence plots (tree-based models)

Reports are designed to be:

  • Analyst-friendly
  • Shareable (single HTML file)
  • Consistent with RetroFit’s evaluation engine
  • Safe for production diagnostics

Code Examples

1. Supervised Learning Examples

Below are runnable examples for all supported algorithms.
Each section includes regression, binary classification, and multiclass examples.


CatBoost Examples

Regression Training

# Setup environment
import os
import polars as pl
from PolarsFE import datasets
from QuickEcharts import Charts
from retrofit import MachineLearning as ml
from retrofit import utils


# Load some data
df = utils.make_retrofit_demo_data(
    n_rows=50_000,
    n_segments=5,
    seed=42,
)

# Get TrainData, ValidationData, and TestData
DataSets = datasets.partition_random(
    data=df,
    num_partitions=3,
    seed=42,
    percentages=[0.7, 0.2, 0.1]
)

# Initialize RetroFit
model = ml.RetroFit(TargetType="regression", Algorithm="catboost")

# Create algo-specific model data
model.create_model_data(
  TrainData=DataSets[0],
  ValidationData=DataSets[1],
  TestData=DataSets[2],
  TargetColumnName="Leads",
  NumericColumnNames=['XREGS1', 'XREGS2', 'XREGS3'],
  CategoricalColumnNames=['MarketingSegments', 'MarketingSegments2', 'MarketingSegments3'],
  TextColumnNames=None,
  WeightColumnName=None,
  Threads=-1
)

# Print default parameter settings
model.print_algo_args()

# Update algo args for GPU
model.update_model_parameters(
    task_type='GPU',
    sampling_frequency=None,
    rsm=1.0,
    iterations = 200
)

# Train Model
model.train()

# Score train, validation, and test; store internally
model.score()

# Build regression report
path = model.build_model_insights_report(
    output_path="regression_report.html",
    theme="neon",
)

# Inspect scored data
model.ScoredData["train"]
model.ScoredData["validation"]
model.ScoredData["test"]

# Evaluate scored data
global_eval = model.evaluate(
    DataName="test"
)

# Per segment
segment_eval = model.evaluate(
    DataName="test",
    ByVariables="MarketingSegments"
)

# Get variable importance
imp = model.compute_feature_importance()

# Get interaction importance
interact = model.compute_catboost_interaction_importance()

# Model Calibration Tables
cal = model.build_regression_calibration_table(
    DataName="test",
    binning="quantile"
)

# Store plot in working directory
model.plot_regression_calibration(
    DataName="test",
    n_bins=20,
    binning="quantile",
    plot_name=f"{os.getcwd()}/my_calibration_plot"
)

# Actual vs Predicted Scatterplot
model.plot_regression_scatter(
    DataName="test",
    SampleSize=15000,
    plot_name=f"{os.getcwd()}/my_scatter_plot",
    Theme="dark"
)

# Residuals vs Predicted Values
model.plot_regression_residuals_vs_predicted(
    DataName="test",
    SampleSize=15000,
    plot_name=f"{os.getcwd()}/my_residuals_plot",
    Theme="dark"
)

# Residual Distribution
model.plot_regression_residual_distribution(
    DataName="test",
    n_bins=40,
    plot_name=f"{os.getcwd()}/my_density_plot",
    Theme="dark",
)

# Get shap values
shap_train = model.compute_shap_values(split="train", attach=True)

# Shap Boxplot
shap_summary = model.plot_shap_summary(
    split="train",
    shap_attached=shap_train,
    prefix="shap_",
    top_n=20,
    max_samples=10_000,
    plot_name=f"{os.getcwd()}/my_shap_boxplot",
    Theme="dark",
)

# Create shap dependence plot
out = model.plot_shap_dependence(
    feature="XREGS1",
    split="train",
    plot_name=f"{os.getcwd()}/my_shap_pdp",
)

# Actual vs Predicted Distribution Overlay
resid_dist_ = model.plot_prediction_distribution(
    DataName="test",
    n_bins=40,
    plot_name=f"{os.getcwd()}/my_overlay_plot",
)

# Numeric Partial Dependence plot
model.plot_pdp_numeric(
    feature='XREGS1',
    DataName="test",
    plot_name=f"{os.getcwd()}/numeric_pdp",
)

# Categorical Partial Dependence plot
model.plot_pdp_categorical(
    feature='MarketingSegments',
    DataName="test",
    plot_name=f"{os.getcwd()}/categorical_pdp",
)

Classification Training

# Setup environment
import os
import polars as pl
from PolarsFE import datasets
from QuickEcharts import Charts
from retrofit import MachineLearning as ml
from retrofit import utils


# Load some data
df = utils.make_retrofit_demo_data(
    n_rows=50_000,
    n_segments=5,
    seed=42,
)

# Get TrainData, ValidationData, and TestData
DataSets = datasets.partition_random(
    data=df,
    num_partitions=3,
    seed=42,
    percentages=[0.7, 0.2, 0.1]
)

# Initialize RetroFit
model = ml.RetroFit(TargetType="classification", Algorithm="catboost")

# Create algo-specific model data
model.create_model_data(
  TrainData=DataSets[0],
  ValidationData=DataSets[1],
  TestData=DataSets[2],
  TargetColumnName="Label_binary",
  NumericColumnNames=['XREGS1','XREGS2','XREGS3'],
  CategoricalColumnNames=['MarketingSegments','MarketingSegments2','MarketingSegments3'],
  TextColumnNames=None,
  WeightColumnName=None,
  Threads=-1
)

# Print default parameter settings
model.print_algo_args()

# Update algo args for GPU
model.update_model_parameters(
    task_type='GPU',
    sampling_frequency=None,
    rsm=1.0,
    iterations=200
)

# Train Model
model.train()

# Score train, validation, and test; store internally
model.score()

# Build classification report
path = model.build_model_insights_report(
    output_path="classification_report.html",
    theme="neon",
)

# Inspect scored data
model.ScoredData["train"]
model.ScoredData["validation"]
model.ScoredData["test"]

# Evaluate scored data
global_eval = model.evaluate(
    DataName="test"
)

# Per segment
segment_eval = model.evaluate(
    DataName="test",
    ByVariables="MarketingSegments"
)

# Get variable importance
imp = model.compute_feature_importance()

# Get interaction importance
interact = model.compute_catboost_interaction_importance()

# Evaluate scored data
model.plot_classification_threshold_metrics(
    DataName="test",
    CostDict=dict(tpcost=1.0, fpcost=-1.0, fncost=-1.0, tncost=1.0),
    plot_name=f"{os.getcwd()}/my_thresh_plot",
    Theme="dark"
)

# Store plot in working directory
model.plot_classification_calibration(
    DataName="test",
    n_bins=20,
    binning="quantile",
    plot_name=f"{os.getcwd()}/my_calibration_plot"
)

# ROC Plot
model.plot_classification_roc(
    DataName="test",
    plot_name=f"{os.getcwd()}/my_roc_plot",
)

# PR Plot
model.plot_classification_pr(
    DataName="test",
    plot_name=f"{os.getcwd()}/my_pr_plot"
)

# Get shap values
shap_train = model.compute_shap_values(split="train", attach=True)

# Shap Boxplot
shap_summary = model.plot_shap_summary(
    split="train",
    shap_attached=shap_train,
    prefix="shap_",
    top_n=20,
    max_samples=10_000,
    plot_name=f"{os.getcwd()}/my_shap_boxplot",
    Theme="dark",
)

# Create shap dependence plot
out = model.plot_shap_dependence(
    feature="XREGS1",
    split="train",
    plot_name=f"{os.getcwd()}/my_shap_pdp",
)

# Numeric Partial Dependence plot
model.plot_pdp_numeric(
    feature='XREGS1',
    DataName="test",
    plot_name=f"{os.getcwd()}/numeric_pdp",
)

# Categorical Partial Dependence plot
model.plot_pdp_categorical(
    feature='MarketingSegments',
    DataName="test",
    plot_name=f"{os.getcwd()}/categorical_pdp",
)

MultiClass Training

import os
import polars as pl
from PolarsFE import datasets
from QuickEcharts import Charts
from retrofit import MachineLearning as ml

# Load some data
# Generate demo data instead of reading packaged CSVs
df = utils.make_retrofit_demo_data(
    n_rows=50_000,
    n_segments=5,
    seed=42,
)

# Get TrainData, ValidationData, and TestData
DataSets = datasets.partition_random(
    data=df,
    num_partitions=3,
    seed=42,
    percentages=[0.7, 0.2, 0.1]
)

# Initialize RetroFit
model = ml.RetroFit(TargetType="multiclass", Algorithm="catboost")

# Create algo-specific model data
model.create_model_data(
  TrainData=DataSets[0],
  ValidationData=DataSets[1],
  TestData=DataSets[2],
  TargetColumnName="Label",
  NumericColumnNames=['XREGS1', 'XREGS2', 'XREGS3'],
  CategoricalColumnNames=['MarketingSegments', 'MarketingSegments2', 'MarketingSegments3'],
  TextColumnNames=None,
  WeightColumnName=None,
  Threads=-1
)

# Print default parameter settings
model.print_algo_args()

# Update algo args for GPU
model.update_model_parameters(
    task_type='GPU',
    bootstrap_type='Bayesian',
    rsm=1.0,
    iterations=200,
    subsample=None,
    sampling_frequency=None
)

# Train Model
model.train()

# Score train, validation, and test; store internally
model.score()

# Inspect scored data
model.ScoredData["train"]
model.ScoredData["validation"]
model.ScoredData["test"]

# Evaluate scored data
global_eval = model.evaluate(
    DataName="test"
)

# Per segment
segment_eval = model.evaluate(
    DataName="test",
    ByVariables="MarketingSegments"
)

# Get variable importance
imp = model.compute_feature_importance()

# Get interaction importance
interact = model.compute_catboost_interaction_importance()

XGBoost Examples

Regression Training

# Setup Environment
import os
import polars as pl
from PolarsFE import datasets, character
from QuickEcharts import Charts
from retrofit import MachineLearning as ml

# Generate demo data instead of reading packaged CSVs
df = utils.make_retrofit_demo_data(
    n_rows=50_000,
    n_segments=5,
    seed=42,
)

# Get TrainData, ValidationData, and TestData
DataSets = datasets.partition_random(
    data=df,
    num_partitions=3,
    seed=42,
    percentages=[0.7, 0.2, 0.1]
)

# Create target encodings for categorical variables
categorical_cols = ['MarketingSegments', 'MarketingSegments2', 'MarketingSegments3']
output = character.categorical_encoding(
    data=DataSets[0],
    ML_Type="regression",
    group_variables=categorical_cols,
    target_variable="Leads",
    method="target_encoding",
    scoring=False,
    keep_original_factors=False
)

# Collect data and encodings
DataSets[0] = output['data']
encodings = output['factor_components']

# Note parameter: scoring=True
DataSets[1] = character.categorical_encoding(
    data=DataSets[1],
    ML_Type="regression",
    group_variables=categorical_cols,
    target_variable="Leads",
    method="target_encoding",
    scoring=True,
    supply_factor_level_list=encodings,
    keep_original_factors=False
)

# Note parameter: scoring=True
DataSets[2] = character.categorical_encoding(
    data=DataSets[2],
    ML_Type="regression",
    group_variables=categorical_cols,
    target_variable="Leads",
    method="target_encoding",
    scoring=True,
    supply_factor_level_list=encodings,
    keep_original_factors=False
)

# Initialize RetroFit
model = ml.RetroFit(TargetType="regression", Algorithm="xgboost")

# Create algo-specific model data
model.create_model_data(
  TrainData=DataSets[0],
  ValidationData=DataSets[1],
  TestData=DataSets[2],
  TargetColumnName="Leads",
  NumericColumnNames=[
    'XREGS1',
    'XREGS2',
    'XREGS3',
    'MarketingSegments_TargetEncode',
    'MarketingSegments2_TargetEncode',
    'MarketingSegments3_TargetEncode'
  ],
  Threads=-1
)

# Print default parameter settings
model.print_algo_args()

# Update algo args
model.update_model_parameters(
    num_boost_round=200,
    num_parallel_tree=4,
    max_depth=4
)

# Train Model
model.train()

# Score train, validation, and test; store internally
model.score()

# Build regression report
path = model.build_model_insights_report(
    output_path="regression_report.html",
    theme="neon",
)

# Inspect scored data
model.ScoredData["train"]
model.ScoredData["validation"]
model.ScoredData["test"]

# Evaluate scored data
global_eval = model.evaluate(
    DataName="test"
)

# Per segment
segment_eval = model.evaluate(
    DataName="test",
    ByVariables="MarketingSegments_TargetEncode"
)

# Get variable importance
imp = model.compute_feature_importance()

# Model Calibration Tables
cal = model.build_regression_calibration_table(
    DataName="test",
    binning="quantile"
)

# Store plot in working directory
model.plot_regression_calibration(
    DataName="test",
    n_bins=20,
    binning="quantile",
    plot_name=f"{os.getcwd()}/my_calibration_plot"
)

# Get shap values
shap_train = model.compute_shap_values(split="train", attach=True)

# Shap Boxplot
shap_summary = model.plot_shap_summary(
    split="train",
    shap_attached=shap_train,
    prefix="shap_",
    top_n=20,
    max_samples=10_000,
    plot_name=f"{os.getcwd()}/my_shap_boxplot",
    Theme="dark",
)

# Create shap dependence plot
out = model.plot_shap_dependence(
    feature="XREGS1",
    split="train",
    plot_name=f"{os.getcwd()}/my_shap_pdp",
)

# Actual vs Predicted Scatterplot
model.plot_regression_scatter(
    DataName="test",
    SampleSize=15000,
    plot_name=f"{os.getcwd()}/my_scatter_plot",
    Theme="dark"
)

# Residuals vs Predicted Values
model.plot_regression_residuals_vs_predicted(
    DataName="test",
    SampleSize=15000,
    plot_name=f"{os.getcwd()}/my_residuals_plot",
    Theme="dark"
)

# Residual Distribution
model.plot_regression_residual_distribution(
    DataName="test",
    n_bins=40,
    plot_name=f"{os.getcwd()}/my_density_plot",
    Theme="dark",
)

# Actual vs Predicted Distribution Overlay
resid_dist_ = model.plot_prediction_distribution(
    DataName="test",
    n_bins=40,
    plot_name=f"{os.getcwd()}/my_overlay_plot",
)

# Numeric Partial Dependence plot
model.plot_pdp_numeric(
    feature='XREGS1',
    DataName="test",
    plot_name=f"{os.getcwd()}/numeric_pdp",
)

# Categorical Partial Dependence plot
model.plot_pdp_categorical(
    feature='MarketingSegments',
    DataName="test",
    plot_name=f"{os.getcwd()}/categorical_pdp",
)

Classification Training

# Setup Environment
import os
import polars as pl
from PolarsFE import datasets, character
from QuickEcharts import Charts
from retrofit import MachineLearning as ml

# Generate demo data instead of reading packaged CSVs
df = utils.make_retrofit_demo_data(
    n_rows=50_000,
    n_segments=5,
    seed=42,
)

# Turn Label into a binary target variable
df = df.with_columns(
    pl.when(pl.col("XREGS1") > 200).then(1)
      .otherwise(0)
      .alias("Label")
)

# Get TrainData, ValidationData, and TestData
DataSets = datasets.partition_random(
    data=df,
    num_partitions=3,
    seed=42,
    percentages=[0.7, 0.2, 0.1]
)

# Create target encodings for categorical variables
categorical_cols = ['MarketingSegments', 'MarketingSegments2', 'MarketingSegments3']
output = character.categorical_encoding(
    data=DataSets[0],
    ML_Type="classification",
    group_variables=categorical_cols,
    target_variable="Leads",
    method="target_encoding",
    scoring=False,
    keep_original_factors=False
)

# Collect data and encodings
DataSets[0] = output['data']
encodings = output['factor_components']

# Note parameter: scoring=True
DataSets[1] = character.categorical_encoding(
    data=DataSets[1],
    ML_Type="classification",
    group_variables=categorical_cols,
    target_variable="Leads",
    method="target_encoding",
    scoring=True,
    supply_factor_level_list=encodings,
    keep_original_factors=False
)

# Note parameter: scoring=True
DataSets[2] = character.categorical_encoding(
    data=DataSets[2],
    ML_Type="classification",
    group_variables=categorical_cols,
    target_variable="Label",
    method="target_encoding",
    scoring=True,
    supply_factor_level_list=encodings,
    keep_original_factors=False
)

# Initialize RetroFit
model = ml.RetroFit(TargetType="classification", Algorithm="xgboost")

# Create algo-specific model data
model.create_model_data(
  TrainData=DataSets[0],
  ValidationData=DataSets[1],
  TestData=DataSets[2],
  TargetColumnName="Label",
  NumericColumnNames=[
    'XREGS1',
    'XREGS2',
    'XREGS3',
    'MarketingSegments_TargetEncode',
    'MarketingSegments2_TargetEncode',
    'MarketingSegments3_TargetEncode'
  ],
  Threads=-1
)

# Print default parameter settings
model.print_algo_args()

# Update algo args
model.update_model_parameters(
    num_boost_round=200,
    num_parallel_tree=4,
    max_depth=4
)

# Train Model
model.train()

# Score train, validation, and test; store internally
model.score()

# Build classification report
path = model.build_model_insights_report(
    output_path="classification_report.html",
    theme="neon",
)

# Inspect scored data
model.ScoredData["train"]
model.ScoredData["validation"]
model.ScoredData["test"]

# Evaluate scored data
global_eval = model.evaluate(
    DataName="test"
)

# Per segment
segment_eval = model.evaluate(
    DataName="test",
    ByVariables="MarketingSegments_TargetEncode"
)

# Get variable importance
imp = model.compute_feature_importance()

# Evaluate scored data
model.plot_classification_threshold_metrics(
    DataName="test",
    CostDict=dict(tpcost=1.0, fpcost=-1.0, fncost=-1.0, tncost=1.0),
    plot_name=f"{os.getcwd()}/my_thresh_plot",
    Theme="dark"
)

# Store plot in working directory
model.plot_classification_calibration(
    DataName="test",
    n_bins=20,
    binning="quantile",
    plot_name=f"{os.getcwd()}/my_calibration_plot"
)

# ROC Plot
model.plot_classification_roc(
    DataName="test",
    plot_name=f"{os.getcwd()}/my_roc_plot",
)

# PR Plot
model.plot_classification_pr(
    DataName="test",
    plot_name=f"{os.getcwd()}/my_pr_plot"
)

# Get shap values
shap_train = model.compute_shap_values(split="train", attach=True)

# Shap Boxplot
shap_summary = model.plot_shap_summary(
    split="train",
    shap_attached=shap_train,
    prefix="shap_",
    top_n=20,
    max_samples=10_000,
    plot_name=f"{os.getcwd()}/my_shap_boxplot",
    Theme="dark",
)

# Create shap dependence plot
out = model.plot_shap_dependence(
    feature="XREGS1",
    split="train",
    plot_name=f"{os.getcwd()}/my_shap_pdp",
)

# Numeric Partial Dependence plot
model.plot_pdp_numeric(
    feature='XREGS1',
    DataName="test",
    plot_name=f"{os.getcwd()}/numeric_pdp",
)

# Categorical Partial Dependence plot
model.plot_pdp_categorical(
    feature='MarketingSegments',
    DataName="test",
    plot_name=f"{os.getcwd()}/categorical_pdp",
)

MultiClass Training

# Setup Environment
import os
import polars as pl
from PolarsFE import datasets, character
from QuickEcharts import Charts
from retrofit import MachineLearning as ml

# Generate demo data instead of reading packaged CSVs
df = utils.make_retrofit_demo_data(
    n_rows=50_000,
    n_segments=5,
    seed=42,
)

# Get TrainData, ValidationData, and TestData
DataSets = datasets.partition_random(
    data=df,
    num_partitions=3,
    seed=42,
    percentages=[0.7, 0.2, 0.1]
)

# Create target encodings for categorical variables
categorical_cols = ['MarketingSegments', 'MarketingSegments2', 'MarketingSegments3']
output = character.categorical_encoding(
    data=DataSets[0],
    ML_Type="multiclass",
    group_variables=categorical_cols,
    target_variable="Leads",
    method="target_encoding",
    scoring=False,
    keep_original_factors=False
)

# Collect data and encodings
DataSets[0] = output['data']
encodings = output['factor_components']

# Note parameter: scoring=True
DataSets[1] = character.categorical_encoding(
    data=DataSets[1],
    ML_Type="multiclass",
    group_variables=categorical_cols,
    target_variable="Leads",
    method="target_encoding",
    scoring=True,
    supply_factor_level_list=encodings,
    keep_original_factors=False
)

# Note parameter: scoring=True
DataSets[2] = character.categorical_encoding(
    data=DataSets[2],
    ML_Type="multiclass",
    group_variables=categorical_cols,
    target_variable="Label",
    method="target_encoding",
    scoring=True,
    supply_factor_level_list=encodings,
    keep_original_factors=False
)

# Initialize RetroFit
model = ml.RetroFit(TargetType="multiclass", Algorithm="xgboost")

# Model Variables
drop_cols = ['CalendarDateColumn', 'Label']
features = [c for c in DataSets[2].columns if c not in drop_cols]

# Create algo-specific model data
model.create_model_data(
  TrainData=DataSets[0],
  ValidationData=DataSets[1],
  TestData=DataSets[2],
  TargetColumnName="Label",
  NumericColumnNames=features,
  Threads=-1
)

# Print default parameter settings
model.print_algo_args()

# Update algo args
model.update_model_parameters(
    num_boost_round=200,
    num_parallel_tree=4,
    max_depth=4
)

# Train Model
model.train()

# Score train, validation, and test; store internally
model.score()

# Inspect scored data
model.ScoredData["train"]
model.ScoredData["validation"]
model.ScoredData["test"]

# Evaluate scored data
global_eval = model.evaluate(
    DataName="test"
)

# Get variable importance
imp = model.compute_feature_importance()

LightGBM Examples

Regression Training

# Setup Environment
import os
import polars as pl
from PolarsFE import datasets, character
from QuickEcharts import Charts
from retrofit import MachineLearning as ml

# Generate demo data instead of reading packaged CSVs
df = utils.make_retrofit_demo_data(
    n_rows=50_000,
    n_segments=5,
    seed=42,
)

# Get TrainData, ValidationData, and TestData
DataSets = datasets.partition_random(
    data=df,
    num_partitions=3,
    seed=42,
    percentages=[0.7, 0.2, 0.1]
)

# Create target encodings for categorical variables
categorical_cols = ['MarketingSegments', 'MarketingSegments2', 'MarketingSegments3']
output = character.categorical_encoding(
    data=DataSets[0],
    ML_Type="regression",
    group_variables=categorical_cols,
    target_variable="Leads",
    method="target_encoding",
    scoring=False,
    keep_original_factors=False
)

# Collect data and encodings
DataSets[0] = output['data']
encodings = output['factor_components']

# Note parameter: scoring=True
DataSets[1] = character.categorical_encoding(
    data=DataSets[1],
    ML_Type="regression",
    group_variables=categorical_cols,
    target_variable="Leads",
    method="target_encoding",
    scoring=True,
    supply_factor_level_list=encodings,
    keep_original_factors=False
)

# Note parameter: scoring=True
DataSets[2] = character.categorical_encoding(
    data=DataSets[2],
    ML_Type="regression",
    group_variables=categorical_cols,
    target_variable="Leads",
    method="target_encoding",
    scoring=True,
    supply_factor_level_list=encodings,
    keep_original_factors=False
)

# Initialize RetroFit
model = ml.RetroFit(TargetType="regression", Algorithm="lightgbm")

# Create algo-specific model data
model.create_model_data(
  TrainData=DataSets[0],
  ValidationData=DataSets[1],
  TestData=DataSets[2],
  TargetColumnName="Leads",
  NumericColumnNames=[
    'XREGS1',
    'XREGS2',
    'XREGS3',
    'MarketingSegments_TargetEncode',
    'MarketingSegments2_TargetEncode',
    'MarketingSegments3_TargetEncode'
  ],
  Threads=-1
)

# Print default parameter settings
model.print_algo_args()

# Update algo args
model.update_model_parameters(
    num_iterations=200,
    max_depth=6,
    min_data_in_leaf=2
)

# Train Model
model.train()

# Score train, validation, and test; store internally
model.score()

# Build regression report
path = model.build_model_insights_report(
    output_path="regression_report.html",
    theme="neon",
)

# Inspect scored data
model.ScoredData["train"]
model.ScoredData["validation"]
model.ScoredData["test"]

# Evaluate scored data
global_eval = model.evaluate(
    DataName="test"
)

# Per segment
segment_eval = model.evaluate(
    DataName="test",
    ByVariables="MarketingSegments_TargetEncode"
)

# Get variable importance
imp = model.compute_feature_importance()

# Model Calibration Tables
cal = model.build_regression_calibration_table(
    DataName="test",
    binning="quantile"
)
# Store plot in working directory
model.plot_regression_calibration(
    DataName="test",
    n_bins=20,
    binning="quantile",
    plot_name=f"{os.getcwd()}/my_calibration_plot"
)

# Get shap values
shap_train = model.compute_shap_values(split="train", attach=True)

# Shap Boxplot
shap_summary = model.plot_shap_summary(
    split="train",
    shap_attached=shap_train,
    prefix="shap_",
    top_n=20,
    max_samples=10_000,
    plot_name=f"{os.getcwd()}/my_shap_boxplot",
    Theme="dark",
)

# Create shap dependence plot
out = model.plot_shap_dependence(
    feature="XREGS1",
    split="train",
    plot_name=f"{os.getcwd()}/my_shap_pdp",
)

# Actual vs Predicted Scatterplot
model.plot_regression_scatter(
    DataName="test",
    SampleSize=15000,
    plot_name=f"{os.getcwd()}/my_scatter_plot",
    Theme="dark"
)

# Residuals vs Predicted Values
model.plot_regression_residuals_vs_predicted(
    DataName="test",
    SampleSize=15000,
    plot_name=f"{os.getcwd()}/my_residuals_plot",
    Theme="dark"
)

# Residual Distribution
model.plot_regression_residual_distribution(
    DataName="test",
    n_bins=40,
    plot_name=f"{os.getcwd()}/my_density_plot",
    Theme="dark",
)

# Actual vs Predicted Distribution Overlay
resid_dist_ = model.plot_prediction_distribution(
    DataName="test",
    n_bins=40,
    plot_name=f"{os.getcwd()}/my_overlay_plot",
)

# Numeric Partial Dependence plot
model.plot_pdp_numeric(
    feature='XREGS1',
    DataName="test",
    plot_name=f"{os.getcwd()}/numeric_pdp",
)

# Categorical Partial Dependence plot
model.plot_pdp_categorical(
    feature='MarketingSegments',
    DataName="test",
    plot_name=f"{os.getcwd()}/categorical_pdp",
)

Classification Training

# Setup Environment
import os
import polars as pl
from PolarsFE import datasets, character
from QuickEcharts import Charts
from retrofit import MachineLearning as ml

# Generate demo data instead of reading packaged CSVs
df = utils.make_retrofit_demo_data(
    n_rows=50_000,
    n_segments=5,
    seed=42,
)

# Turn Label into a binary target variable
df = df.with_columns(
    pl.when(pl.col("XREGS1") > 200).then(1)
      .otherwise(0)
      .alias("Label")
)

# Get TrainData, ValidationData, and TestData
DataSets = datasets.partition_random(
    data=df,
    num_partitions=3,
    seed=42,
    percentages=[0.7, 0.2, 0.1]
)

# Create target encodings for categorical variables
categorical_cols = ['MarketingSegments', 'MarketingSegments2', 'MarketingSegments3']
output = character.categorical_encoding(
    data=DataSets[0],
    ML_Type="classification",
    group_variables=categorical_cols,
    target_variable="Leads",
    method="target_encoding",
    scoring=False,
    keep_original_factors=False
)

# Collect data and encodings
DataSets[0] = output['data']
encodings = output['factor_components']

# Note parameter: scoring=True
DataSets[1] = character.categorical_encoding(
    data=DataSets[1],
    ML_Type="classification",
    group_variables=categorical_cols,
    target_variable="Leads",
    method="target_encoding",
    scoring=True,
    supply_factor_level_list=encodings,
    keep_original_factors=False
)

# Note parameter: scoring=True
DataSets[2] = character.categorical_encoding(
    data=DataSets[2],
    ML_Type="classification",
    group_variables=categorical_cols,
    target_variable="Label",
    method="target_encoding",
    scoring=True,
    supply_factor_level_list=encodings,
    keep_original_factors=False
)

# Initialize RetroFit
model = ml.RetroFit(TargetType="classification", Algorithm="lightgbm")

# Create algo-specific model data
model.create_model_data(
  TrainData=DataSets[0],
  ValidationData=DataSets[1],
  TestData=DataSets[2],
  TargetColumnName="Label",
  NumericColumnNames=[
    'XREGS1',
    'XREGS2',
    'XREGS3',
    'MarketingSegments_TargetEncode',
    'MarketingSegments2_TargetEncode',
    'MarketingSegments3_TargetEncode'
  ],
  Threads=-1
)

# Print default parameter settings
model.print_algo_args()

# Update algo args
model.update_model_parameters(
    num_iterations=200,
    max_depth=6,
    min_data_in_leaf=2
)

# Train Model
model.train()

# Score train, validation, and test; store internally
model.score()

# Build classification report
path = model.build_model_insights_report(
    output_path="classification_report.html",
    theme="neon",
)

# Inspect scored data
model.ScoredData["train"]
model.ScoredData["validation"]
model.ScoredData["test"]

# Evaluate scored data
global_eval = model.evaluate(
    DataName="test"
)

# Per segment
segment_eval = model.evaluate(
    DataName="test",
    ByVariables="MarketingSegments_TargetEncode"
)

# Get variable importance
imp = model.compute_feature_importance()

# Evaluate scored data
model.plot_classification_threshold_metrics(
    DataName="test",
    CostDict=dict(tpcost=1.0, fpcost=-1.0, fncost=-1.0, tncost=1.0),
    plot_name=f"{os.getcwd()}/my_thresh_plot",
    Theme="dark"
)

# Store plot in working directory
model.plot_classification_calibration(
    DataName="test",
    n_bins=20,
    binning="quantile",
    plot_name=f"{os.getcwd()}/my_calibration_plot"
)

# ROC Plot
model.plot_classification_roc(
    DataName="test",
    plot_name=f"{os.getcwd()}/my_roc_plot",
)

# PR Plot
model.plot_classification_pr(
    DataName="test",
    plot_name=f"{os.getcwd()}/my_pr_plot"
)

# Get shap values
shap_train = model.compute_shap_values(split="train", attach=True)

# Shap Boxplot
shap_summary = model.plot_shap_summary(
    split="train",
    shap_attached=shap_train,
    prefix="shap_",
    top_n=20,
    max_samples=10_000,
    plot_name=f"{os.getcwd()}/my_shap_boxplot",
    Theme="dark",
)

# Create shap dependence plot
out = model.plot_shap_dependence(
    feature="XREGS1",
    split="train",
    plot_name=f"{os.getcwd()}/my_shap_pdp",
)

# Numeric Partial Dependence plot
model.plot_pdp_numeric(
    feature='XREGS1',
    DataName="test",
    plot_name=f"{os.getcwd()}/numeric_pdp",
)

# Categorical Partial Dependence plot
model.plot_pdp_categorical(
    feature='MarketingSegments',
    DataName="test",
    plot_name=f"{os.getcwd()}/categorical_pdp",
)

MultiClass Training

# Setup Environment
import os
import polars as pl
from PolarsFE import datasets, character
from QuickEcharts import Charts
from retrofit import MachineLearning as ml

# Generate demo data instead of reading packaged CSVs
df = utils.make_retrofit_demo_data(
    n_rows=50_000,
    n_segments=5,
    seed=42,
)

# Get TrainData, ValidationData, and TestData
DataSets = datasets.partition_random(
    data=df,
    num_partitions=3,
    seed=42,
    percentages=[0.7, 0.2, 0.1]
)

# Create target encodings for categorical variables
categorical_cols = ['MarketingSegments', 'MarketingSegments2', 'MarketingSegments3']
output = character.categorical_encoding(
    data=DataSets[0],
    ML_Type="multiclass",
    group_variables=categorical_cols,
    target_variable="Leads",
    method="target_encoding",
    scoring=False,
    keep_original_factors=False
)

# Collect data and encodings
DataSets[0] = output['data']
encodings = output['factor_components']

# Note parameter: scoring=True
DataSets[1] = character.categorical_encoding(
    data=DataSets[1],
    ML_Type="multiclass",
    group_variables=categorical_cols,
    target_variable="Leads",
    method="target_encoding",
    scoring=True,
    supply_factor_level_list=encodings,
    keep_original_factors=False
)

# Note parameter: scoring=True
DataSets[2] = character.categorical_encoding(
    data=DataSets[2],
    ML_Type="multiclass",
    group_variables=categorical_cols,
    target_variable="Label",
    method="target_encoding",
    scoring=True,
    supply_factor_level_list=encodings,
    keep_original_factors=False
)

# Initialize RetroFit
model = ml.RetroFit(TargetType="multiclass", Algorithm="lightgbm")

# Model Variables
drop_cols = ['CalendarDateColumn', 'Label']
features = [c for c in DataSets[2].columns if c not in drop_cols]

# Create algo-specific model data
model.create_model_data(
  TrainData=DataSets[0],
  ValidationData=DataSets[1],
  TestData=DataSets[2],
  TargetColumnName="Label",
  NumericColumnNames=features,
  Threads=-1
)

# Print default parameter settings
model.print_algo_args()

# Update algo args
model.update_model_parameters(
    num_iterations=200,
    max_depth=6,
    min_data_in_leaf=2
)

# Train Model
model.train()

# Score train, validation, and test; store internally
model.score()

# Inspect scored data
model.ScoredData["train"]
model.ScoredData["validation"]
model.ScoredData["test"]

# Evaluate scored data
global_eval = model.evaluate(
    DataName="test"
)

# Get variable importance
imp = model.compute_feature_importance()


2. Model Evaluation Visuals

Below is a gallery of example evaluation plots produced by RetroFit.

Click to expand gallery























3. Model Insights Reports

Below is a gallery of example insights report outputs produced by RetroFit.

Click to expand gallery

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

retrofit-0.3.1.tar.gz (4.4 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

retrofit-0.3.1-py3-none-any.whl (4.4 MB view details)

Uploaded Python 3

File details

Details for the file retrofit-0.3.1.tar.gz.

File metadata

  • Download URL: retrofit-0.3.1.tar.gz
  • Upload date:
  • Size: 4.4 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.11

File hashes

Hashes for retrofit-0.3.1.tar.gz
Algorithm Hash digest
SHA256 183f9f9a5a156fd835f3fc0416dbbe822a7987c7cdce8a9fbdae36a99c0e073a
MD5 bc20a350cde5911c6ac977d7de132694
BLAKE2b-256 e8f7182e04424a0e3a9d5b639e66ed02530d21906579de48e82202b2897e0bf7

See more details on using hashes here.

File details

Details for the file retrofit-0.3.1-py3-none-any.whl.

File metadata

  • Download URL: retrofit-0.3.1-py3-none-any.whl
  • Upload date:
  • Size: 4.4 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.11

File hashes

Hashes for retrofit-0.3.1-py3-none-any.whl
Algorithm Hash digest
SHA256 a42b0be2b795f953be0d8a6de9fe07803b8c3f89dc651b59d3d5699862faee47
MD5 243030eb9bbb575dc135bb87e1833df4
BLAKE2b-256 ff559305c948917c4fbef86d169aa3ea5a769ab30df0665926ece17b11cef072

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page