Skip to main content

AutoML, Forecasting, NLP, Image Classification, Feature Engineering, Model Evaluation, Model Interpretation, Fast Processing.

Project description

Version: 0.1.5 Python Build: Passing Maintenance PRs Welcome GitHub Stars

Quick Note

This package is currently in its beginning stages. I'll be working off a blueprint from my R package RemixAutoML so there should be minimal breakages upon new releases, only non-breaking enhancements and additions.

Installation

# Most up-to-date
pip install git+https://github.com/AdrianAntico/RetroFit.git#egg=retrofit

# From pypi
pip install retrofit==0.1.5

# Check out R package RemixAutoML
https://github.com/AdrianAntico/RemixAutoML

Feature Engineering

Feature Engineering - Some of the feature engineering functions can only be found in this package. I believe feature engineering is your best bet for improving model performance. I have functions that cover all feature types. There are feature engineering functions for numeric data, categorical data, text data, and date data. They are all designed to generate features for training and scoring pipelines and they run extremely fast with low memory utilization. The package takes advantage of datatable or polars (user chooses) for all feature engineering and data wrangling related functions which means you'll only have to go to big data tools if absolutely necessary.

Machine Learning

Machine Learning Training -

Machine Learning Scoring -

Machine Learning Evaluation -

Machine Learning Interpretation -

Feature Engineering

Expand to view content

FE0 Feature Engineering: Row-Dependence

Expand to view content

FE0_AutoLags()

Function Description

FE0_AutoLags() Automatically generate any number of lags, for any number of columns, by any number of By-Variables, using datatable.

Code Example

# QA: Test FE0_AutoLags
import pkg_resources
import timeit
import datatable as dt
import polars as pl
import retrofit
from retrofit import FeatureEngineering as fe

## No Group Example: datatable
FilePath = pkg_resources.resource_filename('retrofit', 'datasets/BenchmarkData.csv')
data = dt.fread(FilePath)
t_start = timeit.default_timer()
Output = fe.FE0_AutoLags(
  data=data, 
  ArgsList=None, 
  LagPeriods=1, 
  LagColumnNames='Leads', 
  DateColumnName='CalendarDateColumn', 
  ByVariables=None, 
  ImputeValue=-1, 
  Sort=True, 
  Processing='datatable', 
  InputFrame='datatable', 
  OutputFrame='datatable')
t_end = timeit.default_timer()
print(t_end - t_start)
data1 = Output['data']
ArgsList = Output['ArgsList']
del Output
print(data1.names)
print(ArgsList)

## No Group Example: polars
data = pl.read_csv("C:/Users/Bizon/Documents/GitHub/BenchmarkData.csv")
t_start = timeit.default_timer()
Output = fe.FE0_AutoLags(
  data=data, 
  ArgsList=None, 
  LagPeriods=1, 
  LagColumnNames='Leads', 
  DateColumnName='CalendarDateColumn', 
  ByVariables=None, 
  ImputeValue=-1.0, 
  Sort=True, 
  Processing='polars', 
  InputFrame='polars', 
  OutputFrame='polars')
t_end = timeit.default_timer()
print(t_end - t_start)
data2 = Output['data']
ArgsList = Output['ArgsList']
del Output
print(data2.columns)
print(ArgsList)

## Group Example, Single Lag: datatable
FilePath = pkg_resources.resource_filename('retrofit', 'datasets/BenchmarkData.csv') 
data = dt.fread(FilePath)
t_start = timeit.default_timer()
Output = fe.FE0_AutoLags(
  data=data, 
  ArgsList=None, 
  LagPeriods=1, 
  LagColumnNames='Leads', 
  DateColumnName='CalendarDateColumn', 
  ByVariables=['MarketingSegments','MarketingSegments2','MarketingSegments3', 'Label'], 
  ImputeValue=-1, 
  Sort=True, 
  Processing='datatable',
  InputFrame='datatable',
  OutputFrame='datatable')
t_end = timeit.default_timer()
print(t_end - t_start)
data1 = Output['data']
ArgsList = Output['ArgsList']
del Output
print(data1.names)
print(ArgsList)

## Group Exmaple: polars
data = pl.read_csv("C:/Users/Bizon/Documents/GitHub/BenchmarkData.csv")
t_start = timeit.default_timer()
Output = fe.FE0_AutoLags(
  data=data, 
  ArgsList=None, 
  LagPeriods=1, 
  LagColumnNames='Leads', 
  DateColumnName='CalendarDateColumn', 
  ByVariables=['MarketingSegments','MarketingSegments2','MarketingSegments3', 'Label'], 
  ImputeValue=-1.0, 
  Sort=True, 
  Processing='polars', 
  InputFrame='polars', 
  OutputFrame='polars')
t_end = timeit.default_timer()
print(t_end - t_start)
data2 = Output['data']
ArgsList = Output['ArgsList']
del Output
print(data2.columns)
print(ArgsList)

## Group and Multiple Periods and LagColumnNames: datatable
FilePath = pkg_resources.resource_filename('retrofit', 'datasets/BenchmarkData.csv') 
data = dt.fread(FilePath)
t_start = timeit.default_timer()
Output = fe.FE0_AutoLags(
  data=data, 
  ArgsList=None, 
  LagPeriods=[1,3,5], 
  LagColumnNames=['Leads','XREGS1'], 
  DateColumnName='CalendarDateColumn', 
  ByVariables=['MarketingSegments','MarketingSegments2','MarketingSegments3', 'Label'], 
  ImputeValue=-1, 
  Sort=True, 
  Processing='datatable', 
  InputFrame='datatable', 
  OutputFrame='datatable')
t_end = timeit.default_timer()
print(t_end - t_start)
data1 = Output['data']
ArgsList = Output['ArgsList']
del Output
print(data1.names)
print(ArgsList)

## Group and Multiple Periods and LagColumnNames: datatable
data = pl.read_csv("C:/Users/Bizon/Documents/GitHub/BenchmarkData.csv")
t_start = timeit.default_timer()
Output = fe.FE0_AutoLags(
  data=data, 
  ArgsList=None, 
  LagPeriods=[1,3,5],
  LagColumnNames=['Leads','XREGS1'], 
  DateColumnName='CalendarDateColumn', 
  ByVariables=['MarketingSegments','MarketingSegments2','MarketingSegments3', 'Label'], 
  ImputeValue=-1.0, 
  Sort=True, 
  Processing='polars', 
  InputFrame='polars', 
  OutputFrame='polars')
t_end = timeit.default_timer()
print(t_end - t_start)
data2 = Output['data']
ArgsList = Output['ArgsList']
del Output
print(data2.columns)
print(ArgsList)

FE0_AutoRollStats()

Function Description

FE0_AutoRollStats() Automatically generate any number of moving averages, moving standard deviations, moving mins and moving maxs from any number of source columns, by any number of By-Variables, using datatable.

Code Example

# Test Function
import pkg_resources
import timeit
import datatable as dt
import retrofit
from retrofit import FeatureEngineering as fe

## Group Example:
FilePath = pkg_resources.resource_filename('retrofit', 'datasets/BenchmarkData.csv')
data = dt.fread(FilePath)
t_start = timeit.default_timer()
data = fe.FE0_AutoRollStats(
  data=data, 
  RollColumnNames='Leads', 
  DateColumnName='CalendarDateColumn', 
  ByVariables=None, 
  MovingAvg_Periods=[3,5,7], 
  MovingSD_Periods=[3,5,7], 
  MovingMin_Periods=[3,5,7], 
  MovingMax_Periods=[3,5,7], 
  ImputeValue=-1, 
  Sort=True, 
  Processing='datatable', 
  InputFrame='datatable', 
  OutputFrame='datatable')
t_end = timeit.default_timer()
print(t_end - t_start)
print(data.names)
    
## Group and Multiple Periods and RollColumnNames:
FilePath = pkg_resources.resource_filename('retrofit', 'datasets/BenchmarkData.csv') 
data = dt.fread(FilePath)
t_start = timeit.default_timer()
data = fe.FE0_AutoRollStats(
  data=data, 
  RollColumnNames=['Leads','XREGS1'], 
  DateColumnName='CalendarDateColumn', 
  ByVariables=['MarketingSegments', 'MarketingSegments2', 'MarketingSegments3', 'Label'], 
  MovingAvg_Periods=[3,5,7], 
  MovingSD_Periods=[3,5,7], 
  MovingMin_Periods=[3,5,7], 
  MovingMax_Periods=[3,5,7], 
  ImputeValue=-1, 
  Sort=True, 
  Processing='datatable', 
  InputFrame='datatable', 
  OutputFrame='datatable')
t_end = timeit.default_timer()
print(t_end - t_start)
print(data.names)

## No Group Example:
FilePath = pkg_resources.resource_filename('retrofit', 'datasets/BenchmarkData.csv') 
data = dt.fread(FilePath)
t_start = timeit.default_timer()
data = fe.FE0_AutoRollStats(
  data=data, 
  RollColumnNames='Leads', 
  DateColumnName='CalendarDateColumn', 
  ByVariables=None, 
  MovingAvg_Periods=[3,5,7], 
  MovingSD_Periods=[3,5,7], 
  MovingMin_Periods=[3,5,7], 
  MovingMax_Periods=[3,5,7], 
  ImputeValue=-1, 
  Sort=True, 
  Processing='datatable', 
  InputFrame='datatable', 
  OutputFrame='datatable')
t_end = timeit.default_timer()
print(t_end - t_start)
print(data.names)

FE0_AutoDiff()

Function Description

FE0_AutoDiff() Automatically generate any number of differences from any number of source columns, for numeric, character, and date columns, by any number of By-Variables, using datatable.

Code Example

# Test Function
import pkg_resources
import timeit
import datatable as dt
import retrofit
from retrofit import FeatureEngineering as fe

## Group Example:
FilePath = pkg_resources.resource_filename('retrofit', 'datasets/BenchmarkData.csv')
data = dt.fread(FilePath)
t_start = timeit.default_timer()
data = fe.FE0_AutoDiff(
  data=data, 
  DateColumnName = 'CalendarDateColumn', 
  ByVariables = ['MarketingSegments', 'MarketingSegments2', 'MarketingSegments3', 'Label'], 
  DiffNumericVariables = 'Leads', 
  DiffDateVariables = 'CalendarDateColumn', 
  DiffGroupVariables = None, 
  NLag1 = 0, 
  NLag2 = 1, 
  Sort=True, 
  Processing='datatable',
  InputFrame = 'datatable', 
  OutputFrame = 'datatable')
t_end = timeit.default_timer()
print(t_end - t_start)
print(data.names)
    
## Group and Multiple Periods and RollColumnNames:
FilePath = pkg_resources.resource_filename('retrofit', 'datasets/BenchmarkData.csv')
data = dt.fread(FilePath)
t_start = timeit.default_timer()
data = fe.FE0_AutoDiff(
  data=data, 
  DateColumnName = 'CalendarDateColumn',
  ByVariables = ['MarketingSegments', 'MarketingSegments2', 'MarketingSegments3', 'Label'], 
  DiffNumericVariables = 'Leads', 
  DiffDateVariables = 'CalendarDateColumn', 
  DiffGroupVariables = None, 
  NLag1 = 0, 
  NLag2 = 1, 
  Sort=True, 
  Processing = 'datatable',
  InputFrame = 'datatable',
  OutputFrame = 'datatable')
t_end = timeit.default_timer()
print(t_end - t_start)
print(data.names)

## No Group Example:
FilePath = pkg_resources.resource_filename('retrofit', 'datasets/BenchmarkData.csv') 
data = dt.fread(FilePath)
t_start = timeit.default_timer()
data = fe.FE0_AutoDiff(
  data=data, 
  DateColumnName = 'CalendarDateColumn', 
  ByVariables = None, 
  DiffNumericVariables = 'Leads', 
  DiffDateVariables = 'CalendarDateColumn', 
  DiffGroupVariables = None, 
  NLag1 = 0, 
  NLag2 = 1, 
  Sort=True, 
  Processing = 'datatable',
  InputFrame = 'datatable', 
  OutputFrame = 'datatable')
t_end = timeit.default_timer()
print(t_end - t_start)
print(data.names)

FE1 Feature Engineering: Row-Independence

Expand to view content

FE1_AutoCalendarVariables()

Function Description

FE1_AutoCalendarVariables() Automatically generate calendar variables from your datatable.

Code Example

# Test Function
import pkg_resources
import timeit
import datatable as dt
import retrofit
from retrofit import FeatureEngineering as fe
 
# Data can be created using the R package RemixAutoML and function FakeDataGenerator
FilePath = pkg_resources.resource_filename('retrofit', 'datasets/BenchmarkData.csv') 
data = dt.fread(FilePath)
t_start = timeit.default_timer()
data = fe.AutoCalendarVariables(
  data=data, 
  ArgsList=None, 
  DateColumnNames = 'CalendarDateColumn', 
  CalendarVariables = ['wday','mday','wom','month','quarter','year'], 
  Processing = 'datatable', 
  InputFrame = 'datatable', 
  OutputFrame = 'datatable')
t_end = timeit.default_timer()
print(t_end - t_start)
data.names

FE1_DummyVariables()

Function Description

FE1_DummyVariables() Automatically generate dummy variables for user supplied categorical columns

Code Example

# Example: datatable
import pkg_resources
import timeit
import datatable as dt
import retrofit
from retrofit import FeatureEngineering as fe
FilePath = pkg_resources.resource_filename('retrofit', 'datasets/BenchmarkData.csv') 
data = dt.fread(FilePath)
t_start = timeit.default_timer()
Output = fe.FE1_DummyVariables(
  data=data, 
  ArgsList=None, 
  CategoricalColumnNames=['MarketingSegments','MarketingSegments2'], 
  Processing='datatable', 
  InputFrame='datatable', 
  OutputFrame='datatable')
t_end = timeit.default_timer()
t_end - t_start
data = Output['data']
ArgsList = Output['ArgsList']


# Example: polars
import retrofit
from retrofit import FeatureEngineering as fe
import polars as pl
data = pl.read_csv("C:/Users/Bizon/Documents/GitHub/BenchmarkData.csv")
t_start = timeit.default_timer()
Output = fe.FE1_DummyVariables(
  data=data, 
  ArgsList=None, 
  CategoricalColumnNames=['MarketingSegments','MarketingSegments2'], 
  Processing='polars', 
  InputFrame='polars', 
  OutputFrame='polars')
t_end = timeit.default_timer()
t_end - t_start
data = Output['data']
ArgsList = Output['ArgsList']

FE1_ColTypeConversions()

Function Description

FE1_ColTypeConversions() Automatically convert column types required by certain models

Code Example

# Setup Environment
import pkg_resources
import timeit
import datatable as dt
import retrofit
from retrofit import DatatableFE as dtfe
from retrofit import MachineLearning as ml

# Load some data
FilePath = pkg_resources.resource_filename('retrofit', 'datasets/RegressionData.csv') 
data = dt.fread(FilePath)

# Instantiate Feature Engineering Class
FE = dtfe.FE()

# Create some lags
data = FE.FE0_AutoLags(
    data,
    LagColumnNames=['Independent_Variable1', 'Independent_Variable2'],
    DateColumnName='DateTime',
    ByVariables='Factor_1',
    LagPeriods=[1,2],
    ImputeValue=-1,
    Sort=True,
    use_saved_args=False)

# Create some rolling stats
data = FE.FE0_AutoRollStats(
    data,
    RollColumnNames=['Independent_Variable1','Independent_Variable2'],
    DateColumnName='DateTime',
    ByVariables='Factor_1',
    MovingAvg_Periods=[1,2],
    MovingSD_Periods=[2,3],
    MovingMin_Periods=[1,2],
    MovingMax_Periods=[1,2],
    ImputeValue=-1,
    Sort=True,
    use_saved_args=False)

# Create some diffs
data = FE.FE0_AutoDiff(
    data,
    DateColumnName='DateTime',
    ByVariables=['Factor_1','Factor_2','Factor_3'],
    DiffNumericVariables='Independent_Variable1',
    DiffDateVariables=None,
    DiffGroupVariables=None,
    NLag1=0,
    NLag2=1,
    Sort=True,
    use_saved_args=False)

# Create Calendar Vars
data = FE.FE1_AutoCalendarVariables(
    data,
    DateColumnNames='DateTime',
    CalendarVariables=['wday','month','quarter'],
    use_saved_args=False)

# Type conversions for modeling
data = FE.FE1_ColTypeConversions(
    self,
    data,
    Int2Float=True,
    Bool2Float=True,
    RemoveDateCols=False,
    RemoveStrCols=False,
    SkipCols=None,
    use_saved_args=False)

FE2 Feature Engineering: Full-Data-Set

Expand to view content

FE2_AutoDataParition()

Function Description

FE2_AutoDataParition() Automatically create data sets for training based on random or time based splits

Code Example

# FE2_AutoDataParition
import pkg_resources
import timeit
import datatable as dt
import polars as pl
import retrofit
from retrofit import FeatureEngineering as fe
from retrofit import utils as u

# datatable random Example
FilePath = pkg_resources.resource_filename('retrofit', 'datasets/BenchmarkData.csv') 
data = dt.fread(FilePath)
t_start = timeit.default_timer()
DataSets = fe.FE2_AutoDataParition(
  data=data, 
  ArgsList=None, 
  DateColumnName='CalendarDateColumn', 
  PartitionType='random', 
  Ratios=[0.70,0.20,0.10], 
  Sort = False,
  ByVariables=None, 
  Processing='datatable', 
  InputFrame='datatable', 
  OutputFrame='datatable')
t_end = timeit.default_timer()
print(t_end - t_start)
TrainData = DataSets['TrainData']
ValidationData = DataSets['ValidationData']
TestData = DataSets['TestData']
ArgsList = DataSets['ArgsList']

# polars random Example
data = pl.read_csv("C:/Users/Bizon/Documents/GitHub/BenchmarkData.csv")
t_start = timeit.default_timer()
DataSets = fe.FE2_AutoDataParition(
  data=data, 
  ArgsList=None, 
  DateColumnName='CalendarDateColumn', 
  PartitionType='random', 
  Ratios=[0.70,0.20,0.10], 
  ByVariables=None, 
  Sort = False,
  Processing='polars', 
  InputFrame='polars', 
  OutputFrame='polars')
t_end = timeit.default_timer()
print(t_end - t_start)
TrainData = DataSets['TrainData']
ValidationData = DataSets['ValidationData']
TestData = DataSets['TestData']
ArgsList = DataSets['ArgsList']

# datatable time Example
FilePath = pkg_resources.resource_filename('retrofit', 'datasets/BenchmarkData.csv')
data = dt.fread(FilePath)
t_start = timeit.default_timer()
DataSets = fe.FE2_AutoDataParition(
  data=data, 
  ArgsList=None, 
  DateColumnName='CalendarDateColumn', 
  PartitionType='time', 
  Ratios=[0.70,0.20,0.10], 
  Sort = True,
  ByVariables=None, 
  Processing='datatable', 
  InputFrame='datatable', 
  OutputFrame='datatable')
t_end = timeit.default_timer()
print(t_end - t_start)
TrainData = DataSets['TrainData']
ValidationData = DataSets['ValidationData']
TestData = DataSets['TestData']
ArgsList = DataSets['ArgsList']

# polars time Example
data = pl.read_csv("C:/Users/Bizon/Documents/GitHub/BenchmarkData.csv")
t_start = timeit.default_timer()
DataSets = fe.FE2_AutoDataParition(
  data=data, 
  ArgsList=None, 
  DateColumnName='CalendarDateColumn', 
  PartitionType='time', 
  Ratios=[0.70,0.20,0.10], 
  ByVariables=None, 
  Sort = True,
  Processing='polars', 
  InputFrame='polars', 
  OutputFrame='polars')
t_end = timeit.default_timer()
t_end - t_start
TrainData = DataSets['TrainData']
ValidationData = DataSets['ValidationData']
TestData = DataSets['TestData']
ArgsList = DataSets['ArgsList']

FE3 Feature Engineering: Model-Based

Expand to view content

Coming soon

Machine Learning Training

Expand to view content

ML0 Machine Learning: Prepare for Modeling

Expand to view content

ML0_Parameters()

Function Description

ML0_Parameters() Automatically generate parameters for modeling. User can update the parameters as desired.

Code Example

# Setup Environment
import pkg_resources
import timeit
import datatable as dt
from datatable import sort, f, by
import retrofit
from retrofit import FeatureEngineering as fe
from retrofit import MachineLearning as ml

# Load some data
FilePath = pkg_resources.resource_filename('retrofit', 'datasets/BenchmarkData.csv') 
data = dt.fread(FilePath)

# Create partitioned data sets
Data = fe.FE2_AutoDataParition(
  data=data, 
  ArgsList=None, 
  DateColumnName=None, 
  PartitionType='random', 
  Ratios=[0.7,0.2,0.1], 
  ByVariables=None, 
  Sort=False, 
  Processing='datatable', 
  InputFrame='datatable', 
  OutputFrame='datatable')

# Prepare modeling data sets
DataSets = ml.ML0_GetModelData(
  Processing='catboost',
  TrainData=Data['TrainData'],
  ValidationData=Data['ValidationData'],
  TestData=Data['TestData'],
  ArgsList=None,
  TargetColumnName='Leads',
  NumericColumnNames=['XREGS1','XREGS2','XREGS3'],
  CategoricalColumnNames=['MarketingSegments','MarketingSegments2','MarketingSegments3','Label'],
  TextColumnNames=None,
  WeightColumnName=None,
  Threads=-1,
  InputFrame='datatable')

# Get args list for algorithm and target type
ModelArgs = ml.ML0_Parameters(
  Algorithms='CatBoost', 
  TargetType="Regression", 
  TrainMethod="Train")

ML0_GetModelData()

Function Description

ML0_GetModelData() Automatically create data sets chosen ML algorithm. Currently supports catboost, xgboost, and lightgbm.

Code Example

# ML0_GetModelData Example:
import pkg_resources
import datatable as dt
from datatable import sort, f, by
import retrofit
from retrofit import FeatureEngineering as fe
from retrofit import MachineLearning as ml

############################################################################################
# CatBoost
############################################################################################

# Load some data
FilePath = pkg_resources.resource_filename('retrofit', 'datasets/BenchmarkData.csv') 
data = dt.fread(FilePath)
    
# Create partitioned data sets
DataSets = fe.FE2_AutoDataParition(
  data=data, 
  ArgsList=None, 
  DateColumnName='CalendarDateColumn', 
  PartitionType='random', 
  Ratios=[0.70,0.20,0.10], 
  ByVariables=None, 
  Processing='datatable', 
  InputFrame='datatable', 
  OutputFrame='datatable')

# Collect partitioned data
TrainData = DataSets['TrainData']
ValidationData = DataSets['ValidationData']
TestData = DataSets['TestData']
del DataSets

# Create catboost data sets
DataSets = ml.ML0_GetModelData(
  TrainData=TrainData, 
  ValidationData=ValidationData, 
  TestData=TestData, 
  ArgsList=None, 
  TargetColumnName='Leads', 
  NumericColumnNames=['XREGS1', 'XREGS2', 'XREGS3'], 
  CategoricalColumnNames=['MarketingSegments','MarketingSegments2','MarketingSegments3','Label'], 
  TextColumnNames=None, 
  WeightColumnName=None, 
  Threads=-1, 
  Processing='catboost', 
  InputFrame='datatable')
  
# Collect catboost training data
catboost_train = DataSets['train_data']
catboost_validation = DataSets['validation_data']
catboost_test = DataSets['test_data']

############################################################################################
# XGBoost
############################################################################################

# Load some data
FilePath = pkg_resources.resource_filename('retrofit', 'datasets/BenchmarkData.csv') 
data = dt.fread(FilePath)
    
# Create partitioned data sets
DataSets = fe.FE2_AutoDataParition(
  data=data, 
  ArgsList=None, 
  DateColumnName='CalendarDateColumn', 
  PartitionType='random', 
  Ratios=[0.70,0.20,0.10], 
  ByVariables=None, 
  Processing='datatable', 
  InputFrame='datatable', 
  OutputFrame='datatable')

# Collect partitioned data
TrainData = DataSets['TrainData']
ValidationData = DataSets['ValidationData']
TestData = DataSets['TestData']
del DataSets

# Create xgboost data sets
DataSets = ml.ML0_GetModelData(
  TrainData=TrainData, 
  ValidationData=ValidationData, 
  TestData=TestData, 
  ArgsList=None, 
  TargetColumnName='Leads', 
  NumericColumnNames=['XREGS1', 'XREGS2', 'XREGS3'], 
  CategoricalColumnNames=['MarketingSegments','MarketingSegments2','MarketingSegments3','Label'], 
  TextColumnNames=None, 
  WeightColumnName=None, 
  Threads=-1, 
  Processing='xgboost', 
  InputFrame='datatable')
  
# Collect xgboost training data
xgboost_train = DataSets['train_data']
xgboost_validation = DataSets['validation_data']
xgboost_test = DataSets['test_data']

############################################################################################
# LightGBM
############################################################################################

# Load some data
FilePath = pkg_resources.resource_filename('retrofit', 'datasets/BenchmarkData.csv') 
data = dt.fread(FilePath)
    
# Create partitioned data sets
DataSets = fe.FE2_AutoDataParition(
  data=data, 
  ArgsList=None, 
  DateColumnName='CalendarDateColumn', 
  PartitionType='random', 
  Ratios=[0.70,0.20,0.10], 
  ByVariables=None, 
  Processing='datatable', 
  InputFrame='datatable', 
  OutputFrame='datatable')

# Collect partitioned data
TrainData = DataSets['TrainData']
ValidationData = DataSets['ValidationData']
TestData = DataSets['TestData']
del DataSets

# Create lightgbm data sets
DataSets = ml.ML0_GetModelData(
  TrainData=TrainData, 
  ValidationData=ValidationData, 
  TestData=TestData, 
  ArgsList=None, 
  TargetColumnName='Leads', 
  NumericColumnNames=['XREGS1', 'XREGS2', 'XREGS3'], 
  CategoricalColumnNames=['MarketingSegments','MarketingSegments2','MarketingSegments3','Label'], 
  TextColumnNames=None, 
  WeightColumnName=None, 
  Threads=-1, 
  Processing='lightgbm', 
  InputFrame='datatable')
  
# Collect lightgbm training data
lightgbm_train = DataSets['train_data']
lightgbm_validation = DataSets['validation_data']
lightgbm_test = DataSets['test_data']

ML1 Machine Learning: RetroFit Class

Expand to view content

Class Meta Information

Class Goals

####################################
# Goals
####################################

Class Initialization
Model Initialization
Training
Feature Tuning
Grid Tuning
Model Scoring
Model Evaluation
Model Interpretation

Class Functions

####################################
# Functions
####################################

ML1_Single_Train()
ML1_Single_Score()
PrintAlgoArgs()

Class Attributes

####################################
# Attributes
####################################

self.ModelArgs = ModelArgs
self.ModelArgsNames = [*self.ModelArgs]
self.Runs = len(self.ModelArgs)
self.DataSets = DataSets
self.DataSetsNames = [*self.DataSets]
self.ModelList = dict()
self.ModelListNames = []
self.FitList = dict()
self.FitListNames = []
self.EvaluationList = dict()
self.EvaluationListNames = []
self.InterpretationList = dict()
self.InterpretationListNames = []
self.CompareModelsList = dict()
self.CompareModelsListNames = []

Ftrl Examples

Regression

####################################
# Ftrl Regression
####################################

# Setup Environment
import pkg_resources
import timeit
import datatable as dt
import retrofit
from retrofit import DatatableFE as dtfe
from retrofit import MachineLearning as ml

# Load some data
FilePath = pkg_resources.resource_filename('retrofit', 'datasets/RegressionData.csv') 
data = dt.fread(FilePath)

# Instantiate Feature Engineering Class
FE = dtfe.FE()

# Create some lags
data = FE.FE0_AutoLags(
    data,
    LagColumnNames=['Independent_Variable1', 'Independent_Variable2'],
    DateColumnName='DateTime',
    ByVariables='Factor_1',
    LagPeriods=[1,2],
    ImputeValue=-1,
    Sort=True,
    use_saved_args=False)

# Create some rolling stats
data = FE.FE0_AutoRollStats(
    data,
    RollColumnNames=['Independent_Variable1','Independent_Variable2'],
    DateColumnName='DateTime',
    ByVariables='Factor_1',
    MovingAvg_Periods=[1,2],
    MovingSD_Periods=[2,3],
    MovingMin_Periods=[1,2],
    MovingMax_Periods=[1,2],
    ImputeValue=-1,
    Sort=True,
    use_saved_args=False)

# Create some diffs
data = FE.FE0_AutoDiff(
    data,
    DateColumnName='DateTime',
    ByVariables=['Factor_1','Factor_2','Factor_3'],
    DiffNumericVariables='Independent_Variable1',
    DiffDateVariables=None,
    DiffGroupVariables=None,
    NLag1=0,
    NLag2=1,
    Sort=True,
    use_saved_args=False)

# Create Calendar Vars
data = FE.FE1_AutoCalendarVariables(
    data,
    DateColumnNames='DateTime',
    CalendarVariables=['wday','month','quarter'],
    use_saved_args=False)

# Type conversions for modeling
data = FE.FE1_ColTypeConversions(
    data,
    Int2Float=True,
    Bool2Float=True,
    RemoveDateCols=True,
    RemoveStrCols=False,
    SkipCols=None,
    use_saved_args=False)

# Drop Text Cols (no word2vec yet)
data = data[:, [z for z in data.names if z not in ['Comment']]]

# Create partitioned data sets
DataFrames = FE.FE2_AutoDataPartition(
  data, 
  DateColumnName = None, 
  PartitionType = 'random', 
  Ratios = [0.7,0.2,0.1], 
  ByVariables = None, 
  Sort = False,
  use_saved_args = False)

# Prepare modeling data sets
ModelData = ml.ML0_GetModelData(
  Processing = 'Ftrl',
  TrainData = DataFrames['TrainData'],
  ValidationData = DataFrames['ValidationData'],
  TestData = DataFrames['TestData'],
  ArgsList = None,
  TargetColumnName = 'Adrian',
  NumericColumnNames = [z for z in list(data.names) if z not in ['Factor_1','Factor_2','Factor_3','Adrian']],
  CategoricalColumnNames = ['Factor_1', 'Factor_2', 'Factor_3'],
  TextColumnNames = None,
  WeightColumnName = None,
  Threads = -1,
  InputFrame = 'datatable')

# Get args list for algorithm and target type
ModelArgs = ml.ML0_Parameters(
  Algorithms = 'Ftrl', 
  TargetType = "Regression", 
  TrainMethod = "Train")

# Initialize RetroFit
x = ml.RetroFit(ModelArgs, ModelData, DataFrames)

# Train Model
x.ML1_Single_Train(Algorithm = 'Ftrl')

# Score data
x.ML1_Single_Score(
  DataName = x.DataSetsNames[2], 
  ModelName = x.ModelListNames[0], 
  Algorithm = 'Ftrl', 
  NewData = None)

# Scoring data names
x.DataSetsNames

# Scoring data
x.DataSets.get('Scored_test_data_Ftrl_1')

# Check ModelArgs Dict
x.PrintAlgoArgs(Algo='Ftrl')

# List of model names
x.ModelListNames

# List of model fitted names
x.FitListNames

Classification

####################################
# Ftrl Classification
####################################

# Setup Environment
import pkg_resources
import timeit
import datatable as dt
import retrofit
from retrofit import DatatableFE as dtfe
from retrofit import MachineLearning as ml

# Load some data
FilePath = pkg_resources.resource_filename('retrofit', 'datasets/ClassificationData.csv') 
data = dt.fread(FilePath)

# Instantiate Feature Engineering Class
FE = dtfe.FE()

# Create some lags
data = FE.FE0_AutoLags(
    data,
    LagColumnNames=['Independent_Variable1', 'Independent_Variable2'],
    DateColumnName='DateTime',
    ByVariables='Factor_1',
    LagPeriods=[1,2],
    ImputeValue=-1,
    Sort=True,
    use_saved_args=False)

# Create some rolling stats
data = FE.FE0_AutoRollStats(
    data,
    RollColumnNames=['Independent_Variable1','Independent_Variable2'],
    DateColumnName='DateTime',
    ByVariables='Factor_1',
    MovingAvg_Periods=[1,2],
    MovingSD_Periods=[2,3],
    MovingMin_Periods=[1,2],
    MovingMax_Periods=[1,2],
    ImputeValue=-1,
    Sort=True,
    use_saved_args=False)

# Create some diffs
data = FE.FE0_AutoDiff(
    data,
    DateColumnName='DateTime',
    ByVariables=['Factor_1','Factor_2','Factor_3'],
    DiffNumericVariables='Independent_Variable1',
    DiffDateVariables=None,
    DiffGroupVariables=None,
    NLag1=0,
    NLag2=1,
    Sort=True,
    use_saved_args=False)

# Create Calendar Vars
data = FE.FE1_AutoCalendarVariables(
    data,
    DateColumnNames='DateTime',
    CalendarVariables=['wday','month','quarter'],
    use_saved_args=False)

# Type conversions for modeling
data = FE.FE1_ColTypeConversions(
    data,
    Int2Float=True,
    Bool2Float=True,
    RemoveDateCols=True,
    RemoveStrCols=False,
    SkipCols=None,
    use_saved_args=False)

# Drop Text Cols (no word2vec yet)
data = data[:, [z for z in data.names if z not in ['Comment']]]

# Create partitioned data sets
DataFrames = FE.FE2_AutoDataPartition(
  data, 
  DateColumnName = None, 
  PartitionType = 'random', 
  Ratios = [0.7,0.2,0.1], 
  ByVariables = None, 
  Sort = False,
  use_saved_args = False)

# Prepare modeling data sets
ModelData = ml.ML0_GetModelData(
  Processing = 'Ftrl',
  TrainData = DataFrames['TrainData'],
  ValidationData = DataFrames['ValidationData'],
  TestData = DataFrames['TestData'],
  ArgsList = None,
  TargetColumnName = 'Adrian',
  NumericColumnNames = [z for z in list(data.names) if z not in ['Factor_1','Factor_2','Factor_3','Adrian']],
  CategoricalColumnNames = ['Factor_1', 'Factor_2', 'Factor_3'],
  TextColumnNames = None,
  WeightColumnName = None,
  Threads = -1,
  InputFrame = 'datatable')

# Get args list for algorithm and target type
ModelArgs = ml.ML0_Parameters(
  Algorithms = 'Ftrl', 
  TargetType = "Classification", 
  TrainMethod = "Train")

# Initialize RetroFit
x = ml.RetroFit(ModelArgs, ModelData, DataFrames)

# Train Model
x.ML1_Single_Train(Algorithm = 'Ftrl')

# Score data
x.ML1_Single_Score(
  DataName = x.DataSetsNames[2],
  ModelName = x.ModelListNames[0],
  Algorithm = 'Ftrl',
  NewData = None)

# Scoring data names
x.DataSetsNames

# Scoring data
x.DataSets.get('Scored_test_data_Ftrl_1')

# Check ModelArgs Dict
x.PrintAlgoArgs(Algo='Ftrl')

# List of model names
x.ModelListNames

# List of model fitted names
x.FitListNames

MultiClass

####################################
# Ftrl MultiClass
####################################

# Setup Environment
import pkg_resources
import timeit
import datatable as dt
import retrofit
from retrofit import DatatableFE as dtfe
from retrofit import MachineLearning as ml

# Load some data
FilePath = pkg_resources.resource_filename('retrofit', 'datasets/MultiClassData.csv') 
data = dt.fread(FilePath)

# Instantiate Feature Engineering Class
FE = dtfe.FE()

# Create Calendar Vars
data = FE.FE1_AutoCalendarVariables(
    data,
    DateColumnNames='DateTime',
    CalendarVariables=['wday','month','quarter'],
    use_saved_args=False)

# Type conversions for modeling
data = FE.FE1_ColTypeConversions(
    data,
    Int2Float=True,
    Bool2Float=True,
    RemoveDateCols=True,
    RemoveStrCols=False,
    SkipCols=None,
    use_saved_args=False)

# Drop Text Cols (no word2vec yet)
data = data[:, [z for z in data.names if z not in ['Comment']]]

# Create partitioned data sets
DataFrames = FE.FE2_AutoDataPartition(
  data, 
  DateColumnName = None, 
  PartitionType = 'random', 
  Ratios = [0.7,0.2,0.1], 
  ByVariables = None, 
  Sort = False,
  use_saved_args = False)

# Prepare modeling data sets
ModelData = ml.ML0_GetModelData(
  Processing = 'Ftrl',
  TrainData = DataFrames['TrainData'],
  ValidationData = DataFrames['ValidationData'],
  TestData = DataFrames['TestData'],
  ArgsList = None,
  TargetColumnName = 'Adrian',
  NumericColumnNames = [z for z in list(data.names) if z not in ['Factor_2','Factor_3','Adrian']],
  CategoricalColumnNames = ['Factor_2', 'Factor_3'],
  TextColumnNames = None,
  WeightColumnName = None,
  Threads = -1,
  InputFrame = 'datatable')

# Get args list for algorithm and target type
ModelArgs = ml.ML0_Parameters(
  Algorithms = 'Ftrl',
  TargetType = "MultiClass",
  TrainMethod = "Train")

# Initialize RetroFit
x = ml.RetroFit(ModelArgs, ModelData, DataFrames)

# Train Model
x.ML1_Single_Train(Algorithm = 'Ftrl')

# Score data
x.ML1_Single_Score(
  DataName = x.DataSetsNames[2],
  ModelName = x.ModelListNames[0],
  Algorithm = 'Ftrl',
  NewData = None)

# Scoring data names
x.DataSetsNames

# Scoring data
x.DataSets.get('Scored_test_data_Ftrl_1').names

# Check ModelArgs Dict
x.PrintAlgoArgs(Algo='Ftrl')

# List of model names
x.ModelListNames

# List of model fitted names
x.FitListNames

CatBoost Examples

Regression

####################################
# CatBoost Regression
####################################

# Setup Environment
import pkg_resources
import timeit
import datatable as dt
import retrofit
from retrofit import DatatableFE as dtfe
from retrofit import MachineLearning as ml

# Load some data
FilePath = pkg_resources.resource_filename('retrofit', 'datasets/RegressionData.csv') 
data = dt.fread(FilePath)

# Instantiate Feature Engineering Class
FE = dtfe.FE()

# Create some lags
data = FE.FE0_AutoLags(
    data,
    LagColumnNames=['Independent_Variable1', 'Independent_Variable2'],
    DateColumnName='DateTime',
    ByVariables='Factor_1',
    LagPeriods=[1,2],
    ImputeValue=-1,
    Sort=True,
    use_saved_args=False)

# Create some rolling stats
data = FE.FE0_AutoRollStats(
    data,
    RollColumnNames=['Independent_Variable1','Independent_Variable2'],
    DateColumnName='DateTime',
    ByVariables='Factor_1',
    MovingAvg_Periods=[1,2],
    MovingSD_Periods=[2,3],
    MovingMin_Periods=[1,2],
    MovingMax_Periods=[1,2],
    ImputeValue=-1,
    Sort=True,
    use_saved_args=False)

# Create some diffs
data = FE.FE0_AutoDiff(
    data,
    DateColumnName='DateTime',
    ByVariables=['Factor_1','Factor_2','Factor_3'],
    DiffNumericVariables='Independent_Variable1',
    DiffDateVariables=None,
    DiffGroupVariables=None,
    NLag1=0,
    NLag2=1,
    Sort=True,
    use_saved_args=False)

# Create Calendar Vars
data = FE.FE1_AutoCalendarVariables(
    data,
    DateColumnNames='DateTime',
    CalendarVariables=['wday','month','quarter'],
    use_saved_args=False)

# Type conversions for modeling
data = FE.FE1_ColTypeConversions(
    data,
    Int2Float=True,
    Bool2Float=True,
    RemoveDateCols=True,
    RemoveStrCols=False,
    SkipCols=None,
    use_saved_args=False)

# Drop Text Cols (no word2vec yet)
data = data[:, [z for z in data.names if z not in ['Comment']]]

# Create partitioned data sets
DataFrames = FE.FE2_AutoDataPartition(
  data, 
  DateColumnName = None, 
  PartitionType = 'random', 
  Ratios = [0.7,0.2,0.1], 
  ByVariables = None, 
  Sort = False,
  use_saved_args = False)

# Prepare modeling data sets
ModelData = ml.ML0_GetModelData(
  Processing = 'catboost',
  TrainData = DataFrames['TrainData'],
  ValidationData = DataFrames['ValidationData'],
  TestData = DataFrames['TestData'],
  ArgsList = None,
  TargetColumnName = 'Adrian',
  NumericColumnNames = NumericColumnNames = [z for z in list(data.names) if z not in ['Factor_1','Factor_2','Factor_3','Adrian']],
  CategoricalColumnNames = ['Factor_1', 'Factor_2', 'Factor_3'],
  TextColumnNames = None,
  WeightColumnName = None,
  Threads = -1,
  InputFrame = 'datatable')

# Get args list for algorithm and target type
ModelArgs = ml.ML0_Parameters(
  Algorithms = 'CatBoost', 
  TargetType = "Regression", 
  TrainMethod = "Train")

# Update iterations to run quickly
ModelArgs.get('CatBoost').get('AlgoArgs')['iterations'] = 50

# Initialize RetroFit
x = ml.RetroFit(ModelArgs, ModelData, DataFrames)

# Train Model
x.ML1_Single_Train(Algorithm = 'CatBoost')

# Score data
x.ML1_Single_Score(
  DataName = x.DataSetsNames[2], 
  ModelName = x.ModelListNames[0],
  Algorithm = 'CatBoost',
  NewData = None)

# Scoring data names
x.DataSetsNames

# Scoring data
x.DataSets.get('Scored_test_data_CatBoost_1')

# Check ModelArgs Dict
x.PrintAlgoArgs(Algo = 'CatBoost')

# List of model names
x.ModelListNames

# List of model fitted names
x.FitListNames

Classification

####################################
# CatBoost Classification
####################################

# Setup Environment
import pkg_resources
import timeit
import datatable as dt
import retrofit
from retrofit import DatatableFE as dtfe
from retrofit import MachineLearning as ml

# Load some data
FilePath = pkg_resources.resource_filename('retrofit', 'datasets/ClassificationData.csv') 
data = dt.fread(FilePath)

# Instantiate Feature Engineering Class
FE = dtfe.FE()

# Create some lags
data = FE.FE0_AutoLags(
    data,
    LagColumnNames=['Independent_Variable1', 'Independent_Variable2'],
    DateColumnName='DateTime',
    ByVariables='Factor_1',
    LagPeriods=[1,2],
    ImputeValue=-1,
    Sort=True,
    use_saved_args=False)

# Create some rolling stats
data = FE.FE0_AutoRollStats(
    data,
    RollColumnNames=['Independent_Variable1','Independent_Variable2'],
    DateColumnName='DateTime',
    ByVariables='Factor_1',
    MovingAvg_Periods=[1,2],
    MovingSD_Periods=[2,3],
    MovingMin_Periods=[1,2],
    MovingMax_Periods=[1,2],
    ImputeValue=-1,
    Sort=True,
    use_saved_args=False)

# Create some diffs
data = FE.FE0_AutoDiff(
    data,
    DateColumnName='DateTime',
    ByVariables=['Factor_1','Factor_2','Factor_3'],
    DiffNumericVariables='Independent_Variable1',
    DiffDateVariables=None,
    DiffGroupVariables=None,
    NLag1=0,
    NLag2=1,
    Sort=True,
    use_saved_args=False)

# Create Calendar Vars
data = FE.FE1_AutoCalendarVariables(
    data,
    DateColumnNames='DateTime',
    CalendarVariables=['wday','month','quarter'],
    use_saved_args=False)

# Type conversions for modeling
data = FE.FE1_ColTypeConversions(
    data,
    Int2Float=True,
    Bool2Float=True,
    RemoveDateCols=True,
    RemoveStrCols=False,
    SkipCols=None,
    use_saved_args=False)

# Drop Text Cols (no word2vec yet)
data = data[:, [z for z in data.names if z not in ['Comment']]]

# Create partitioned data sets
DataFrames = FE.FE2_AutoDataPartition(
  data, 
  DateColumnName = None, 
  PartitionType = 'random', 
  Ratios = [0.7,0.2,0.1], 
  ByVariables = None, 
  Sort = False,
  use_saved_args = False)

# Prepare modeling data sets
ModelData = ml.ML0_GetModelData(
  Processing = 'catboost',
  TrainData = DataFrames['TrainData'],
  ValidationData = DataFrames['ValidationData'],
  TestData = DataFrames['TestData'],
  ArgsList = None,
  TargetColumnName = 'Adrian',
  NumericColumnNames = NumericColumnNames = [z for z in list(data.names) if z not in ['Factor_1','Factor_2','Factor_3','Adrian']],
  CategoricalColumnNames = ['Factor_1', 'Factor_2', 'Factor_3'],
  TextColumnNames = None,
  WeightColumnName = None,
  Threads = -1,
  InputFrame = 'datatable')

# Get args list for algorithm and target type
ModelArgs = ml.ML0_Parameters(
  Algorithms = 'CatBoost', 
  TargetType = 'Classification', 
  TrainMethod = 'Train')

# Update iterations to run quickly
ModelArgs.get('CatBoost').get('AlgoArgs')['iterations'] = 50

# Initialize RetroFit
x = ml.RetroFit(ModelArgs, ModelData, DataFrames)

# Train Model
x.ML1_Single_Train(Algorithm = 'CatBoost')

# Score data
x.ML1_Single_Score(
  DataName = x.DataSetsNames[2], 
  ModelName = x.ModelListNames[0],
  Algorithm = 'CatBoost',
  NewData = None)

# Scoring data names
x.DataSetsNames

# Scoring data
x.DataSets.get('Scored_test_data_CatBoost_1')

# Check ModelArgs Dict
x.PrintAlgoArgs(Algo = 'CatBoost')

# List of model names
x.ModelListNames

# List of model fitted names
x.FitListNames

MultiClass

####################################
# CatBoost MultiClass
####################################

# Setup Environment
import pkg_resources
import timeit
import datatable as dt
import retrofit
from retrofit import DatatableFE as dtfe
from retrofit import MachineLearning as ml

# Load some data
FilePath = pkg_resources.resource_filename('retrofit', 'datasets/MultiClassData.csv') 
data = dt.fread(FilePath)

# Instantiate Feature Engineering Class
FE = dtfe.FE()

# Create Calendar Vars
data = FE.FE1_AutoCalendarVariables(
    data,
    DateColumnNames='DateTime',
    CalendarVariables=['wday','month','quarter'],
    use_saved_args=False)

# Type conversions for modeling
data = FE.FE1_ColTypeConversions(
    data,
    Int2Float=True,
    Bool2Float=True,
    RemoveDateCols=True,
    RemoveStrCols=False,
    SkipCols=None,
    use_saved_args=False)

# Drop Text Cols (no word2vec yet)
data = data[:, [z for z in data.names if z not in ['Comment']]]

# Create partitioned data sets
DataFrames = FE.FE2_AutoDataPartition(
  data, 
  DateColumnName = None, 
  PartitionType = 'random', 
  Ratios = [0.7,0.2,0.1], 
  ByVariables = None, 
  Sort = False,
  use_saved_args = False)

# Prepare modeling data sets
ModelData = ml.ML0_GetModelData(
  Processing = 'catboost',
  TrainData = DataFrames['TrainData'],
  ValidationData = DataFrames['ValidationData'],
  TestData = DataFrames['TestData'],
  ArgsList = None,
  TargetColumnName = 'Adrian',
  NumericColumnNames = [z for z in list(data.names) if z not in ['Factor_2','Factor_3','Adrian']],
  CategoricalColumnNames = ['Factor_2', 'Factor_3'],
  TextColumnNames = None,
  WeightColumnName = None,
  Threads = -1,
  InputFrame = 'datatable')

# Get args list for algorithm and target type
ModelArgs = ml.ML0_Parameters(
  Algorithms = 'CatBoost',
  TargetType = 'MultiClass',
  TrainMethod = 'Train')

# Update iterations to run quickly
ModelArgs.get('CatBoost').get('AlgoArgs')['iterations'] = 50

# Initialize RetroFit
x = ml.RetroFit(ModelArgs, ModelData, DataFrames)

# Train Model
x.ML1_Single_Train(Algorithm = 'CatBoost')

# Score data
x.ML1_Single_Score(
  DataName = x.DataSetsNames[2], 
  ModelName = x.ModelListNames[0],
  Algorithm = 'CatBoost',
  NewData = None)

# Scoring data names
x.DataSetsNames

# Scoring data
x.DataSets.get('Scored_test_data_CatBoost_1')

# Check ModelArgs Dict
x.PrintAlgoArgs(Algo = 'CatBoost')

# List of model names
x.ModelListNames

# List of model fitted names
x.FitListNames

XGBoost Examples

Regression

#################################### # XGBoost Regression #################################### # Setup Environment import pkg_resources import timeit import datatable as dt import retrofit from retrofit import DatatableFE as dtfe from retrofit import MachineLearning as ml # Load some data FilePath = pkg_resources.resource_filename('retrofit', 'datasets/RegressionData.csv') data = dt.fread(FilePath) # Instantiate Feature Engineering Class FE = dtfe.FE() # Create some lags data = FE.FE0_AutoLags( data, LagColumnNames=['Independent_Variable1', 'Independent_Variable2'], DateColumnName='DateTime', ByVariables='Factor_1', LagPeriods=[1,2], ImputeValue=-1, Sort=True, use_saved_args=False) # Create some rolling stats data = FE.FE0_AutoRollStats( data, RollColumnNames=['Independent_Variable1','Independent_Variable2'], DateColumnName='DateTime', ByVariables='Factor_1', MovingAvg_Periods=[1,2], MovingSD_Periods=[2,3], MovingMin_Periods=[1,2], MovingMax_Periods=[1,2], ImputeValue=-1, Sort=True, use_saved_args=False) # Create some diffs data = FE.FE0_AutoDiff( data, DateColumnName='DateTime', ByVariables=['Factor_1','Factor_2','Factor_3'], DiffNumericVariables='Independent_Variable1', DiffDateVariables=None, DiffGroupVariables=None, NLag1=0, NLag2=1, Sort=True, use_saved_args=False) # Dummify data = FE.FE1_DummyVariables( data = data, CategoricalColumnNames = ['Factor_1','Factor_2','Factor_3'], use_saved_args=False) data = data[:, [name not in ['Factor_1','Factor_2','Factor_3'] for name in data.names]] # Create Calendar Vars data = FE.FE1_AutoCalendarVariables( data, DateColumnNames='DateTime', CalendarVariables=['wday','month','quarter'], use_saved_args=False) # Type conversions for modeling data = FE.FE1_ColTypeConversions( data, Int2Float=True, Bool2Float=True, RemoveDateCols=True, RemoveStrCols=False, SkipCols=None, use_saved_args=False) # Drop Text Cols (no word2vec yet) data = data[:, [z for z in data.names if z not in ['Comment']]] # Create partitioned data sets DataFrames = FE.FE2_AutoDataPartition( data, DateColumnName = None, PartitionType = 'random', Ratios = [0.7,0.2,0.1], ByVariables = None, Sort = False, use_saved_args = False) # Features Features = [z for z in list(data.names) if not z in ['Adrian','DateTime','Comment','Weights']] # Prepare modeling data sets ModelData = ml.ML0_GetModelData( Processing = 'xgboost', TrainData = DataFrames['TrainData'], ValidationData = DataFrames['ValidationData'], TestData = DataFrames['TestData'], ArgsList = None, TargetColumnName = 'Adrian', NumericColumnNames = Features, CategoricalColumnNames = None, TextColumnNames = None, WeightColumnName = None, Threads = -1, InputFrame = 'datatable') # Get args list for algorithm and target type ModelArgs = ml.ML0_Parameters( Algorithms = 'XGBoost', TargetType = "Classification", TrainMethod = "Train") # Update iterations to run quickly ModelArgs['XGBoost']['AlgoArgs']['num_boost_round'] = 50 # Initialize RetroFit x = ml.RetroFit(ModelArgs, ModelData, DataFrames) # Train Model x.ML1_Single_Train(Algorithm = 'XGBoost') # Score data x.ML1_Single_Score( DataName = x.DataSetsNames[2], ModelName = x.ModelListNames[0], Algorithm = 'XGBoost', NewData = None) # Scoring data names x.DataSetsNames # Scoring data x.DataSets.get('Scored_test_data_XGBoost_1') # Check ModelArgs Dict x.PrintAlgoArgs(Algo = 'XGBoost') # List of model names x.ModelListNames # List of model fitted names x.FitListNames

Classification

####################################
# XGBoost Classification
####################################

# Setup Environment
import pkg_resources
import timeit
import datatable as dt
import retrofit
from retrofit import DatatableFE as dtfe
from retrofit import MachineLearning as ml

# Load some data
FilePath = pkg_resources.resource_filename('retrofit', 'datasets/ClassificationData.csv') 
data = dt.fread(FilePath)

# Instantiate Feature Engineering Class
FE = dtfe.FE()

# Create some lags
data = FE.FE0_AutoLags(
    data,
    LagColumnNames=['Independent_Variable1', 'Independent_Variable2'],
    DateColumnName='DateTime',
    ByVariables='Factor_1',
    LagPeriods=[1,2],
    ImputeValue=-1,
    Sort=True,
    use_saved_args=False)

# Create some rolling stats
data = FE.FE0_AutoRollStats(
    data,
    RollColumnNames=['Independent_Variable1','Independent_Variable2'],
    DateColumnName='DateTime',
    ByVariables='Factor_1',
    MovingAvg_Periods=[1,2],
    MovingSD_Periods=[2,3],
    MovingMin_Periods=[1,2],
    MovingMax_Periods=[1,2],
    ImputeValue=-1,
    Sort=True,
    use_saved_args=False)

# Create some diffs
data = FE.FE0_AutoDiff(
    data,
    DateColumnName='DateTime',
    ByVariables=['Factor_1','Factor_2','Factor_3'],
    DiffNumericVariables='Independent_Variable1',
    DiffDateVariables=None,
    DiffGroupVariables=None,
    NLag1=0,
    NLag2=1,
    Sort=True,
    use_saved_args=False)

# Dummify
data = FE.FE1_DummyVariables(
  data = data, 
  CategoricalColumnNames = ['Factor_1','Factor_2','Factor_3'],
  use_saved_args=False)
data = data[:, [name not in ['Factor_1','Factor_2','Factor_3'] for name in data.names]]

# Create Calendar Vars
data = FE.FE1_AutoCalendarVariables(
    data,
    DateColumnNames='DateTime',
    CalendarVariables=['wday','month','quarter'],
    use_saved_args=False)

# Type conversions for modeling
data = FE.FE1_ColTypeConversions(
    data,
    Int2Float=True,
    Bool2Float=True,
    RemoveDateCols=True,
    RemoveStrCols=False,
    SkipCols=None,
    use_saved_args=False)

# Drop Text Cols (no word2vec yet)
data = data[:, [z for z in data.names if z not in ['Comment']]]

# Create partitioned data sets
DataFrames = FE.FE2_AutoDataPartition(
  data, 
  DateColumnName = None, 
  PartitionType = 'random', 
  Ratios = [0.7,0.2,0.1], 
  ByVariables = None, 
  Sort = False,
  use_saved_args = False)

# Features
Features = [z for z in list(data.names) if not z in ['Adrian','DateTime','Comment','Weights']]

# Prepare modeling data sets
ModelData = ml.ML0_GetModelData(
  Processing = 'xgboost',
  TrainData = DataFrames['TrainData'],
  ValidationData = DataFrames['ValidationData'],
  TestData = DataFrames['TestData'],
  ArgsList = None,
  TargetColumnName = 'Adrian',
  NumericColumnNames = Features,
  CategoricalColumnNames = None,
  TextColumnNames = None,
  WeightColumnName = None,
  Threads = -1,
  InputFrame = 'datatable')

# Get args list for algorithm and target type
ModelArgs = ml.ML0_Parameters(
  Algorithms = 'XGBoost', 
  TargetType = "Classification", 
  TrainMethod = "Train")

# Update iterations to run quickly
ModelArgs.get('XGBoost').get('AlgoArgs')['num_boost_round'] = 50

# Initialize RetroFit
x = ml.RetroFit(ModelArgs, ModelData, DataFrames)

# Train Model
x.ML1_Single_Train(Algorithm = 'XGBoost')

# Score data
x.ML1_Single_Score(
  DataName = x.DataSetsNames[2],
  ModelName = x.ModelListNames[0],
  Algorithm = 'XGBoost',
  NewData = None)

# Scoring data names
x.DataSetsNames

# Scoring data
x.DataSets.get('Scored_test_data_XGBoost_1')

# Check ModelArgs Dict
x.PrintAlgoArgs(Algo = 'XGBoost')

# List of model names
x.ModelListNames

# List of model fitted names
x.FitListNames

MultiClass

####################################
# XGBoost MultiClass
####################################

# Setup Environment
import pkg_resources
import timeit
import datatable as dt
import retrofit
from retrofit import DatatableFE as dtfe
from retrofit import MachineLearning as ml

# Load some data
FilePath = pkg_resources.resource_filename('retrofit', 'datasets/MultiClassData.csv') 
data = dt.fread(FilePath)

# Instantiate Feature Engineering Class
FE = dtfe.FE()

# Dummify
data = FE.FE1_DummyVariables(
  data = data, 
  CategoricalColumnNames = ['Factor_2','Factor_3'],
  use_saved_args=False)
data = data[:, [name not in ['Factor_2','Factor_3'] for name in data.names]]

# Create Calendar Vars
data = FE.FE1_AutoCalendarVariables(
    data,
    DateColumnNames='DateTime',
    CalendarVariables=['wday','month','quarter'],
    use_saved_args=False)

# Type conversions for modeling
data = FE.FE1_ColTypeConversions(
    data,
    Int2Float=True,
    Bool2Float=True,
    RemoveDateCols=True,
    RemoveStrCols=False,
    SkipCols=None,
    use_saved_args=False)

# Drop Text Cols (no word2vec yet)
data = data[:, [z for z in data.names if z not in ['Comment']]]

# Create partitioned data sets
DataFrames = FE.FE2_AutoDataPartition(
  data, 
  DateColumnName = None, 
  PartitionType = 'random', 
  Ratios = [0.7,0.2,0.1], 
  ByVariables = None, 
  Sort = False,
  use_saved_args = False)

# Features
Features = [z for z in list(data.names) if not z in ['Adrian','DateTime','Comment','Weights']]

# Prepare modeling data sets
ModelData = ml.ML0_GetModelData(
  Processing = 'xgboost',
  TrainData = DataFrames['TrainData'],
  ValidationData = DataFrames['ValidationData'],
  TestData = DataFrames['TestData'],
  ArgsList = None,
  TargetColumnName = 'Adrian',
  NumericColumnNames = Features,
  CategoricalColumnNames = None,
  TextColumnNames = None,
  WeightColumnName = None,
  Threads = -1,
  InputFrame = 'datatable')

# Get args list for algorithm and target type
ModelArgs = ml.ML0_Parameters(
  Algorithms = 'XGBoost',
  TargetType = "MultiClass",
  TrainMethod = "Train")

# Update iterations to run quickly
ModelArgs.get('XGBoost').get('AlgoArgs')['num_boost_round'] = 50

# Initialize RetroFit
x = ml.RetroFit(ModelArgs, ModelData, DataFrames)

# Train Model
x.ML1_Single_Train(Algorithm = 'XGBoost')

# Score data
x.ML1_Single_Score(
  DataName = x.DataSetsNames[2],
  ModelName = x.ModelListNames[0],
  Algorithm = 'XGBoost',
  NewData = None)

# Scoring data names
x.DataSetsNames

# Scoring data
x.DataSets.get('Scored_test_data_XGBoost_1')

# Check ModelArgs Dict
x.PrintAlgoArgs(Algo = 'XGBoost')

# List of model names
x.ModelListNames

# List of model fitted names
x.FitListNames

LightGBM Examples

Regression

####################################
# LightGBM Regression
####################################

# Setup Environment
import pkg_resources
import timeit
import datatable as dt
import retrofit
from retrofit import DatatableFE as dtfe
from retrofit import MachineLearning as ml

# Load some data
FilePath = pkg_resources.resource_filename('retrofit', 'datasets/RegressionData.csv') 
data = dt.fread(FilePath)

# Instantiate Feature Engineering Class
FE = dtfe.FE()

# Create some lags
data = FE.FE0_AutoLags(
    data,
    LagColumnNames=['Independent_Variable1', 'Independent_Variable2'],
    DateColumnName='DateTime',
    ByVariables='Factor_1',
    LagPeriods=[1,2],
    ImputeValue=-1,
    Sort=True,
    use_saved_args=False)

# Create some rolling stats
data = FE.FE0_AutoRollStats(
    data,
    RollColumnNames=['Independent_Variable1','Independent_Variable2'],
    DateColumnName='DateTime',
    ByVariables='Factor_1',
    MovingAvg_Periods=[1,2],
    MovingSD_Periods=[2,3],
    MovingMin_Periods=[1,2],
    MovingMax_Periods=[1,2],
    ImputeValue=-1,
    Sort=True,
    use_saved_args=False)

# Create some diffs
data = FE.FE0_AutoDiff(
    data,
    DateColumnName='DateTime',
    ByVariables=['Factor_1','Factor_2','Factor_3'],
    DiffNumericVariables='Independent_Variable1',
    DiffDateVariables=None,
    DiffGroupVariables=None,
    NLag1=0,
    NLag2=1,
    Sort=True,
    use_saved_args=False)

# Dummify
data = FE.FE1_DummyVariables(
  data = data, 
  CategoricalColumnNames = ['Factor_1','Factor_2','Factor_3'],
  use_saved_args=False)
data = data[:, [name not in ['Factor_1','Factor_2','Factor_3'] for name in data.names]]

# Create Calendar Vars
data = FE.FE1_AutoCalendarVariables(
    data,
    DateColumnNames='DateTime',
    CalendarVariables=['wday','month','quarter'],
    use_saved_args=False)

# Type conversions for modeling
data = FE.FE1_ColTypeConversions(
    data,
    Int2Float=True,
    Bool2Float=True,
    RemoveDateCols=True,
    RemoveStrCols=False,
    SkipCols=None,
    use_saved_args=False)

# Drop Text Cols (no word2vec yet)
data = data[:, [z for z in data.names if z not in ['Comment']]]

# Create partitioned data sets
DataFrames = FE.FE2_AutoDataPartition(
  data, 
  DateColumnName = None, 
  PartitionType = 'random', 
  Ratios = [0.7,0.2,0.1], 
  ByVariables = None, 
  Sort = False,
  use_saved_args = False)

# Features
Features = [z for z in list(data.names) if not z in ['Adrian','DateTime','Comment','Weights']]

# Prepare modeling data sets
ModelData = ml.ML0_GetModelData(
  Processing = 'lightgbm',
  TrainData = DataFrames['TrainData'],
  ValidationData = DataFrames['ValidationData'],
  TestData = DataFrames['TestData'],
  ArgsList = None,
  TargetColumnName = 'Adrian',
  NumericColumnNames = Features,
  CategoricalColumnNames = None,
  TextColumnNames = None,
  WeightColumnName = None,
  Threads = -1,
  InputFrame = 'datatable')

# Get args list for algorithm and target type
ModelArgs = ml.ML0_Parameters(
  Algorithms = 'LightGBM', 
  TargetType = "Regression", 
  TrainMethod = "Train")

# Update iterations to run quickly
ModelArgs.get('LightGBM').get('AlgoArgs')['num_iterations'] = 50

# Initialize RetroFit
x = ml.RetroFit(ModelArgs, ModelData, DataFrames)

# Train Model
x.ML1_Single_Train(Algorithm = 'LightGBM')

# Score data
x.ML1_Single_Score(
  DataName = x.DataSetsNames[2],
  ModelName = x.ModelListNames[0],
  Algorithm = 'LightGBM')

# Scoring data names
x.DataSetsNames

# Scoring data
x.DataSets.get('Scored_test_data_LightGBM_1')

# Check ModelArgs Dict
x.PrintAlgoArgs(Algo = 'LightGBM')

# List of model names
x.ModelListNames

# List of model fitted names
x.FitListNames

Classification

####################################
# LightGBM Classification
####################################

# Setup Environment
import pkg_resources
import timeit
import datatable as dt
import retrofit
from retrofit import DatatableFE as dtfe
from retrofit import MachineLearning as ml

# Load some data
FilePath = pkg_resources.resource_filename('retrofit', 'datasets/ClassificationData.csv') 
data = dt.fread(FilePath)

# Instantiate Feature Engineering Class
FE = dtfe.FE()

# Create some lags
data = FE.FE0_AutoLags(
    data,
    LagColumnNames=['Independent_Variable1', 'Independent_Variable2'],
    DateColumnName='DateTime',
    ByVariables='Factor_1',
    LagPeriods=[1,2],
    ImputeValue=-1,
    Sort=True,
    use_saved_args=False)

# Create some rolling stats
data = FE.FE0_AutoRollStats(
    data,
    RollColumnNames=['Independent_Variable1','Independent_Variable2'],
    DateColumnName='DateTime',
    ByVariables='Factor_1',
    MovingAvg_Periods=[1,2],
    MovingSD_Periods=[2,3],
    MovingMin_Periods=[1,2],
    MovingMax_Periods=[1,2],
    ImputeValue=-1,
    Sort=True,
    use_saved_args=False)

# Create some diffs
data = FE.FE0_AutoDiff(
    data,
    DateColumnName='DateTime',
    ByVariables=['Factor_1','Factor_2','Factor_3'],
    DiffNumericVariables='Independent_Variable1',
    DiffDateVariables=None,
    DiffGroupVariables=None,
    NLag1=0,
    NLag2=1,
    Sort=True,
    use_saved_args=False)

# Dummify
data = FE.FE1_DummyVariables(
  data = data, 
  CategoricalColumnNames = ['Factor_1','Factor_2','Factor_3'],
  use_saved_args=False)
data = data[:, [name not in ['Factor_1','Factor_2','Factor_3'] for name in data.names]]

# Create Calendar Vars
data = FE.FE1_AutoCalendarVariables(
    data,
    DateColumnNames='DateTime',
    CalendarVariables=['wday','month','quarter'],
    use_saved_args=False)

# Type conversions for modeling
data = FE.FE1_ColTypeConversions(
    data,
    Int2Float=True,
    Bool2Float=True,
    RemoveDateCols=True,
    RemoveStrCols=False,
    SkipCols=None,
    use_saved_args=False)

# Drop Text Cols (no word2vec yet)
data = data[:, [z for z in data.names if z not in ['Comment']]]

# Create partitioned data sets
DataFrames = FE.FE2_AutoDataPartition(
  data, 
  DateColumnName = None, 
  PartitionType = 'random', 
  Ratios = [0.7,0.2,0.1], 
  ByVariables = None, 
  Sort = False,
  use_saved_args = False)

# Features
Features = [z for z in list(data.names) if not z in ['Adrian','DateTime','Comment','Weights']]

# Prepare modeling data sets
ModelData = ml.ML0_GetModelData(
  Processing = 'lightgbm',
  TrainData = DataFrames['TrainData'],
  ValidationData = DataFrames['ValidationData'],
  TestData = DataFrames['TestData'],
  ArgsList = None,
  TargetColumnName = 'Adrian',
  NumericColumnNames = Features,
  CategoricalColumnNames = None,
  TextColumnNames = None,
  WeightColumnName = None,
  Threads = -1,
  InputFrame = 'datatable')

# Get args list for algorithm and target type
ModelArgs = ml.ML0_Parameters(
  Algorithms = 'LightGBM', 
  TargetType = "Classification", 
  TrainMethod = "Train")

# Update iterations to run quickly
ModelArgs.get('LightGBM').get('AlgoArgs')['num_iterations'] = 50

# Initialize RetroFit
x = ml.RetroFit(ModelArgs, ModelData, DataFrames)

# Train Model
x.ML1_Single_Train(Algorithm = 'LightGBM')

# Score data
x.ML1_Single_Score(
  DataName = x.DataSetsNames[2],
  ModelName = x.ModelListNames[0],
  Algorithm = 'LightGBM')

# Scoring data names
x.DataSetsNames

# Scoring data
x.DataSets.get('Scored_test_data_LightGBM_1')

# Check ModelArgs Dict
x.PrintAlgoArgs(Algo = 'LightGBM')

# List of model names
x.ModelListNames

# List of model fitted names
x.FitListNames

MultiClass

####################################
# LightGBM MultiClass
####################################

# Setup Environment
import pkg_resources
import timeit
import datatable as dt
import retrofit
from retrofit import DatatableFE as dtfe
from retrofit import MachineLearning as ml

# Load some data
FilePath = pkg_resources.resource_filename('retrofit', 'datasets/MultiClassData.csv') 
data = dt.fread(FilePath)

# Instantiate Feature Engineering Class
FE = dtfe.FE()

# Dummify
data = FE.FE1_DummyVariables(
  data = data, 
  CategoricalColumnNames = ['Factor_2','Factor_3'],
  use_saved_args=False)
data = data[:, [name not in ['Factor_2','Factor_3'] for name in data.names]]

# Create Calendar Vars
data = FE.FE1_AutoCalendarVariables(
    data,
    DateColumnNames='DateTime',
    CalendarVariables=['wday','month','quarter'],
    use_saved_args=False)

# Type conversions for modeling
data = FE.FE1_ColTypeConversions(
    data,
    Int2Float=True,
    Bool2Float=True,
    RemoveDateCols=True,
    RemoveStrCols=False,
    SkipCols=None,
    use_saved_args=False)

# Drop Text Cols (no word2vec yet)
data = data[:, [z for z in data.names if z not in ['Comment']]]

# Create partitioned data sets
DataFrames = FE.FE2_AutoDataPartition(
  data, 
  DateColumnName = None, 
  PartitionType = 'random', 
  Ratios = [0.7,0.2,0.1], 
  ByVariables = None, 
  Sort = False,
  use_saved_args = False)

# Features
Features = [z for z in list(data.names) if not z in ['Adrian','DateTime','Comment','Weights']]

# Prepare modeling data sets
ModelData = ml.ML0_GetModelData(
  Processing = 'lightgbm',
  TrainData = DataFrames['TrainData'],
  ValidationData = DataFrames['ValidationData'],
  TestData = DataFrames['TestData'],
  ArgsList = None,
  TargetColumnName = 'Adrian',
  NumericColumnNames = Features,
  CategoricalColumnNames = None,
  TextColumnNames = None,
  WeightColumnName = None,
  Threads = -1,
  InputFrame = 'datatable')

# Get args list for algorithm and target type
ModelArgs = ml.ML0_Parameters(
  Algorithms = 'LightGBM', 
  TargetType = "MultiClass", 
  TrainMethod = "Train")

# Update iterations to run quickly
ModelArgs.get('LightGBM').get('AlgoArgs')['num_iterations'] = 50

# Initialize RetroFit
x = ml.RetroFit(ModelArgs, ModelData, DataFrames)

# Train Model
x.ML1_Single_Train(Algorithm = 'LightGBM')

# Score data
x.ML1_Single_Score(
  DataName = x.DataSetsNames[2],
  ModelName = x.ModelListNames[0],
  Algorithm = 'LightGBM')

# Scoring data names
x.DataSetsNames

# Scoring data
x.DataSets.get('Scored_test_data_LightGBM_1')

# Check ModelArgs Dict
x.PrintAlgoArgs(Algo = 'LightGBM')

# List of model names
x.ModelListNames

# List of model fitted names
x.FitListNames

Machine Learning Evaluation

Expand to view content

Coming Soon

Machine Learning Interpretation

Expand to view content

Coming Soon

Machine Learning Scoring

Expand to view content

Coming Soon

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

retrofit-0.1.5.tar.gz (37.3 kB view hashes)

Uploaded Source

Built Distribution

retrofit-0.1.5-py3-none-any.whl (29.4 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page