A library for machine learning utilities
Project description
Model Tooling library
Installation
Use pip to install:
pip install ml-tooling
Contents
-
Transformers
- A library of transformers for use with Scikit-learn pipelines
-
Model base classes
- Production baseclasses for subclassing - guarantees interface for use in API
-
Plotting functions
- Functions for producing nice, commonly used plots such as roc_curves and confusion matrices
BaseClassModel
A base Class for defining your model. Your subclass must define two methods:
-
get_prediction_data()
Function that, given an input, fetches corresponding features. Used for predicting an unseen observation
-
get_training_data()
Function that retrieves all training data. Used for training and evaluating the model
Example usage
Define a class using BaseClassModel and implement the two required methods. Here we simply implement a linear regression on the Boston dataset using sklearn.datasets
from ml_tooling import BaseClassModel
from sklearn.datasets import load_boston
from sklearn.linear_model import LinearRegression, Ridge, LassoLars
# Define a new class
class BostonModel(BaseClassModel):
def get_prediction_data(self, idx):
x, _ = load_boston(return_X_y=True)
return x[idx] # Return given observation
def get_training_data(self):
return load_boston(return_X_y=True)
# Use our new class to implement a given model - any sklearn compatible estimator
linear_boston = BostonModel(LinearRegression())
results = linear_boston.score_model()
# Visualize results
results.plot.residuals()
results.plot.prediction_error()
# Save our model
linear_boston.save_model()
# Recreate model
BostonModel.load_model('.')
# Train Different models and get the best performing
models_to_try = [LinearRegression(), Ridge(), LassoLars()]
# best_model will be BostonModel instantiated with the highest scoring model. all_results is a list of all results
best_model, alL_results = BostonModel.test_models(models_to_try, metric='neg_mean_squared_error')
print(alL_results)
The BaseClass implements a number of useful methods
save_model()
Saves the model as a binary file
load_model()
Instantiates the class with a joblib pickled model
score_model()
Loads all training data and trains the model on it, using a train_test split. Returns a Result object containing all result parameters
train_model()
Loads all training data and trains the model on all data. Typically used as the last step when model tuning is complete
set_config({'CONFIG_KEY': 'VALUE'})
Set configuration options - existing configuration options can be seen using the .config
property
make_prediction(*args)
Makes a prediction given an input. For example a customer number.
Passed to the implemented get_prediction_data()
method and calls .predict()
on the estimator
test_models([model1, model2], metric='accuracy')
Runs score_model()
on each model, saving the result.
Returns the best model as well as a list of all results
setup_model()
To be implemented by the user - setup_model is a classmethod which loads up an untrained model. Typically this would setup a pipeline and the selected model for easy training
Returning to our previous example of the BostonModel, let us implement a setup_model method
from ml_tooling import BaseClassModel
from sklearn.datasets import load_boston
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
class BostonModel(BaseClassModel):
def get_prediction_data(self, idx):
x, _ = load_boston(return_X_y=True)
return x[idx] # Return given observation
def get_training_data(self):
return load_boston(return_X_y=True)
@classmethod
def setup_model(cls):
pipeline = Pipeline([
('scaler', StandardScaler()),
('clf', LinearRegression())
])
return cls(pipeline)
Given this extra setup, it becomes easy to load the untrained model to train it:
model = BostonModel.setup_model()
model.train_model()
Visualizing results
When a model is trained, it returns a Result object. That object has number of visualization options depending on the type of model:
Classifiers
roc_curve()
confusion_matrix()
feature_importance()
lift_curve()
Regressors
prediction_error()
residuals()
feature_importance()
Transformers
The library also provides a number of transformers for working with DataFrames in a pipeline
Select
A column selector - Provide a list of columns to be passed on in the pipeline
Example
from ml_tooling.transformers import Select
import pandas as pd
df = pd.DataFrame({
"id": [1, 2, 3, 4],
"status": ["OK", "Error", "OK", "Error"],
"sales": [2000, 3000, 4000, 5000]
})
select = Select(['id', 'status'])
select.fit_transform(df)
Out[1]:
id status
0 1 OK
1 2 Error
2 3 OK
3 4 Error
FillNA
Fills NA values with instantiated value - passed to df.fillna()
Example
from ml_tooling.transformers import FillNA
import pandas as pd
import numpy as np
df = pd.DataFrame({
"id": [1, 2, 3, 4],
"status": ["OK", "Error", "OK", "Error"],
"sales": [2000, 3000, 4000, np.nan]
})
fill_na = FillNA(0)
fill_na.fit_transform(df)
Out[1]:
id status sales
0 1 OK 2000.0
1 2 Error 3000.0
2 3 OK 4000.0
3 4 Error 0.0
ToCategorical
Performs one-hot encoding of categorical values through pd.Categorical. All categorical values not found in training data will be set to 0
Example
from ml_tooling.transformers import ToCategorical
import pandas as pd
df = pd.DataFrame({
"status": ["OK", "Error", "OK", "Error"]
})
onehot = ToCategorical()
onehot.fit_transform(df)
Out[1]:
status_Error status_OK
0 0 1
1 1 0
2 0 1
3 1 0
FuncTransformer
Applies a given function to each column
Example
from ml_tooling.transformers import FuncTransformer
import pandas as pd
df = pd.DataFrame({
"status": ["OK", "Error", "OK", "Error"]
})
uppercase = FuncTransformer(lambda x: x.str.upper())
uppercase.fit_transform(df)
Out[1]:
status
0 OK
1 ERROR
2 OK
3 ERROR
Binner
Bins numerical data into supplied bins
Example
from ml_tooling.transformers import Binner
import pandas as pd
df = pd.DataFrame({
"sales": [1500, 2000, 2250, 7830]
})
binned = Binner(bins=[0, 1000, 2000, 8000])
binned.fit_transform(df)
Out[1]:
sales
0 (1000, 2000]
1 (1000, 2000]
2 (2000, 8000]
3 (2000, 8000]
Renamer
Renames columns to be equal to the passed list - must be in order
Example
from ml_tooling.transformers import Renamer
import pandas as pd
df = pd.DataFrame({
"Total Sales": [1500, 2000, 2250, 7830]
})
rename = Renamer(['sales'])
rename.fit_transform(df)
Out[1]:
sales
0 1500
1 2000
2 2250
3 7830
DateEncoder
Adds year, month, day, week columns based on a datefield. Each date type can be toggled in the initializer
from ml_tooling.transformers import DateEncoder
import pandas as pd
df = pd.DataFrame({
"sales_date": [pd.to_datetime('2018-01-01'), pd.to_datetime('2018-02-02')]
})
dates = DateEncoder(week=False)
dates.fit_transform(df)
Out[1]:
sales_date_day sales_date_month sales_date_year
0 1 1 2018
1 2 2 2018
FreqFeature
Converts a column into a normalized frequencies
from ml_tooling.transformers import FreqFeature
import pandas as pd
df = pd.DataFrame({
"sales_category": ['Sale', 'Sale', 'Not Sale']
})
freq = FreqFeature()
freq.fit_transform(df)
Out[1]:
sales_category
0 0.666667
1 0.666667
2 0.333333
DFFeatureUnion
A FeatureUnion equivalent for DataFrames. Concatenates the result of multiple transformers
from ml_tooling.transformers import FreqFeature, Binner, Select, DFFeatureUnion
from sklearn.pipeline import Pipeline
import pandas as pd
df = pd.DataFrame({
"sales_category": ['Sale', 'Sale', 'Not Sale', 'Not Sale'],
"sales": [1500, 2000, 2250, 7830]
})
freq = Pipeline([
('select', Select('sales_category')),
('freq', FreqFeature())
])
binned = Pipeline([
('select', Select('sales')),
('bin', Binner(bins=[0, 1000, 2000, 8000]))
])
union = DFFeatureUnion([
('sales_category', freq),
('sales', binned)
])
union.fit_transform(df)
Out[1]:
sales_category sales
0 0.5 (1000, 2000]
1 0.5 (1000, 2000]
2 0.5 (2000, 8000]
3 0.5 (2000, 8000]
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for ml_tooling-0.1.3-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7d99adec288b66d11a96245037b34e1bee89116ed1049e3827719745d95b6bd5 |
|
MD5 | 01bf973e806102c3f5d8f4cfe8baf8be |
|
BLAKE2b-256 | e7f1f2562b8239e8c5ac725c6241b9b4a4b5365c7e4334ba6c5b72a217bc8568 |