A library for machine learning pipeline and creation of reports.
Project description
MLPR Library
Packages status | Downloads | Updates | Author |
---|---|---|---|
This repository is a developing library named MLPR
(Machine Learning Pipeline Report). It aims to facilitate the creation of machine learning models in various areas such as regression, forecasting, classification, and clustering. The library allows the user to perform tuning of these models, generate various types of plots for post-modeling analysis, and calculate various metrics.
In addition, MLPR
allows the creation of metric reports using Jinja2
, including the obtained graphs. The user can customize the report template according to their needs.
Using the MLPR Library
The MLPR library is a powerful tool for machine learning and data analysis. Here's a brief guide on how to use it.
Installation
Before you start, make sure you have installed the MLPR library. You can do this by running pip install mlpr.
pip install mlpr
Table of contents
The library current support some features for Machine Learning, being:
1. Regression: Support for model selection using a Grid of params to model selection, calculating metrics and generating plots. This module support spark dataframe. Click here for examples.
2. Classification Metrics: Support for classification metrics and give for the user the possibility to create your own metric. This module spport spark dataframe. Click here for examples.
3. Tunning: Support for supervisioned models tunning. Click here for examples.
4. Classification uncertainty: Support for uncertainty methods estimation. Click here for examples.
5. Surrogates: Support for training surrogates models, using a white box or less complex that can reproduce the black box behavior or a complex model. Click here for examples.
Tunning
Click here for contents.
MLPR used for model selection.
Importing the Library
First, import the necessary modules from the library:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.base import BaseEstimator
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import make_blobs
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, cohen_kappa_score
from mlpr.ml.supervisioned.tunning.grid_search import GridSearch
from utils.reader import read_file_yaml
Methods
Here we have a custom method for accuracy to use in model selection. Thus:
def custom_accuracy_score(y_true: np.ndarray, y_pred: np.ndarray) -> float:
return accuracy_score(y_true, y_pred, normalize=False)
Set parameters
n_samples = 1000
centers = [(0, 0), (3, 4.5)]
n_features = 2
cluster_std = 1.3
random_state = 42
cv = 5
np.random.seed(random_state)
params_split: dict[str, float | int] = {
'test_size': 0.25,
'random_state': random_state
}
params_norm: dict[str, bool] = {'with_mean': True, 'with_std': True}
model_metrics: dict[str, any] = {
'custom_accuracy': custom_accuracy_score,
'accuracy': accuracy_score,
'precision': precision_score,
'recall': recall_score,
'kappa': cohen_kappa_score,
'f1': f1_score,
}
Loading the Data
Load your dataset. In this example, we're generating a dataset for classification using sklearn:
X, y = make_blobs(
n_samples=n_samples,
centers=centers,
n_features=n_features,
cluster_std=cluster_std,
random_state=random_state
)
Plot the dataset
fig, ax = plt.subplots(1, 1, figsize=(14, 6))
ax.plot(X[:, 0][y == 0], X[:, 1][y == 0], "bs")
ax.plot(X[:, 0][y == 1], X[:, 1][y == 1], "g^")
ax.set_title("Dataset")
ax.set_frame_on(False)
ax.set_xticks([])
ax.set_yticks([])
fig.tight_layout()
Cross-validtion
models: dict[BaseEstimator, dict] = {
RandomForestClassifier: {
'n_estimators': [10, 50, 100, 200],
'max_depth': [None, 5, 10, 15],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4],
'random_state': [random_state]
},
GradientBoostingClassifier: {
'n_estimators': [50, 100, 200],
'learning_rate': [0.1, 0.05, 0.01, 0.005],
'subsample': [0.5, 0.8, 1.0],
'random_state': [random_state]
},
LogisticRegression: {
'C': [0.01, 0.1, 1.0, 10.0],
'penalty': ['l1', 'l2'],
'solver': ['liblinear', 'saga'],
'random_state': [random_state]
},
GaussianNB: {
'var_smoothing': [1e-9, 1e-8, 1e-7, 1e-6, 1e-5]
},
KNeighborsClassifier: {
'n_neighbors': [3, 5, 7, 9],
'weights': ['uniform', 'distance'],
'algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute']
},
SVC: {
'C': [0.01, 0.1, 1.0, 10.0],
'kernel': ['linear', 'poly', 'rbf', 'sigmoid'],
'degree': [2, 3, 4],
'gamma': ['scale', 'auto'],
'random_state': [random_state]
},
DecisionTreeClassifier: {
'criterion': ['gini', 'entropy'],
'splitter': ['best', 'random'],
'max_depth': [None, 5, 10, 15],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4],
'random_state': [random_state]
}
}
grid_search = GridSearch(
X,
y,
params_split=params_split,
models_params=models,
normalize=True,
params_norm=params_norm,
scoring='accuracy',
metrics=model_metrics
)
grid_search.search(cv=cv, n_jobs=-1)
best_model, best_params = \
grid_search \
.get_best_model()
results: pd.DataFrame = pd.DataFrame(grid_search._metrics).T
results
custom_accuracy | accuracy | precision | recall | kappa | f1 | |
---|---|---|---|---|---|---|
RandomForestClassifier | 246.0 | 0.984 | 0.983607 | 0.983607 | 0.967982 | 0.983607 |
GradientBoostingClassifier | 244.0 | 0.976 | 0.975410 | 0.975410 | 0.951972 | 0.975410 |
LogisticRegression | 245.0 | 0.980 | 0.983471 | 0.975410 | 0.959969 | 0.979424 |
GaussianNB | 244.0 | 0.976 | 0.975410 | 0.975410 | 0.951972 | 0.975410 |
KNeighborsClassifier | 245.0 | 0.980 | 0.983471 | 0.975410 | 0.959969 | 0.979424 |
SVC | 245.0 | 0.980 | 0.983471 | 0.975410 | 0.959969 | 0.979424 |
DecisionTreeClassifier | 242.0 | 0.968 | 0.991379 | 0.942623 | 0.935889 | 0.966387 |
Best model
Here we can see the distribution for the best classifier.
fig, ax = plt.subplots(1, 2, figsize=(14, 6))
ax[0].plot(X[:, 0][y == 0], X[:, 1][y == 0], "bs")
ax[0].plot(X[:, 0][y == 1], X[:, 1][y == 1], "g^")
ax[0].set_title("Dataset")
ax[0].set_frame_on(False)
ax[0].set_xticks([])
ax[0].set_yticks([])
ax[1].plot(
grid_search.X_test[:, 0][grid_search.y_test == 0],
grid_search.X_test[:, 1][grid_search.y_test == 0],
"bs"
)
ax[1].plot(
grid_search.X_test[:, 0][grid_search.y_test == 1],
grid_search.X_test[:, 1][grid_search.y_test == 1],
"g^"
)
ax[1].set_title(grid_search.best_model.__class__.__name__)
ax[1].set_frame_on(False)
ax[1].set_xticks([])
ax[1].set_yticks([])
fig.tight_layout()
Classification: uncertainty estimation
Click here for contents.
How to use the module for uncertainty estimation in classification tasks.
Importing the Library
First, import the necessary modules from the library:
from typing import Dict
from functools import partial
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.base import BaseEstimator
from sklearn.datasets import make_blobs
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (accuracy_score, cohen_kappa_score, f1_score,
precision_score, recall_score)
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from mlpr.ml.supervisioned.classification.uncertainty import UncertaintyPlots
from mlpr.ml.supervisioned.classification.utils import calculate_probas
from mlpr.ml.supervisioned.tunning.grid_search import GridSearch
import warnings
warnings.filterwarnings("ignore")
Parameters
Setting parameters for the experiments.
random_state: int = 42
n_feats: int = 2
n_size: int = 1000
centers: list[tuple] = [
(0, 2),
(2, 0),
(5, 4.5)
]
n_class: int = len(centers)
cluster_std: list[float] = [1.4, 1.4, 0.8]
cv: int = 5
np.random.seed(random_state)
params: dict[str, dict[str, any]] = {
"n_samples": n_size,
"n_features": n_feats,
"centers": centers,
"cluster_std": cluster_std,
"random_state": random_state
}
np.random.seed(random_state)
params_split: dict[str, float | int] = {
'test_size': 0.25,
'random_state': random_state
}
params_norm: dict[str, bool] = {'with_mean': True, 'with_std': True}
model_metrics: dict[str, any] = {
'custom_accuracy': partial(accuracy_score, normalize=False),
'accuracy': accuracy_score,
'precision': partial(precision_score, average='macro'),
'recall': partial(recall_score, average='macro'),
'kappa': cohen_kappa_score,
'f1': partial(f1_score, average='macro'),
}
Load the dataset
Here we are generating a dataset for experiments, using blobs from scikit-learn.
X, y = make_blobs(
**params
)
Plot the dataset
Behavior of the dataset used in the experiment.
markers = ['o', 'v', '^']
fig, ax = plt.subplots(1, 1, figsize=(14, 6))
colors = generate_colors("FF4B3E", "1C2127", len(np.unique(y)))
for i, k in enumerate(np.unique(y)):
ax.scatter(X[:, 0][y == k], X[:, 1][y == k], marker=markers[i % len(markers)], color=colors[i], label=f"c{i}")
ax.set_title("Dataset")
ax.set_frame_on(False)
ax.set_xticks([])
ax.set_yticks([])
for i, (center, color) in enumerate(zip(centers, colors)):
ax.scatter(
center[0],
center[1],
color="white",
linewidths=3,
marker="o",
edgecolor="black",
s=120,
label="center" if i == 0 else None
)
plt.legend()
fig.tight_layout()
Cross-validation
models: dict[BaseEstimator, dict] = {
RandomForestClassifier: {
'n_estimators': [10, 50, 100, 200],
'max_depth': [None, 5, 10, 15],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4],
'random_state': [random_state]
},
GradientBoostingClassifier: {
'n_estimators': [50, 100, 200],
'learning_rate': [0.1, 0.05, 0.01, 0.005],
'subsample': [0.5, 0.8, 1.0],
'random_state': [random_state]
},
LogisticRegression: {
'C': [0.01, 0.1, 1.0, 10.0],
'penalty': ['l1', 'l2'],
'solver': ['liblinear', 'saga'],
'random_state': [random_state],
'max_iter': [10000]
},
GaussianNB: {
'var_smoothing': [1e-9, 1e-8, 1e-7, 1e-6, 1e-5]
},
SVC: {
'C': [0.01, 0.1, 1.0, 10.0],
'kernel': ['linear', 'poly', 'rbf', 'sigmoid'],
'degree': [2, 3, 4],
'gamma': ['scale', 'auto'],
'probability': [True],
'random_state': [random_state]
},
DecisionTreeClassifier: {
'criterion': ['gini', 'entropy'],
'splitter': ['best', 'random'],
'max_depth': [None, 5, 10, 15],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4],
'random_state': [random_state]
}
}
grid_search = GridSearch(
X,
y,
params_split=params_split,
models_params=models,
normalize=True,
params_norm=params_norm,
scoring='accuracy',
metrics=model_metrics
)
grid_search.search(cv=cv, n_jobs=-1)
best_model, best_params = \
grid_search \
.get_best_model()
results: pd.DataFrame = pd.DataFrame(grid_search._metrics).T
results
custom_accuracy | accuracy | precision | recall | kappa | f1 | |
---|---|---|---|---|---|---|
RandomForestClassifier | 222.0 | 0.888 | 0.883566 | 0.885902 | 0.831666 | 0.882724 |
GradientBoostingClassifier | 221.0 | 0.884 | 0.878830 | 0.881207 | 0.825583 | 0.878421 |
LogisticRegression | 230.0 | 0.920 | 0.915170 | 0.916987 | 0.879457 | 0.915662 |
GaussianNB | 231.0 | 0.924 | 0.919375 | 0.921682 | 0.885539 | 0.920046 |
SVC | 230.0 | 0.920 | 0.915170 | 0.916987 | 0.879457 | 0.915662 |
DecisionTreeClassifier | 214.0 | 0.856 | 0.848155 | 0.845622 | 0.782325 | 0.846414 |
Probabilities
Getting probabilities for uncertainty generation.
probas = calculate_probas(grid_search.fitted, grid_search.X_train)
Plot best model result
Plotting the result for best model.
up = UncertaintyPlots()
fig_un, ax_un = up.uncertainty(
model_names=[[best_model.__class__.__name__]],
probs={best_model.__class__.__name__: probas[best_model.__class__.__name__]},
X=grid_search.X_train,
figsize=(20, 6),
cmap='RdYlGn',
show_inline=True,
box_on=False
)
Plot overall of uncertainty
Plotting an overall of uncertainty estimated for the models.
sorted_models = results.sort_values("accuracy", ascending=False).index.tolist()
pyramid = []
i = 0
for row in range(1, len(sorted_models)):
if i + row <= len(sorted_models):
pyramid.append(sorted_models[i:i+row])
i += row
else:
break
if i < len(sorted_models):
pyramid.append(sorted_models[i:])
if len(pyramid[-1]) < len(pyramid[-2]):
pyramid[-2].extend(pyramid[-1])
pyramid = pyramid[:-1]
up = UncertaintyPlots()
fig_un, ax_un = up.uncertainty(
model_names=pyramid,
probs=probas,
X=grid_search.X_train,
figsize=(20, 10),
show_inline=True,
cmap='RdYlGn',
box_on=False
)
Aleatory uncertainty and Epistemic uncertainty
data_probas = pd.DataFrame(probas)
random: pd.Series = data_probas.mean(axis=1)
epistemic: pd.Series = data_probas.var(axis=1)
up = UncertaintyPlots()
fig_both, ax_both = up.uncertainty(
model_names=[["Random uncertainty", "Epistemic uncertainty"]],
probs={
"Random uncertainty": random,
"Epistemic uncertainty": epistemic
},
X=grid_search.X_train,
figsize=(20, 6),
cmap='RdYlGn',
show_inline=True,
box_on=False
)
Regression
Click here for contents.
How to use the module for regression problems.
Importing the Library
First, import the necessary modules from the library:
from mlpr.ml.supervisioned.regression import metrics, plots
from mlpr.ml.supervisioned.tunning.grid_search import GridSearch
from mlpr.reports.create import ReportGenerator
Loading the Data
Load your dataset. In this example, we're generating a dataset for regression using sklearn:
n_feats = 11
n_instances = 1000
n_invert = 50
n_noise = 20
cv = 10
X, y = make_regression(n_samples=n_instances, n_features=n_feats, noise=n_noise)
# introduce of noises
indices = np.random.choice(y.shape[0], size=n_invert, replace=False)
y[indices] = np.max(y) - y[indices]
data = pd.DataFrame(data=X, columns=[f'feature_{i}' for i in range(1, n_feats + 1)])
data['target'] = y
Set the seed
Set the random seed for reproducibility
n_seed = 42
np.random.seed(n_seed)
Preparing the Data
Split your data into features ($X$) and target ($y$):
X = data.drop("target", axis=1)
y = data["target"].values
Model Training
Define the parameters for your models and use GridSearch
to find the best model:
models_params = {
Ridge: {
'alpha': [1.0, 10.0, 15., 20.],
'random_state': [n_seed]
},
Lasso: {
'alpha': [0.1, 1.0, 10.0],
'random_state': [n_seed]
},
SVR: {
'C': [0.1, 1.0, 10.0],
'kernel': ['linear', 'rbf']
},
RandomForestRegressor: {
'n_estimators': [10, 50, 100],
'max_depth': [None, 5, 10],
'random_state': [n_seed]
},
GradientBoostingRegressor: {
'n_estimators': [100, 200],
'learning_rate': [0.1, 0.05, 0.01],
'random_state': [n_seed]
},
XGBRegressor: {
'n_estimators': [100, 200],
'learning_rate': [0.1, 0.05, 0.01],
'random_state': [n_seed]
}
}
params_split: dict[str, float | int] = {
'test_size': 0.25,
'random_state': n_seed
}
params_norm: dict[str, bool] = {'with_mean': True, 'with_std': True}
grid_search = GridSearch(
X,
y,
params_split=params_split,
models_params=models_params,
normalize=True,
scoring='neg_mean_squared_error',
metrics={'neg_mean_squared_error': rmse},
params_norm=params_norm
)
grid_search.search(cv=5, n_jobs=-1)
best_model, best_params = \
grid_search \
.get_best_model()
Making Predictions
Use the best model to make predictions:
data_train = pd.DataFrame(
grid_search.X_train,
columns=X.columns
)
data_train["y_true"] = grid_search.y_train
data_train["y_pred"] = grid_search.best_model.predict(grid_search.X_train)
Evaluating the Model
Calculate various metrics to evaluate the performance of the model:
k = 3
rm = metrics.RegressionMetrics(
data_train,
*["y_true", "y_pred"]
)
results = rm.calculate_metrics(
["mape", "rmse", "kolmogorov_smirnov", "confusion_matrix", "calculate_kappa"],
{
"mape": {},
"rmse": {},
"kolmogorov_smirnov": {},
"confusion_matrix": {"n_bins": k},
"calculate_kappa": {"n_bins": k}
}
)
Results
The output it's a dictionary object with the calculated metrics, like this:
{'mape': 39.594540526956436,
'rmse': 54.09419440169204,
'kolmogorov_smirnov': (0.1510574018126888, 0.0010310446878578096),
'confusion_matrix': (array([[54, 57, 2, 0],
[16, 70, 21, 0],
[ 0, 37, 37, 3],
[ 0, 6, 25, 3]]),
{'precision': array([0.77142857, 0.41176471, 0.43529412, 0.5 ]),
'recall': array([0.47787611, 0.65420561, 0.48051948, 0.08823529]),
'f1_score': array([0.59016393, 0.50541516, 0.45679012, 0.15 ]),
'support': array([113, 107, 77, 34]),
'accuracy': 0.4954682779456193}),
'calculate_kappa': {0: {'confusion_matrix': array([[202, 16],
[ 59, 54]]),
'kappa_score': 0.4452885840055415,
'metrics': {'precision': array([0.77394636, 0.77142857]),
'recall': array([0.9266055 , 0.47787611]),
'f1_score': array([0.8434238 , 0.59016393]),
'support': array([218, 113]),
'accuracy': 0.7734138972809668}},
1: {'confusion_matrix': array([[124, 100],
[ 37, 70]]),
'kappa_score': 0.180085703437178,
'metrics': {'precision': array([0.77018634, 0.41176471]),
'recall': array([0.55357143, 0.65420561]),
'f1_score': array([0.64415584, 0.50541516]),
'support': array([224, 107]),
'accuracy': 0.5861027190332326}},
2: {'confusion_matrix': array([[206, 48],
[ 40, 37]]),
'kappa_score': 0.2813579394059016,
'metrics': {'precision': array([0.83739837, 0.43529412]),
'recall': array([0.81102362, 0.48051948]),
'f1_score': array([0.824 , 0.45679012]),
'support': array([254, 77]),
'accuracy': 0.7341389728096677}},
3: {'confusion_matrix': array([[294, 3],
[ 31, 3]]),
'kappa_score': 0.12297381546134645,
'metrics': {'precision': array([0.90461538, 0.5 ]),
'recall': array([0.98989899, 0.08823529]),
'f1_score': array([0.94533762, 0.15 ]),
'support': array([297, 34]),
'accuracy': 0.8972809667673716}}}}
Visualizing the Results
Plot the results using the RegressionPlots
module:
rp = \
plots \
.RegressionPlots(
data_train,
color_palette=["#FF4B3E", "#1C2127"]
)
fig, axs = rp.grid_plot(
plot_functions=[
['graph11', 'graph12', 'graph13'],
['graph21', 'graph22'],
['graph23']
],
plot_args={
'graph11': {
"plot": "scatter",
"params": {
'y_true_col': 'y_true',
'y_pred_col': 'y_pred',
'linecolor': '#1C2127',
'worst_interval': True,
'metrics': rm.metrics["calculate_kappa"],
'class_interval': rm._class_intervals,
'method': 'recall',
'positive': True
}
},
'graph12': {
"plot": "plot_ecdf",
"params": {
'y_true_col': 'y_true',
'y_pred_col': 'y_pred'
}
},
'graph21': {
"plot": "plot_kde",
"params": {
'columns': ['y_true', 'y_pred']
}
},
'graph22': {
"plot": "plot_error_hist",
"params": {
'y_true_col': 'y_true',
'y_pred_col': 'y_pred',
'linecolor': '#1C2127'
}
},
'graph13': {
"plot": "plot_fitted",
"params": {
'y_true_col': 'y_true',
'y_pred_col': 'y_pred',
'condition': (
(
rm._worst_interval_kappa[0] <= data_train["y_true"]
) & (
data_train["y_true"] <= rm._worst_interval_kappa[1]
)
),
'sample_size': None
}
},
'graph23': {
"plot": "plot_fitted",
"params": {
'y_true_col': 'y_true',
'y_pred_col': 'y_pred',
'condition': None,
'sample_size': None
}
},
},
show_inline=True
)
Reports
Here you can see the report output.
Contact
Here you can find my contact information:
License
This project is licensed under the terms of the MIT license. For more details, see the LICENSE file in the project's root directory.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.