Skip to main content

A library for machine learning pipeline and creation of reports.

Project description

MLPR Library

Packages status Downloads Updates Author
PyPI version License PyPI - Python Version PyPI - Wheel PyPI - Status PyPI - Downloads PyPI - Downloads PyPI - Downloads Last Commit Open Issues Closed Issues Contributors Docs Author Open Issues Closed Issues

fig0

:arrow_forward: For use examples, click here.

This repository is a developing library named MLPR (Machine Learning Pipeline Report). It aims to facilitate the creation of machine learning models in various areas such as regression, forecasting, classification, and clustering. The library allows the user to perform tuning of these models, generate various types of plots for post-modeling analysis, and calculate various metrics.

In addition, MLPR allows the creation of metric reports using Jinja2, including the obtained graphs. The user can customize the report template according to their needs.

Using the MLPR Library

The MLPR library is a powerful tool for machine learning and data analysis. Here's a brief guide on how to use it.

Installation

Before you start, make sure you have installed the MLPR library. You can do this by running pip install mlpr.

pip install mlpr

Table of contents

The library current support some features for Machine Learning, being:

1. Regression: Support for model selection using a Grid of params to model selection, calculating metrics and generating plots. This module support spark dataframe. Click here for examples.

2. Classification Metrics: Support for classification metrics and give for the user the possibility to create your own metric. This module spport spark dataframe. Click here for examples.

3. Tunning: Support for supervisioned models tunning. Click here for examples.

4. Classification uncertainty: Support for uncertainty methods estimation. Click here for examples.

5. Surrogates: Support for training surrogates models, using a white box or less complex that can reproduce the black box behavior or a complex model. Click here for examples.

6. Tunning (Support for spark): Grid search and model selection using Spark framework for Python (pyspark).

Tunning

Click here for contents.

MLPR used for model selection.

Importing the Library

First, import the necessary modules from the library:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.base import BaseEstimator
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier

from sklearn.datasets import make_blobs
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, cohen_kappa_score


from mlpr.ml.supervisioned.tunning.grid_search import GridSearch
from utils.reader import read_file_yaml

Methods

Here we have a custom method for accuracy to use in model selection. Thus:

def custom_accuracy_score(y_true: np.ndarray, y_pred: np.ndarray) -> float:
    return accuracy_score(y_true, y_pred, normalize=False)

Set parameters

n_samples = 1000
centers = [(0, 0), (3, 4.5)]
n_features = 2
cluster_std = 1.3
random_state = 42
cv = 5
np.random.seed(random_state)
params_split: dict[str, float | int] = {
    'test_size': 0.25,
    'random_state': random_state
}
params_norm: dict[str, bool] = {'with_mean': True, 'with_std': True}
model_metrics: dict[str, any] = {
    'custom_accuracy': custom_accuracy_score,
    'accuracy': accuracy_score,
    'precision': precision_score,
    'recall': recall_score,
    'kappa': cohen_kappa_score,
    'f1': f1_score,
}

Loading the Data

Load your dataset. In this example, we're generating a dataset for classification using sklearn:

X, y = make_blobs(
    n_samples=n_samples,
    centers=centers,
    n_features=n_features,
    cluster_std=cluster_std,
    random_state=random_state
)

Plot the dataset

fig, ax = plt.subplots(1, 1, figsize=(14, 6))

ax.plot(X[:, 0][y == 0], X[:, 1][y == 0], "bs")
ax.plot(X[:, 0][y == 1], X[:, 1][y == 1], "g^")
ax.set_title("Dataset")
ax.set_frame_on(False)
ax.set_xticks([])
ax.set_yticks([])
fig.tight_layout()

fig0

Cross-validtion

models: dict[BaseEstimator, dict] = {
    RandomForestClassifier: {
        'n_estimators': [10, 50, 100, 200],
        'max_depth': [None, 5, 10, 15],
        'min_samples_split': [2, 5, 10],
        'min_samples_leaf': [1, 2, 4],
        'random_state': [random_state]
    },
    GradientBoostingClassifier: {
        'n_estimators': [50, 100, 200],
        'learning_rate': [0.1, 0.05, 0.01, 0.005],
        'subsample': [0.5, 0.8, 1.0],
        'random_state': [random_state]
    },
    LogisticRegression: {
        'C': [0.01, 0.1, 1.0, 10.0],
        'penalty': ['l1', 'l2'],
        'solver': ['liblinear', 'saga'],
        'random_state': [random_state]
    },
    GaussianNB: {
        'var_smoothing': [1e-9, 1e-8, 1e-7, 1e-6, 1e-5]
    },
    KNeighborsClassifier: {
        'n_neighbors': [3, 5, 7, 9],
        'weights': ['uniform', 'distance'],
        'algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute']
    },
    SVC: {
        'C': [0.01, 0.1, 1.0, 10.0],
        'kernel': ['linear', 'poly', 'rbf', 'sigmoid'],
        'degree': [2, 3, 4],
        'gamma': ['scale', 'auto'],
        'random_state': [random_state]
    },
    DecisionTreeClassifier: {
        'criterion': ['gini', 'entropy'],
        'splitter': ['best', 'random'],
        'max_depth': [None, 5, 10, 15],
        'min_samples_split': [2, 5, 10],
        'min_samples_leaf': [1, 2, 4],
        'random_state': [random_state]
    }
}
grid_search = GridSearch(
    X,
    y,
    params_split=params_split,
    models_params=models,
    normalize=True,
    params_norm=params_norm,
    scoring='accuracy',
    metrics=model_metrics
)
grid_search.search(cv=cv, n_jobs=-1)

best_model, best_params = \
    grid_search \
    .get_best_model()
results: pd.DataFrame = pd.DataFrame(grid_search._metrics).T
results
custom_accuracy accuracy precision recall kappa f1
RandomForestClassifier 246.0 0.984 0.983607 0.983607 0.967982 0.983607
GradientBoostingClassifier 244.0 0.976 0.975410 0.975410 0.951972 0.975410
LogisticRegression 245.0 0.980 0.983471 0.975410 0.959969 0.979424
GaussianNB 244.0 0.976 0.975410 0.975410 0.951972 0.975410
KNeighborsClassifier 245.0 0.980 0.983471 0.975410 0.959969 0.979424
SVC 245.0 0.980 0.983471 0.975410 0.959969 0.979424
DecisionTreeClassifier 242.0 0.968 0.991379 0.942623 0.935889 0.966387

Best model

Here we can see the distribution for the best classifier.

fig, ax = plt.subplots(1, 2, figsize=(14, 6))

ax[0].plot(X[:, 0][y == 0], X[:, 1][y == 0], "bs")
ax[0].plot(X[:, 0][y == 1], X[:, 1][y == 1], "g^")
ax[0].set_title("Dataset")
ax[0].set_frame_on(False)
ax[0].set_xticks([])
ax[0].set_yticks([])

ax[1].plot(
    grid_search.X_test[:, 0][grid_search.y_test == 0],
    grid_search.X_test[:, 1][grid_search.y_test == 0],
    "bs"
)
ax[1].plot(
    grid_search.X_test[:, 0][grid_search.y_test == 1],
    grid_search.X_test[:, 1][grid_search.y_test == 1],
    "g^"
)
ax[1].set_title(grid_search.best_model.__class__.__name__)
ax[1].set_frame_on(False)
ax[1].set_xticks([])
ax[1].set_yticks([])
fig.tight_layout()

fig1

Classification: uncertainty estimation

Click here for contents.

How to use the module for uncertainty estimation in classification tasks.

Importing the Library

First, import the necessary modules from the library:

from typing import Dict
from functools import partial

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.base import BaseEstimator
from sklearn.datasets import make_blobs
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (accuracy_score, cohen_kappa_score, f1_score,
                             precision_score, recall_score)
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier

from mlpr.ml.supervisioned.classification.uncertainty import UncertaintyPlots
from mlpr.ml.supervisioned.classification.utils import calculate_probas
from mlpr.ml.supervisioned.tunning.grid_search import GridSearch

import warnings
warnings.filterwarnings("ignore")

Parameters

Setting parameters for the experiments.

random_state: int = 42
n_feats: int = 2
n_size: int = 1000
centers: list[tuple] = [
    (0, 2),
    (2, 0),
    (5, 4.5)
]
n_class: int = len(centers)
cluster_std: list[float] = [1.4, 1.4, 0.8]
cv: int = 5
np.random.seed(random_state)
params: dict[str, dict[str, any]] = {
    "n_samples": n_size,
    "n_features": n_feats,
    "centers": centers,
    "cluster_std": cluster_std,
    "random_state": random_state
}
np.random.seed(random_state)
params_split: dict[str, float | int] = {
    'test_size': 0.25,
    'random_state': random_state
}
params_norm: dict[str, bool] = {'with_mean': True, 'with_std': True}

model_metrics: dict[str, any] = {
    'custom_accuracy': partial(accuracy_score, normalize=False),
    'accuracy': accuracy_score,
    'precision': partial(precision_score, average='macro'),
    'recall': partial(recall_score, average='macro'),
    'kappa': cohen_kappa_score,
    'f1': partial(f1_score, average='macro'),
}

Load the dataset

Here we are generating a dataset for experiments, using blobs from scikit-learn.

X, y = make_blobs(
    **params
)

Plot the dataset

Behavior of the dataset used in the experiment.

markers = ['o', 'v', '^']
fig, ax = plt.subplots(1, 1, figsize=(14, 6))

colors = generate_colors("FF4B3E", "1C2127", len(np.unique(y)))

for i, k in enumerate(np.unique(y)):
  ax.scatter(X[:, 0][y == k], X[:, 1][y == k], marker=markers[i % len(markers)], color=colors[i], label=f"c{i}")

ax.set_title("Dataset")
ax.set_frame_on(False)
ax.set_xticks([])
ax.set_yticks([])
for i, (center, color) in enumerate(zip(centers, colors)):
    ax.scatter(
        center[0],
        center[1],
        color="white",
        linewidths=3,
        marker="o",
        edgecolor="black",
        s=120,
        label="center" if i == 0 else None
    )
plt.legend()
fig.tight_layout()

fig2

Cross-validation

models: dict[BaseEstimator, dict] = {
    RandomForestClassifier: {
        'n_estimators': [10, 50, 100, 200],
        'max_depth': [None, 5, 10, 15],
        'min_samples_split': [2, 5, 10],
        'min_samples_leaf': [1, 2, 4],
        'random_state': [random_state]
    },
    GradientBoostingClassifier: {
        'n_estimators': [50, 100, 200],
        'learning_rate': [0.1, 0.05, 0.01, 0.005],
        'subsample': [0.5, 0.8, 1.0],
        'random_state': [random_state]
    },
    LogisticRegression: {
        'C': [0.01, 0.1, 1.0, 10.0],
        'penalty': ['l1', 'l2'],
        'solver': ['liblinear', 'saga'],
        'random_state': [random_state],
        'max_iter': [10000]
    },
    GaussianNB: {
        'var_smoothing': [1e-9, 1e-8, 1e-7, 1e-6, 1e-5]
    },
    SVC: {
        'C': [0.01, 0.1, 1.0, 10.0],
        'kernel': ['linear', 'poly', 'rbf', 'sigmoid'],
        'degree': [2, 3, 4],
        'gamma': ['scale', 'auto'],
        'probability': [True],
        'random_state': [random_state]
    },
    DecisionTreeClassifier: {
        'criterion': ['gini', 'entropy'],
        'splitter': ['best', 'random'],
        'max_depth': [None, 5, 10, 15],
        'min_samples_split': [2, 5, 10],
        'min_samples_leaf': [1, 2, 4],
        'random_state': [random_state]
    }
}
grid_search = GridSearch(
    X,
    y,
    params_split=params_split,
    models_params=models,
    normalize=True,
    params_norm=params_norm,
    scoring='accuracy',
    metrics=model_metrics
)
grid_search.search(cv=cv, n_jobs=-1)

best_model, best_params = \
    grid_search \
    .get_best_model()
results: pd.DataFrame = pd.DataFrame(grid_search._metrics).T
results
custom_accuracy accuracy precision recall kappa f1
RandomForestClassifier 222.0 0.888 0.883566 0.885902 0.831666 0.882724
GradientBoostingClassifier 221.0 0.884 0.878830 0.881207 0.825583 0.878421
LogisticRegression 230.0 0.920 0.915170 0.916987 0.879457 0.915662
GaussianNB 231.0 0.924 0.919375 0.921682 0.885539 0.920046
SVC 230.0 0.920 0.915170 0.916987 0.879457 0.915662
DecisionTreeClassifier 214.0 0.856 0.848155 0.845622 0.782325 0.846414

Probabilities

Getting probabilities for uncertainty generation.

probas = calculate_probas(grid_search.fitted, grid_search.X_train)

Plot best model result

Plotting the result for best model.

up = UncertaintyPlots()
fig_un, ax_un = up.uncertainty(
    model_names=[[best_model.__class__.__name__]],
    probs={best_model.__class__.__name__: probas[best_model.__class__.__name__]},
    X=grid_search.X_train,
    figsize=(20, 6),
    cmap='RdYlGn',
    show_inline=True,
    box_on=False
)

fig2

Plot overall of uncertainty

Plotting an overall of uncertainty estimated for the models.

sorted_models = results.sort_values("accuracy", ascending=False).index.tolist()

pyramid = []
i = 0
for row in range(1, len(sorted_models)):
    if i + row <= len(sorted_models):
        pyramid.append(sorted_models[i:i+row])
        i += row
    else:
        break

if i < len(sorted_models):
    pyramid.append(sorted_models[i:])

if len(pyramid[-1]) < len(pyramid[-2]):
    pyramid[-2].extend(pyramid[-1])
    pyramid = pyramid[:-1]
up = UncertaintyPlots()
fig_un, ax_un = up.uncertainty(
    model_names=pyramid,
    probs=probas,
    X=grid_search.X_train,
    figsize=(20, 10),
    show_inline=True,
    cmap='RdYlGn',
    box_on=False
)

fig3

Aleatory uncertainty and Epistemic uncertainty

data_probas = pd.DataFrame(probas)
random: pd.Series = data_probas.mean(axis=1)
epistemic: pd.Series = data_probas.var(axis=1)
up = UncertaintyPlots()
fig_both, ax_both = up.uncertainty(
    model_names=[["Random uncertainty", "Epistemic uncertainty"]],
    probs={
        "Random uncertainty": random,
        "Epistemic uncertainty": epistemic
    },
    X=grid_search.X_train,
    figsize=(20, 6),
    cmap='RdYlGn',
    show_inline=True,
    box_on=False
)

fig4

Regression

Click here for contents.

How to use the module for regression problems.

Importing the Library

First, import the necessary modules from the library:

from mlpr.ml.supervisioned.regression import metrics, plots
from mlpr.ml.supervisioned.tunning.grid_search import GridSearch
from mlpr.reports.create import ReportGenerator

Loading the Data

Load your dataset. In this example, we're generating a dataset for regression using sklearn:

n_feats = 11
n_instances = 1000
n_invert = 50
n_noise = 20
cv = 10
X, y = make_regression(n_samples=n_instances, n_features=n_feats, noise=n_noise)

# introduce of noises
indices = np.random.choice(y.shape[0], size=n_invert, replace=False)
y[indices] = np.max(y) - y[indices]

data = pd.DataFrame(data=X, columns=[f'feature_{i}' for i in range(1, n_feats + 1)])
data['target'] = y

Set the seed

Set the random seed for reproducibility

n_seed = 42
np.random.seed(n_seed)

Preparing the Data

Split your data into features ($X$) and target ($y$):

X = data.drop("target", axis=1)
y = data["target"].values

Model Training

Define the parameters for your models and use GridSearch to find the best model:

models_params = {
    Ridge: {
        'alpha': [1.0, 10.0, 15., 20.],
        'random_state': [n_seed]
    },
    Lasso: {
        'alpha': [0.1, 1.0, 10.0],
        'random_state': [n_seed]
    },
    SVR: {
        'C': [0.1, 1.0, 10.0],
        'kernel': ['linear', 'rbf']
    },
    RandomForestRegressor: {
        'n_estimators': [10, 50, 100],
        'max_depth': [None, 5, 10],
        'random_state': [n_seed]
    },
    GradientBoostingRegressor: {
        'n_estimators': [100, 200],
        'learning_rate': [0.1, 0.05, 0.01],
        'random_state': [n_seed]
    },
    XGBRegressor: {
        'n_estimators': [100, 200],
        'learning_rate': [0.1, 0.05, 0.01],
        'random_state': [n_seed]
    }
}

params_split: dict[str, float | int] = {
    'test_size': 0.25,
    'random_state': n_seed
}
params_norm: dict[str, bool] = {'with_mean': True, 'with_std': True}

grid_search = GridSearch(
    X,
    y,
    params_split=params_split,
    models_params=models_params,
    normalize=True,
    scoring='neg_mean_squared_error',
    metrics={'neg_mean_squared_error': rmse},
    params_norm=params_norm
)
grid_search.search(cv=5, n_jobs=-1)

best_model, best_params = \
    grid_search \
    .get_best_model()

Making Predictions

Use the best model to make predictions:

data_train = pd.DataFrame(
    grid_search.X_train,
    columns=X.columns
)
data_train["y_true"] = grid_search.y_train
data_train["y_pred"] = grid_search.best_model.predict(grid_search.X_train)

Evaluating the Model

Calculate various metrics to evaluate the performance of the model:

k = 3
rm = metrics.RegressionMetrics(
    data_train,
    *["y_true", "y_pred"]
)
results = rm.calculate_metrics(
    ["mape", "rmse", "kolmogorov_smirnov", "confusion_matrix", "calculate_kappa"],
    {
        "mape": {},
        "rmse": {},
        "kolmogorov_smirnov": {},
        "confusion_matrix": {"n_bins": k},
        "calculate_kappa": {"n_bins": k}
    }
)

Results

The output it's a dictionary object with the calculated metrics, like this:

{'mape': 39.594540526956436,
 'rmse': 54.09419440169204,
 'kolmogorov_smirnov': (0.1510574018126888, 0.0010310446878578096),
 'confusion_matrix': (array([[54, 57,  2,  0],
         [16, 70, 21,  0],
         [ 0, 37, 37,  3],
         [ 0,  6, 25,  3]]),
  {'precision': array([0.77142857, 0.41176471, 0.43529412, 0.5       ]),
   'recall': array([0.47787611, 0.65420561, 0.48051948, 0.08823529]),
   'f1_score': array([0.59016393, 0.50541516, 0.45679012, 0.15      ]),
   'support': array([113, 107,  77,  34]),
   'accuracy': 0.4954682779456193}),
 'calculate_kappa': {0: {'confusion_matrix': array([[202,  16],
          [ 59,  54]]),
   'kappa_score': 0.4452885840055415,
   'metrics': {'precision': array([0.77394636, 0.77142857]),
    'recall': array([0.9266055 , 0.47787611]),
    'f1_score': array([0.8434238 , 0.59016393]),
    'support': array([218, 113]),
    'accuracy': 0.7734138972809668}},
  1: {'confusion_matrix': array([[124, 100],
          [ 37,  70]]),
   'kappa_score': 0.180085703437178,
   'metrics': {'precision': array([0.77018634, 0.41176471]),
    'recall': array([0.55357143, 0.65420561]),
    'f1_score': array([0.64415584, 0.50541516]),
    'support': array([224, 107]),
    'accuracy': 0.5861027190332326}},
  2: {'confusion_matrix': array([[206,  48],
          [ 40,  37]]),
   'kappa_score': 0.2813579394059016,
   'metrics': {'precision': array([0.83739837, 0.43529412]),
    'recall': array([0.81102362, 0.48051948]),
    'f1_score': array([0.824     , 0.45679012]),
    'support': array([254,  77]),
    'accuracy': 0.7341389728096677}},
  3: {'confusion_matrix': array([[294,   3],
          [ 31,   3]]),
   'kappa_score': 0.12297381546134645,
   'metrics': {'precision': array([0.90461538, 0.5       ]),
    'recall': array([0.98989899, 0.08823529]),
    'f1_score': array([0.94533762, 0.15      ]),
    'support': array([297,  34]),
    'accuracy': 0.8972809667673716}}}}

Visualizing the Results

Plot the results using the RegressionPlots module:

rp = \
    plots \
        .RegressionPlots(
            data_train,
            color_palette=["#FF4B3E", "#1C2127"]
        )
fig, axs = rp.grid_plot(
    plot_functions=[
        ['graph11', 'graph12', 'graph13'],
        ['graph21', 'graph22'],
        ['graph23']
    ],
    plot_args={
        'graph11': {
            "plot": "scatter",
            "params": {
                'y_true_col': 'y_true',
                'y_pred_col': 'y_pred',
                'linecolor': '#1C2127',
                'worst_interval': True,
                'metrics': rm.metrics["calculate_kappa"],
                'class_interval': rm._class_intervals,
                'method': 'recall',
                'positive': True
            }
        },
        'graph12': {
            "plot": "plot_ecdf",
            "params": {
                'y_true_col': 'y_true',
                'y_pred_col': 'y_pred'
            }
        },
        'graph21': {
            "plot": "plot_kde",
            "params": {
                'columns': ['y_true', 'y_pred']
            }
        },
        'graph22': {
            "plot": "plot_error_hist",
            "params": {
                'y_true_col': 'y_true',
                'y_pred_col': 'y_pred',
                'linecolor': '#1C2127'
            }
        },
        'graph13': {
            "plot": "plot_fitted",
            "params": {
                'y_true_col': 'y_true',
                'y_pred_col': 'y_pred',
                'condition': (
                    (
                        rm._worst_interval_kappa[0] <= data_train["y_true"]
                    ) & (
                        data_train["y_true"] <= rm._worst_interval_kappa[1]
                    )
                ),
                'sample_size': None
            }
        },
        'graph23': {
            "plot": "plot_fitted",
            "params": {
                'y_true_col': 'y_true',
                'y_pred_col': 'y_pred',
                'condition': None,
                'sample_size': None
            }
        },
    },
    show_inline=True
)

fig_reg_

Reports

Here you can see the report output.

Contact

Here you can find my contact information:


License

This project is licensed under the terms of the MIT license. For more details, see the LICENSE file in the project's root directory.

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mlpr-0.1.17.tar.gz (33.4 kB view hashes)

Uploaded Source

Built Distribution

mlpr-0.1.17-py3-none-any.whl (39.1 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page