Skip to main content

Treinamento e avaliação de modelos de machine learning através de funções e classes encapsuladas

Project description

mlcomposer logo

:robot: Applying Machine Learning in an easy and efficient way like you never did before! :robot:

PyPI PyPI - Python Version PyPI - Status

Downloads Downloads Downloads


Table of contents


About mlcomposer

Have ever tried to apply Machine Learning for solving a problem and found yourself lost on how much code you needed in order to reach your goal? Have you ever though it was too hard to do it? Don't be afraid because this package certainly can help you improving your skills and your modeling flow. Meet mlcomposer as a useful python package for helping users to use some built in classes and functions for applying Machine Learning as easy as possible.

With mlcomposer, you can:

  • Build data prep solutions with custom pipelines by using python classes that handle transformation steps
  • Train and extract all information you need from Machine Learning basic tasks like classification and regression
  • Build and visualize custom evaluation charts like performance reports, confusion matrix, ROC curves and others
  • Save models in pkl or joblib formats after applying hyperparameter optimization in only one method execution
  • Compare performances for multiple models at once
  • Much more...

Package Structure

The package is built around two main modules called transformers and trainer. The first one contains custom python classes written strategically for improving constructions of pipelines using native sklearn's class Pipeline. The second one is a powerful tool for training and evaluating Machine Learning models with classes for each different task (binary classification, multiclass classification and regression at this time). We will dive deep into those pieces on this documentation and I'm sure you will like it!

Transformers Module

As said on the top of this section, the transformers module allocates custom classes with useful transformations to be applied on data prep pipelines. In order to provide the opportunity to integrate these classes into a preparation pipeline using sklearn's Pipeline, every class on transformers module inherits BaseEstimator and TransformerMixin from sklearn. This way, the code for data transformation itself is written inside a transform() method in each class, giving the chance for execute a sequence of steps outside the module in a more complex Pipeline and also the possibility to use user custom classes on this preparation flow.

If things are still a little complicated, the table below contains all classes built inside transformers module. After that, it will be placed an example of a data prep pipeline written using some of those classes.

Class Short Description
ColumnFormatter Applies a custom column formatting in a pandas DataFrame to standardize column names
ColumnSelection Filters columns in a DataFrame based on a list passed as a class attribute
BinaryTargetMapping Transforms a raw target column in a binary one (1 or 0) based on a positive class argument
DropDuplicates Drops duplicated rows in a dataset
DataSplitter Applies a separation on data and creates new sets using train_test_split() function
CategoricalLimitter Limits entries in categorical columns and restrict them based on a "n_cat" argument
CategoricalMapper Receives a dictionary for mapping entries on categorical columns
DtypeModifier Changes dtypes for columns based on a "mod_dict" argument
DummiesEncoding Applies dummies encoding process (OrdinalEncoder using pd.get_dummies() method)
FillNullData Fills null data (at just only columns, if needed)
FeatureSelection Uses feature importance analysis for selecting top k features for the model
LogTransformation Applies log1p() from numpy in order to log transform all numerical data
DynamicLogTransformation Contains a boolean flag for applying or not log transformation (can be tunned afterall)
DynamicScaler Applies normalization on data (StandardScaler or MinMaxScaler based on an application flag
ModelResults Receives a set of features and an estimator for building a DataFrame with source data and predictions

With table above, we can imagine a dot of custom transformations that can be applied one by one on data preparation pipelines. The snippet below simulates a pipeline construction with some of those classes. The idea is to create a block of code that automatically fills null data with a dummy number, applies log transformation and scaler on numerical features and finally applies dummies encoding on categorical features. Let's see how it can be built.

from mlcomposer.transformers import FillNulldata, DynamicLogTransformation, DynamicScaler
from sklearn.pipeline import Pipeline

# Building a numerical pipeline
num_pipeline = Pipeline([
    ('imputer', FillNullData(value_fill=-999)),
    ('log_transformer', DynamicLogTransformation(application=True, num_features=NUM_FEATURES, 
                                                 cols_to_log=COLS_TO_LOG)),
    ('scaler', DynamicScaler(scaler_type='Standard'))
])

# Building a categorical pipeline
cat_pipeline = Pipeline([
    ('encoder', DummiesEncoding(dummy_na=True, cat_features_final=CAT_FEATURES_FINAL))
])

# Building a complete pipeline
prep_pipeline = ColumnTransformer([
    ('num', num_pipeline, NUM_FEATURES),
    ('cat', cat_pipeline, CAT_FEATURES)
])

Done! With few steps and some python tricks, it was possible to build a resilient pipeline for handling all transformation steps specified. The best from it is that it was not necessary to build up those custom classes as long as the transformers module did it for us.


Trainer Module

Well, by now what we can do after designing a complete pipeline for data preparation step? The answear could be straightforward: train and evaluate Machine Learning models! And that's the point the trainer module from mlcomposer gets in. With trainer, we can use classes already built for handling common steps for training and evaluating models considering basic ML approaches like binary and multiclass classification or linear regression.

So, it's possible to define the trainer module as a extremely poweful tool that gives the user the hability to solve complex problems and generates useful analysis with few lines of codes wherever there's a basic ML problem. It provides classes that encapsulates most of all steps needed for completing the solution. For make it cleaner, the table below shows up what kind of problems trainer module can solves or, in other words, what classes are built into the module.

Class Short Description
BinaryClassifier Solves binary classification problems by delivering useful methods for training and evaluating multiple models at once
MulticlassClassifier Solves multiclass classification problems by delivering useful methods for training and evaluating multiple models at once
LinearRegressor Solves linear regression problems by delivering useful methods for training and evaluating multiple models at once

Below it will be place an example of trainer module usage by passing through a complete flow of a binary classification problem. Two powerful methods of BinaryClassifier class will be applied in order to train, evaluate, extract feature importance, plotting confusion matrix, ROC Curve, score distribution and other visual analysis: the training_flow() and visual_analysis() methods. Take a look:

#[...] other imports, variable definition and data pre process
from mlcomposer.trainer import BinaryClassifier

# Creating a object
trainer = BinaryClassifier()

# Training and evaluating models
trainer.training_flow(set_classifiers, X_train_prep, y_train, X_val_prep, y_val, 
                      features=MODEL_FEATURES, metrics_output_path=METRICS_OUTPUT_PATH,
                      models_output_path=MODELS_OUTPUT_PATH)

# Generating and saving figures for visual analysis
trainer.visual_analysis(features=MODEL_FEATURES, model_shap=MODEL_SHAP, 
                        output_path=IMGS_OUTPUT_PATH)

The simple execution of these two methods on binary classification problems can generate the following output:

└── output
    ├── imgs
       ├── confusion_matrix.png
       ├── feature_importances.png
       ├── learning_curve.png
       ├── metrics_comparison.png
       ├── roc_curve.png
       ├── score_bins_percent.png
       ├── score_bins.png
       ├── score_distribution.png
       └── shap_analysis_modelname.png
    ├── metrics
       ├── metrics.csv
       └── top_features.csv
    └── models
        ├── modelname_date.pkl

Installing the Package

The last version of mlcomposer package are published and available on PyPI repository

:pushpin: Note: as a good practice for every Python project, the creation of a virtual environment is needed to get a full control of dependencies and third part packages on your code. By this way, the code below can be used for creating a new venv on your OS.

# Creating and activating venv on Linux
$ python -m venv <path_venv>/<name_venv>
$ source <path_venv>/<nome_venv>/bin/activate

# Creating and activating venv on Windows
$ python -m venv <path_venv>/<name_venv>
$ <path_venv>/<nome_venv>/Scripts/activate

With the new venv active, all you need is execute the code below using pip for installing xplotter package (upgrading pip is optional):

$ pip install --upgrade pip
$ pip install mlcomposer

The mlcomposer package is built in a layer above some other python packages like pandas, numpy, sklearn and shap. Because of that, when installing mlcomposer, the pip utility will also install all dependencies linked to the package.

Examples

For making the package usage easy as possible for new users, it's placed on this Github repository a direcotry identified as examples/ with python scripts with complete examples using transformers and trainer modules for solving differente Machine Learning problems with different approachs.

Usage Around the World

By now, the mlcomposer package was first introduced on Kaggle's Tabular Playground Series of May 2021 with the linked notebook by user Thiago Panini. This notebooks also uses the xplotter module for making the Exploratory Data Analysis and that would be a good idea too for getting insights from data. By now, the notebook reaches a bronze model with around 15 upvotes.


Contribution

The mlcomposer python package is an open source implementation and the more people use it, the more happy the developers will be. So if you want to contribute with mlcomposer, please feel free to follow the best practices for implementing coding on this github repository through creating new branches, making merge requests and pointig out whenever you think there is a new topic to explore or a bug to be fixed.

Thank you very much for reaching this and it will be a pleasure to have you as mlcomposer user or developer.


Social Media

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mlcomposer-0.1.1.tar.gz (46.6 kB view hashes)

Uploaded Source

Built Distribution

mlcomposer-0.1.1-py3-none-any.whl (42.3 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page