Fast and customizable framework for automatic ML model creation (AutoML)
Project description
LightAutoML - automatic model creation framework
LightAutoML (LAMA) - project from Sberbank AI Lab AutoML group is the framework for automatic classification and regression model creation.
Current available tasks to solve:
- binary classification
- multiclass classification
- regression
Currently we work with datasets, where each row is an object with its specific features and target. Multitable datasets and sequences are now under contruction :)
Note: for automatic creation of interpretable models we use AutoWoE
library made by our group as well.
Authors: Alexander Ryzhkov, Anton Vakhrushev, Dmitry Simakov, Vasilii Bunakov, Rinchin Damdinov, Pavel Shvets, Alexander Kirilin.
Quick tour
Let's solve the popular Kaggle Titanic competition below. There are two main ways to solve machine learning problems using LightAutoML:
- Use ready preset for tabular data:
import pandas as pd
from sklearn.metrics import f1_score
from lightautoml.automl.presets.tabular_presets import TabularAutoML
from lightautoml.tasks import Task
df_train = pd.read_csv('../input/titanic/train.csv')
df_test = pd.read_csv('../input/titanic/test.csv')
automl = TabularAutoML(
task = Task(
name = 'binary',
metric = lambda y_true, y_pred: f1_score(y_true, (y_pred > 0.5)*1))
)
oof_pred = automl.fit_predict(
df_train,
roles = {'target': 'Survived', 'drop': ['PassengerId']}
)
test_pred = automl.predict(df_test)
pd.DataFrame({
'PassengerId':df_test.PassengerId,
'Survived': (test_pred.data[:, 0] > 0.5)*1
}).to_csv('submit.csv', index = False)
- Build your own custom pipeline:
import pandas as pd
from sklearn.metrics import f1_score
from lightautoml.automl.presets.tabular_presets import TabularAutoML
from lightautoml.tasks import Task
df_train = pd.read_csv('../input/titanic/train.csv')
df_test = pd.read_csv('../input/titanic/test.csv')
# define that machine learning problem is binary classification
task = Task("binary")
reader = PandasToPandasReader(task, cv=N_FOLDS, random_state=RANDOM_STATE)
# create a feature selector
model0 = BoostLGBM(
default_params={'learning_rate': 0.05, 'num_leaves': 64,
'seed': 42, 'num_threads': N_THREADS}
)
pipe0 = LGBSimpleFeatures()
mbie = ModelBasedImportanceEstimator()
selector = ImportanceCutoffSelector(pipe0, model0, mbie, cutoff=0)
# build first level pipeline for AutoML
pipe = LGBSimpleFeatures()
# stop after 20 iterations or after 30 seconds
params_tuner1 = OptunaTuner(n_trials=20, timeout=30)
model1 = BoostLGBM(
default_params={'learning_rate': 0.05, 'num_leaves': 128,
'seed': 1, 'num_threads': N_THREADS}
)
model2 = BoostLGBM(
default_params={'learning_rate': 0.025, 'num_leaves': 64,
'seed': 2, 'num_threads': N_THREADS}
)
pipeline_lvl1 = MLPipeline([
(model1, params_tuner1),
model2
], pre_selection=selector, features_pipeline=pipe, post_selection=None)
# build second level pipeline for AutoML
pipe1 = LGBSimpleFeatures()
model = BoostLGBM(
default_params={'learning_rate': 0.05, 'num_leaves': 64,
'max_bin': 1024, 'seed': 3, 'num_threads': N_THREADS},
freeze_defaults=True
)
pipeline_lvl2 = MLPipeline([model], pre_selection=None, features_pipeline=pipe1,
post_selection=None)
# build AutoML pipeline
automl = AutoML(reader, [
[pipeline_lvl1],
[pipeline_lvl2],
], skip_conn=False)
# train AutoML and get predictions
oof_pred = automl.fit_predict(df_train, roles = {'target': 'Survived', 'drop': ['PassengerId']})
test_pred = automl.predict(df_test)
pd.DataFrame({
'PassengerId':df_test.PassengerId,
'Survived': (test_pred.data[:, 0] > 0.5)*1
}).to_csv('submit.csv', index = False)
LighAutoML framework has a lot of ready-to-use parts and extensive customization options, to learn more check out the resources section.
Reference papers
Anton Vakhrushev, Alexander Ryzhkov, Dmitry Simakov, Rinchin Damdinov, Maxim Savchenko, Alexander Tuzhilin "LightAutoML: AutoML Solution for a Large Financial Services Ecosystem". arXiv:2109.01528, 2021.
Installation
Installation via pip from PyPI
To install LAMA framework on your machine:
# Installation base functionality:
pip install -U lightautoml
# Available partial installation
# Use extra dependecies = ['nlp', 'cv', 'report']
# Or may use 'all' for installation full functionality, example:
pip install -U lightautoml[nlp]
Additionaly, run following commands for generating report in pdf format:
# MacOS
brew install cairo pango gdk-pixbuf libffi
# Debian / Ubuntu
sudo apt-get install build-essential libcairo2 libpango-1.0-0 libpangocairo-1.0-0 libgdk-pixbuf2.0-0 libffi-dev shared-mime-info
# Fedora
sudo yum install redhat-rpm-config libffi-devel cairo pango gdk-pixbuf2
# Windows
# follow this tutorial https://weasyprint.readthedocs.io/en/stable/install.html#windows
Installation from source code
First of all you need to install git and poetry.
# Load LAMA source code
git clone https://github.com/sberbank-ai-lab/LightAutoML.git
cd LightAutoML/
# !!!Choose only one item!!!
# 1. Global installation: Don't create virtual environment
poetry config virtualenvs.create false --local
# 2. Recommended: Create virtual environment inside your project directory
poetry config virtualenvs.in-project true
# For more information read poetry docs
# Install LAMA
poetry lock
poetry install
Resources
-
Documentation of LightAutoML documentation is available here, you can also generate it.
-
Kaggle kernel examples of LightAutoML usage:
- Tabular Playground Series April 2021 competition solution
- Titanic competition solution (80% accuracy)
- Titanic 12-code-lines competition solution (78% accuracy)
- House prices competition solution
- Natural Language Processing with Disaster Tweets solution
- Tabular Playground Series March 2021 competition solution
- Tabular Playground Series February 2021 competition solution
- Interpretable WhiteBox solution
- Custom ML pipeline elements inside existing ones
-
To find out how to work with LightAutoML, we have several tutorials and examples here. Some of them you can run in Google Colab:
Tutorial_1_basics.ipynb
- get started with LightAutoML on tabular data.Tutorial_2_WhiteBox_AutoWoE.ipynb
- creating interpretable models.Tutorial_3_sql_data_source.ipynb
- shows how to use LightAutoML presets (both standalone and time utilized variants) for solving ML tasks on tabular data from SQL data base instead of CSV.Tutorial_4_NLP_Interpretation.ipynb
- example of using TabularNLPAutoML preset, LimeTextExplainer.Tutorial_5_uplift.ipynb
- shows how to use LightAutoML for a uplift-modeling task.Tutorial_6_custom_pipeline.ipynb
- shows how to create your own pipeline from specified blocks: pipelines for feature generation and feature selection, ML algorithms, hyperparameter optimization etc.Tutorial_7_ICE_and_PDP_interpretation.ipynb
- shows how to obtain local and global interpretation of model results using ICE and PDP approaches.
Important 1: for production you have no need to use profiler (which increase work time and memory consomption), so please do not turn it on - it is in off state by default
Important 2: to take a look at this report after the run, please comment last line of demo with report deletion command.
-
LightAutoML crash courses:
-
Video guides
- (Russian) LightAutoML webinar for Sberloga community (Alexander Ryzhkov, Dmitry Simakov)
- (Russian) LightAutoML hands-on tutorial in Kaggle Kernels (Alexander Ryzhkov)
- (English) Automated Machine Learning with LightAutoML: theory and practice (Alexander Ryzhkov)
- (English) LightAutoML framework general overview, benchmarks and advantages for business (Alexander Ryzhkov)
- (English) LightAutoML practical guide - ML pipeline presets overview (Dmitry Simakov)
-
Articles about LightAutoML
Contributing to LightAutoML
If you are interested in contributing to LightAutoML, please read the Contributing Guide to get started.
Questions / Issues / Suggestions
Seek prompt advice at Slack community or Telegram group.
Open bug reports and feature requests on GitHub issues.
Licence
This project is licensed under the Apache Licence, Version 2.0. See LICENSE file for more details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for LightAutoML-0.3.2-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | c6235fbe777888c8048c2da1aa08376f2ea1c3d2f79fe59050fd2127f5e4eb6a |
|
MD5 | 349d907cc3d572622960fc1719e2c9a2 |
|
BLAKE2b-256 | 3fd865251a8508c8ca3398c8c7f4018ca8be1f0edcf6069abec557e0e9f6ae21 |