Skip to main content

Python package for automated hyperparameter-optimization of common machine-learning algorithms

Project description

PyPI version

automl: Automated Machine Learning

Intro

automl is a python project focussed on automating much of the machine learning efforts encountered in zero-dimensional regression and classification (and thus not multidimensional data such as for a CNN). It relies on existing Python packages Sci-Kit Learn, Optuna and model specific packages LightGBM, CatBoost and XGBoost.

automl works by assessing the performance of various machine-learning models for a set number of trials over a pre-defined range of hyperparameters. During succesive trials the hyperparameters are optimized following a user-defined methodology (the default optimisation uses Bayesian search). Unpromising trials are stopped (pruned) early by assessing performance on an incrementally increasing fraction of training data, saving computational resources. Hyperparameter optimization trials are stored locally on disk, allowing the training to be picked up after interuption. The best trials of the defined models are reloaded and combined, or stacked, to form a final model. This final model is assessed and, due to the nature of stacking, tends to outperform any of its constituting models.

automl contains several additional functionalities beyond the hyperoptimization and stacking of models:

  • scaling of the input X-matrix (tested for on default)
  • normal transformation of the y-matrix (tested for on default)
  • PCA compression
  • spline transformation
  • polynomial expansion
  • categorical feature support (nominal and ordinal)
  • bagging of weak models in addition to optimized models
  • multithreading
  • feature-importance analyses with shap

Installation

Create a new environment to prevent pip install from breaking anything. Include Python version 3.11

conda create -n ENVNAME -c conda-forge python=3.11

Activate new environment

conda activate ENVNAME

Pip install

python3 -m pip install py-automl-lib

Optionally include the shap package for feature-importance analyses (see example_notebook.ipynb chapter 7.)

python3 -m pip install py-automl-lib[shap]

Use

For a more detailed example checkout examples/example_notebook.ipynb

Minimal use case regression:

from sklearn.metrics import r2_score
from automl import AutomatedRegression

X, y = make_regression(n_samples=1000, n_features=10, n_informative=2, random_state=42)

regression = AutomatedRegression(
    y=y,
    X=X,
    n_trial=10,
    timeout_study=100
    metric_optimise=r2_score,
    optimisation_direction='maximize',
    models_to_optimize=['bayesianridge', 'lightgbm'],
    )
    
regression.apply()
regression.summary

Expanded options use case regression:

from optuna.samplers import TPESampler
from optuna.pruners import HyperbandPruner
from sklearn.metrics import r2_score
from sklearn.model_selection import KFold
from automl import AutomatedRegression

X, y = make_regression(n_samples=1000, n_features=10, n_informative=2, random_state=42)

# -- adding categorical features
df_X = pd.DataFrame(X)
df_X['nine'] = pd.cut(df_X[9], bins=[-float('Inf'), -3, -1, 1, 3, float('Inf')], labels=['a', 'b', 'c', 'd', 'e'])
df_X['ten'] = pd.cut(df_X[9], bins=[-float('Inf'), -1, 1, float('Inf')], labels=['A', 'B', 'C'])
df_y = pd.Series(y)

regression = AutomatedRegression(
    y=df_y,
    X=df_X,
    test_frac=0.2,
    fit_frac=[0.2, 0.4, 0.6, 1],
    n_trial=50,
    timeout_study=600,
    timeout_trial=120,
    metric_optimise=r2_score,
    optimisation_direction='maximize',
    cross_validation=KFold(n_splits=5, shuffle=True, random_state=42),
    sampler=TPESampler(seed=random_state),
    pruner=HyperbandPruner(min_resource=1, max_resource='auto', reduction_factor=3),
    reload_study=False,
    reload_trial_cap=False,
    write_folder='/auto_regression_test',
    models_to_optimize=['bayesianridge', 'lightgbm'],
    nominal_columns=['nine'],
    ordinal_columns=['ten'],
    pca_value=0.95,
    spline_value={'n_knots': 5, 'degree':3},
    poly_value={'degree': 2, 'interaction_only': True},
    boosted_early_stopping_rounds=100,
    n_weak_models=5,
    random_state=42,
    )

regression.apply()
regression.summary
    

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

py_automl_lib-2.2.9.tar.gz (28.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

py_automl_lib-2.2.9-py3-none-any.whl (31.2 kB view details)

Uploaded Python 3

File details

Details for the file py_automl_lib-2.2.9.tar.gz.

File metadata

  • Download URL: py_automl_lib-2.2.9.tar.gz
  • Upload date:
  • Size: 28.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.1

File hashes

Hashes for py_automl_lib-2.2.9.tar.gz
Algorithm Hash digest
SHA256 a52f9d2f0197533e36c6ec4a16a27f513b7bdc1e06eedd8f04f4af2816ff7d80
MD5 52ae0bba889b3445ca33011ab64056eb
BLAKE2b-256 796a35e2532319ab3da85334534a2da8ddd4ae83cd74d7d1f17bd784ac5487e7

See more details on using hashes here.

File details

Details for the file py_automl_lib-2.2.9-py3-none-any.whl.

File metadata

  • Download URL: py_automl_lib-2.2.9-py3-none-any.whl
  • Upload date:
  • Size: 31.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.1

File hashes

Hashes for py_automl_lib-2.2.9-py3-none-any.whl
Algorithm Hash digest
SHA256 df8c98e5cc6cea2d6c04b5a07c07032760c83c1bbf000cb18ccc30af3339228a
MD5 d41aa9c316ff5af47f4cb326aed2a836
BLAKE2b-256 98c2025b0978bdfa830f37297129afc6047c27d4b76ddd96da53be23823084db

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page