Hyperparameter optimization for multiple machine learning algorithms using Optuna, with Scikit-learn API
Project description
OptuML: Hyperparameter Optimization for Machine Learning Algorithms using Optuna
⣰⡁ ⡀⣀ ⢀⡀ ⣀⣀ ⢀⡀ ⣀⡀ ⣰⡀ ⡀⢀ ⣀⣀ ⡇ ⠄ ⣀⣀ ⣀⡀ ⢀⡀ ⡀⣀ ⣰⡀ ⡎⢱ ⣀⡀ ⣰⡀ ⠄ ⣀⣀ ⠄ ⣀⣀ ⢀⡀ ⡀⣀
⢸ ⠏ ⠣⠜ ⠇⠇⠇ ⠣⠜ ⡧⠜ ⠘⠤ ⠣⠼ ⠇⠇⠇ ⠣ ⠇ ⠇⠇⠇ ⡧⠜ ⠣⠜ ⠏ ⠘⠤ ⠣⠜ ⡧⠜ ⠘⠤ ⠇ ⠇⠇⠇ ⠇ ⠴⠥ ⠣⠭ ⠏
OptuML (Optuna + ML) is a Python module providing hyperparameter optimization for machine learning algorithms using the Optuna framework. The module offers a scikit-learn compatible API with enhanced features for robust optimization.
tl;dr
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from optuml import Optimizer
# Load data
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
# Create and train optimizer
clf = Optimizer(algorithm="RandomForestClassifier", n_trials=50, cv=5, scoring="accuracy")
clf.fit(X_train, y_train)
# Make predictions
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
Key Features
- Comprehensive Algorithm Support: Full scikit-learn algorithm zoo plus CatBoost and XGBoost
- Full Scikit-learn Compatibility: Seamless integration with pipelines, cross-validation, and all sklearn tools
- Robust Optimization: Powered by Optuna with early stopping, timeout protection, and parallel execution
- Type-Safe Design: Separate optimizers for classification and regression with proper type checking
- Production Ready: Cross-platform compatibility, comprehensive error handling, and extensive validation
- Flexible Configuration: Control every aspect of the optimization process
Installation
Option A: pip (recommended)
pip install optuml
With optional algorithm support:
pip install optuml[all] # CatBoost + XGBoost + LightGBM
pip install optuml[catboost] # CatBoost only
pip install optuml[xgboost] # XGBoost only
pip install optuml[lightgbm] # LightGBM only
or upgrade:
pip install optuml --upgrade
Option B: Manual installation
# Install required dependencies
pip install optuna scikit-learn numpy
# Optional: Install additional algorithms
pip install catboost xgboost
# Download the module
wget https://raw.githubusercontent.com/filipsPL/optuml/main/optuml/optuml.py
Quick Start
Classification Example
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from optuml import Optimizer
# Load data
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Create and train optimizer
clf = Optimizer(
algorithm="RandomForestClassifier",
n_trials=50,
cv=5,
scoring="accuracy",
random_state=42,
show_progress_bar=True
)
clf.fit(X_train, y_train)
# Make predictions
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
# View results
print(f"Accuracy: {accuracy:.3f}")
print(f"Best parameters: {clf.best_params_}")
print(f"Optimization took: {clf.study_time_:.2f} seconds")
print(f"Trials completed: {clf.n_trials_completed_}")
Regression Example
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
from optuml import Optimizer
# Load data
X, y = load_diabetes(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Create and train optimizer
reg = Optimizer(
algorithm="XGBRegressor",
n_trials=100,
cv=5,
scoring="r2",
early_stopping_patience=10, # Stop if no improvement for 10 trials
n_jobs=-1, # Use all CPU cores for CV
verbose=True
)
reg.fit(X_train, y_train)
# Evaluate
y_pred = reg.predict(X_test)
r2 = r2_score(y_test, y_pred)
print(f"R² Score: {r2:.3f}")
Supported Algorithms
Classification Algorithms
| Algorithm | Description | Key Features |
|---|---|---|
SVC |
Support Vector Classifier | Non-linear kernels, probability estimates |
LogisticRegression |
Logistic Regression | L1/L2/Elastic-Net regularization |
RidgeClassifier |
Ridge Classifier | L2 regularization, fast linear model |
KNeighborsClassifier |
k-Nearest Neighbors | Distance weighting, various metrics |
RandomForestClassifier |
Random Forest | Feature importance, OOB score |
ExtraTreesClassifier |
Extremely Randomized Trees | Faster than RF, reduced variance |
AdaBoostClassifier |
AdaBoost | Boosted ensemble, learning rate tuning |
GradientBoostingClassifier |
Gradient Boosting | Sequential boosting, feature subsampling |
HistGradientBoostingClassifier |
Histogram Gradient Boosting | Fast GBDT, native NaN support |
MLPClassifier |
Neural Network | Multiple architectures, early stopping |
GaussianNB |
Gaussian Naive Bayes | Fast, probabilistic |
QDA |
Quadratic Discriminant Analysis | Non-linear boundaries |
DecisionTreeClassifier |
Decision Tree | Multiple criteria, pruning |
SGDClassifier |
Stochastic Gradient Descent | Multiple losses, L1/L2/ElasticNet, online |
CatBoostClassifier* |
CatBoost | Categorical features, GPU support |
XGBClassifier* |
XGBoost | Regularization, missing values |
LGBMClassifier* |
LightGBM | Fast GBDT, leaf-wise growth |
Regression Algorithms
| Algorithm | Description | Key Features |
|---|---|---|
SVR |
Support Vector Regression | Epsilon-insensitive loss |
LinearRegression |
Linear Regression | Simple, interpretable |
Ridge |
Ridge Regression | L2 regularization, stable on collinear |
Lasso |
Lasso Regression | L1 regularization, feature selection |
ElasticNet |
Elastic Net | L1+L2 regularization, sparse solutions |
KNeighborsRegressor |
k-Nearest Neighbors | Local regression |
RandomForestRegressor |
Random Forest | Reduces overfitting |
ExtraTreesRegressor |
Extremely Randomized Trees | Faster than RF, reduced variance |
AdaBoostRegressor |
AdaBoost | Sequential learning |
GradientBoostingRegressor |
Gradient Boosting | Sequential boosting, feature subsampling |
HistGradientBoostingRegressor |
Histogram Gradient Boosting | Fast GBDT, native NaN support |
MLPRegressor |
Neural Network | Non-linear patterns |
DecisionTreeRegressor |
Decision Tree | Non-parametric |
SGDRegressor |
Stochastic Gradient Descent | Multiple losses, L1/L2/ElasticNet, online |
CatBoostRegressor* |
CatBoost | Handles categoricals |
XGBRegressor* |
XGBoost | High performance |
LGBMRegressor* |
LightGBM | Fast GBDT, leaf-wise growth |
*Optional dependencies (install separately)
Advanced Features
Early Stopping
Stop optimization when no improvement is observed:
optimizer = Optimizer(
algorithm="XGBClassifier",
n_trials=1000,
early_stopping_patience=20 # Stop after 20 trials without improvement
)
Parallel Cross-Validation
Speed up optimization using multiple CPU cores:
optimizer = Optimizer(
algorithm="RandomForestClassifier",
n_trials=100,
cv=10,
n_jobs=-1 # Use all available cores
)
Custom Scoring Metrics
Use any scikit-learn compatible scoring metric:
optimizer = Optimizer(
algorithm="SVC",
scoring="roc_auc", # For classification
# scoring="neg_mean_squared_error", # For regression
# scoring="f1_weighted", # For imbalanced classes
)
Timeout Protection
Set time limits for optimization:
optimizer = Optimizer(
algorithm="MLPClassifier",
timeout=300, # Total optimization timeout (5 minutes)
cv_timeout=30, # Per-trial timeout (30 seconds)
n_trials=1000 # Will stop at timeout even if trials remain
)
Access to Optuna Study
Get detailed optimization information:
# After fitting
optimizer.fit(X_train, y_train)
# Access the Optuna study object
study = optimizer.study_
print(f"Best trial: {study.best_trial.number}")
print(f"Best value: {study.best_value:.4f}")
# Plot optimization history (requires plotly)
import optuna.visualization as vis
fig = vis.plot_optimization_history(study)
fig.show()
# Plot parameter importances
fig = vis.plot_param_importances(study)
fig.show()
Pipeline Integration
Full compatibility with scikit-learn pipelines:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
# Create pipeline with OptuML
pipe = Pipeline([
('scaler', StandardScaler()),
('optimizer', Optimizer(algorithm="SVC", n_trials=50))
])
# Use like any sklearn pipeline
pipe.fit(X_train, y_train)
predictions = pipe.predict(X_test)
Type-Specific Optimizers
For more control, use the specific optimizer classes:
from optuml.optuml import ClassifierOptimizer, RegressorOptimizer
# Classifier with all classifier-specific methods
clf = ClassifierOptimizer(
algorithm="RandomForestClassifier",
n_trials=100
)
clf.fit(X_train, y_train)
probas = clf.predict_proba(X_test)
decision = clf.decision_function(X_test) # If supported
# Regressor with regression-specific defaults
reg = RegressorOptimizer(
algorithm="RandomForestRegressor",
n_trials=100,
scoring="r2" # Default for regressors
)
API Reference
Main Classes
Optimizer
Universal optimizer that automatically selects between classification and regression.
ClassifierOptimizer
Specialized optimizer for classification algorithms with methods like predict_proba() and decision_function().
RegressorOptimizer
Specialized optimizer for regression algorithms with appropriate default scoring metrics.
Common Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
algorithm |
str | required | ML algorithm to optimize |
n_trials |
int | 100 | Number of optimization trials |
cv |
int | 5 | Cross-validation folds |
scoring |
str/None | Auto* | Scoring metric for CV |
direction |
str | "maximize" | Optimization direction |
timeout |
float/None | None | Total optimization timeout (seconds) |
cv_timeout |
float | 120 | Single CV evaluation timeout |
random_state |
int/None | None | Random seed for reproducibility |
n_jobs |
int | 1 | Parallel jobs for CV (-1 for all cores) |
early_stopping_patience |
int/None | None | Trials without improvement before stopping |
verbose |
bool/int | False | Verbosity level |
show_progress_bar |
bool | False | Show optimization progress |
*Auto defaults: "accuracy" for classifiers, "r2" for regressors
Methods
| Method | Description | Available For |
|---|---|---|
fit(X, y) |
Optimize hyperparameters and train | All |
predict(X) |
Make predictions | All |
score(X, y) |
Evaluate model performance | All |
predict_proba(X) |
Predict class probabilities | Classifiers |
decision_function(X) |
Get decision values | Some classifiers |
get_params() |
Get optimizer parameters | All |
set_params(**params) |
Set optimizer parameters | All |
Attributes (after fitting)
| Attribute | Description |
|---|---|
best_estimator_ |
Trained model with best parameters |
best_params_ |
Best hyperparameters found |
best_score_ |
Best cross-validation score |
study_ |
Optuna study object |
study_time_ |
Total optimization time |
n_trials_completed_ |
Number of completed trials |
classes_ |
Class labels (classifiers only) |
n_features_in_ |
Number of input features |
feature_names_in_ |
Feature names (if available) |
Troubleshooting
Issue: "No successful trials completed"
Solution: Increase cv_timeout or reduce cv folds:
optimizer = Optimizer(algorithm="SVC", cv_timeout=300, cv=3)
Issue: CatBoost/XGBoost/LightGBM not available
Solution: Install optional dependencies:
pip install optuml[all]
# or individually:
pip install catboost xgboost lightgbm
Issue: Optimization takes too long
Solutions:
- Use parallel CV:
n_jobs=-1 - Set timeout:
timeout=600 - Use early stopping:
early_stopping_patience=10 - Reduce trials:
n_trials=50
Issue: Memory errors with large datasets
Solutions:
- Use algorithms with lower memory footprint (e.g.,
LogisticRegression,SGDClassifier, orSGDRegressor) - Reduce CV folds
Best Practices
-
Start with fewer trials: Begin with
n_trials=20-50for exploration, then increase for final optimization -
Use appropriate scoring metrics:
- Imbalanced classification:
"f1_weighted","roc_auc" - Regression:
"r2","neg_mean_squared_error"
- Imbalanced classification:
-
Enable early stopping for large trial counts:
Optimizer(n_trials=1000, early_stopping_patience=20)
-
Set random state for reproducibility:
Optimizer(random_state=42)
-
Use parallel processing for faster optimization:
Optimizer(n_jobs=-1)
Benchmark
See this page for benchmark results.
Citation
If you use OptuML in your research, please cite:
@software{stefaniak_optuml_2024,
author = {Filip Stefaniak},
title = {OptuML: Hyperparameter Optimization for Multiple Machine Learning Algorithms using Optuna},
year = {2024},
publisher = {Zenodo},
doi = {10.5281/zenodo.17305963},
url = {https://doi.org/10.5281/zenodo.17305963}
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file optuml-0.2.6.tar.gz.
File metadata
- Download URL: optuml-0.2.6.tar.gz
- Upload date:
- Size: 24.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
26d48b39cb7414022848a781ccc71824e5f0b64f50ef5444ad970ae70d673e35
|
|
| MD5 |
89d60a30364002cf67a81d43f059b500
|
|
| BLAKE2b-256 |
b1e657ec12bb20d844f0dd028ce43cf0838e13adcc84fefc1b5ca31a4e460b7d
|
File details
Details for the file optuml-0.2.6-py3-none-any.whl.
File metadata
- Download URL: optuml-0.2.6-py3-none-any.whl
- Upload date:
- Size: 19.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0c75489b596ca8f743147364fe87f27ad55768c83aba421a8b53a1a7bca07b29
|
|
| MD5 |
65c50be2c53cd7bbb6cfaa6375e45482
|
|
| BLAKE2b-256 |
dfe1ed4a6666460d2ce546107424b7b188ff755001b496d46777b939e70d2ec3
|