A package for automated ML model training and creation of pipelines capable of handling multiple estimators.
Project description
Orpheus
What is Orpheus?
Orpheus stands for Optimized Robust Pipelines for Heuristic Ensemble Utilization and Selection.
Orpheus provides a tool for data scientists and machine learning engineers to automate pipeline construction and optimization, as well as experiment with various combinations of preprocessing techniques and estimators. Orpheus is build on top of the scikit-learn library and is compatible with all scikit-learn estimators.
It is a Python package designed to automate the process of building and optimizing machine learning pipelines. These pipelines are different from the conventional Pipeline class from Scikit-Learn, in the sense that a pipeline can contain multiple estimators instead of just one. This class inherits from the Scikit-Learn Pipeline class and is called MultiEstimatorPipeline
.
Some common use-cases for Orpheus include:
- Building and optimizing pipelines for regression and classification problems.
- Preprocessing data using a variety of techniques such as scaling, feature adding, and feature selection.
- Combining multiple estimators into a single pipeline.
- Evolving pipelines through stack generalization.
- Evaluating the performance of pipelines.
- Explanation of features
- Support for custom metrics
- Support for time-series
How to Use Orpheus
All steps can be controlled through a configuration file in YAML format, which is created when you first run the program with an instance of the ComponentService
or PipelineOrchestrator
class. You can edit this file to change the settings of all the preprocessing components. Detailed explanations of the various component settings are provided within the configuration file itself.
The preprocessing components are performed in the following order:
Scaling
component: Identifies and applies the best scaler for the data.Feature Adding
component: Adds recommended features to the data.Feature Removing
component: Implements various algorithms to remove poorly performing or redundant features.HyperTuner
component: Performs hyperparameter tuning through a three-round process, storing trained models and their performance. Each HyperTuner instance represents a single fold, acquired by the splits of an object which inherits fromBaseCrossValidator
class in Scikit-Learn (eg .TimeSeriesSplit, KFold, ShuffleSplit etc.)
In addition to the configuration file, you can control the enabled/disabled status of components using the parameters in the ComponentService.initialize
method.
MultiEstimatorPipeline
The MultiEstimatorPipeline
class is a scikit-learn pipeline with additional functionality, the main one being the ability to add multiple estimators and make combined predictions with them. Estimators in the pipeline can be accessed by the MultiEstimatorPipeline.estimators
attribute, which is a list where the estimators are indexed by their score. The better the score, the higher the index of the estimator in the list.
The scores can be updated and can be used to determine the weights of the estimators when making predictions. This is done through the score
method. How estimators are weighted scorewise, can be checked by the get_weights
method.
Pipelines can be saved to disk and loaded again using the save
and load
methods.
Common Parameters
Most classes, including the components, share a common set of parameters:
metric/scoring
: A callable that takes twopd.Series
objects and returns afloat
. This is the metric that will be optimized during the pipeline execution. Examples includesklearn.metrics.mean_squared_error
andsklearn.metrics.accuracy_score
. Also, custom metricfunctions can be used. In this case, they need to be registered through thePipelineOrchestrator.register_metric
static method.config_path
: Astr
representing the path to the configuration file of the components. This file specifies the hyperparameters and other settings for each component in the pipeline.maximize_scoring
: Abool
indicating whether to maximize or minimize themetric/scoring
. IfTrue
, the pipeline will try to maximize the metric. IfFalse
, the pipeline will try to minimize the metric.verbose
: Anint
representing the verbosity level. The higher the value, the more information will be printed to the console during the pipeline execution. The possible values are:0
: No information will be printed to the console.1
Only warnings, errors and critical messages will be printed to the console.2
: Only important informative messages and errors will be printed to the console.3:
All messages, including errors, will be printed to the console.
In PipelineOrchestrator
, if log_file_path
is set, logging to this file will be done instead of printing to the console.
Services
ComponentService
ComponentService
is the service class which binds all preprocessing-and training components together. It is responsible for all the preprocessing and training of the data. It also provides the ability to generate pipelines for the best base models and stacked models, found by the hyperparameter tuning process. These pipelines include the preprocessing steps and estimators. Before the scaling process, binary features as a default are excluded from Scaling
and FeatureAdding
components. This is done to prevent the scaling and adding of features based on binary features, which is generally undesirable.
In addition, the parameters ordinal_features
and categorical_features
can be used to specify ordinal and categorical features. These features also will be excluded from the Scaling
and FeatureAdding
process.
The ordinal_features
parameter takes a dict as value, where the key is the columnname and the value a list with values in the column from low to high. The categorical_features
parameter takes a list with columnnames as its value.
The estimator_list
parameter allows you to provide your own list of uninitialized estimators.
By default, this is set to None, and all scikit-learn estimators will be used, determined by the type of your data (classification or regression).
If you wish to use only your custom estimators and exclude the default scikit-learn estimators, set use_sklearn_estimators_aside_estimator_list
to False
.
Alternatively, estimators from other libraries with scikit-learn compatible interfaces can be added to estimator_list
,
such as xgboost
and lightgbm
.
Last named parameters are available in both the ComponentService
and PipelineOrchestrator
classes.
Basic usage of the ComponentService class:
import pandas as pd
from sklearn.model_selection import ShuffleSplit, train_test_split
from sklearn.datasets import make_regression
from orpheus import ComponentService, PipelineEvolverService, MultiEstimatorPipeline
config_path = "./configurations.yaml"
# create a cross validation object. replace with your own cv object
cv_obj = ShuffleSplit(n_splits=3)
# create a synthetic dataset. replace with your own data
X, y = make_regression(
n_samples=1000,
n_features=5,
random_state=42,
)
X = pd.DataFrame(X)
X.columns = [f"feature_{N}" for N in range(1, X.shape[1] + 1)]
y = pd.Series(y)
# split into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, shuffle=True)
if __name__ == "__main__":
# initialize the compomnentservice.
# at first runtime, program will create a config file if it doesn't exist yet.
# you can edit this file to change the settings of all the preprocessing components
# before running the program again.
component_service = ComponentService(
X_train,
X_test,
y_train,
y_test,
config_path=config_path,
cv_obj=cv_obj,
n_jobs=-1,
)
# kick off the preprocessing and training process.
# settings per component are read from the config file and applied
# to the preprocessing and training process when running this method.
component_service.initialize(
scale=True,
add_features=True,
remove_features=True,
)
# generate fitted pipelines for best base models and stacked models,
# found by the hyperparameter tuning process.
# these include the preprocessing steps and estimators.
pipe_base: MultiEstimatorPipeline = component_service.generate_pipeline_for_base_models(top_n_per_tuner=5)
pipe_stacked: MultiEstimatorPipeline = component_service.generate_pipeline_for_stacked_models(
top_n_per_tuner_range=[3, 5]
)
# evolve the pipelines through stack generalization
evolver = PipelineEvolverService(pipe_stacked)
evolved_pipe_hv = evolver.evolve_voting(n_jobs=4, voting="hard")
evolved_pipe_hv.fit(X_train, y_train)
print(evolved_pipe_hv.score(X_test, y_test))
evolved_pipe_sv = evolver.evolve_voting(n_jobs=4, voting="soft")
evolved_pipe_sv.fit(X_train, y_train)
print(evolved_pipe_sv.score(X_test, y_test))
PipelineOrchestrator
For a simpler and more high-level user interface, you can utilize the PipelineOrchestrator
class.
This class provides full and easy control over the entire signalflow, from the preprocessing components to model validation (eg. ComponentService
is being used under the hood). It assumes a heuristic approach where the dataset is split into 3 partitions: The train, test and validationsets. This to ensure the quality of the models afterwards.
The trainset will be assigned the folds by the Scikit-Learn cross-validation object and should generally be the largest dataset.
The second dataset, in this context called the testset, will be used to evaluate the models from the earlier training process. During this process, 3 generations of models will be created. You can change this by setting the generations
parameter in the PipelineOrchestrator.build()
method.
The three generations are:
Generation 1: Base: These are the top-performing base models discovered through the hyperparameter tuning process in the HyperTuner component. Each instantiated HyperTuner object serves as a "tuner" and also represents a single cross-validation fold. The number of models per tuner is determined by the top_n_per_tuner parameter in the PipelineOrchestrator.build() method.
Generation 2: Stacked: These meta-models are formed by combining the base models from generation 1 using various ensemble methods, such as voting, stacking, and averaging.
Generation 3: Evolved: This is a single meta-model created by ensembling the models from generation 2.
After utilizing the PipelineOrchestrator.build()
method, models in the created pipelines can be validated by the PipelineOrchestrator.fortify()
method. Here, stresstests will be executed on the models in all pipeline generations. Models which do not pass the stresstests, will be removed from their pipeline. For this process, the validationset will be used.
Hierarchy diagram
This diagram provides a visual overview of how different components and services interact within the Orpheus framework:
Flowchart
Here is a concrete example what parts of the complete training process are automated by Orpheus:
Basic usage of the PipelineOrchestrator class:
import os
import pandas as pd
from sklearn.model_selection import ShuffleSplit, train_test_split
from sklearn.datasets import make_regression
from sklearn.metrics import r2_score
from orpheus import PipelineOrchestrator
config_path = "./configurations.yaml"
# create a cross-validation object. Replace with your own cv object
cv_obj = ShuffleSplit(n_splits=4)
# create a synthetic dataset. Replace with your own data
X, y = make_regression(
n_samples=1000,
n_features=5,
random_state=42,
)
X = pd.DataFrame(X)
X.columns = [f"feature_{N}" for N in range(1, X.shape[1] + 1)]
y = pd.Series(y)
# split into train and test
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, shuffle=True
)
if __name__ == "__main__":
orchestrator = PipelineOrchestrator(
X_train,
y_train,
metric=r2_score,
config_path=config_path,
cv_obj=cv_obj,
verbose=3,
n_jobs=max(1, int(os.cpu_count() / 2)),
shuffle=True,
ensemble_size=0.1,
validation_size=0.1,
)
(
orchestrator
.pre_optimize(max_splits=4)
.build(
scale=False,
add_features=False,
remove_features=False,
)
.fortify(
optimize_n_jobs=True,
threshold_score=0.90,
plot_explaining=True,
)
)
# make predictions
pred_base = orchestrator.pipelines["base"].predict(X_test)
pred_stacked = orchestrator.pipelines["stacked"].predict(X_test)
pred_evolved = orchestrator.pipelines["evolved"].predict(X_test)
# get an overview of the feature importances
explained_features = orchestrator.get_explained_features()
# save the pipelines to disk for later use
orchestrator.pipelines["base"].save("base_pipeline")
orchestrator.pipelines["stacked"].save("stacked_pipeline")
orchestrator.pipelines["evolved"].save("evolved_pipeline")
Because of its simpler interface, general advice is to use the PipelineOrchestrator class for all actions, unless you have a specific reason not to, like for example, if you want more fine-grained control.
Explanation of features
Features can be explained through LIME (Local Interpretable Model-agnostic Explanations). Explanations are done on a per-sample basis.
This is done by the PipelineOrchestrator.fortify()
method. The plot_explaining
parameter controls whether the explanations are plotted.
Setting the plot_explaining
parameter to True
will plot the explanations for the best base
model, the best stacked
model, and the evolved
model.
Custom metrics
Custom metrics can be registered through the PipelineOrchestrator.register_metric
static method. This method takes a callable as its only parameter. The callable should take two pd.Series
objects as its parameters and return a float
. The first pd.Series
object represents the true values, while the second pd.Series
object represents the predicted values.
PipelineOrchestratorProxy
Metadata about each PipelineOrchestrator
run can be stored in a (sqlite) database using the experimental PipelineOrchestratorProxy
class. This class takes an PipelineOrchestrator
as argument in its constructor and adds the ability to store metadata about each run in a database.
This adds interesting new functionality, such as the ability to analyse the metadata from the database and find out which configurations of the components work best for a specific dataset.
The idea of this experimental class is to use a surrogate model to find the best configuration of the components for a specific dataset. This is done by training the surrogate model on the configurationdata from the database, where the scores are used as the target.
With the new proposed configurations, more and more iterations are produced, which deliver new data to train the surrogate model.
Using this technique, the idea is that the surrogate model should eventually converge to the best configuration for the dataset.
Tips
If overfitting is a problem when using a classifier, consider adjusting the following settings in the YAML configurationfile for the HyperTuner component:
- The
R2_weights
can be adjusted to prioritize regularization on the estimator scores. A starting point may be{"best_mean": 0.9, "lowest_stdev": 0.3, "amount_of_unique_vals": 0.3}
. It is important to understand these weights are applied on a per-estimatorpopulation basis during the round 2 process. For instance, if the RandomForestClassifier estimatorpopulation (meaning ALL trained instances of RandomForestClassifier during round 2) had the highest mean accuracy score of 0.85 in round 2, compared to all other trained estimator-populations, and "bestmean" has the highest weight, there will be a significant chance that _RandomForestClassifier will be the estimator advancing to round 3. penalty_to_score_if_overfitting
: Increase the value in the direction to1.0
to impose a heavy penalty on overfitting. This can be very useful if eg. using a classifier model and the predictions are too one-sided, like predicting only one class. This can be a sign of overfitting.
If you encounter memory or performance issues due to a large dataset, consider utilizing the random_subset
parameter in the YAML configurationfile. This parameter, available in the Scaling
, FeatureRemoving
, and HyperTuner
components, extracts a random subset of the data. Note that the indices may vary with each fitting iteration, the sole exception being the FeatureRemoving
component.
If the program keeps on hanging, use the log_cpu_memory_usage
parameter in the constructor of PipelineOrchestrator
to keep track of memory and cpu-usage. If the hanging occurs in PipelineOrchestrator.build()
, try the timeout_duration
parameter.
Release Notes
Version 1.3.12
- Dependency Updates: Updated Pandas dependency to allow versions up to 3.x and added
setuptools
to dependencies to allow version to be read from the package. - Refactor: Removed
categorical_features_pca_variance_threshold
parameter from thePipelineOrchestrator
andComponentService
classe, as this parameters usecase was too specific.
releasenotes for older versions will be added in the future.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
File details
Details for the file orpheus_ml-1.3.13-py3-none-any.whl
.
File metadata
- Download URL: orpheus_ml-1.3.13-py3-none-any.whl
- Upload date:
- Size: 476.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.11.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9a4be34903143d3e0b184466eaaf4238a3670d1347a5c413f837ec95ad261f71 |
|
MD5 | 1162ceaf95159e2eaf60591892427e43 |
|
BLAKE2b-256 | 0f394629789f3a0722295ce92b74a146e267677b11279cd0680b084185a0f35d |