A wrapper toolbox that provides compatibility layers between TPOT and Auto-Sklearn and OpenML
Project description
Arbok (Automl wrapper toolbox for openml compatibility) provides wrappers for TPOT and Auto-Sklearn, as a compatibility layer between these tools and OpenML.
The wrapper extends Sklearn’s BaseSearchCV and provides all the internal parameters that OpenML needs, such as cv_results_, best_index_, best_params_, best_score_ and classes_.
Installation
pip install arbok
Simple example
import openml from arbok import AutoSklearnWrapper, TPOTWrapper task = openml.tasks.get_task(31) dataset = task.get_dataset() # Get the AutoSklearn wrapper and pass parameters like you would to AutoSklearn clf = AutoSklearnWrapper( time_left_for_this_task=3600, per_run_time_limit=360 ) # Or get the TPOT wrapper and pass parameters like you would to TPOT clf = TPOTWrapper( generations=100, population_size=100, verbosity=2 ) # Execute the task run = openml.runs.run_model_on_task(task, clf) run.publish() print('URL for run: %s/run/%d' % (openml.config.server, run.run_id))
Preprocessing data
To make the wrapper more robust, we need to preprocess the data. We can fill the missing values, and one-hot encode categorical data.
First, we get a mask that tells us whether a feature is a categorical feature or not.
dataset = task.get_dataset() _, categorical = dataset.get_data(return_categorical_indicator=True) categorical = categorical[:-1] # Remove last index (which is the class)
Next, we setup a pipeline for the preprocessing. We are using a ConditionalImputer, which is an imputer which is able to use different strategies for categorical (nominal) and numerical data.
from sklearn.pipeline import make_pipeline from sklearn.preprocessing import OneHotEncoder from arbok import ConditionalImputer preprocessor = make_pipeline( ConditionalImputer( categorical_features=categorical, strategy="mean", strategy_nominal="most_frequent" ), OneHotEncoder( categorical_features=categorical, handle_unknown="ignore", sparse=False ) )
And finally, we put everything together in one of the wrappers.
clf = AutoSklearnWrapper( preprocessor=preprocessor, time_left_for_this_task=3600, per_run_time_limit=360 )
Limitations
- Currently only the classifiers are implemented. Regression is therefore not possible.
- For TPOT, the config_dict variable can not be set, because this causes problems with the API.
Benchmarking
Installing the arbok package includes the arbench cli tool. We can generate a json file like this:
from arbok.bench import Benchmark bench = Benchmark() config_file = bench.create_config_file( # Wrapper parameters wrapper={"refit": True, "verbose": False, "retry_on_error": True}, # TPOT parameters tpot={ "max_time_mins": 6, # Max total time in minutes "max_eval_time_mins": 1 # Max time per candidate in minutes }, # Autosklearn parameters autosklearn={ "time_left_for_this_task": 360, # Max total time in seconds "per_run_time_limit": 60 # Max time per candidate in seconds } )
And then, we can call arbench like this:
arbench --classifier autosklearn --task-id 31 --config config.json
Or calling arbok as a python module:
python -m arbok --classifier autosklearn --task-id 31 --config config.json
Running a benchmark on batch systems
To run a large scale benchmark, we can create a configuration file like above, and generate and submit jobs to a batch system as follows.
# We create a benchmark setup where we specify the headers, the interpreter we # want to use, the directory to where we store the jobs (.sh-files), and we give # it the config-file we created earlier. bench = Benchmark( headers="#PBS -lnodes=1:cpu3\n#PBS -lwalltime=1:30:00", python_interpreter="python3", # Path to interpreter root="/path/to/project/", jobs_dir="jobs", config_file="config.json", log_file="log.json" ) # Create the config file like we did in the section above config_file = bench.create_config_file( # Wrapper parameters wrapper={"refit": True, "verbose": False, "retry_on_error": True}, # TPOT parameters tpot={ "max_time_mins": 6, # Max total time in minutes "max_eval_time_mins": 1 # Max time per candidate in minutes }, # Autosklearn parameters autosklearn={ "time_left_for_this_task": 360, # Max total time in seconds "per_run_time_limit": 60 # Max time per candidate in seconds } ) # Next, we load the tasks we want to benchmark on from OpenML. # In this case, we load a list of task id's from study 99. tasks = openml.study.get_study(99).tasks # Next, we create jobs for both tpot and autosklearn. bench.create_jobs(tasks, classifiers=["tpot", "autosklearn"]) # And finally, we submit the jobs using qsub bench.submit_jobs()
Preprocessing parameters
from arbok import ParamPreprocessor import numpy as np from sklearn.feature_selection import VarianceThreshold from sklearn.pipeline import make_pipeline X = np.array([ [1, 2, True, "foo", "one"], [1, 3, False, "bar", "two"], [np.nan, "bar", None, None, "three"], [1, 7, 0, "zip", "four"], [1, 9, 1, "foo", "five"], [1, 10, 0.1, "zip", "six"] ], dtype=object) # Manually specify types, or use types="detect" to automatically detect types types = ["numeric", "mixed", "bool", "nominal", "nominal"] pipeline = make_pipeline(ParamPreprocessor(types="detect"), VarianceThreshold()) pipeline.fit_transform(X)
Output:
[[-0.4472136 -0.4472136 1.41421356 -0.70710678 -0.4472136 -0.4472136 2.23606798 -0.4472136 -0.4472136 -0.4472136 0.4472136 -0.4472136 -0.85226648 1. ] [-0.4472136 2.23606798 -0.70710678 -0.70710678 -0.4472136 -0.4472136 -0.4472136 -0.4472136 -0.4472136 2.23606798 0.4472136 -0.4472136 -0.5831297 -1. ] [ 2.23606798 -0.4472136 -0.70710678 -0.70710678 -0.4472136 -0.4472136 -0.4472136 -0.4472136 2.23606798 -0.4472136 -2.23606798 2.23606798 -1.39054004 -1. ] [-0.4472136 -0.4472136 -0.70710678 1.41421356 -0.4472136 2.23606798 -0.4472136 -0.4472136 -0.4472136 -0.4472136 0.4472136 -0.4472136 0.49341743 -1. ] [-0.4472136 -0.4472136 1.41421356 -0.70710678 2.23606798 -0.4472136 -0.4472136 -0.4472136 -0.4472136 -0.4472136 0.4472136 -0.4472136 1.031691 1. ] [-0.4472136 -0.4472136 -0.70710678 1.41421356 -0.4472136 -0.4472136 -0.4472136 2.23606798 -0.4472136 -0.4472136 0.4472136 -0.4472136 1.30082778 1. ]]
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.