skrobot is a Python module for designing, running and tracking Machine Learning experiments / tasks. It is built on top of scikit-learn framework.
Project description
skrobot
What is it about?
skrobot is a Python module for designing, running and tracking Machine Learning experiments / tasks. It is built on top of scikit-learn framework.
How do I install it?
PyPI
$ pip install skrobot
Development Version
The skrobot version on PyPI may always be one step behind; you can install the latest development version from the GitHub repository by executing
$ pip install git+git://github.com/medoidai/skrobot.git
Or, you can clone the GitHub repository and install skrobot from your local drive via
$ python setup.py install
Which are the components?
NOTE : Currently, skrobot can be used only for binary classification problems. Multiclass, regression and multitarget extensions are project priorities. Also, a proper module documentation is on the making after the first public release and testing (please check below).
Functionality for the module users
Component | What is this? |
---|---|
Train Task | This task can be used to fit a scikit-learn estimator on some data. |
Evaluation Cross Validation Task | This task can be used to evaluate a scikit-learn estimator on some data. |
Feature Selection Cross Validation Task | This task can be used to perform feature selection with Recursive Feature Elimination using a scikit-learn estimator on some data. |
Hyperparameters Search Cross Validation Task | This task can be used to search the best hyperparameters of a scikit-learn estimator on some data. |
Experiment | This is used to build and run experiments. It can run tasks in the context of an experiment and glue everything together to complete a modelling pipeline. |
Task Runner | This is like the Experiment component but without the "experiment" stuff. It can be used to run various tasks and glue everything together to complete a modelling pipeline. |
Functionality for the module developers extending tasks or notifiers
Component | What is this? |
---|---|
Base Task | All tasks inherit from this component. A task is a configurable and reproducible piece of code built on top of scikit-learn that can be used in machine learning pipelines. |
Base Cross Validation Task | All tasks that use cross validation functionality inherit from this component. |
Base Notifier | All notifiers inherit from this component. A notifier can be used to send success / failure notifications for tasks execution. |
More details on the components functionality?
Evaluation Cross Validation Task
-
Cross validation runs by default and can be configured to use either stratified k-folds or custom folds
-
The provided estimator is not affected and is used only as a template
-
The provided estimator can be either a scikit-learn machine learning model (e.g., LogisticRegression) or a pipeline ending with an estimator
-
The provided train / test dataset file paths can be either URLs or disk file paths
-
The provided estimator needs to be able to predict probabilities through a
predict_proba
method -
The following evaluation results can be generated on-demand for hold-out test set as well as train / validation CV folds:
- PR / ROC Curves
- Confusion Matrixes
- Classification Reports
- Performance Metrics
- False Positives
- False Negatives
-
The evaluation results can be generated either for a specifc provided threshold or for the best one found from threshold tuning
-
The threshold used along with its related performance metrics and summary metrics from all CV splits as well as hold-out test set are returned as a result
Feature Selection Cross Validation Task
-
Cross validation runs by default and can be configured to use either stratified k-folds or custom folds
-
The provided estimator can be either a scikit-learn machine learning model (e.g., LogisticRegression) or a pipeline ending with an estimator
-
The provided train dataset file path can be either a URL or a disk file path
-
The provided estimator needs to provide feature importances through either a
coef_
or afeature_importances_
attribute -
Along with the provided estimator a preprocessor can also be provided to preprocess the data before feature selection runs
-
The provided estimator and preprocessor are not affected and are used only as templates
-
The selected features can be either column names (from the original data) or column indexes (from the preprocessed data) depending on whether a preprocessor was used or not
-
The selected features are stored in a text file and also returned as a result
Train Task
-
The provided estimator is not affected and is used only as a template
-
The provided estimator can be either a scikit-learn machine learning model (e.g., LogisticRegression) or a pipeline ending with an estimator
-
The provided train dataset file path can be either a URL or a disk file path
-
The fitted estimator is stored as a pickle file and also returned as a result
Hyperparameters Search Cross Validation Task
-
Cross validation runs by default and can be configured to use either stratified k-folds or custom folds
-
The provided estimator is not affected and is used only as a template
-
The provided estimator can be either a scikit-learn machine learning model (e.g., LogisticRegression) or a pipeline ending with an estimator
-
The provided train dataset file path can be either a URL or a disk file path
-
The search can be either randomized or grid-based
-
The search results as well as the best estimator found with its related hyperparameters and score are returned as a result
-
The search results are stored in a file as a static HTML table
Experiment
-
Each experiment when it runs it leaves in a unique directory a footprint of metadata files (experiment source code, experiment ID, experiment date/time, experimenter name, experiment default / overwritten parameters in JSON format)
-
Notifications can be send after running a task, through an easy to implement API (it can be useful for teams who need to get notified for the progress of the experiment)
-
In case of error when running a task, a text file will be generated with the related error
Task Runner
-
It runs the provided tasks and saves in a file the default / overwritten parameters in JSON format
-
In case of error when running a task, a text file will be generated with the related error
Why does it exists?
It can help Data Scientists and Machine Learning Engineers:
-
to keep track of modelling experiments / tasks
-
to automate the repetitive (and boring) stuff when designing modelling pipelines
-
to spend more time on the things that truly matter when solving a problem
The people behind it
Development:
Support, testing and features recommendation:
Next Priorities?
-
Write documentation (e.g., Read the Docs) for the module!
-
Add support in the module for multiclass, regression and multitarget problems
-
Support Python 3.8
-
Build some notifiers (e.g., Slack, Trello and Discord)
-
Add Bayesian hyperparameter tuning support
Can I contribute?
Of course, the project is Free Software and you can contribute to it!
What license do you use?
See our license (LICENSE.txt) for more details.
How do I use it?
Many examples can be found in the examples directory.
Below, are some examples that use many of skrobot's components to built a machine learning modelling pipeline. Please try them and we would love to have your feedback!
Example on Titanic Dataset (auto-generated results)
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
from skrobot.core import Experiment
from skrobot.tasks import TrainTask
from skrobot.tasks import FeatureSelectionCrossValidationTask
from skrobot.tasks import EvaluationCrossValidationTask
from skrobot.tasks import HyperParametersSearchCrossValidationTask
from skrobot.feature_selection import ColumnSelector
from skrobot.notification import BaseNotifier
######### Initialization Code
train_data_set_file_path = 'https://bit.ly/titanic-data-train'
test_data_set_file_path = 'https://bit.ly/titanic-data-test'
random_seed = 42
id_column = 'PassengerId'
label_column = 'Survived'
numerical_features = ['Age', 'Fare', 'SibSp', 'Parch']
categorical_features = ['Embarked', 'Sex', 'Pclass']
numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer()),
('scaler', StandardScaler())])
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='most_frequent')),
('encoder', OneHotEncoder(handle_unknown='ignore'))])
preprocessor = ColumnTransformer(transformers=[
('numerical_transfomer', numeric_transformer, numerical_features),
('categorical_transfomer', categorical_transformer, categorical_features)])
classifier = LogisticRegression(solver='liblinear', random_state=random_seed)
search_params = {
"classifier__C" : [ 1.e-01, 1.e+00, 1.e+01 ],
"classifier__penalty" : [ "l1", "l2" ],
"preprocessor__numerical_transfomer__imputer__strategy" : [ "mean", "median" ]
}
######### skrobot Code
# Define a Notifier (This is optional and you can implement any notifier you want, e.g. for Slack / Trello / Discord)
class ConsoleNotifier(BaseNotifier):
def notify (self, message):
print(message)
# Build an Experiment
experiment = Experiment('experiments-output', __file__).set_experimenter('echatzikyriakidis').set_notifier(ConsoleNotifier()).build()
# Run Feature Selection Task
features_columns = experiment.run(FeatureSelectionCrossValidationTask (estimator=classifier,
train_data_set_file_path=train_data_set_file_path,
preprocessor=preprocessor,
min_features_to_select=4,
id_column=id_column,
label_column=label_column,
random_seed=random_seed).stratified_folds(total_folds=5, shuffle=True))
pipe = Pipeline(steps=[('preprocessor', preprocessor),
('selector', ColumnSelector(cols=features_columns)),
('classifier', classifier)])
# Run Hyperparameters Search Task
hyperparameters_search_results = experiment.run(HyperParametersSearchCrossValidationTask (estimator=pipe,
search_params=search_params,
train_data_set_file_path=train_data_set_file_path,
id_column=id_column,
label_column=label_column,
random_seed=random_seed).random_search(n_iters=100).stratified_folds(total_folds=5, shuffle=True))
# Run Evaluation Task
evaluation_results = experiment.run(EvaluationCrossValidationTask(estimator=pipe,
estimator_params=hyperparameters_search_results['best_params'],
train_data_set_file_path=train_data_set_file_path,
test_data_set_file_path=test_data_set_file_path,
id_column=id_column,
label_column=label_column,
random_seed=random_seed,
export_classification_reports=True,
export_confusion_matrixes=True,
export_pr_curves=True,
export_roc_curves=True,
export_false_positives_reports=True,
export_false_negatives_reports=True,
export_also_for_train_folds=True).stratified_folds(total_folds=5, shuffle=True))
# Run Train Task
train_results = experiment.run(TrainTask(estimator=pipe,
estimator_params=hyperparameters_search_results['best_params'],
train_data_set_file_path=train_data_set_file_path,
id_column=id_column,
label_column=label_column,
random_seed=random_seed))
# Print in-memory results
print(features_columns)
print(hyperparameters_search_results['best_params'])
print(hyperparameters_search_results['best_index'])
print(hyperparameters_search_results['best_estimator'])
print(hyperparameters_search_results['best_score'])
print(hyperparameters_search_results['search_results'])
print(evaluation_results['threshold'])
print(evaluation_results['cv_threshold_metrics'])
print(evaluation_results['cv_splits_threshold_metrics'])
print(evaluation_results['cv_splits_threshold_metrics_summary'])
print(evaluation_results['test_threshold_metrics'])
print(train_results['estimator'])
Example on SMS Spam Collection Dataset (auto-generated results)
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.feature_selection import SelectPercentile, chi2
from sklearn.linear_model import SGDClassifier
from skrobot.core import Experiment
from skrobot.tasks import TrainTask
from skrobot.tasks import EvaluationCrossValidationTask
from skrobot.tasks import HyperParametersSearchCrossValidationTask
from skrobot.feature_selection import ColumnSelector
######### Initialization Code
train_data_set_file_path = 'https://bit.ly/sms-spam-ham-data-train'
test_data_set_file_path = 'https://bit.ly/sms-spam-ham-data-test'
field_delimiter = '\t'
random_seed = 42
pipe = Pipeline(steps=[
('column_selection', ColumnSelector(cols=['message'], drop_axis=True)),
('vectorizer', CountVectorizer()),
('tfidf', TfidfTransformer()),
('feature_selection', SelectPercentile(chi2)),
('classifier', SGDClassifier(loss='log'))])
search_params = {
'classifier__max_iter': [ 20, 50, 80 ],
'classifier__alpha': [ 0.00001, 0.000001 ],
'classifier__penalty': [ 'l2', 'elasticnet' ],
"vectorizer__stop_words" : [ "english", None ],
"vectorizer__ngram_range" : [ (1, 1), (1, 2) ],
"vectorizer__max_df": [ 0.5, 0.75, 1.0 ],
"tfidf__use_idf" : [ True, False ],
"tfidf__norm" : [ 'l1', 'l2' ],
"feature_selection__percentile" : [ 70, 60, 50 ]
}
######### skrobot Code
# Build an Experiment
experiment = Experiment('experiments-output', __file__).set_experimenter('echatzikyriakidis').build()
# Run Hyperparameters Search Task
hyperparameters_search_results = experiment.run(HyperParametersSearchCrossValidationTask (estimator=pipe,
search_params=search_params,
train_data_set_file_path=train_data_set_file_path,
field_delimiter=field_delimiter,
random_seed=random_seed).random_search().stratified_folds(total_folds=5, shuffle=True))
# Run Evaluation Task
evaluation_results = experiment.run(EvaluationCrossValidationTask(estimator=pipe,
estimator_params=hyperparameters_search_results['best_params'],
train_data_set_file_path=train_data_set_file_path,
test_data_set_file_path=test_data_set_file_path,
field_delimiter=field_delimiter,
random_seed=random_seed,
export_classification_reports=True,
export_confusion_matrixes=True,
export_pr_curves=True,
export_roc_curves=True,
export_false_positives_reports=True,
export_false_negatives_reports=True,
export_also_for_train_folds=True).stratified_folds(total_folds=5, shuffle=True))
# Run Train Task
train_results = experiment.run(TrainTask(estimator=pipe,
estimator_params=hyperparameters_search_results['best_params'],
train_data_set_file_path=train_data_set_file_path,
field_delimiter=field_delimiter,
random_seed=random_seed))
# Print in-memory results
print(hyperparameters_search_results['best_params'])
print(hyperparameters_search_results['best_index'])
print(hyperparameters_search_results['best_estimator'])
print(hyperparameters_search_results['best_score'])
print(hyperparameters_search_results['search_results'])
print(evaluation_results['threshold'])
print(evaluation_results['cv_threshold_metrics'])
print(evaluation_results['cv_splits_threshold_metrics'])
print(evaluation_results['cv_splits_threshold_metrics_summary'])
print(evaluation_results['test_threshold_metrics'])
print(train_results['estimator'])
Sample of auto-generated results
Classification Reports
Confusion Matrixes
False Negatives
False Positives
PR Curves
ROC Curves
Performance Metrics
On train / validation CV folds:
On hold-out test set:
Hyperparameters Search Results
Task Parameters Logging
Experiment Logging
Features Selected
The selected column indexes from the transformed features (this is generated when a preprocessor is used):
The selected column names from the original features (this is generated when no preprocessor is used):
Expreriment Source Code
Thank you!
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.