Model Development for Data Science Projects.
Project description
Modev : Model Development for Data Science Projects
Introduction
Most data science projects involve similar ingredients (loading data, defining some evaluation metrics, splitting data into different train/validation/test sets, etc.). Modev's goal is to ease these repetitive steps, without constraining the freedom data scientists need to develop models.
Installation
The easiest is to install from pip:
pip install modev
Otherwise you can clone the latest release and install it manually:
git clone git@github.com:pabloarosado/modev.git
cd modev
python setup.py install
Otherwise you can install from conda:
conda install -c pablorosado modev
Quick guide
The quickest way to get started with modev is to run a pipeline with the default settings:
import modev
pipe = modev.Pipeline()
pipe.run()
This runs a pipeline on some example data, and returns a dataframe with a ranking of approaches that perform best (given some metrics) on the data.
To get the data used in the pipeline:
pipe.get_data()
By default, modev splits the data into a playground and a test set. The test set is omitted (unless parameter execution_inputs['test_mode'] is set to True), and the playground is split into k train/dev folds, to do k-fold cross-validation. To get the indexes of train/dev/test sets:
pipe.get_indexes()
The pipeline will load two dummy approaches (which can be accessed on pipe.approaches_function
) with some
parameters (which can be accessed on pipe.approaches_pars
).
For each fold, these approaches will be fitted to the train set and predict the 'color' of the examples on the dev sets.
The metrics used to evaluate the performance of the approaches are listed in pipe.evaluation_pars['metrics']
.
An exhaustive grid search is performed, to get all possible combinations of the parameters of each of the approaches. The performance of each of these combinations on each fold can be accessed on:
pipe.get_results()
To plot these results per fold for each of the metrics:
pipe.plot_results()
To plot only a certain list of metrics, this list can be given as an argument of this function.
To get the final ranking of best approaches (after combining the results of different folds):
pipe.get_selected_models()
Guide
The inputs accepted by modev.Pipeline
refer to the usual ingredients in a data science project (data loading,
evaluation metrics, model selection method...).
We define an experiment as a combination of the inputs that a pipeline accepts:
load_inputs
: Dictionary of inputs related to data loading. To read what the default function does, see documentation ofmodev.etl.load_local_file
.validation_inputs
: Dictionary of inputs related to validation method (e.g. k-fold or temporal-fold cross-validation). To read what the default function does, see documentation ofmodev.validation.k_fold_playground_n_tests_split
.execution_inputs
: Dictionary of inputs related to the execution of approaches (by default, an approach consists of a class with a 'fit' and a 'predict' method). To read what the default function does, see documentation ofmodev.execution.execute_model
.evaluation_inputs
: Dictionary of inputs related to evaluation metrics. To read what the default function does, see documentation ofmodev.evaluation.evaluate_predictions
.exploration_inputs
: Dictionary of inputs related to the method to explore the parameter space (e.g. grid search or random search). To read what the default function does (in fact, in this case it is a class, not a function), see documentation ofmodev.exploration.GridSearch
.selection_inputs
: Dictionary of inputs related to the model selection method. To read what the default function does, see documentation ofmodev.selection.model_selection
.approaches_inputs
: List of dictionaries, one per approach to be used. To read what the default function does, see documentation of default approaches (modev.approaches.DummyPredictor
andmodev.approaches.RandomChoicePredictor
).
If any of these inputs is not given, they will be taken from default.
Each dictionary has a key 'function'
that corresponds to a particular function (or class).
Any other item in the dictionary is assumed to be an argument of that function.
- If
'function'
is not specified, a default function (taken from one of modev's modules) is used. Arguments of the function can be given in the same dictionary (and if not given, default values are assumed). - If a custom
'function'
is specified, all parameters required by it can be given in the same dictionary.
There is one special case: approaches_inputs
is not a dictionary but a list of dictionaries (given that one
experiment can include several approaches).
Each dictionary in the list has at least two keys:
approach_name
: Name of the approach.function
: Actual approach (usually, a class with 'fit' and 'predict' methods).- Any other given key will be assumed to be arguments of the approach.
An experiment can be contained in a python module.
As an example, there is a template experiment in modev.templates
, that is a small variation with respect to the
default experiment.
To start a pipeline on this experiment:
experiment = templates.experiment_01.experiment
pipe = Pipeline(**experiment)
And to run it follow the example in the quick guide.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file modev-0.5.0.tar.gz
.
File metadata
- Download URL: modev-0.5.0.tar.gz
- Upload date:
- Size: 23.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/41.6.0 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.6.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a7978261d68f0a599a8d39ce380d11be8d62597f0fc333f24f8cc7cbbaed2ba5 |
|
MD5 | c3b2830289928cd96e06a1bbddb6bfbb |
|
BLAKE2b-256 | 099015b9c806a98bc503656e3f302b0218a0c313ec24f50a8095cfcdfd443292 |