Skip to main content

Pyplearnr is a tool designed to easily and more elegantly build, validate (nested k-fold cross-validation), and test scikit-learn pipelines.

Project description

# What
Pyplearnr is a tool designed to perform model selection, hyperparameter tuning, and model validation via nested k-fold cross-validation in a reproducible way.

# Why
I found GridSearchCV to be lacking. I wanted a tool that used a similar procedure to perform simultaneous hyperparameter tuning AND model selection with a clear input that summarizes exactly what scikit-learn pipeline steps and parameter combinations will used and whose results allow perfect reproducibility. So, I made my own.

# How
### Use
See the [demo](https://nbviewer.jupyter.org/github/JaggedParadigm/pyplearnr/blob/master/pyplearnr_demo.ipynb) for more detailed use of pyplearnr with actual data.

Here are the basic steps:
#### 1) Place feature data into non-null feature matrix and target vector
#### 2) Initialize the nested k-fold cross-validation object
```python
kfcv = ppl.NestedKFoldCrossValidation(outer_loop_fold_count=5,
inner_loop_fold_count=5)
```
#### 3) Specify the combinatorial pipeline schematic detailing all possible model/parameter combinations

Ex: Here's an example of model/parameter combinations of optional scaling of two types, a principal component analysis directly using scikit-learn's sklearn.decomposition.PCA transformer, selection of data transformed by k principal components (between 1 and 30), and the use of either a k-nearest neighbors classifier (k between 1 and 30) or random forest classifier with a maximum depth between 2 and 5 (and a specified random state for reproducibility).

```python
pipeline_schematic = [
{'scaler': {
'none': {},
'min_max': {},
'standard': {}
}
},
{'transform': {
'pca': {
'sklo': sklearn.decomposition.PCA,
'n_components': [feature_count]
}
}
},
{'feature_selection': {
'select_k_best': {
'k': range(1, feature_count+1)
}
}
},
{'estimator': {
'knn': {
'n_neighbors': range(1,31)
},
'random_forest': {
'sklo': RandomForestClassifier,
'max_depth': range(2,6),
'random_state': [57]
}
}
}
]
```

#### 4) Run pyplearnr
```python
# Perform nested k-fold cross-validation
kfcv.fit(X, y, pipeline_schematic=pipeline_schematic,
scoring_metric='auc', score_type='median')
```
### Methodology
The core model selection and validation method is nested k-fold cross-validation (stratified if for classification). Inner-fold contests are used for model selection and outer-folds are used to cross-validate the final winning model.

Here's the basic algorithm used by pyplearnr:

- 1) Pyplearnr shuffles and divides the data into k validation outer-folds.
- 2) For each outer-fold:
- a) The remaining folds are combined to form the corresponding training set
- b) This training set is divided into k (or possibly a different number) of inner-test-folds.
- c) For each inner-test-fold:
- i) The remaining inner-test-folds are combined and used to train all pipelines/models, which are scored on the corresponding inner-test-fold
- d) The winning model/pipeline of each inner-test-fold contest is chosen as that with the best median score over all inner-test-folds
- iii) The user is alerted If there is a tie and expected to decide the winning pipeline (usually the simplest for better generalizability)
- 4) The final winning model/pipeline is chosen as that with the most number of wins from all inner-test-fold contests corresponding to each outer-fold
- e) Again, the user is expected to decide the winner If there is a tie
- 5) This final winning model/pipeline is trained on all of the training data for each outer-fold, tested on the corresponding validation set, and summary statistics are presented to the user representing expected out-of-sample performance.


### Installation
##### Dependencies

pyplearnr requires:

Python (>= 2.7 or >= 3.3)
scikit-learn (>= 0.18.2)
numpy (>= 1.13.0)
scipy (>= 0.19.1)
pandas (>= 0.20.2)
matplotlib (>= 2.0.2)

For use in Jupyter notebooks and the conda installation, I recommend having nb_conda (>= 2.2.0).

### User installation
Install by using pip:

```
pip install pyplearnr
```

For conda, you can issue the same command above within a conda environment or you can include this in your environment.yml file:

```
- pip:
- pyplearnr
```

and then either generate a new environment from the terminal using:

```
conda env create
```

or update an existing one (environment_name) using:

```
conda env update -n=environment_name -f=./environment.yml
```

Another option is to simply clone the respository, link to the location in your code, and import it.




Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyplearnr-1.0.11.1.tar.gz (28.6 kB view details)

Uploaded Source

Built Distribution

pyplearnr-1.0.11.1-py2-none-any.whl (34.3 kB view details)

Uploaded Python 2

File details

Details for the file pyplearnr-1.0.11.1.tar.gz.

File metadata

  • Download URL: pyplearnr-1.0.11.1.tar.gz
  • Upload date:
  • Size: 28.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for pyplearnr-1.0.11.1.tar.gz
Algorithm Hash digest
SHA256 92594cc4d70f314eb6d5c5889a321acfca803a220fdbbb7269d1be965e137ab7
MD5 7d4ac3e16caefddcc655f27f5e56db19
BLAKE2b-256 1276c73a277b8271545d884bfe0b3481d19cbc5c1a118430858b5027194204a8

See more details on using hashes here.

File details

Details for the file pyplearnr-1.0.11.1-py2-none-any.whl.

File metadata

File hashes

Hashes for pyplearnr-1.0.11.1-py2-none-any.whl
Algorithm Hash digest
SHA256 b9e4ff9e79c1cea7c0ffe35e1964e4d233ffaafe0c5cb6cdcca0ee568b64cfc6
MD5 6411fe90c591cbf3b2d67b4e082b64a8
BLAKE2b-256 2610edc08c7939b9d7f9db09d4a1c39e017d8615bbd5add20c9ec9da0a498d35

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page