Pyplearnr is a tool designed to easily and more elegantly build, validate (nested k-fold cross-validation), and test scikit-learn pipelines.
Project description
# What
Pyplearnr is a tool designed to easily and more elegantly build, select, and validate scikit-learn pipelines using nested k-fold cross-validation.
# How
### Use
See the [demo](https://nbviewer.jupyter.org/github/JaggedParadigm/pyplearnr/blob/master/pyplearnr_demo.ipynb) for use of pyplearnr.
### Installation
##### Dependencies
pyplearnr requires:
Python (>= 2.7 or >= 3.3)
scikit-learn (>= 0.18.2)
numpy (>= 1.13.0)
scipy (>= 0.19.1)
pandas (>= 0.20.2)
matplotlib (>= 2.0.2)
For use in Jupyter notebooks and the conda installation, I recommend having nb_conda (>= 2.2.0).
### User installation
Currently, installation is handled by using pip to install from the Github repository. I'm currently working on making this easier.
For now, from the command line use:
```
pip install git+git://github.com/JaggedParadigm/pyplearnr.git@master
```
For conda, you can issue the same command above or you can include in your environment.yml file this:
```
- pip:
- git+https://github.com/JaggedParadigm/pyplearnr.git#egg=pyplearnr
```
and then either generate a new environment from the terminal using:
```
conda env create
```
or update an existing one (environment_name) using:
```
conda env update -n=environment_name -f=./environment.yml
```
Another option is to simply clone the respository, link to the location in your code, and import it.
# bleh
One core aspect of pyplearnr is the combinatorial pipeline schematic, a flexible diagram of every step (e.g. estimator), step option (e.g. knn, logistic regression, etc.), and parameter option (e.g. n_neighbors for knn and C for logistic regression) combination. Any scikit-learn class instance you would use in a normal pipeline can be inserted or one can be chosen from a list of supported ones.
Here's an example with optional scaling, PCA (directly from the sklearn object), selection of the number of principal components to use, and the use of k-nearest neighbors with different values for the number of neighbors:
```python
pipeline_schematic = [
{'scaler': {
'none': {},
'min_max': {},
'standard': {}
}
},
{'transform': {
'pca': {
'sklo': sklearn.decomposition.PCA,
'n_components': [feature_count]
}
}
},
{'feature_selection': {
'select_k_best': {
'k': range(1, feature_count+1)
}
}
},
{'estimator': {
'knn': {
'n_neighbors': range(1,31)
}
}
}
]
```
The core validation method is nested k-fold cross-validation (stratified if for classification). Pyplearnr divides the data into k validation outer-folds and their corresponding training sets into k test inner-folds, picks the best pipeline as that having the highest score (median by default) for the inner-folds for each outer-fold, chooses the winning pipeline as that with the most wins, and uses the validation outer-folds to give an estimate of the ultimate winner's out-of-sample scores. This final pipeline can then be used to make predictions.
# Why
I wanted a way to do what GridSearchCV does for specific estimators with any estimator in a repeatable way.
Pyplearnr is a tool designed to easily and more elegantly build, select, and validate scikit-learn pipelines using nested k-fold cross-validation.
# How
### Use
See the [demo](https://nbviewer.jupyter.org/github/JaggedParadigm/pyplearnr/blob/master/pyplearnr_demo.ipynb) for use of pyplearnr.
### Installation
##### Dependencies
pyplearnr requires:
Python (>= 2.7 or >= 3.3)
scikit-learn (>= 0.18.2)
numpy (>= 1.13.0)
scipy (>= 0.19.1)
pandas (>= 0.20.2)
matplotlib (>= 2.0.2)
For use in Jupyter notebooks and the conda installation, I recommend having nb_conda (>= 2.2.0).
### User installation
Currently, installation is handled by using pip to install from the Github repository. I'm currently working on making this easier.
For now, from the command line use:
```
pip install git+git://github.com/JaggedParadigm/pyplearnr.git@master
```
For conda, you can issue the same command above or you can include in your environment.yml file this:
```
- pip:
- git+https://github.com/JaggedParadigm/pyplearnr.git#egg=pyplearnr
```
and then either generate a new environment from the terminal using:
```
conda env create
```
or update an existing one (environment_name) using:
```
conda env update -n=environment_name -f=./environment.yml
```
Another option is to simply clone the respository, link to the location in your code, and import it.
# bleh
One core aspect of pyplearnr is the combinatorial pipeline schematic, a flexible diagram of every step (e.g. estimator), step option (e.g. knn, logistic regression, etc.), and parameter option (e.g. n_neighbors for knn and C for logistic regression) combination. Any scikit-learn class instance you would use in a normal pipeline can be inserted or one can be chosen from a list of supported ones.
Here's an example with optional scaling, PCA (directly from the sklearn object), selection of the number of principal components to use, and the use of k-nearest neighbors with different values for the number of neighbors:
```python
pipeline_schematic = [
{'scaler': {
'none': {},
'min_max': {},
'standard': {}
}
},
{'transform': {
'pca': {
'sklo': sklearn.decomposition.PCA,
'n_components': [feature_count]
}
}
},
{'feature_selection': {
'select_k_best': {
'k': range(1, feature_count+1)
}
}
},
{'estimator': {
'knn': {
'n_neighbors': range(1,31)
}
}
}
]
```
The core validation method is nested k-fold cross-validation (stratified if for classification). Pyplearnr divides the data into k validation outer-folds and their corresponding training sets into k test inner-folds, picks the best pipeline as that having the highest score (median by default) for the inner-folds for each outer-fold, chooses the winning pipeline as that with the most wins, and uses the validation outer-folds to give an estimate of the ultimate winner's out-of-sample scores. This final pipeline can then be used to make predictions.
# Why
I wanted a way to do what GridSearchCV does for specific estimators with any estimator in a repeatable way.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
pyplearnr-1.0.10.tar.gz
(27.7 kB
view hashes)
Built Distribution
pyplearnr-1.0.10-py2-none-any.whl
(33.2 kB
view hashes)
Close
Hashes for pyplearnr-1.0.10-py2-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | e3bf711db1875e196b0164703257f84eb4032203a96dab25f4808b562ef48993 |
|
MD5 | 124107649a8703b0ae1ea5a9067fed8e |
|
BLAKE2b-256 | 0ebb3b08a00e10869a429cc0724acc0a68dac2353d0edff2b3671a8afbf3ccdc |